Michael Flor
Educational Testing Service, Text, Language and Computation, Faculty Member
- NLP, Computational Linguistics, Natural Language Processing, English language, Languages and Linguistics, Second Language Acquisition, and 25 moreLanguage Acquisition, Language Education, Language, Linguistics, Cognitive Linguistics, Applied Linguistics, Corpus Linguistics, Text Mining, Discourse Analysis, Language Teaching, Computer Assisted Language Learning, English Literature, N-Grams, Metaphor, Grammatical and Lexical Cohesion, Lexical Semantics, Quantitative Narrative Analysis, Lexicology, Computational Linguistics & NLP, Distributional Semantics, Text Complexity, Construction Grammar, Conceptual Metaphor Theory, Learner corpora, and Second Language Writingedit
It is important for developers of automated scoring systems to ensure that their systems are as fair and valid as possible. This commitment means evaluating the performance of these systems in light of construct-irrelevant response... more
It is important for developers of automated scoring systems to ensure that their systems are as fair and valid as possible. This commitment means evaluating the performance of these systems in light of construct-irrelevant response strategies. The enhancement of systems to detect and deal with these kinds of strategies is often an iterative process, whereby as new strategies come to light they need to be evaluated and effective mechanisms built into the automated scoring systems to handle them. In this paper, we focus on the Babel system, which automatically generates semantically incohesive essays. We expect that these essays may unfairly receive high scores from automated scoring engines despite essentially being nonsense.
We found that the classifier built on Babel-generated essays and good-faith essays and using features from the automated scoring engine can distinguish the Babel-generated essays from the good-faith ones with 100% accuracy. We also found that if we integrated this classifier into the automated scoring engine it flagged very few responses that were submitted as part of operational submissions (76 of 434,656). The responses that were flagged had previously been assigned a score of Null (non-scorable) or a score of 1 by human experts. The measure of lexical-semantic cohesion shows promise in being able to distinguish Babel-generated essays from good-faith essays.
Our results show that it is possible to detect the kind of gaming strategy illustrated by the Babel system and add it to an automated scoring engine without adverse effects on essays seen during real high-stakes tests. We also show that a measure of lexical-semantic cohesion can separate Babel-generated essays from good-faith essays to a certain degree, depending on task. This points to future work that would generalize the capability to detect semantic incoherence in essays.
We found that the classifier built on Babel-generated essays and good-faith essays and using features from the automated scoring engine can distinguish the Babel-generated essays from the good-faith ones with 100% accuracy. We also found that if we integrated this classifier into the automated scoring engine it flagged very few responses that were submitted as part of operational submissions (76 of 434,656). The responses that were flagged had previously been assigned a score of Null (non-scorable) or a score of 1 by human experts. The measure of lexical-semantic cohesion shows promise in being able to distinguish Babel-generated essays from good-faith essays.
Our results show that it is possible to detect the kind of gaming strategy illustrated by the Babel system and add it to an automated scoring engine without adverse effects on essays seen during real high-stakes tests. We also show that a measure of lexical-semantic cohesion can separate Babel-generated essays from good-faith essays to a certain degree, depending on task. This points to future work that would generalize the capability to detect semantic incoherence in essays.
Research Interests:
Systems and methods described herein utilize supervised machine learning to generate a figure-of-speech prediction model for classify content words in a running text as either being figurative (e.g., as a metaphor, simile, etc.) or... more
Systems and methods described herein utilize supervised machine learning to generate a figure-of-speech prediction model for classify content words in a running text as either being figurative (e.g., as a metaphor, simile, etc.) or non-figurative (i.e., literal). The prediction model may extract and analyze any number of features in making its prediction, including a topic model feature, unigram feature, part-of-speech feature, concreteness feature, concreteness difference feature, literal context feature, non-literal context feature, and off-topic feature, each of which are described in detail herein. Since uses of figure of speech in writings may signal content sophistication, the figure-of-speech prediction model allows scoring engines to further take into consideration a text's use of figure of speech when generating a score.
Research Interests:
This paper presents an exploratory study on large-scale detection of idiomatic expressions in essays written by non-native speakers of English. We describe a computational search procedure for automatic detection of idiom-candidate... more
This paper presents an exploratory study on large-scale detection of idiomatic expressions in essays written by non-native speakers of English. We describe a computational search procedure for automatic detection of idiom-candidate phrases in essay texts. The study used a corpus of essays written during a standardized examination of English language proficiency. Automatically-flagged candidate expressions were manually annotated for idiomaticity. The study found that idioms are widely used in EFL essays. The study also showed that a search algorithm that accommodates the syntactic and lexical flexibility of idioms can increase the recall of idiom instances by 30%, but it also increases the amount of false positives.
Research Interests:
We present two NLP components for the Story Cloze Task – dictionary-based sentiment analysis and lexical cohesion. While previous research found no contribution from sentiment analysis to the accuracy on this task, we demonstrate that... more
We present two NLP components for the Story Cloze Task – dictionary-based sentiment analysis and lexical cohesion. While previous research found no contribution from sentiment analysis to the accuracy on this task, we demonstrate that sentiment is an important aspect. We describe a new approach, using a rule that estimates sentiment congruence in a story. Our sentiment-based system achieves strong results on this task. Our lexical cohesion system achieves accuracy comparable to previously published baseline results. A combination of the two systems achieves better accuracy than published baselines. We argue that sentiment analysis should be considered an integral part of narrative comprehension.
Research Interests:
This article describes TextEvaluator, a comprehensive text-analysis system designed to help teachers, textbook publishers, test developers, and literacy researchers select reading materials that are consistent with the text-complexity... more
This article describes TextEvaluator, a comprehensive text-analysis system designed to help teachers, textbook publishers, test developers, and literacy researchers select reading materials that are consistent with the text-complexity goals outlined in the Common Core State Standards. Three particular aspects of the TextEvaluator measurement approach are highlighted: (1) attending to relevant reader and task considerations, (2) expanding construct coverage beyond the two dimensions of text variation traditionally assessed by readability metrics, and (3) addressing two potential threats to tool validity: genre bias and blueprint bias. We argue that systems that are attentive to these particular measurement issues may be more effective at helping users achieve a key goal of the new Standards: ensuring that students are challenged to read texts at steadily increasing complexity levels as they progress through school, so that all students acquire the advanced reading skills needed for s...
The rates of attendance of students with learning disabilities into higher education institutions is increasing, elevating the importance of ensuring both access and validity with respect to testing accommodations for graduate entrance... more
The rates of attendance of students with learning disabilities into higher education institutions is increasing, elevating the importance of ensuring both access and validity with respect to testing accommodations for graduate entrance exams. In this paper, we examine fairness issues surrounding two testing accommodations, the use of: (a) extended time and (b) spell-checkers, framing the discussion around the considerations proposed by Phillips (1994) for determining the appropriateness of testing accommodations. We address this issue through our review of fairness considerations discussed in Phillips and the extant literature on accommodations for college students. We also evaluate empirical results of a study conducted using data from the Analytical Writing section of the revised Graduate Record Examination administered to both students with LD and students without LD. Our study findings stemming from our evaluation of the literature and the results of our empirical study using GR...
Graduate school recommendations are an important part of admissions in higher education, and natural language processing may be able to provide objective and consistent analyses of recommendation texts to complement readings by faculty... more
Graduate school recommendations are an important part of admissions in higher education, and natural language processing may be able to provide objective and consistent analyses of recommendation texts to complement readings by faculty and admissions staff. However, these sorts of high-stakes, personal recommendations are different from the product and service reviews studied in much of the research on sentiment analysis. In this report, we develop an approach for analyzing recommendations and evaluate the approach
on four tasks: (a) identifying which sentences are actually about the student, (b) measuring specificity, (c) measuring sentiment, and (d) predicting recommender ratings.We find substantial agreement with human annotations and analyze the effects of different types of features.
on four tasks: (a) identifying which sentences are actually about the student, (b) measuring specificity, (c) measuring sentiment, and (d) predicting recommender ratings.We find substantial agreement with human annotations and analyze the effects of different types of features.
Research Interests:
In this paper we present an application of associative lexical cohesion to the analysis of text complexity as determined by expert-assigned US school grade levels. Lexical cohesion in a text is represented as a distribution of pairwise... more
In this paper we present an application of associative lexical cohesion to the analysis of text complexity as determined by expert-assigned US school grade levels. Lexical cohesion in a text is represented as a distribution of pairwise positive normalized mutual information values. Our quantitative measure of lexical cohesion is Lexical Tightness (LT), computed as average of such values per text. It represents the degree to which a text tends to use words that are highly inter-associated in the language. LT is inversely correlated with grade levels and adds significantly to the amount of explained variance when estimating grade level with a readability formula. In general, simpler texts are more lexically cohesive and complex texts are less cohesive. We further demonstrate that lexical tightness is a very robust measure. We compute lexical tightness for a whole text and also across segmental units of a text. While texts are more cohesive at the sentence level than at the paragraph or whole-text levels, the same systematic variation of lexical tightness with grade level is observed for all levels of segmentation. Measuring text cohesion at various levels uncovers a specific genre effect: informational texts are significantly more cohesive than literary texts, across all grade levels.
Research Interests: Lexicology, Natural Language Processing, Computational Linguistics, Applied Linguistics, Statistical Modeling of Language, and 14 moreLexical Semantics, Computational Linguistics & NLP, Text Analysis, Computational Lexicography, Digital Libraries, Web Information Retrieval, Semantics, Language Technology, Text Complexity, Readability, Distributional Semantics, Halliday, M.A.K., Word Associations, Lexical Complexity, Grammatical and Lexical Cohesion, Lexical Cohesion, Language Models, Word association, and Lexical Association
This article describes TextEvaluator, a comprehensive text-analysis system designed to help teachers, textbook publishers, test developers, and literacy researchers select reading materials that are consistent with the text-complexity... more
This article describes TextEvaluator, a comprehensive text-analysis system designed to help teachers, textbook publishers, test developers, and literacy researchers select reading materials that are consistent with the text-complexity goals outlined in the Common Core State Standards. Three particular aspects of the TextEvaluator measurement approach are highlighted: (1) attending to relevant reader and task considerations, (2) expanding construct coverage beyond the two dimensions of text variation traditionally assessed by readability metrics, and (3) addressing two potential threats to tool validity: genre bias and blueprint bias. We argue that systems that are attentive to these particular measurement issues may be more effective at helping users achieve a key goal of the new Standards: ensuring that students are challenged to read texts at steadily increasing complexity levels as they progress through school, so that all students acquire the advanced reading skills needed for success in college and careers.
Research Interests:
We present an automated system that computes multi-cue associations and generates associated-word suggestions, using lexical co-occurrence data from a large corpus of English texts. The system performs expansion of cue words to their... more
We present an automated system that computes multi-cue associations and generates
associated-word suggestions, using lexical co-occurrence data from a large corpus of English
texts. The system performs expansion of cue words to their inflectional variants, retrieves
candidate words from corpus data, finds maximal associations between candidates and cues,
computes an aggregate score for each candidate, and outputs an n-best list of candidates. We
present experiments using several measures of statistical association, two methods of score
aggregation, ablation of resources and applying additional filters on retrieved candidates. The
system achieves 18.6% precision on the COGALEX-4 shared task data. Results with
additional evaluation methods are presented. We also describe an annotation experiment which
suggests that the shared task may underestimate the appropriateness of candidate words
produced by the corpus-based system.
associated-word suggestions, using lexical co-occurrence data from a large corpus of English
texts. The system performs expansion of cue words to their inflectional variants, retrieves
candidate words from corpus data, finds maximal associations between candidates and cues,
computes an aggregate score for each candidate, and outputs an n-best list of candidates. We
present experiments using several measures of statistical association, two methods of score
aggregation, ablation of resources and applying additional filters on retrieved candidates. The
system achieves 18.6% precision on the COGALEX-4 shared task data. Results with
additional evaluation methods are presented. We also describe an annotation experiment which
suggests that the shared task may underestimate the appropriateness of candidate words
produced by the corpus-based system.
Research Interests: Natural Language Processing, Semantics, Computational Linguistics, Vocabulary, Statistical Modeling of Language, and 11 moreLexical Semantics, Computational Linguistics & NLP, Lexis, Thesaurus, N-Grams, Lexical access, Distributional Semantics, Word Associations, Language Models, Tip-of-the-tongue, and Lexical Association
This annotation study is designed to help us gain an increased understanding of paraphrase strategies used by native and nonnative English speakers and how these strategies might affect test takers’ essay scores. Toward that end, this... more
This annotation study is designed to help us gain an increased understanding of paraphrase strategies used by native and nonnative English speakers and how these strategies might affect test takers’ essay scores. Toward that end, this study aims to examine and analyze the paraphrase and the types of linguistic modifications used in paraphrase in test-taker responses and differences that may exist between native and nonnative English speakers. We are also interested in how these factors might influence final essay score. Outcomes discussed in this report can be used to inform the development of new e-rater® scoring engine features that capture information related to paraphrase, specifically in nonnative speaker responses to the TOEFL® exam integrated writing task.
Research Interests:
We present a suggestive finding regarding the loss of associative texture in the process of machine translation, using comparisons between (a) original and back-translated texts, (b) reference and system translations, and (c) better and... more
We present a suggestive finding regarding the loss of associative texture in the process
of machine translation, using comparisons between (a) original and back-translated
texts, (b) reference and system translations, and (c) better and worse MT systems.
We represent the amount of association in a text using word association profile –
a distribution of pointwise mutual information between all pairs of content word types in a text.
We use the average of the distribution, which we term lexical tightness, as a single measure of
the amount of association in a text. We show that the lexical tightness of human-composed
texts is higher than that of the machine translated materials; human references are tighter than machine translations, and better MT systems produce lexically tighter translations. While the phenomenon of the loss of associative texture has been theoretically predicted by translation scholars, we present a measure
capable of quantifying the extent of this phenomenon.
of machine translation, using comparisons between (a) original and back-translated
texts, (b) reference and system translations, and (c) better and worse MT systems.
We represent the amount of association in a text using word association profile –
a distribution of pointwise mutual information between all pairs of content word types in a text.
We use the average of the distribution, which we term lexical tightness, as a single measure of
the amount of association in a text. We show that the lexical tightness of human-composed
texts is higher than that of the machine translated materials; human references are tighter than machine translations, and better MT systems produce lexically tighter translations. While the phenomenon of the loss of associative texture has been theoretically predicted by translation scholars, we present a measure
capable of quantifying the extent of this phenomenon.
Research Interests:
We describe a new representation of the content vocabulary of a text we call word association profile that captures the proportions of highly associated, mildly associated, unassociated, and dis-associated pairs of words that co-exist... more
We describe a new representation of the content vocabulary of a text we call word
association profile that captures the proportions of highly associated, mildly associated,
unassociated, and dis-associated pairs of words that co-exist in the given
text. We illustrate the shape of the distribution and observe variation with genre
and target audience. We present a study of the relationship between quality of writing
and word association profiles. For a set of essays written by college graduates
on a number of general topics, we show that the higher scoring essays tend to have
higher percentages of both highly associated and dis-associated pairs, and lower
percentages of mildly associated pairs of words. Finally, we use word association
profiles to improve a system for automated scoring of essays.
association profile that captures the proportions of highly associated, mildly associated,
unassociated, and dis-associated pairs of words that co-exist in the given
text. We illustrate the shape of the distribution and observe variation with genre
and target audience. We present a study of the relationship between quality of writing
and word association profiles. For a set of essays written by college graduates
on a number of general topics, we show that the higher scoring essays tend to have
higher percentages of both highly associated and dis-associated pairs, and lower
percentages of mildly associated pairs of words. Finally, we use word association
profiles to improve a system for automated scoring of essays.
Research Interests:
Flor M. (2012). Four types of context for automatic spelling correction. Traitement Automatique des Langues (TAL), 53:3, 61-99. This paper presents an investigation on using four types of contextual information for improving the accuracy... more
Flor M. (2012). Four types of context for automatic spelling correction. Traitement Automatique des Langues (TAL), 53:3, 61-99.
This paper presents an investigation on using four types of contextual information for improving the accuracy of automatic correction of single-token non-word misspellings. The task is framed as contextually-informed re-ranking of correction candidates. Immediate local context is captured by word n-grams statistics from a Web-scale language model. The second approach measures how well a candidate correction fits in the semantic fabric of the local lexical neighborhood, using a very large Distributional Semantic Model. In the third approach, recognizing a misspelling as an instance of a recurring word can be useful for reranking. The fourth approach looks at context beyond the text itself. If the approximate topic can be known in advance, spelling correction can be biased towards the topic. Effectiveness of proposed methods is demonstrated with an annotated corpus of 3,000 student essays from international high-stakes English language assessments. The paper also describes an implemented system that achieves high accuracy on this task."
This paper presents an investigation on using four types of contextual information for improving the accuracy of automatic correction of single-token non-word misspellings. The task is framed as contextually-informed re-ranking of correction candidates. Immediate local context is captured by word n-grams statistics from a Web-scale language model. The second approach measures how well a candidate correction fits in the semantic fabric of the local lexical neighborhood, using a very large Distributional Semantic Model. In the third approach, recognizing a misspelling as an instance of a recurring word can be useful for reranking. The fourth approach looks at context beyond the text itself. If the approximate topic can be known in advance, spelling correction can be biased towards the topic. Effectiveness of proposed methods is demonstrated with an annotated corpus of 3,000 student essays from international high-stakes English language assessments. The paper also describes an implemented system that achieves high accuracy on this task."
Research Interests:
This paper presents TrendStream, a versatile architecture for very large word n-gram datasets. Designed for speed, flexibility, and portability, TrendStream uses a novel triebased architecture, features lossless compression, and... more
This paper presents TrendStream, a versatile architecture for very large word n-gram
datasets. Designed for speed, flexibility, and portability, TrendStream uses a novel triebased
architecture, features lossless compression, and provides optimization for both speed
and memory use. In addition to literal queries, it also supports fast pattern matching searches
(with wildcards or regular expressions), on the same data structure, without any additional
indexing. Language models are updateable directly in the compiled binary format, allowing
rapid encoding of existing tabulated collections, incremental generation of n-gram models
from streaming text, and merging of encoded compiled files. This architecture offers flexible
choices for loading and memory utilization: fast memory-mapping of a multi-gigabyte model,
or on-demand partial data loading with very modest memory requirements. The implemented
system runs successfully on several different platforms, under different operating systems, even
when the n-gram model file is much larger than available memory. Experimental evaluation
results are presented with the Google Web1T collection and the Gigaword corpus.
datasets. Designed for speed, flexibility, and portability, TrendStream uses a novel triebased
architecture, features lossless compression, and provides optimization for both speed
and memory use. In addition to literal queries, it also supports fast pattern matching searches
(with wildcards or regular expressions), on the same data structure, without any additional
indexing. Language models are updateable directly in the compiled binary format, allowing
rapid encoding of existing tabulated collections, incremental generation of n-gram models
from streaming text, and merging of encoded compiled files. This architecture offers flexible
choices for loading and memory utilization: fast memory-mapping of a multi-gigabyte model,
or on-demand partial data loading with very modest memory requirements. The implemented
system runs successfully on several different platforms, under different operating systems, even
when the n-gram model file is much larger than available memory. Experimental evaluation
results are presented with the Google Web1T collection and the Gigaword corpus.
Research Interests: Information Retrieval, Digital Humanities, Computational Linguistics, Data Compression, Data Compression, and 9 moreCorpus Linguistics, Optimization techniques, Computational Linguistics & NLP, N-Grams, Textual Data Compression, Language Models, Statistical Language Models, Suffix Trees, and Digital Humaniities
The Common Core Standards call for students to be exposed to a much greater level of text complexity than has been the norm in schools for the past 40 years. Textbook publishers, teachers, and assessment developers are being asked to... more
The Common Core Standards call for students to be exposed to a much greater level of text
complexity than has been the norm in schools for the past 40 years. Textbook publishers,
teachers, and assessment developers are being asked to refocus materials and methods to
ensure that students are challenged to read texts at steadily increasing complexity levels as
they progress through school so that all students remain on track to achieve college and
career readiness by the end of 12th grade. Although automated text analysis tools have been
proposed as one method for helping educators achieve this goal, research suggests that
existing tools are subject to three limitations: inadequate construct coverage, overly narrow
criterion variables, and inappropriate treatment of genre effects. Modeling approaches
developed to address these limitations are described. Recommended approaches are
incorporated into a new text analysis system called SourceRater. Validity analyses
implemented on an independent sample of texts suggest that, compared to existing
approaches, SourceRater’s estimates of text complexity are more reflective of the complexity
classifications given in the new standards. Implications for the development of learning
progressions designed to help educators organize curriculum, instruction, and assessment in
reading are discussed.
complexity than has been the norm in schools for the past 40 years. Textbook publishers,
teachers, and assessment developers are being asked to refocus materials and methods to
ensure that students are challenged to read texts at steadily increasing complexity levels as
they progress through school so that all students remain on track to achieve college and
career readiness by the end of 12th grade. Although automated text analysis tools have been
proposed as one method for helping educators achieve this goal, research suggests that
existing tools are subject to three limitations: inadequate construct coverage, overly narrow
criterion variables, and inappropriate treatment of genre effects. Modeling approaches
developed to address these limitations are described. Recommended approaches are
incorporated into a new text analysis system called SourceRater. Validity analyses
implemented on an independent sample of texts suggest that, compared to existing
approaches, SourceRater’s estimates of text complexity are more reflective of the complexity
classifications given in the new standards. Implications for the development of learning
progressions designed to help educators organize curriculum, instruction, and assessment in
reading are discussed.
