Skip to main content
It is important for developers of automated scoring systems to ensure that their systems are as fair and valid as possible. This commitment means evaluating the performance of these systems in light of construct-irrelevant response... more
It is important for developers of automated scoring systems to ensure that their systems are as fair and valid as possible. This commitment means evaluating the performance of these systems in light of construct-irrelevant response strategies. The enhancement of systems to detect and deal with these kinds of strategies is often an iterative process, whereby as new strategies come to light they need to be evaluated and effective mechanisms built into the automated scoring systems to handle them. In this paper, we focus on the Babel system, which automatically generates semantically incohesive essays. We expect that these essays may unfairly receive high scores from automated scoring engines despite essentially being nonsense.

We found that the classifier built on Babel-generated essays and good-faith essays and using features from the automated scoring engine can distinguish the Babel-generated essays from the good-faith ones with 100% accuracy. We also found that if we integrated this classifier into the automated scoring engine it flagged very few responses that were submitted as part of operational submissions (76 of 434,656). The responses that were flagged had previously been assigned a score of Null (non-scorable) or a score of 1 by human experts. The measure of lexical-semantic cohesion shows promise in being able to distinguish Babel-generated essays from good-faith essays.

Our results show that it is possible to detect the kind of gaming strategy illustrated by the Babel system and add it to an automated scoring engine without adverse effects on essays seen during real high-stakes tests. We also show that a measure of lexical-semantic cohesion can separate Babel-generated essays from good-faith essays to a certain degree, depending on task. This points to future work that would generalize the capability to detect semantic incoherence in essays.
Download (.pdf)
Systems and methods described herein utilize supervised machine learning to generate a figure-of-speech prediction model for classify content words in a running text as either being figurative (e.g., as a metaphor, simile, etc.) or... more
Systems and methods described herein utilize supervised machine learning to generate a figure-of-speech prediction model for classify content words in a running text as either being figurative (e.g., as a metaphor, simile, etc.) or non-figurative (i.e., literal). The prediction model may extract and analyze any number of features in making its prediction, including a topic model feature, unigram feature, part-of-speech feature, concreteness feature, concreteness difference feature, literal context feature, non-literal context feature, and off-topic feature, each of which are described in detail herein. Since uses of figure of speech in writings may signal content sophistication, the figure-of-speech prediction model allows scoring engines to further take into consideration a text's use of figure of speech when generating a score.
Download (.pdf)
We present a novel rule-based system for automatic generation of factual questions from sentences, using semantic role labeling (SRL) as the main form of text analysis. The system is capable of generating both wh-questions and yes/no... more
We present a novel rule-based system for automatic generation of factual questions from sentences, using semantic role labeling (SRL) as the main form of text analysis. The system is capable of generating both wh-questions and yes/no questions from the same semantic analysis. We present an extensive evaluation of the system and compare it to a recent neural network architecture for question generation. The SRL-based system outperforms the neural system in both average quality and variety of generated questions.
Download (.pdf)
This paper presents an exploratory study on large-scale detection of idiomatic expressions in essays written by non-native speakers of English. We describe a computational search procedure for automatic detection of idiom-candidate... more
This paper presents an exploratory study on large-scale detection of idiomatic expressions in essays written by non-native speakers of English. We describe a computational search procedure for automatic detection of idiom-candidate phrases in essay texts. The study used a corpus of essays written during a standardized examination of English language proficiency. Automatically-flagged candidate expressions were manually annotated for idiomaticity. The study found that idioms are widely used in EFL essays. The study also showed that a search algorithm that accommodates the syntactic and lexical flexibility of idioms can increase the recall of idiom instances by 30%, but it also increases the amount of false positives.
Download (.pdf)
This work lays the foundation for automated assessments of narrative quality in student writing. We first manually score essays for narrative-relevant traits and sub-traits, and measure inter-annotator agreement. We then explore... more
This work lays the foundation for automated assessments of narrative quality in student writing. We first manually score essays for narrative-relevant traits and sub-traits, and measure inter-annotator agreement. We then explore linguistic features that are indicative of good narrative writing and use them to build an automated scoring system. Experiments show that our features are more effective in scoring specific aspects of narrative quality than a state-of-the-art feature set.
Download (.pdf)
Conversations in collaborative problem-solving activities can be used to probe the collaboration skills of the team members. Annotating the conversations into different collaboration skills by human raters is laborious and time consuming.... more
Conversations in collaborative problem-solving activities can be used to probe the collaboration skills of the team members. Annotating
the conversations into different collaboration skills by human raters is laborious and time consuming. In this report, we report our
work on developing an automated annotation system, CPS-rater, for conversational data from collaborative activities.The linear chain
conditional random field method is used to model the sequential dependencies between the turns of the conversations, and the resulting
automated annotation system outperforms those systems that do not model the sequential dependency.
Research Interests:
Download (.pdf)
The Common Core Standards call for students to be exposed to a much greater level of text complexity than has been the norm in schools for the past 40 years. Textbook publishers, teachers, and assessment developers are being asked to... more
The Common Core Standards call for students to be exposed to a much greater level of text complexity than has been the norm in schools for the past 40 years. Textbook publishers, teachers, and assessment developers are being asked to refocus materials and methods to
ensure that students are challenged to read texts at steadily increasing complexity levels as they progress through school so that all students remain on track to achieve college and career readiness by the end of 12th grade. Although automated text analysis tools have been
proposed as one method for helping educators achieve this goal, research suggests that existing tools are subject to three limitations: inadequate construct coverage, overly narrow criterion variables, and inappropriate treatment of genre effects. Modeling approaches
developed to address these limitations are described. Recommended approaches are incorporated into a new text analysis system called SourceRater. Validity analyses implemented on an independent sample of texts suggest that, compared to existing approaches, SourceRater’s estimates of text complexity are more reflective of the complexity
classifications given in the new standards. Implications for the development of learning progressions designed to help educators organize curriculum, instruction, and assessment in reading are discussed
Research Interests:
Download (.pdf)
The report is the first systematic evaluation of the sentence equivalence item type introduced by the GRE® revised General Test. We adopt a validity framework to guide our investigation based on Kane’s approach to validation whereby a... more
The report is the first systematic evaluation of the sentence equivalence item type introduced by the GRE® revised General Test. We
adopt a validity framework to guide our investigation based on Kane’s approach to validation whereby a hierarchy of inferences that
should be documented to support score meaning and interpretation is evaluated. We present evidence relevant to the generalization
inference as well as evidence of construct representation.We analyzed the pool of sentence equivalence items in three studies. The first
and second studies focused on the generalization inference and sought to document the construction principles behind the sentence
equivalence items, specifically the nature of the vocabulary tested.The third study focused on construct representation and evaluated
the contribution of the stem, the keys, and the distractors to item difficulty. We concluded that the vocabulary tested by the sentence
equivalence items is appropriate given the purpose of the GRE, namely, to assist in the selection of graduate students. The difficulty
of the items was shown to be, in part, a function of the familiarity of the vocabulary as well as the context in which the vocabulary is
tested, which we argue is positive validity evidence.
Download (.pdf)
We present two NLP components for the Story Cloze Task – dictionary-based sentiment analysis and lexical cohesion. While previous research found no contribution from sentiment analysis to the accuracy on this task, we demonstrate that... more
We present two NLP components for the Story Cloze Task – dictionary-based sentiment analysis and lexical cohesion. While previous research found no contribution from sentiment analysis to the accuracy on this task, we demonstrate that sentiment is an important aspect. We describe a new approach, using a rule that estimates sentiment congruence in a story. Our sentiment-based system achieves strong results on this task. Our lexical cohesion system achieves accuracy comparable to previously published baseline results. A combination of the two systems achieves better accuracy than published baselines. We argue that sentiment analysis should be considered an integral part of narrative comprehension.
Research Interests:
Download (.pdf)
We investigate the effectiveness of semantic generalizations/classifications for capturing the regularities of the behavior of verbs in terms of their metaphoric-ity. Starting from orthographic word unigrams, we experiment with various... more
We investigate the effectiveness of semantic generalizations/classifications for capturing the regularities of the behavior of verbs in terms of their metaphoric-ity. Starting from orthographic word unigrams, we experiment with various ways of defining semantic classes for verbs (grammatical, resource-based, dis-tributional) and measure the effectiveness of these classes for classifying all verbs in a running text as metaphor or non metaphor.
Research Interests:
Download (.pdf)
We present a novel situational task that integrates collaborative problem solving behavior with testing in a science domain. Participants engage in discourse, which is used to evaluate their collaborative skills. We present initial... more
We present a novel situational task that integrates collaborative problem solving behavior with testing in a science domain. Participants engage in discourse, which is used to evaluate their collaborative skills. We present initial experiments for automatic classification of such discourse, using a novel classification schema. Considerable accuracy is achieved with just lexical features. A speech-act classifier, trained on out-of-domain data, can also be helpful.
Download (.pdf)
Developments in the educational landscape have spurred greater interest in the problem of automatically scoring short answer questions. A recent shared task on this topic revealed a fundamental divide in the modeling approaches that have... more
Developments in the educational landscape have spurred greater interest in the problem of automatically scoring short answer questions. A recent shared task on this topic revealed a fundamental divide in the modeling approaches that have been applied to this problem, with the best-performing systems split between those that employ a knowledge engineering approach and those that almost solely leverage lexical information (as opposed to higher-level syntactic information) in assigning a score to a given response. This paper aims to introduce the NLP community to the largest corpus currently available for short-answer scoring, provide an overview of methods used in the shared task using this data, and explore the extent to which more syntactically-informed features can contribute to the short answer scoring task in a way that avoids the question-specific manual effort of the knowledge engineering approach.
Download (.pdf)
This article describes TextEvaluator, a comprehensive text-analysis system designed to help teachers, textbook publishers, test developers, and literacy researchers select reading materials that are consistent with the text-complexity... more
This article describes TextEvaluator, a comprehensive text-analysis system designed to help teachers, textbook publishers, test developers, and literacy researchers select reading materials that are consistent with the text-complexity goals outlined in the Common Core State Standards. Three particular aspects of the TextEvaluator measurement approach are highlighted: (1) attending to relevant reader and task considerations, (2) expanding construct coverage beyond the two dimensions of text variation traditionally assessed by readability metrics, and (3) addressing two potential threats to tool validity: genre bias and blueprint bias. We argue that systems that are attentive to these particular measurement issues may be more effective at helping users achieve a key goal of the new Standards: ensuring that students are challenged to read texts at steadily increasing complexity levels as they progress through school, so that all students acquire the advanced reading skills needed for s...
Download (.pdf)
The rates of attendance of students with learning disabilities into higher education institutions is increasing, elevating the importance of ensuring both access and validity with respect to testing accommodations for graduate entrance... more
The rates of attendance of students with learning disabilities into higher education institutions is increasing, elevating the importance of ensuring both access and validity with respect to testing accommodations for graduate entrance exams. In this paper, we examine fairness issues surrounding two testing accommodations, the use of: (a) extended time and (b) spell-checkers, framing the discussion around the considerations proposed by Phillips (1994) for determining the appropriateness of testing accommodations. We address this issue through our review of fairness considerations discussed in Phillips and the extant literature on accommodations for college students. We also evaluate empirical results of a study conducted using data from the Analytical Writing section of the revised Graduate Record Examination administered to both students with LD and students without LD. Our study findings stemming from our evaluation of the literature and the results of our empirical study using GR...
In this paper, we address the problem of quantifying the overall extent to which a test-taker's essay deals with the topic it is assigned (prompt). We experiment with a number of models for word topicality, and a number of approaches for... more
In this paper, we address the problem of quantifying the overall extent to which a test-taker's essay deals with the topic it is assigned (prompt). We experiment with a number of models for word topicality, and a number of approaches for aggregating word-level indices into text-level ones. All models are evaluated for their ability to predict the holistic quality of essays. We show that the best text-topicality model provides a significant improvement in a state-of-art essay scoring system. We also show that the findings of the relative merits of different models generalize well across three different datasets.
Research Interests:
Download (.pdf)
Graduate school recommendations are an important part of admissions in higher education, and natural language processing may be able to provide objective and consistent analyses of recommendation texts to complement readings by faculty... more
Graduate school recommendations are an important part of admissions in higher education, and natural language processing may be able to provide objective and consistent analyses of recommendation texts to complement readings by faculty and admissions staff. However, these sorts of high-stakes, personal recommendations are different from the product and service reviews studied in much of the research on sentiment analysis. In this report, we develop an approach for analyzing recommendations and evaluate the approach
on four tasks: (a) identifying which sentences are actually about the student, (b) measuring specificity, (c) measuring sentiment, and (d) predicting recommender ratings.We find substantial agreement with human annotations and analyze the effects of different types of features.
Download (.pdf)
We present a supervised machine learning system for word-level classification of all content words in a running text as being metaphorical or non-metaphorical. The system provides a substantial improvement upon a previously published... more
We present a supervised machine learning system
for word-level classification of all content
words in a running text as being metaphorical
or non-metaphorical. The system provides
a substantial improvement upon a previously
published baseline, using re-weighting of the
training examples and using features derived
from a concreteness database. We observe that
while the first manipulation was very effective,
the second was only slightly so. Possible
reasons for these observations are discussed.
Download (.pdf)
This paper presents a study of misspellings, based on annotated data from the ETS Spelling corpus. The corpus consists of 3000 essays written by examinees, native (NS) and non-native speakers (NNS) of English, on the writing sections of... more
This paper presents a study of misspellings, based on annotated data from the ETS Spelling corpus. The corpus consists of 3000 essays written by examinees, native (NS) and non-native speakers (NNS) of English, on the writing sections of GRE® and TOEFL® examinations. We find that the rate of misspellings decreases as writing proficiency (essay score) increases, both in TOEFL and in GRE. Severity of misspellings depends on writing proficiency and not on NS/NNS distinction. Word-length and word-frequency have strong influences on production of misspellings, showing patterns associated with proficiency. For word-frequency, there is also a clear effect of NS/NNS distinction.
Download (.pdf)
In this paper we present an application of associative lexical cohesion to the analysis of text complexity as determined by expert-assigned US school grade levels. Lexical cohesion in a text is represented as a distribution of pairwise... more
In this paper we present an application of associative lexical cohesion to the analysis of text complexity as determined by expert-assigned US school grade levels. Lexical cohesion in a text is represented as a distribution of pairwise positive normalized mutual information values. Our quantitative measure of lexical cohesion is Lexical Tightness (LT), computed as average of such values per text. It represents the degree to which a text tends to use words that are highly inter-associated in the language. LT is inversely correlated with grade levels and adds significantly to the amount of explained variance when estimating grade level with a readability formula. In general, simpler texts are more lexically cohesive and complex texts are less cohesive. We further demonstrate that lexical tightness is a very robust measure. We compute lexical tightness for a whole text and also across segmental units of a text. While texts are more cohesive at the sentence level than at the paragraph or whole-text levels, the same systematic variation of lexical tightness with grade level is observed for all levels of segmentation. Measuring text cohesion at various levels uncovers a specific genre effect: informational texts are significantly more cohesive than literary texts, across all grade levels.
Download (.pdf)
This article describes TextEvaluator, a comprehensive text-analysis system designed to help teachers, textbook publishers, test developers, and literacy researchers select reading materials that are consistent with the text-complexity... more
This article describes TextEvaluator, a comprehensive text-analysis system designed to help teachers, textbook publishers, test developers, and literacy researchers select reading materials that are consistent with the text-complexity goals outlined in the Common Core State Standards. Three particular aspects of the TextEvaluator measurement approach are highlighted: (1) attending to relevant reader and task considerations, (2) expanding construct coverage beyond the two dimensions of text variation traditionally assessed by readability metrics, and (3) addressing two potential threats to tool validity: genre bias and blueprint bias. We argue that systems that are attentive to these particular measurement issues may be more effective at helping users achieve a key goal of the new Standards: ensuring that students are challenged to read texts at steadily increasing complexity levels as they progress through school, so that all students acquire the advanced reading skills needed for success in college and careers.
Download (.pdf)
This research is motivated by the expectation that automated scoring will play an increasingly important role in high stakes educational testing. Therefore, approaches to safeguard the validity of score interpretation under automated... more
This research is motivated by the expectation that automated scoring will play an increasingly important role in high stakes educational testing. Therefore, approaches to safeguard the validity of score interpretation under automated scoring should be investigated. This investigation illustrates one approach to study the vulnerability of a scoring engine to construct-irrelevant response strategies (CIRS) based on the substitution of more sophisticated words. That approach is illustrated and evaluated by simulating the effect of a specific strategy with real essays. The results suggest that the strategy had modest effects, although it was effective in improving the scores of a fraction of the lower-scoring essays. The broader implications of the results for quality assurance and control of automated scoring engines are discussed.
Download (.pdf)
We present an automated system that computes multi-cue associations and generates associated-word suggestions, using lexical co-occurrence data from a large corpus of English texts. The system performs expansion of cue words to their... more
We present an automated system that computes multi-cue associations and generates
associated-word suggestions, using lexical co-occurrence data from a large corpus of English
texts. The system performs expansion of cue words to their inflectional variants, retrieves
candidate words from corpus data, finds maximal associations between candidates and cues,
computes an aggregate score for each candidate, and outputs an n-best list of candidates. We
present experiments using several measures of statistical association, two methods of score
aggregation, ablation of resources and applying additional filters on retrieved candidates. The
system achieves 18.6% precision on the COGALEX-4 shared task data. Results with
additional evaluation methods are presented. We also describe an annotation experiment which
suggests that the shared task may underestimate the appropriateness of candidate words
produced by the corpus-based system.
Download (.pdf)
"Current approaches to supervised learning of metaphor tend to use sophisticated features and restrict their attention to constructions and contexts where these features apply. In this paper, we describe the development of a supervised... more
"Current approaches to supervised learning of metaphor tend to use sophisticated features and restrict their attention to constructions and contexts where these features apply. In this paper, we describe the development of a supervised learning system to classify all content words in a running
text as either being used metaphorically or not. We start by examining the performance of a simple unigram baseline that achieves surprisingly good results for some of the datasets. We then show how the recall of the system can be improved over this strong baseline."
Download (.pdf)
This annotation study is designed to help us gain an increased understanding of paraphrase strategies used by native and nonnative English speakers and how these strategies might affect test takers’ essay scores. Toward that end, this... more
This annotation study is designed to help us gain an increased understanding of paraphrase strategies used by native and nonnative English speakers and how these strategies might affect test takers’ essay scores. Toward that end, this study aims to examine and analyze the paraphrase and the types of linguistic modifications used in paraphrase in test-taker responses and differences that may exist between native and nonnative English speakers. We are also interested in how these factors might influence final essay score. Outcomes discussed in this report can be used to inform the development of new e-rater® scoring engine features that capture information related to paraphrase, specifically in nonnative speaker responses to the TOEFL® exam integrated writing task.
Download (.pdf)
We present a suggestive finding regarding the loss of associative texture in the process of machine translation, using comparisons between (a) original and back-translated texts, (b) reference and system translations, and (c) better and... more
We present a suggestive finding regarding the loss of associative texture in the process
of machine translation, using comparisons between (a) original and back-translated
texts, (b) reference and system translations, and (c) better and worse MT systems.
We represent the amount of association in a text using word association profile –
a distribution of pointwise mutual information between all pairs of content word types in a text.
We use the average of the distribution, which we term lexical tightness, as a single measure of
the amount of association in a text. We show that the lexical tightness of human-composed
texts is higher than that of the machine translated materials; human references are tighter than machine translations, and better MT systems produce lexically tighter translations. While the phenomenon of the loss of associative texture has been theoretically predicted by translation scholars, we present a measure
capable of quantifying the extent of this phenomenon.
Download (.pdf)
We describe a new representation of the content vocabulary of a text we call word association profile that captures the proportions of highly associated, mildly associated, unassociated, and dis-associated pairs of words that co-exist... more
We describe a new representation of the content vocabulary of a text we call word
association profile that captures the proportions of highly associated, mildly associated,
unassociated, and dis-associated pairs of words that co-exist in the given
text. We illustrate the shape of the distribution and observe variation with genre
and target audience. We present a study of the relationship between quality of writing
and word association profiles. For a set of essays written by college graduates
on a number of general topics, we show that the higher scoring essays tend to have
higher percentages of both highly associated and dis-associated pairs, and lower
percentages of mildly associated pairs of words. Finally, we use word association
profiles to improve a system for automated scoring of essays.
Download (.pdf)
Flor M. (2012). Four types of context for automatic spelling correction. Traitement Automatique des Langues (TAL), 53:3, 61-99. This paper presents an investigation on using four types of contextual information for improving the accuracy... more
Flor M. (2012). Four types of context for automatic spelling correction. Traitement Automatique des Langues (TAL), 53:3, 61-99.
This paper presents an investigation on using four types of contextual information for improving the accuracy of automatic correction of single-token non-word misspellings. The task is framed as contextually-informed re-ranking of correction candidates. Immediate local context is captured by word n-grams statistics from a Web-scale language model. The second approach measures how well a candidate correction fits in the semantic fabric of the local lexical neighborhood, using a very large Distributional Semantic Model. In the third approach, recognizing a misspelling as an instance of a recurring word can be useful for reranking. The fourth approach looks at context beyond the text itself. If the approximate topic can be known in advance, spelling correction can be biased towards the topic. Effectiveness of proposed methods is demonstrated with an annotated corpus of 3,000 student essays from international high-stakes English language assessments. The paper also describes an implemented system that achieves high accuracy on this task."
Download (.pdf)
We present a computational notion of Lexical Tightness that measures global cohesion of content words in a text. Lexical tightness represents the degree to which a text tends to use words that are highly inter-associated in the... more
We present a computational notion of Lexical
Tightness that measures global cohesion of content
words in a text. Lexical tightness represents
the degree to which a text tends to use words
that are highly inter-associated in the language.
We demonstrate the utility of this measure for
estimating text complexity as measured by US
school grade level designations of texts. Lexical
tightness strongly correlates with grade level in
a collection of expertly rated reading materials.
Lexical tightness captures aspects of prose
complexity that are not covered by classic readability
indexes, especially for literary texts. We
also present initial findings on the utility of this
measure for automated estimation of complexity
for poetry.
Download (.pdf)
Many existing approaches for measuring text complexity tend to overestimate the complexi-ty levels of informational texts while simulta-neously underestimating the complexity levels of literary texts. We present a two-stage esti-mation... more
Many existing approaches for measuring text complexity tend to overestimate the complexi-ty levels of informational texts while simulta-neously underestimating the complexity levels of literary texts. We present a two-stage esti-mation technique that successfully addresses this problem. At Stage 1, each text is classi-fied into one or another of three possible ge-nres: informational, literary or mixed. Next, at Stage 2, a complexity score is generated for each text by applying one or another of three possible prediction models: one optimized for application to informational texts, one opti-mized for application to literary texts, and one optimized for application to mixed texts. Each model combines lexical, syntactic and discourse features, as appropriate, to best rep-licate human complexity judgments. We dem-onstrate that resulting text complexity predictions are both unbiased, and highly cor-related with classifications provided by expe-rienced educators.
Download (.pdf)
This paper presents TrendStream, a versatile architecture for very large word n-gram datasets. Designed for speed, flexibility, and portability, TrendStream uses a novel triebased architecture, features lossless compression, and... more
This paper presents TrendStream, a versatile architecture for very large word n-gram
datasets. Designed for speed, flexibility, and portability, TrendStream uses a novel triebased
architecture, features lossless compression, and provides optimization for both speed
and memory use. In addition to literal queries, it also supports fast pattern matching searches
(with wildcards or regular expressions), on the same data structure, without any additional
indexing. Language models are updateable directly in the compiled binary format, allowing
rapid encoding of existing tabulated collections, incremental generation of n-gram models
from streaming text, and merging of encoded compiled files. This architecture offers flexible
choices for loading and memory utilization: fast memory-mapping of a multi-gigabyte model,
or on-demand partial data loading with very modest memory requirements. The implemented
system runs successfully on several different platforms, under different operating systems, even
when the n-gram model file is much larger than available memory. Experimental evaluation
results are presented with the Google Web1T collection and the Gigaword corpus.
Download (.pdf)
In this paper we present a new spell-checking system that utilizes contextual information for automatic correction of non-word misspellings. The system is evaluated with a large corpus of essays written by native and nonnative... more
In this paper we present a new spell-checking
system that utilizes contextual information for
automatic correction of non-word misspellings.
The system is evaluated with a large
corpus of essays written by native and nonnative
speakers of English to the writing
prompts of high-stakes standardized tests
(TOEFL® and GRE®). We also present comparative
evaluations with Aspell and the speller
from Microsoft Office 2007. Using
context-informed re-ranking of candidate suggestions,
our system exhibits superior errorcorrection
results overall and also corrects errors
generated by non-native English writers
with almost same rate of success as it does for
writers who are native English speakers
Download (.pdf)
The Common Core Standards call for students to be exposed to a much greater level of text complexity than has been the norm in schools for the past 40 years. Textbook publishers, teachers, and assessment developers are being asked to... more
The Common Core Standards call for students to be exposed to a much greater level of text
complexity than has been the norm in schools for the past 40 years. Textbook publishers,
teachers, and assessment developers are being asked to refocus materials and methods to
ensure that students are challenged to read texts at steadily increasing complexity levels as
they progress through school so that all students remain on track to achieve college and
career readiness by the end of 12th grade. Although automated text analysis tools have been
proposed as one method for helping educators achieve this goal, research suggests that
existing tools are subject to three limitations: inadequate construct coverage, overly narrow
criterion variables, and inappropriate treatment of genre effects. Modeling approaches
developed to address these limitations are described. Recommended approaches are
incorporated into a new text analysis system called SourceRater. Validity analyses
implemented on an independent sample of texts suggest that, compared to existing
approaches, SourceRater’s estimates of text complexity are more reflective of the complexity
classifications given in the new standards. Implications for the development of learning
progressions designed to help educators organize curriculum, instruction, and assessment in
reading are discussed.
Download (.pdf)
Download (.pdf)