2 Thinking Like A Robo-Grader: What The Research Tells Us… Words Matter!

Alise Lamoreaux

While it’s not possible to know the actual coding behind the proprietary rights of the major testing companies, it is possible to find research from the people involved in creating the technology driving the industry.  Specific algorithms were not discussed in the professional research projects, but general methodology was, especially in the early research studies before proprietary rights were involved.  It is also possible to find student projects in artificial intelligence attempting to recreate the Hewlett Foundation’s ASAP competition results.  The information provided in this chapter is based on the knowledge available in 2019.  Based on the history of the automated testing movement, it is likely that there will be improvements to the systems, but unless there is a major technological advancement, like personal computers and the Internet were to the 1990’s, the basics will likely remain the same.

The framework of the features to be evaluated by the automated essay grader is based on assumptions about what indicates good writing. However, essay grading can be plagued with inconsistencies in determining what good writing really involves.  Ellis Page began with the premise that there was an intrinsic aspect to good writing that couldn’t be measured by a computer.  He called the intrinsic components the “Trins”.  He believed that approximations could be developed to represent those intrinsic features and call them the “Proxes”.  The concepts he developed are foundational components to understanding the coding behind automated essay graders (AEG).  The basis for determining good writing has bias built into it.  Assumptions or beliefs on the part of the program developers are fundamental to the baseline from which the AEG is created.

Some Common Assumptions Underlying Good Writing:

  • People who read and write more frequently have broader knowledge and larger vocabularies which correlates to higher essay scores
  • The more people read and write the greater their exposure to a larger vocabulary and more thorough understanding of how to properly use that information in their own writing.
  • The number of sentences in an essay equates to the quality of the essay
  • Complex sentences are of more value than simple sentences
  • Longer essays are more likely to have more unique words which shows a bigger vocabulary
  • Short essays have a low word count and fewer words means lesser writing ability
  • Correct spelling indicates a command over language and facility of use
  • Good punctuation is an indicator of a well-structured essay
  • Good essays will use similar vocabulary to high scoring essays in the data set used to judge the essay against
  • Intrinsic features such as style and fluency cannot be measured, but can be approximated with measurable qualities like sentence length, word length, and essay length

What happens to an essay submitted to an AEG?

The process of “reading” for AEG is about analyzing components within the essay submitted.  The essay will be “tokenized” or broken into individual tokens/features to be assessed. In reviewing the research on the development of AEG, different terminology was used to explain the process associated with the feature selection, but the basics are similar because of the limitations of artificial intelligence at this time.  How the tokens are valued or weighted will vary and this methodology will not be shared based on intellectual property rights and ownership of the data set essays.  There are several different coding platforms for determining the grammatical correctness of sentences and how much variance is allowed from the standard set.  This is another area that is not shared information.  The testing companies don’t reveal the source of their essay data sets or who the human essay scorers were or even whom the human readers may have worked for in the past. The standardization information is proprietary.

A glimpse into the process can been seen in the research projects of the students studying artificial intelligence who were trying to recreate the results from the ASAP competition.  In their projects they explain their assumptions and features selected to measure as well as how their program compared to the results they are trying to match.

In 2012 at Stanford University, a group of students (Mahanna, Johns, & Apte), reported a final project for their CS229 Machine Learning course involving automated essay grading.  The purpose of their project was to develop algorithms to assess and grade essay responses. They used the essays provided for the Hewlett Foundation ASAP competition.  In the explanation of the data they used, they stated the essays were written by students from grades 7-10.  Each essay was approximately 150-550 words in length.  The essays were divided into 8 sets and had different types of essays associated with each set.  They used a linear regression model to assess the essays.

The assumptions/hypothesis behind their study was that a good essay would involve features such as language fluency and dexterity, diction and vocabulary, structure and organization, orthography and content.  They stated they were unable to test for content.  They use the Natural Language Toolkit (NLTK) and Textmining to process the language.  The process to prepare the essays for assessment involved removing all placeholder for proper nouns and stripping all the punctuation from the essays.

Machine learning algorithms cannot work with raw text directly; the text must be converted into numbers. The students used a model for extracting features regarding the words in the essays called a “Bag-of-Words” (BOW).  A bag-of-words is a representation of text that describes the occurrence of words within a document. It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where they occur in the document.  BOW involves two things: a vocabulary of known words and a measure of the presence of known words. A set of top words was created for the BOW and the” Stop Words” were discarded.  Stop words are commonly used words: the, a, of, is, at, and so on.  Search engines are commonly programmed to ignore these words as they are deemed irrelevant for searching purposes because they occur frequently in language.  To save space and time, stop words are dropped at indexing time.  Once the BOW is established, the top words are assigned a numerical value or a “weight”.

Numerical features were also assigned to total word count per essay, average word length per essay, number of sentences in the essay (which was used to indicate fluency and dexterity) and character count per essay.  Various parts of speech (nouns, adjectives, adverbs and verbs) were determined to be good proxes for vocabulary and diction.  The essays were tokenized or split into sentences before the tagging process.  Correct spelling is an indicator of command of language, so the ration of misspelled words is another feature assessed.

The Stanford study also revealed some information about the 8 sets of essays that were provided for use in the study.  Sets 1-2 were persuasive/informative essays and were relatively free from contextual text.  Sets 3-6 expected a critical response after reading a text provided (story, essay, excerpt from a book) and therefore were expected to have more specific content.  Sets 7-8 were narrative essays based on personal experiences or imagination.  The students’ results showed that their model of analysis performed relatively well on sets 1-2, the persuasive essays where the content was easier to control for.  The model suffered on sets where content could vary more and they stated, “Our model does not work well on narrative essays.”

In the fall of 2016, at Harvard University, a group of students, (Gupta, Hwang, Lisker, & Loughlin) reported a final project to their CS109A course.  They also studied machine learning as it related to automatic essay grading.  Like the Stanford study, these students used the essays from the Hewlett Foundation ASAP competition.  Their assumption/hypothesis was that word count would positively correlate to a good essay and that longer essays were reflective of deeper thinking and stronger content. They assumed a skilled writer would use a greater variety of words.

The study at Harvard used similar methodology to the Stanford students’ study.  The Natural Language Toolkit was used and stop words were removed from indexing as well.  They based their analysis on number of sentences per essay, percent of misspellings, percentages of each part of speech, the likelihood of a word appearing that matched the test data essays, total word count, and unique word count. Their results also showed better scoring results on persuasive essays.

A project from Rice University (Lukic & Acuna) also used the Hewlett Foundation ASAP essays to develop and evaluate an AEG system.  Their study assumption/hypothesis was that numerical data could be a good predictor of an essay score.  They measured: word count, character count, average word length, misspelled word count, adjective count, transition/analysis word count, and total occurrence of words in the prompt in the essay.  As with the projects from Stanford and Harvard, the Natural Language Tool Kit was used to tokenize essays and strip the essays of punctuation and stop words and for parts of speech tagging.

The Rice study featured noun-verb pairs within sentences and gave weight to those. In addition, weight was given to the number of words found in both the prompt and the essay. Nouns were weighted and it was believed that they would demonstrate a focus surrounding topics and also demonstrate if  an essay became “off-topic”.  The 200 most frequently occurring words within a data set were selected.  Word pairing was evaluated and believed to indicate topical association between words.  Nouns that were “personally identifying information” were censored and censored nouns were stripped from the essay. In the results, the authors felt that censoring for nouns with personal identifiers may have affected the noun-verb pairings and thus effected results within the project.

Automated essay grading has turned into a reality now.  In helping students prepare to take an examination that will be scored by an AEG, it’s important to know what the “rules” of the grader might be. After reviewing several studies, it seems that inferences can be made as to how the AEG will be “reading” the essay.   Summing up the results of the projects at Stanford, Harvard, and Rice Universities the following inferences can be made about the basis of the algorithms used the automated essay graders and what their measurement capabilities are as of the writing of this book:

  • Lexical complexity rewards bigger vocabulary words and usage of unique words
  • Text complexity is similar to assessing the reading level of the text
  • Proportion of errors in grammar, usage, and mechanics can be rated
  • Essay length matters
  • Tokenizing components and ignoring stop words are part of indexing to “read”
  • Matching examinee’s essay vocabulary to data test sets vocabulary can matter

In February of 2012, Douglas D. Hesse, Executive Director of Writing at the University of Denver, published a paper entitled, “Can Computers Grade Writing? Should They?”  In this paper, Hesse states that for automated essay graders, “Content analysis is based on vocabulary measures and organizational development scores are fairly related.”  AEG depends on chains of words, and words those words are associated with.  Sophistication of vocabulary may be determined by the collection of terms.  For example, the word “dog” could be replaced with the word “canine” and the sentence it was used in would rate higher.  In addition, the word canine could be viewed as a more unique word because dog is more commonly used.

An ambiguous aspect of the vocabulary usage is the component of grade-level appropriateness.  A teacher/human grader is responding to student work and can judge appropriate vocabulary for the essay.  An AEG has a Bag of Words in some form it is “reading” for, and the words commanding the highest value may or may not be grade-level appropriate.  What is the appropriate grade level for college transition or high school equivalency (HSE) or a workplace? Who determines vocabulary level appropriateness?  It appears that it’s the assumptions of the people writing the computer code behind the AEG and the test set of essay samples that are being compared to as the standard.

Hesse also provides examples of how longer sentence length is rewarded.  Longer sentences may be valued as having more “style” than short sentence.  According to Hesse, a short sentence combination like, “Dogs are interesting animals.  Dogs are friendly to their owners. Dogs show affection by wagging their tails.” would score lower with an AEG than the following sentence, “Friendly to their owners, wagging tails to show affection, dogs are interesting animals.” Chaining words matters.  The Bag of Words created for the essay evaluation will contain words associated with other words. For example, owner and wag are words associated with dogs.

Hesse’s paper seems to support the student projects from Stanford and Harvard that AEG is better suited to grade certain types of writing than others.  He says, “Computer scores tend to be more valid and reliable – in relation to scores from expert human readers – when the tasks are very carefully designed and limited in length. The SAT® writing sample, for example, gives students 25 mins to write on a limited task…”

On the SAT® website, (https://collegereadiness.collegeboard.org/sample-questions/essay), the examples of topic prompts for preparing to take their test, the prompts ask students to develop an argument, thus examine persuasion, as their response. Once again, this seems to support the Stanford and Harvard student project findings in which their results were more accurate for test sets 1-2 from the Hewlett Foundation ASAP essay sets, which were the persuasive essays.

It’s interesting to think about tokenization of an essay.  As human readers we are not used to looking at essays or extended responses without paragraphs and punctuation.  Thinking like a Robo-grader creates a new word awareness.  What words demonstrate sentence complexity?  What are the “Sign Posts” or “Cue terms” that indicate organization? It’s not enough to be a “transition word” when being evaluated by an automated essay grader.  Word choice matters, but not “stop words” yet for human readers those “commonly used” words are essential parts of communication not to be ignored.