1 Robo-Grader: Artificial Intelligence As An Automated Essay Grading System, The Backstory
Alise Lamoreaux
The idea of Automated Essay Graders (AEG or Robo-graders) has been around since the early 1960’s. A former English teacher, Ellis B. Page, began working on the idea of helping students improve their writing by getting quick feedback on their essays with the help of computers. In December of 1964, at the University of Connecticut, Project Essay Grade (PEG®) was born (Page, 1967). At that time, 272 trial essays were written by students grades 8-12 in an “American High School” and each was judged by at least 4 independent teachers. A hypothesis was generated surrounding the variables, also referred to as features, that might influence the teachers’ judgement. The essays were manually entered into an IBM 7040 computer by clerical staff using keypunch cards. The process was time consuming and labor intensive due to the limitations of computers at that time, but the results were impressive.
Page believed that writing could be broken down into what he called a “trin” and a “prox”. The Trin was a variable that measured the intrinsic interest to the human judge, for example, word choice. The Trin was not directly measurable by the computer strategies of the 1960’s. The Prox was an approximation or correlation to the Trin, for example, the proportion of “uncommon words” used by a student (Page, 1967). Thirty variables were identified as criterion for Project Essay Grade (PEG®). Page found that “the overall accuracy of this beginning strategy was startling. The proxes achieved a multiple-correlation of .71 for the first set of essays analyzed, and by chance, achieved the identical coefficient for the second set.” (Page, 1967) While the results were impressive, the technology of the time was too cumbersome for practical applications, and computers were not readily accessible to most people. Page’s ideas may have seemed outlandish at the time, but it could be argued that they were prophetic. His work with AEG came years before students would have computers to write essays with.
Page continued to work on PEG for the next 30 years and his research consistently showed high correlations between Automated Essay Graders (AEG) and human graders. One study, (Page, 1994) analyzed 2 sets of essays: one group of 495 essays in 1989, and another group of 599 in 1990. The students involved in the analysis were high school seniors participating in the National Assessment of Educational Progress who were responding to a question about recreational opportunities and whether a city should spend money fixing up old railroad tracks or convert an old warehouse to a new use. Using 20 variables, PEG reached 87% accuracy compared with targeted human judges.
In May of 2005, Ellis B. Page passed away at the age of 81. Two years earlier, he sold Project Essay Grade (PEG®) to a company called Measurement Incorporated. PEG® is currently being used by the State of Utah as the sole essay grader on the state summative writing assessment. According to Measurement Incorporated’s website (www.measurementinc.com) 3 more States are considering adapting the program. PEG® is currently being used in 1,000 schools and 3,000 public libraries as a formative assessment tool. Ellis B. Page could be considered the forefather of Automated Essay Graders.
What changed since Ellis B. Page began his Project Essay Grade in 1964? Personal computers and the Internet! The onset of personal computers in the 1990’s changed the face of possibility for Automated Essay Graders. With electronic keyboards in the hands of students and the Internet to provide a universal platform to submit text for evaluation, (Shermis, Mzumara, Olson, & Harrington, 2001), a new industry, testing, was born.
In 1997, Intelligent Essay Assessor® (IEA®) was introduced as another type of automated essay grading system developed by Thomas Landauer and Peter Foltz. In 1989, the system was originally patented for indexing documents for information retrieval. The indexing programming was subsequently applied to automated essay grading. Intellectual property rights became a factor in the marketplace of automated essay grading. The Intelligent Essay Assessor® program was designed to use what’s known as Latent Semantic Analysis (LAS), which determines similarity of words and passages by analyzing bodies of text. Developers using LAS create code that estimates how close the vocabulary of the essay writer is to the targeted vocabulary set (Landauer, Foltz, & Laham, 1998). Like most automated essay grading systems, documents are indexed for information retrieval regarding features, such as proportion of errors in grammar, proportion of word usage errors, proportion of style components, number of discourse elements, average length of sentences, similarity in vocabulary to top scoring essays, average word length, and total number of words. Typically, these features are clustered into sets. The sets may include content, word variety, grammar, text complexity, and sentence variety. In addition to measuring observable components in writing, the IEA® system uses an approach that involves specification of vocabulary. Word variety refers to word complexity and word uniqueness. Text complexity is similar to determining the reading level of the text. As with Project Essay Grader®, IEA® has reported high correlations with human scored essays (Landauer, Foltz, & Laham 1998). IEA® has become the automated grading system used by Pearson VUE. In 2011, Pearson VUE and the American Council on Education (ACE) partnered and launched GED® Testing Services (GEDTS) which provides students with a high school equivalency (HSE) program.
Around the same time period as IEA® was being developed, Educational Testing Services (ETS®), was developing the Electronic Essay Rater knows as e-rater®. This system uses a “Hybrid Feature Identification Technique” (Burstein et al, 1998) that includes syntactic structure analysis, rhetorical structure analysis, and topical analysis to score essay responses via automated essay reading. The e-rater® system is used to score the GRE® General Test for admission to Graduate, Business, and Law school programs. ETS also provides testing for HiSET®, and TOEFL®. The e-rater® measurement system counts discount words (words that help text flow by showing time, cause and effect, contrast, qualifications etc.), the number of complement, subordinate, infinite, and relative clauses, as well as the occurrence of modal verbs (would, could, etc.) to calculate ratios of syntactic features per sentence and per essay. The structural analysis uses 60 different variables/features similar to the proxes used in Project Essay Grader® to create the essay score (Ruder & Gagne, 2001).
The e-rater® was the initial AEG used by the GMAT® (Graduate Management Assessment Test) when the test added an essay component to the testing format in 1999. In January 2006, ACT, Inc. became responsible for development and scoring of the written portion of the GMAT®test. At that point, ACT, Inc. partnered with Vantage Learning and a new automatic essay grading system was introduced, IntelliMetric™, for use with the Analytic Writing Assessment. Vantage Learning’s corporate policy treats IntelliMetric™ as an intellectual property asset. Many of the details regarding this automated essay grader remain trade secrets (Rudner, Garcia, & Welch, 2005). However, the general concepts behind the AEG system used in IntelliMetric™ have been described by Shermis and Burstein in their book, Handbook of Automated Essay Evaluation (2013). According to their research, the IntelliMetric™ model selects from 500 component features (proxes) and clusters them into 5 sets: content, word variety, grammar, text complexity, and sentence variety.
One thing is true across all the major automated essay grading systems: due to the proprietary nature of the artificial intelligence surrounding the exact algorithms used to create these automated essay grading systems, the exact weighting of the system’s features, or exactly how the clusters and what features are in them are created, cannot be known. It’s important for test examinees to find out which automated essay grading system is being used by the company administering the test to be taken because that is the “audience” for the essay that is to be graded. Essays have traditionally been thought of as school-related assignments, something to use for college admission or a scholarship application, but the nature of the workplace is changing and automated essay graders are also used to determine the writing skills of future employees. Automated essay graders are impacting more than just academics.
It’s important to remember that AEGs can’t read for understanding when evaluating text. That is beyond the capabilities of artificial intelligence currently. For example, an automated essay reader could not “understand” the following joke:
Did you hear about the Mathematician who is afraid of negative numbers?
He’ll stop at nothing to avoid them.
Or the following play on words:
No matter how much you push the envelope, it will still be stationary.
Artificial intelligence (AI) cannot make inferences or judge cleverness of word choice. Artificial intelligence would not understand that I feel like I have been chasing squirrels, herding cats, and falling down rabbit holes in the process tracking down the information used in this book.
Artificial intelligence cannot understand polysemy and so does not understand whether the word mine is being used as a pronoun, or an explosive device, if it is referring to a large hole in the ground from which ore is produced, or part of the name of the 2009 Kentucky Derby winner, Mine That Bird. It can count how many times the word shows up in a text. Understanding what automated essay graders can “read”, and how they “read” is important for helping test examinees learn to think like their audience and write for that audience. But if the details behind the “thought process” of automated essay graders is proprietary, what can be found out about how an AEG thinks? Research can be found that provides general details about the major AEG systems currently in use, and like a puzzle, things become clearer as more pieces are added to the picture.
In early 2012, The William and Flora Hewlett Foundation sponsored a competition geared towards data scientists and machine learning specialists called the Automated Student Assessment Prize (ASAP). The goal of this competition was to “…help solve an important social issue. We need fast, effective and affordable solutions for automated grading of student written essays” (www.kaggle.com). The competition had 2,500 entries and 250 participants who made up 150 teams. The competitors were provided with essays that had been scored by human readers and that varied in length and skill level of the writers. The competition sought to find a winner who could come closest to the results of the human scorers. “Software scoring programs do not independently assess the merits of an essay; instead they predict, very accurately, how a person would have scored the essay” (www.gettingsmart.com). In May of 2012, a winning team was announced, but no information was provided as to the algorithms behind the winning software. That was proprietary information. However, by the Autumn of 2012, students involved in studying artificial intelligence at universities in the US began producing “final projects” for their classes that tried to duplicate the results of the ASAP competition. The students used the same sample sets of essays used in the competition. Their studies provided many more details into the process of developing automated essay graders.