Making better wordlists for elt: Harvesting vocabulary lists from the web using WebBootCat Simon Smith*, Adam Kilgarriff

Download 0.62 Mb.
Size0.62 Mb.
  1   2
Making better wordlists for ELT: Harvesting vocabulary lists from the web using WebBootCat

Simon Smith*, Adam Kilgarriff and Scott Sommers*

*English Language Center, Ming Chuan University

Lexical Computing Ltd, UK


In Taiwan, and other Asian countries, students of English expect and are expected to memorize a lot of vocabulary: MCU, for example, relies fairly heavily on vocabulary acquisition and retention in its teaching and testing resources. Oftentimes, lists of vocabulary items to be learned by students do not really belong to a particular topic, or fit it very loosely, because the items have not been chosen in a principled way.

The present paper reviews the arguments for incidental learning and direct learning of vocabulary in ELT, and shows how a web corpus builder (WebBootCat) can be used to build lists of words that are related to a particular topic in an intuitive and statistically principled way. A small number of seed search terms are used by WebBootCat to generate a corpus of texts on a given topic, and this corpus is searched to find vocabulary items which are salient to the topic.


For students of English in Taiwan, direct learning of wordlists plays a major role. It is clearly important which words are chosen to be on the wordlists, and which words are selected to be used in textbooks, if Taiwanese learners are to acquire language which is meaningful and useful.

In this paper, we first review the arguments for incidental learning and direct learning of vocabulary, and consider how they are played out in English teaching in Taiwan. We consider one particular textbook, and find that the vocabulary is not systematically selected, with the vocabulary to be learnt not forming a good match either to the topic of the chapter, or to the reading material, or to corpus frequency.  We report experiments with WebBootCat (WBC), a software tool which uses Yahoo! web services to harvest linguistic corpora on user-specified subject areas from the World Wide Web. We use WBC to extract from these corpora key vocabulary which can be used to populate wordlists in textbook-writing. 

Vocabulary: incidental and direct acquisition

Studies in the acquisition of vocabulary have identified two principal learning strategies, incidental learning (discussed by Nagy, Anderson & Hermann, 1985; Nation & Coady, 1988; Nation, 2001) and direct learning. Research by Nagy and colleagues claimed that learning from context is one of the most significant aspects of incidental learning. This laid the groundwork for the belief that authentic context is a particularly powerful source of incidental language learning (Krashen, 1989; Pitts, White and Krashen, 1989).

While there is little doubt that incidental learning, particularly that acquired through reading, is key to learning the vocabulary necessary for functioning in an English environment, some researchers have argued that this form of acquisition has limitations. This may be especially true for students for whom English skills include academic performance in their coursework, textbook reading, and classroom lectures, as well as test performance (see Chaffin, 1997; Zechmeister et al, 1995). These researchers have argued that an essential role is played by the direct instruction of strategies for learning vocabulary and meaning. Without these, they believe long-term retention of new vocabulary rarely follows. They emphasize the role of dictionaries and other word reference books, and note that direct instruction is important in fostering an interest in words.

Direct acquisition studies recognize that vocabulary can be learnt using tools that bring the learner’s attention into direct contact with the form and meaning of words, such as dictionaries and vocabulary lists. However, the question of how best to use these tools for direct vocabulary acquisition remains unanswered. In Taiwan, and other parts of Asia, the traditional (and intuitively suboptimal) approach has been simply to memorize the vocabulary item along with one or two possible L1 translations.

The memorization of vocabulary items is a pedagogical fact of life for most students of English in Taiwan. Ironically, government policies intended to boost the national standard of communicative language skills have actually encouraged this approach to language learning. Previously, lists of words were presented primarily to students in public secondary schools, but nowadays official attempts to promote language proficiency have resulted in the widespread use of proficiency tests such as the GEPT and TOEIC; consequently there has been an explosion of test preparation classes. In almost every case, these classes emphasize vocabulary acquisition through the memorization of lists rather than the use of communicative tasks or the presentation of authentic examples.

Typically, these lists incorporate vocabulary selected by employees and teachers of test preparation schools. In more professional situations, the selections are derived from word counts of actual standardized tests. In other cases, the lists are populated more or less arbitrarily, with only a vague and unclear match between the items on a given list and the topic it is supposed to represent. Furthermore, items are often demonstrated to students using contrived examples. With such poor models of usage available to students, it is questionable whether even the highest standard of instruction will result in the desired acquisition.

 If students are to learn lists of English words, one would rather that they learnt words which were going to be optimally useful to them, and of course it is the goal of lists such as the CEEC list (a glossary of 6480 words used to help people studying for university entrance exams, described and listed in College Entrance Examination Center (2002)) that they do cover the most useful vocabulary.  However it is not easy to assess what the most useful vocabulary is.  One strategy is to identify the most common words in a general corpus of English: the commonest words are the ones that students are likely to encounter most often, so are, at least from a language understanding perspective, the most useful.  If learners are to produce native-like language, then they should be using the words that native speakers use in similar proportions, so the argument can also be made from a language-production perspective.  The matter has been pursued in Japan, and in 2003 the widely-used JACET list of 8000 basic words was revised substantially on the basis of the British National Corpus (Masamichi 2003, Uemura 2005).  Su (2006) has explored the relation between (a 2000 word version of) the CEEC list and a range of other lists and corpora.  While the verdict in that paper is that the list is largely satisfactory, areas are found in which the corpora and the list do not match.  

An essential difference between corpus-derived lists and those compiled manually, whether by individual teachers or government bodies, is that data from corpora is authentic. Such measures as personal intuition or experience of the teacher are far too problematic to produce meaningful results, according to Biber & Conrad (2001). Careful statistical examination of corpus data, however, can help us to construct meaningful, topic-related wordlists.

English vocabulary acquisition at Ming Chuan University

Two of the authors, Smith and Sommers, are employed by the English Language Center (ELC) of Ming Chuan University, where the principal task is to teach general English skills to large groups (around 60) of relatively unmotivated university students. English is taught throughout the four years of a typical undergraduate career (in contrast to many Taiwan institutions where one or two years is the norm). There is little evidence to show how much acquisition of English takes place over the four year period, but certainly there is ample time for boredom to set in in students who are principally interested in the taught offerings of their home departments.

The ELC’s students are assessed twice a semester by centralized achievement tests. Because the teaching of grammar is not emphasized in the ELC, and because it is difficult to assess communicative competence with such large groups of students, the main focuses of these tests are listening comprehension, and familiarity with the unit vocabulary items. Students do not prepare for listening comprehension assessment, but they do prepare for the vocabulary component. They do this by memorizing the unit vocabulary lists, internalizing each item with its Chinese “equivalent”.

The primary teaching material for these courses is an in-house textbook series called East Meets West. EMW presents some topics relevant to students’ lives and potential future careers, and others which are less relevant or useful. There are a number of different types of activity in each unit, but the common core is a specially commissioned text on the unit topic (written by an ELC teacher), and a collection of about 12-14 vocabulary items, occurring in the text, which may or may not be related to the unit topic.

The first unit of EMW 1 is entitled “Getting started at university”, an apparently appropriate topic for beginning freshmen. There is a short reading on the experience of an imaginary freshman called Patricia Lin, reading comprehension questions, pronunciation exercises, pattern practice and a couple of listening exercises, along with a vocabulary section. This is the standard layout of an EMW unit. There are also, as in other units, some activities specifically related to the topic: maps of the MCU campus, of use to new students; locations of MCU departments; suggested English spellings of Chinese family names etc.

When we turn to the list of vocabulary items, shown at Figure 1, we find that little of what is offered is related to “Getting started at university”, or to “university”, or indeed to getting started at anything at all.



attendance course facilities helmet

initiative major vendor

accomplish consider improve tease


challenging fortunate impatient occasional protective

Figure 1 EMW 1 Unit 1 vocabulary

Only three of the words – all nouns – have an obvious connection to an educational topic. The first verb and the first adjective are also likely to occur more often in educational contexts.

With the benefit of hindsight, most would agree that the procedure adopted for populating the vocabulary lists, when EMW was compiled, was flawed. First, a topic-related text was commissioned (in this case the story about “Patricia Lin”) but without a requirement to incorporate topic-related vocabulary into the text. Next, items were selected (in most cases, not by the text writer, but by another editor) which it was deemed students would be less familiar with, and ought to learn. Many of the apparently on-topic items which occurred in the texts (student, university and so on) were ruled out, because the learners would already know them; instead, words from the texts were chosen seemingly at random. Learners are expected to be familiar with this vocabulary in the midterm and final tests.

This seems an unprincipled approach to vocabulary acquisition. One might argue that a better approach might have been to write a text around a list of pre-determined vocabulary items, related to the unit topic. Creating such a list is not a trivial task, though; it is difficult to determine what sort of vocabulary should be included. Textbook writers cannot produce such a list through contemplation and introspection alone. It might be possible to think of a short list of educational terms (major, sophomore, classmate, campus and the like), and a reading text featuring that vocabulary could then be commissioned. However, at least two objections could be raised to that approach.

First, the list would only include items that belong to the domain in the most transparent way. If, for example, it can be shown that items such as excited, challenging and friend occur more often in texts about “Getting started at university” than they do in texts on other topics, they are candidates for inclusion in our lists.

Secondly, it would be less straightforward to compile such a list for Unit 2 (“Family and hometown”) or Unit 3 (“English learning and you”), to give just two examples. In these domains, only kinship terms and the jargon of TESOL and Applied Linguistics spring to mind, and neither of these would be useful for MCU freshmen.

What is needed is a corpus-based vocabulary generation tool.

WebBootCat, a tool for corpus and wordlist generation

Baroni et al (2006), in a paper which introduces WBC, focused on the tool’s utility as an aid to technical translators. Most translators, Baroni et al note, make regular use of the web as a source of information about technical terms and usages; however, search engine design is not optimized for their use.

The task described in the paper consists of creating a corpus associated with a particular domain, and generating a list of the terms most salient to the domain. All of this information is extracted from the web. The resulting corpus can be expected to be both up to date (the terminology is current), and to be firmly focused on the domain in question (in contrast to offline corpora, such as the BNC, intended for general use).

The basic algorithm is conceptually simple. First, a search is seeded with one or more words selected by the user. These seed words are sent to Yahoo! (formerly Google was used, as mentioned in Baroni et al’s paper), and all the lexical items are extracted from the returned web pages. A substantial amount of filtering is done to exclude web pages which do not mostly contain running text of the language in question. Measures include rejecting pages containing too many words held on a stop list, and very short and excessively large web pages: a user interface provides control over these filters. The resulting corpus may be used in a number of ways. It can be explored in the Sketch Engine, a leading corpus query tool (Kilgarriff et al 2004). The user can also generate keyword lists from it: to do this, all words in the corpus are counted and their frequencies are compared with their frequencies in a general web corpus (the reference corpus). A list of the words whose frequencies are most significantly higher in the reference corpus is created. Baroni et al used WBC to generate the list of keyterms related to Machine Translation shown in Figure 2. Most, but not all, of the terms are indeed related to that domain in some way. Similar lists of vocabulary could also be generated on topics of interest to language learners.

Figure 2 WBC output (from Baroni et al 2006)

Generating vocabulary lists with WBC

The reader probably will already have compared Figure 2 (the list of keywords related to Machine Translation, generated by WBC) with the vocabulary list (Figure 1) on “Getting started at university”, developed by ELC curriculum writers, and drawn the conclusion that the former contains many relevant items, the latter precious few. Figure 3 shows the keywords extracted for a query to WBC, using the seed words freshman and university, and searching 100 websites which feature those words more prominently than other sites

A glance at the figure shows that almost all of the words extracted are salient for the domain. Many terms such as graduation, SAT, and transcripts are part of the specialized vocabulary of tertiary education; courses and results probably are not, but are more frequent in that domain than elsewhere.

Figure 3 WBC keywords for corpus seeded with freshman and university

The second unit of EMW 1 is called “Family and Hometown”. That title is a reasonable description of the contents of the unit, which is designed to get students to share, using the target language, information about their backgrounds. The two keywords featured in the unit title seemed a reasonable point of departure for generating a vocabulary list; this was done, and the result is shown in Figure 5. This may be compared with Figure 4, which shows the vocabulary prescribed for that unit of EMW. This vocabulary is barely concerned with the topic at hand at all – this comes as no surprise when it is known that the list was extracted from a story about one person’s life (albeit a very interesting story).

Directory: Publications
Publications -> Higher Education Academy Essay Competition 2006: How does your experience of the course compare with any expectations you may have had? David Cardenas-Mazurkiewicz, Royal Holloway University of London
Publications -> TSusUnz ds miU;klksa esa ukjh ifjdYiuk] MkW0 e/kq flag] vkUohf{kdh fjlpZ tujy] tqykbZ&vxLr 2007
Publications -> A&m college (Lexington, Ky.), 78: 209, 96: 55-58
Publications -> Publications for Mathew Aitchison 2015
Publications -> Chapter 1 Ombubsman overview
Publications -> Journal of Postsecondary Education and Disability Volume 23, Number 1 (2010) Special Issue: Disability Studies Guest Editor
Publications -> Publications for Paul Giles 2016
Publications -> Libraries for All!
Publications -> Parental support and family education on pupil achievement and adjustment: a literature review

Download 0.62 Mb.

Share with your friends:
  1   2

The database is protected by copyright © 2023
send message

    Main page