Corpus and textbook

This unedited piece was initially written in 2005 as an internal document for discussions with our publisher, who had asked for an opinion on the use of corpora with textbooks. As such it was partly exploring our possible routes forward on a project and was never intended for general consumption. In reviewing it, I felt it made some points worth discussing.

A: Listen up, corpus based syllabuses are really, kinda, like … you know … SO cool.

B: Yeah, sweet.

C: Absolutely.

D: Nah, I mean, that’s totally random.

The application of corpus information to the writing and syllabus design of course materials is an appealing idea on the surface. For course writers, a great deal of the information has already been collected and annotated and the conclusions have been drawn in the essential Longman Grammar of Spoken and Written English. You can see the comparative frequency of frequency adverbs, or what the top ten regular past verbs are in newspapers (unsurprisingly, the list is depressingly weighted towards theft, death and destruction). The Longman Grammar of Spoken and Written English (LGSWE) has taken over the prime position on the ELT writer’s bookshelf .

Word frequency in current usage is still overlooked in some textbooks. I shake my head whenever I see the inevitable printed form in an ELT textbook with surname on it. More than ten years ago, I assembled a large pile of real application forms. Both surname and christian name had already totally disappeared, replaced by either last name or family name. The rejection of christian name reflected a multi-cultural society. The rejection of surname was mainly due to simplicity, as it’s an extra word to learn. At a subtle level it reflects the word origin sire’s name or father’s name, which could upset people from non-standard family set-ups. But it’s clear that even with vocabulary, frequency has its limits. For example the least frequently spoken day of the week according to the corpus is Tuesday. However, we couldn’t go into classes and say ‘These are the six days of the week. The seventh is much less frequent, so we’ll leave it till pre-intermediate level.’ Students want to learn vocabulary in sets, and frequency becomes unimportant in these restricted sets.

Similarly, students want to meet structures in sets, so that it makes sense to teach all the subject pronouns together with each new tense. If you decide that he / she should appear in the subsequent unit to I / you you’ll tie yourself in knots trying to avoid he / she in practising the forms in the unit. In the end, you’ll use he/she in class anyway. And the students already knew the forms existed.

I also wonder how much of the corpus information in student books is really of interest to the students. Most of it seems to be justifying choices to the teacher. I don’t think beginner students need to be told that Phone is 6 times more common than telephone. So what? The course book should have chosen which one to teach, and the course book does not need to justify its sensible choice of phone. The corpus is playing catch-up on real language anyway, as mobile must be increasing in frequency in the UK, as is cell phone in the USA. I agree that the teacher may need to be persuaded that things are different to their expectations, but the place for doing that is the Teacher’s Book.

Nor do students need to be told that the familiar Grandma is more frequent than the formal Grandmother, and that both are much more frequent than Grandpa / Grandfather. Nor that the more formal grandfather is more frequent than the familiar Grandpa. Teachers and students rely on the course book to have thought this little conundrum out in advance for them. I would also say that as students at all the lower levels are far, far more likely to be using English with comparative strangers, then the familiar terms are not especially useful for communication. My grandma sounds somewhat childish when you’re talking to a stranger. Mum and dad are highly frequent in primary level books. Many low-level word lists include mum and dad, but for someone who will be speaking in a foreign language, therefore to relative strangers, after the age of ten, mother and father are more useful.

Problems come up when corpus frequency starts to dictate the syllabus. Take the example at the top of the page. We know that (it’s) like, kind of / kinda, (you) know come right at the top of spoken frequency lists. They’re not simply language viruses that pass in the night either. Bob Dylan was singing You just kinda wasted my precious time, but don’t think twice, it’s alright way back in 1962. Or all right as textbooks once had it. So an attempt has been made to actually “teach” these devices. This is misguided. You can’t teach learners how to be inarticulate, nor is there any reason to do so. The devices appear in two of our more recent textbooks, Handshake and In English, but only in a receptive sense in listening materials. We also focus on them, in having students practise filtering them out of what they hear. But we never try to teach them to use them. They are in the receptive domain, not in the productive domain. They need to be regarded in the same way as um, er, ah … In communicative terms they are hindrances, and in a second language the best advice is to speak as clearly as you can. If learners mix with native-speaker peers, they will soon acquire them. If this is not going to happen, the expressions need only be understood and filtered out.

A comparative area is teaching learners over-colloquial expressions and words. All native speakers know that a non-native speaker who says My, oh, my. It’s raining cats and dogs doesn’t sound fluent. He or she sounds quaint, and slightly ridiculous. It’s sad that years of enthusiastic study have led him or her to an outdated idiom, long abandoned by native speakers, but that’s inevitable with idioms like this. In Practical English Usage, Michael Swan counsels on swearing:

Using this sort of language generally indicates membership of a group: one most often swears in front of people one knows well; who belong to one’s own social circle, age group etc … So a foreigner who uses swearwords may give the impression of claiming membership of a group that he or she does not belong to.

This applies equally to idiomatic expressions, even ones as short and frequent as kinda, you know, it’s like … One recent corpus-led book has Boy! and Gosh! in presentation dialogues. Apart from sounding like out-dated teen talk, these sort of exclamations tend to sound disconcerting from a foreigner. In the case of Boy! it could be positively insulting. If you got in the habit of saying Boy! at the start of sentences, visited the USA and said Boy! Is this pizza good! to an African-American waiter, then the waiter might hear a comma and a question mark instead of two exclamation marks and be deeply offended. It’s best not to attempt to say things in a foreign language which require finely-tuned intonation and pitch.

Teen talk consists of language viruses. Some will come and go over a few years, others will stick around and become a permanent part of the language. I’d guess that cool is too firmly-entrenched to disappear, but I think that Sweet! and Totally random! will be what writers use in 30 years time to make characters sound early 21st century.

The corpus has some surprises on structure too. It turns out that she’s not / you’re not / we’re not … are more frequent in spoken English than she isn’t / you aren’t / we aren’t. This is more so in spoken American English. If that were all, it would simply be interesting, and the majority of courses which prefer she isn’t to she’s not might change in subsequent editions. But it isn’t that simple. It turns out that while ’s not is more frequent after pronouns, isn’t is more frequent after nouns. Bummer.

This then leads to ‘corpus-influenced rules’ such as this one which says:

People use ’s not and ’re not after pronouns.

isn’t and aren’t often follow nouns.

That really gives students something to worry about. It compounds this new and worrying rule by giving an example: My boss isn’t strict (how high does strict come in corpus frequency lists?) This is a particularly confusing example, because after the double-s in boss, it’s almost impossible to say ’s not, therefore isn’t is also preferable for phonological reasons, not because boss is a noun. The LGSWE makes the point that phonological factors often come into play when choosing between ’s not and isn’t. You could also argue that it’s easier to learn ‘s / ‘m / ‘re + not (one new word) than isn’t / aren’t (two new words), but this is marginal. In any case, they’re probably going to learn isn’t / aren’t for short answers. Phonology affects the choice after there also. It’s hard to say There’re not… and easier to say There aren’t …

This reflects the fact that it’s an ease of pronunciation question rather than the simplistic grammar “rule” (use ‘s not with subject pronouns, use isn’t with nouns). There’s not enough trips better off the tongue than There isn’t enough which is, I suspect, because There’s not enough is a common lexical chunk. But There isn’t any … falls more easily than There’s not any …

There’re not any is near impossibly hard to say. That’s why the rule breaks. It’s nearly always there aren’t any …

So what should we teach? In principal we should use the form that sounds clearer and is easier to say and easier to transfer to the widest range of situations. I’m not in favour of teaching ‘American teen English of 2010’ or ‘Estuary English from the UK’ or ‘RP Southern British English’ or any other restricted variety of English. So you then re-examine the question in the light of English as a Lingua Franca, ELF rather than EFL. That is, what do students need when they’re using English as a means of international communication, which more often than not will be non-native speaker to non-native speaker. The learner will never take part in a native speaker to native speaker dialogue. Logically, there will always be one non-native speaker in the conversation.

Even if ’s not is more frequent, as it seems to be between American native speakers, isn’t is still a highly frequent word. Also, isn’t always works, whether following a pronoun or a noun. ’s not doesn’t often follow nouns, and is nearly impossible to say after s, z or x. In the end, isn’t, as the initially taught form to beginners has more communicative usefulness, and is easier to apply in a range of sentences. I would argue that both forms should appear in paradigms, together with the uncontracted form (or as Michael Lewis prefers ‘emphatic form’), but that in presentation and practice at the very first level, we should choose only one initially, and the argument for isn’t / aren’t is powerful.

Another major syllabus implication at early levels is negation. Do you teach not …any consistently or do you teach no ? LGSWE is clear here, and suggests that not negation is the default. It’s apparent from the corpus that no negation has higher frequency with have, and considerably so in spoken American English. But given the whole range of tenses, not negation has wider cover and ease of use. After all, they’re going to need any for the question form.

Corpus information has to be taken with all the other tools we have: structural inventories in order of difficulty (English Grammatical Structure), functional inventories such as the Threshold Level, collocation dictionaries, pronunciation handbooks and so on. Corpus information gives us a fresh insight on a range of areas, but it isn’t the only ‘way’, nor the only source of information in making syllabus decisions.

Peter Viney's Blog

ELT, theatre, music reviews