I’m very please that my discussion of the “we can’t ever know what a word is” Internet meme has elicited a response from Mark Liberman at Language Log. (here) Mark was very systematic in his comments, so I will be very systematic in my responses.
1. Without a careful definition of what you mean by “word” and by “language X”, questions like “how many words are there in language X” are pretty much meaningless, because different definitions will yield very different numbers.
This is very much off the mark. I can measure the distance from the earth to the moon using a variety of techniques, and get different measurements for a variety of reasons. The measurements may differ but they still tell me a great deal about the initial question especially when compared with other measurements (like how far away the sun is in comparison).
The way you have worded your paragraph tells me that if I wanted to examine different languages (say, grouped by language family or geography or whatever) to see if there were big difference in lexicon size, it would be impossible. Are you certain you want to make that argument?
In fact, we are mostly in agreement about the difficulties (see below) but that is not the point of the original post. The original post is about an Internet meme that claims that it is all utterly impossible.
2. The same thing applies, with the added issue of what you mean by “know”, to the question of “how many words of language X does a specific person know?” Another layer of variation is added by generalizing the question to “how many words of language X does an average four-year-old or 18-year-old know?” There’s an obvious answer, subject to the usual sampling-error problems, but the result is a bit like asking about average income — the mean value may not be very useful in telling you what you really want to know about the distribution.
I agree that mean values are not especially interesting without understanding variance (though you’ve objected to my quest for variance in item one) but this is not really related to anything I’ve said in my post or comments thereon.
3. Most sensible definitions for (1) and (2) above create serious practical difficulties for counting. That is, they define an answer, but the prescribed process for finding it is hard to carry out, and especially hard to automate in a way that produces an accurate result.
Interesting point, and it fits with what a lot of linguists seem to think of language. I don’t happen to subscribe to the approach that it is all to big and mysterious to study systematically.
4. Extrapolating accurately from samples raises its own special problems here…
I can’t find the place where I scorned this.
5. Despite all these difficulties, researchers over the years have gone through the steps of defining carefully what they mean by “word”, “language”, “know”, etc., and then carried out these steps…
Please seem my comment regarding a room full of beer loving linguists. I don’t think I ever said that defining “word” or “meaning” is easy or something that can be done with precision. What I did imply is that comments such as your number 1 (above) are very serious overstatements of the impossibility of it all, and more specifically, when we see an entry in a dictionary with dozens of meanings listed, we are not really faced with the question: “Is this one word or fifty?” while acknowledging that we may still be faced with the question “is this 32 words or 50?”
6. Comparisons across languages are made more difficult by the fact that the most natural and sensible answers to questions like those in (1) tend to be different in different languages. Furthermore, a decision that may have only a small effect on the results in language X, may turn out to change things by an order of magnitude or more in language Y. Again, this doesn’t make it impossible to answer the questions, it just increases yet again the range of sensible values that answers might have.
Yes, it does increase the range of possible values, and I would add these two points: The degree to which two languages can be compared is very strongly affected by the data collection. Comparing English Lexicon to Central Sudanic languages is impossible because the English dictionaries have hundreds or thousands of authors and centuries of development (if you count the whole written source), while the Central Sudanic language lexicons have between zero and three authors each, decades of study, and were carried out mainly for the purposes of bible translation. (Mostly, zero written lexicon).
Laden is radically impatient with all this talk about how it all depends and it’s hard to tell, but his impatience doesn’t change the facts. Nor does it change the fact that there are plenty of attempts to answer such questions…
Actually, that was not the point of the post. I was speaking specifically of a certain meme on the Internet, not linguistics in general.
Laden seems to be aware of these issues — for example, he found the Nagy and Anderson reference — but his goal in the cited post seems to be to make fun of people rather than to clarify the questions and answers. (He suggests, towards the start of his post, that he wants to evaluate claims about the rate of word learning by children — but I couldn’t see any connection between this issue and the rest of his hyper-kinetic complaining about the difficulty of getting a simple answer to the word-counting question.)
Oh dear, I stepped on your field of study and you got all icky about it. I didn’t “find” the reference. It is part of the literature of which I became aware while studying for my PhD in anthropology. And your statement about my goal is essentially correct. It is not true that my goal is what you later imagined it could have been. I’m not sure how I would have managed to write the post you were expecting!
Mark, I appreciate your comments, but you are mostly constructing and attacking a straw man.
Response to comment by Nick Lamb(here):
I presumed from the fact that almost every other word in the “rant” is made up that he’s very conscious of what the problem is with counting words, and is actually using this opportunity to show the reader why this is all very tricky, but has chosen the form of a rant which pretends to assert the contrary. … Excuse me if that was so obvious that everyone already knows it and I missed some hint that Mark dropped.
Nick: No excuses. It was utterly obvious and some people certainly missed it. Glad you didn’t.
I’d like to add it is probably helpful to understand the commentary in the broader context of the “falsehoods” writing of which it is a small part. My blog is bit dangerous that way: My posts often do not stand alone but require context. To get the context, you click on the tags near the top of the post and read everything.