Counting Chinese Words

It has been said that “word frequency” is the most important variable in language research, despite the belief by many that it can’t be used as a variable because no one really knows what a word is. (see: Minifalsehood: We can’t tell what a word is!?!? and A run in my stocking …)

A recent study in PLoS looks at a heretofore under investigated area, word/character use in Chinese.

Following recent work by New, Brysbaert, and colleagues in English, French and Dutch, we assembled a database of word and character frequencies based on a corpus of film and television subtitles (46.8 million characters, 33.5 million words). In line with what has been found in the other languages, the new word and character frequencies explain significantly more of the variance in Chinese word naming and lexical decision performance than measures based on written texts.

So, the leading edge is where the mixing happens. The study concludes that subtitle-based word frequencies do a good job of estimating daily language explsure and exemplify the patterns of variance in word processing. Furthermore, this work generated a database that …

… is the first to include information about the contextual diversity of the words and to provide good frequency estimates for multi-character words and the different syntactic roles in which the words are used. The word frequencies are freely available for research purposes.

The most surprising result of this study is probably the degree to which the word frequency data did NOT represent a biased or strange subset of the language. It was thought that since movies treat certain situations more frequently than others, tend to be thematically “American” and because subtitles are not exactly what is being said on screen, and for other reasons, this would be a interesting but complementary (or at least different) word set. But …

It was only when we saw how well these word frequencies were doing to predict word processing times for thousands of words … that we started to appreciate their potential. Despite their shortcomings, subtitle frequencies are a very good indication of how long participants need to recognize words. They also better predict which words will be known to the participants and which not.

If you are inclined, you can read this study (in English) at PLoS.

Cai, Q., & Brysbaert, M. (2010). SUBTLEX-CH: Chinese Word and Character Frequencies Based on Film Subtitles PLoS ONE, 5 (6) DOI: 10.1371/journal.pone.0010729

Greg Laden's Blog

Like this:

Leave a Reply Cancel reply