Tag: Corpus Linguistics

Expat v. Immigrant v. Migrant Post 3: Collocating Adjectives in Corpus

The public discourse around people who move to the U.S. is ugly at the moment (the moment being Build a Wall, Travel Ban, and Zero Tolerance). This series uses dictionaries and corpus linguistics to reflect on how we speak about people that move from one country to another.

We looked at dictionary definitions. We looked at dictionary example sentences. Now we go straight to the horse’s mouth: corpus. We are going to check out which words collocate with immigrant, migrant, and expat in that sexy stallion, the Corpus of Contemporary American English (COCA).

COCA is “the largest freely-available corpus of English, and the only large and balanced corpus of American English.” They had me at free. It’s a big ol’ database with spoken, academic, fiction, newspaper, and magazine texts from American English sources and you can query to your heart’s content.

Confession Time: It’s been a hot minute since my graduate class on language analysis and I honestly forgot near everything I once learned about using corpora, so forgive me and make suggestions if something doesn’t seem right in the search methodology described below.

So what are we looking at here in this table? For each of the People that Move lexical items you can see the top five adjective collocations that go before them. Meaning when I searched for migrant- as a lemma so it encompassed singular and plural forms- COCA pranced through her database of texts and identified that Mexican was the word that most frequently preceded it, black was second most frequent, etc.

Expats are most likely to be qualified with their nationalities, and the top nationalities are Western countries with white majorities. Immigrants are frequently described by their legal status, but also as Mexican. Migrants have a mix of nationality, race, and status. Why don’t we have Mexican expats when Mexican people comprise the largest portion of foreign born living in the U.S.? Why must we speak of immigrants primarily in terms of legal status?

Read More

Expat v. Immigrant v. Migrant Post 2: Dictionary Examples

The public discourse around people who move to the U.S. is ugly at the moment (the moment being Build a Wall, Travel Ban, and Zero Tolerance). This series (Post 1) uses dictionaries and corpus linguistics to reflect on how we speak about people that move from one country to another.

If like my co-bish, Caitlin, you read dictionaries for fun, you might have noticed that there are frequently example phrases and sentences to provide further context of how the word is used. You might even own a book that composed short stories using only dictionary example sentences and you might read it aloud to your indifferent friend, Kaylin.

Lexicographers select example phrases and sentences from corpora that demonstrate how that word is used in a “typical grammatical and semantic context.” That is to say, the examples are intended to be emblematic of the word’s usage as determined by big data (corpus linguistics). Some online dictionaries even populate example sentences from recent media in addition to the official example. What are typical grammatical and semantic contexts for expat, immigrant, and migrant?

An example of a bad example.

We will stick with the same dictionaries from the first post in this series. American Heritage Dictionary online had no examples (can someone old school look them up for me in your real life dictionary pleeaassee). However, our other two dictionaries give us some food for thought.

Compare New Oxford American Dictionary’s examples for expat and immigrant:

  • ‘American expatriates in London’
  • ‘they found it difficult to expel illegal immigrants.’

The difference is jarring. I am jarred. Expats are from a wealthy country neutrally existing in a cosmopolitan city. Immigrants are without status and an ominous ‘they’ attempts to remove them.

(New Oxford American did not have an official example for migrant, but it had multiple example sentences from what I gather to be news sources though they are not credited. From words spelled commonwealthily and references to Australia, I have surmised that most these sources are not American, and must be coming from the sister dictionary Oxford Dictionary of English.)

Merriam-Webster doesn’t chap my lips as much.

  • ‘English and American expatriates in the bars of Paris’
  • ‘Millions of immigrants came to America from Europe in the 19th century.’  
  • ‘migrants in search of work on farms’.

The M-W example for immigrant is not negative like NOA’s, so that’s something. As in NOA, expats are from wealthy, Western countries with white majorities hanging out in a foreign city. This particular example was from a sentence about Ernest Hemingway and Gertrude Stein credited to Robert Penn Warren. The example for migrant squares with M-W’s definition: a person who moves regularly for agricultural work. Agricultural labor, a so-called low-skilled job, is not for expats. Expats move abroad to work as writers and NGO staff and businessbishes. Expats have privilege. Migrants move abroad to toil in fields. Migrants are disenfranchised.

Not blaming Mrs. Dictionary for any of this. She is merely the vessel. Tune in for the next post on the source material: corpus.

Read More

Emoji Grammar as Beat Gestures

Emoji Grammar as Beat Gestures

If you’re a Lingua Bish, you probably know about celebrity linguists Dr. Gretchen McCulloch😻 and Dr. Lauren Gawne 😻. In their presentation at the 1st International Workshop on Emoji Understanding and Applications in Social Media in June (2018), they presented their research to answer the question once and for all, Are emojis a language 🤔? But actually, Gretchen and Lauren always use emoji as the plural for emojis, (bishes don’t) and their research question was “If languages have grammar and emoji are supposedly a language, then what is their grammar?”

If you try to compare emojis to language, the closest you’ll get is word units. Of all the bits of a language, emojis are most similar to words, but language is so much more than a bunch of words. It has parts of speech and structure (and so many other things). Emojis often affect the tone of text or add a layer of emotion😏, but Lauren and Gretchen think that’s just a small part of it because their effect isn’t always straightforward. To compare emojis to words, they decided to look at the most used word sequences and compare them to the most used emoji sequences. They hypothesized that if emoji sequences are repeated they should be considered “beat” gestures, but what is that even?

Beat Gestures and Emojis

So gestures are a different type of communication🖐. They are not a language and they don’t have grammar. 

a beat gesture and definitely cool

One type of gesture is the “beat” gesture. It is characterized by its absence of meaning and its repetitive nature. You use beat gestures when you talk with your hands👐 and most gestures politicians make during speeches are beat gestures.

not cool and not a beat gesture

However, when a really cool person bobs their open palms up and down in the air above their head, you know it means “raise the roof”, so this is not a beat gesture. It seems like emojis act the same way as beat gestures, often repetitive and often with no inherent meaning unless accompanied by words🤯.

The Emoji Corpus

Gretchen and Lauren used a SwiftKey emoji corpus to check out sequences of two, three, and four emojis. That means that they looked for groups of emojis that often appear together. They looked for the 200 most common sequences and noticed that the top sequences used just one repeated emoji. These were the top 10 sequences in the SwiftKey emoji corpus:

The Word Corpus

Then they used the Corpus of Contemporary American English (COCA) to check out word sequences to compare to the emoji sequences. The COCA contains around 500 million words from things like news outlets and websites👩‍💻. In the 200 most common word sequences, they found almost no repetition. The only time words were repeated, were in the cases of “had had” and “very very very.” However, these didn’t even make the top 200. And yes, that could just be because the COCA is formal and perhaps a corpus of informal language would have yielded different results. For example you might get instances of what linguists call the ‘salad-salad reduplication’ (2004) as in “it’s salad salad🥗, not ham salad or jello salad”. It’s the same as “OMG you like like them 😲??” or “It’s Saturday. Tonight I’m going out out💃,” but this bish is digressing. 

Comparing Words to Emojis

The point is, where words are very rarely repeated in a sequence, it appears that emojis are. You’re probably like, “but I send 2-4 emojis at a time and they don’t repeat.” Ya, you might, but I bet they’re pretty similar like 5 different hearts💝💘💖💗💓, or the hear-no-evil monkeys🙈🙉🙊, or allll the dranks🍾🍹🍸🥃🍷🥂🍺. So ya, sometimes they’re all different, but if so, they’re likely on a theme.

But even though emojis can be more repetitive than speech or writing, most emojis occur next to words and not in sequences. Even where emojis occur without words, it’s mostly just one or two at a time and usually in response to a previous message. Guess who else usually partners with words? You guessed it, beat gestures👊! 

It seems like emojis and beat gestures have a lot in common. Let’s list the ways: 

  1. no grammatical structure
  2. no inherent meaning unless accompanied by words
  3. often repeated
  4. often add emphasis

Maybe emojis and beat gestures should get a room already 👉👌😜.

Conclusion

Basically the idea is just to shift the way we think of emojis. Thinking of them as a new language with grammar won’t get research far. Gretchen and Lauren might be on to something by considering emojis to be a type of gesture. Emojis don’t have their own grammar, but they work with our written grammar. They add emphasis, just like beat gestures do with our spoken grammar. So, it’s unlikely that emojis can ever be a full language. If they ever start exhibiting structural regularities in corpus studies though, and start languagifying, I’m sure Gretchen and Lauren will be there to catch it.

This paper is great for emoji bishes👯‍, anyone who texts📱, corpus bishes, and lingthusiasts👸🏻👸🏿👸🏼👸🏾.

——————————————————————————————————–

In: S. Wijeratne, E. Kiciman, H. Saggion, A. Sheth (eds.): Proceedings of the 1st International Workshop on Emoji Understanding and Applications in Social Media (Emoji2018), Stanford, CA, USA, 25-JUN-2018, published at https://ceur-ws.org

Ghomeshi, Jila, et al. “Contrastive Focus Reduplication in English (The Salad-Salad Paper).” Natural Language & Linguistic Theory, vol. 22, no. 2, 2004, pp. 307–357., doi:10.1023/b:nala.0000015789.98638.f9.

Read More

Prospects and Challenges of Short-Term Historical Lexicography

My favorite publication is American Speech, a quarterly journal published by Duke University Press. Yes, it’s a little Anglo-centric, but it has my favorite recurring feature Among the New Words. I developed a very close relationship to this feature through my master’s thesis when I used it to comb through and analyze 10 years’ worth of “new words”. That’s around 2500 words and it was an arduous, tedious, fantastic dictionary wonderland that was totally the best and the worst.

Among the New Words, hereafter to be referred to as ATNW, has the lofty mission of documenting new words and uses of words in real time. It is a totally non-traditional style of lexicography. It’s been running regularly since 1941 but had different incarnations as early as 1937. In its nearly 80 years, ATNW has gone from reader-submissions to the internet age. Ben Zimmer and Charles E. Carson decided to look at ATNWs history and consider its future in the most exciting paper I’ve read all year: Prospects and Challenges of Short-Term Historical Lexicography (2018).

How it all started

In 1933 an English teacher slash Jewish immigrant (slash, from his awesome name, I can only assume refugee from Mordor), Isidor Colodny, started publishing a monthly magazine called Words: A periodical devoted to the study of the origin, history, and etymology of English words. I guess this is the kind of thing people did before Instagram. A couple years later, Isidor (Lord of the 8th ring probably) enlisted Dwight Bolinger, a Spanish Teacher with a Ph.D. to write a column called “The Living Language”.

Bolinger noted a very important part of word collection in his introduction to the very first column. He pointed out that new words are often

“…transitory, so that they leave no mark upon the dictionary; and even those which are fortunate enough to make their way into that solemn repository are usually not recorded in such a way as to show just how they came into being, what was their original context, what suggestive power they may have had aside from their literal meaning…”

Which was basically the premise of my whole thesis btdubs. Also, “that solemn repository” is totally what I’m calling dictionaries from now on.

So Bolinger’s original method for The Living Language was to have readers submit new words and words they found to be used in new ways. They were also asked to include information about coinages (unrealistic goals much?). Even with modern resources, we can’t usually accomplish that. Zimmer and Carson use Bolinger’s entry for “hootenanny” as an example of the difficulties of dating coinages pre-internet. Bolinger dated its first use as 1935, but internet tells us it was used as early as 1906.

Nevertheless, his column reached a broad audience including co-founder of American Speech, H. L. Mencken. He was invited to join and renamed his column Among the New Words in 1941. A man before his time, he eschewed traditional domestic American life for an international, 3D immersive, freelance experience teaching in Costa Rica and performing his American Speech duties remotely.

Bolinger’s neologism spotting skills were on point. He wrote about -worthy from jump. He noticed that we had gone from seaworthy and trustworthy to all kinds of new worthies like newsworthy, courtworthy, and credit worthy par example. That was 1941. Now we have such beauties as Oscar-worthy, cringe-worthy, and meme-worthy.  

Another thing he got right was that we create new words by pronouncing onomatopoeia aswords. He noted ahem and tisk. And that’s totally carried on. Just think of nom nom.

How it all changed

Bolinger passed the torch in 1944 and ATNW met a series of new editors. For the publication’s 50th anniversary, Adele Algeo and her husband John (who were running ATNW at the time) produced a commemorative edition with an overview of the different processes of documenting new words that had been used. Inspired by this, its editors (also Zimmer and Carson + Solomon) did another  retrospective for its 75th anniversary.

A lot of methods were used over the years. There was a lot of reading, and submissions by readers in the beginning. In 1997 Wayne Glowka chose the “ask the kids” method by roping in his undergrad students for credit. Also, in the 90s there was this amazing new method created. It was called “electronic database searching.” So, I don’t know… Encarta perhaps? And since 2009, or “The Year of the Tweet” as I call it, access to language changed. The inundation of language from all social media platforms has made tracking neologisms less a matter of collection and more matter of curation (Zimmer and Carson 2018).

Another cool update is that the publication went digital in 2010. So now when describing a new word, writers can include links to digital media like TV, speeches, music videos, and memes.

The challenges

More access to IRL language use is awesome, but it’s also mo’ words mo’ problems. Ya gotta have a system for using search engines and determining what’s real and what’s just a google algorithm. So, let’s talk about ratchet, shall we? It was the American Dialect Society’s word of the year in 2012. ATNW’s initial treatment of it included these four senses:

  1. (insult) adj Over the top, to the extreme, beyond socially acceptable -1999
  2. (insult) n Woman who is ratchet (as in sense 1) -1999
  3. (neutra or positive) n Type of dance in Shreveport, La., or subgroup of rap music associated with the dance -2004
  4. (positive) adj Excellent, wildly fun, exceeding expectations-2007

According to ATNW’s initial entry it all basically started with one kickass grandmother in Shreveport Louisiana. That’s right, innovative wordsmith Anthony Mandigo allegedly used a word he’d learned from his granny as the title of his hot new track to usher in a new style of rap, Ratchet Rap. ATNW speculated that the word could have come from “wretched.”

But wait! After the publishing, a reader found an earlier use of the word (that’s called antedating btw). It was used in its first sense in 1992 song “I’m So Bad” by UGK, a delightful ditty about S-ing one’s own D as far as I can tell. UGK was from Texas. To this day, that’s all ATNW knows.

All of this illustrates that you can’t just do a google search and call it a dictionary. If the ATNW editors were listeners of rap from 1992 Texas, they would have been able to write a much more informed entry.  Clearly, people have been using ratchet since before 1992- UGK didn’t make it up. It also is a lesson on diversity and inclusion because, stop me if I’m wrong, but I have an image in my head of what the editors of ATNW and those solemn repositories have traditionally looked like, what kind of music they’ve listened to, and which regional dialects they’ve used and I’m willing to bet “ratchet” wasn’t in their lexicons.

So, when you conduct your search of “electronic databases” and the like, you need to thoroughly investigate the source (time and place) and look for whoever was producing content at that time. Rarely are words coined out of the blue, so even if you can’t find any more instances of the word, then call a friend. Someone you know knows someone who knows someone from that area. Sherlock the heck out of that shit!  

This article is great for historical linguistics bishes, lexicography bishes, and Ben Zimmer stans. 

—————————————————————————————————————————

Zimmer, Benjamin, and Charles E. Carson. “‘Among the New Words’: The Prospects and Challenges of Short-Term Historical Lexicography.” Dictionaries: Journal of the Dictionary Society of North America, vol. 39, no. 1, 2018, pp. 59–74., doi:10.1353/dic.2018.0010.

Read More

Dismantling the Native-speakerarchy Post 2: “The role of vowel quality in ELF misunderstandings”

(This is the second post in the series “Dismantling the Native-speakerarchy.” Check out the first post here.)

It’s time to pull another Jenga block out of the Native-speakerarchy tower. That block is vowel quality in English as a Lingua Franca (ELF) interactions brought to you by the Asian Corpus of English.  

ELF v. EFL

English as a Lingua Franca (ELF) is often defined in juxtaposition to English as a Foreign Language (EFL). Yes, yes, the acronyms are irritatingly similar. Don’t shoot the messenger.

ELF refers to English used by speakers of other languages for intercultural communication. Think a French girl and Thai boy falling in love with English as their medium of communication. Or a Korean businesswoman negotiating with a Chinese board of directors in English. ELF prioritizes intelligibility and acknowledges that users will have variations (dropping articles, using relative pronouns like who and which interchangeably, etc.) that deviate from ‘native-speaker’ norms. The variations are a feature not a bug. A natural occurrence in language patterns, not a deficit.

Whereas, English as a Foreign Language is designed to prepare users for communicating with a ‘native-speaker,’ and implied is an attempt to conform to inner-circle (U.S., U.K. etc.) standards. Think a Japanese student studying English to matriculate in a Canadian university. Deviations from the standard are errors. English language instruction in an EFL model seeks to raise students’ accuracy levels to be accepted in academic and professional settings dominated by ‘native-speakers.’ Individual teachers of EFL might not have that philosophy, but mass market coursebooks, curriculum, assessments, and hiring practices demonstrate the pervasive nature of the ‘native-speaker’ norms.

Back to my bae, ELF. English as a Lingua Franca is a threat to the status of ‘native-speaker’ teachers as the gatekeepers of English AND I AM HERE FOR IT. ELF speakers bring the richness of their accents to English, and they don’t have time for all of English’s quirks. Third person singular ‘s,’ I am lookin’ at you.

The Paper

David Deterding and Nur Raihan Mohamed (2016) used the Asian Corpus of English (ACE) to investigate the impact of vowel quality on intelligibility. ACE is a collection of “naturally occurring, spoken, interactive ELF in Asia.” A veritable playground for ELF fanatics.

The OG ELF fangirl Jennifer Jenkins wrote the literal book on it and identified the Lingua Franca Core: a list of pronunciation features that are necessary to comprehensibility in English. Spoiler alert: it’s a short list. It includes “all the consonants of English apart from the dental fricatives,the distinction between long and short vowels, initial and medial consonant clusters, and the placement of intonational nucleus.” (Deterding and Mohamed, 2016, p. 293).  

Lemme ‘splain.

  • Most consonant sounds are necessary for intelligibility. However, the pesky sounds /θ/ as in thot and /ð/ as in that hoe over there are not necessary because substitutions like /f/, /v/, /d/ typically suffice.
  • Short v. long vowels. You know, your sheets v. shits, and your beachs v. bitches, etc. Mastering vowel length is considered important for intelligibility according to Jenkins’ research.
  • Initial and medial consonant clusters. Sounds like  /str/, /mp/, /xtr/, /pl/ /scr/, and so on at the beginning of words, and to a lesser extent, in the middle of words, need to be kept intact for the speaker to be comprehensible.
  • Placement of intonational nucleus: This is stress on a syllable in an intonational unit (group of words), and the wrong stress can throw off the listener, so Jenkins includes it in the Lingua Franca Core.

All other pronunciation features are deemed fair game in ELF by Jenkins, including vowel quality, which is what this paper focuses on. Vowel quality refers to what makes vowels sound different from each other: “I must leave the pep rally early to get a pap smear. Pip pip!”

Vowel quality is why JT’s delivery in “It’s Gonna Be Me” spawned this meme: 

From ACE, Deterding created the Corpus of Misunderstandings (incidentally, the name of my emo band) with data from exclusively outer and expanding circle English speakers.

This paper is building on Deterding’s earlier 2013 work that determined 86% of misunderstandings in CMACE involved pronunciation. He and Mohamed dig into vowel quality specifically because it was left off the Lingua Franca Core by Jenkins.  

Of the 183 tokens of misunderstanding in the corpus, 98 involved vowel quality. In many of those tokens vowel length and quality was an issue, but as vowel length is part of the Lingua Franca Core, they were not included in the analysis, leaving 22 tokens of short vowels misheard for other short vowels. Half of these tokens included /æ/ and /ɛ/, referred to as the TRAP and DRESS vowels in the literature, but what we will call the SASS and FEMME vowels.

When they analyzed each of the 22 tokens in context, they found other pronunciation features that probably caused the misunderstanding, and that vowel quality was indeed a minor factor. For example, “In Token 5, wrapping was misunderstood as ‘weapon’, but the key factor here was the occurrence of /w/ instead of /r/ at the start of the word” (p.229). Recall that consonant sounds are in the Lingua Franca Core and play a big role in intelligibility.

Conclusion

David Deterding and Nur Raihan Mohamed’s research supports Jenkins’ contention that conforming to ‘native-speaker’ standards in vowel quality is unnecessary for English users to successfully communicate. Let me put on my extrapolation cap because you know how I do. ‘Native-speaker’ English teachers don’t have a pronunciation edge over ‘non native-speaker’ teacher colleagues when it comes to vowel quality. It literally does not matter if someone pronounces it, “Thet’s eccentism, you esshet!”

Check out this article if you are a research bish that wants to see the kind of work that can be done with corpus linguistics. And if you’re a EFL bish or an ELF kween. And if you’re a NNEST.


ACE. 2014. The Asian Corpus of English. Director: Andy Kirkpatrick; Researchers: Wang Lixun, John Patkin, Sophiann Subhan. https://corpus.ied.edu.hk/ace/ (May 26, 2018)

Deterding, D. & Mohamed, N. R. (2016). The role of vowel quality in ELF misunderstandings. Journal of English as a Lingua Franca, 5(3). 291-307.

Jenkins, J. (2000). The phonology of English and an international language. Oxford: Oxford University Press.

Read More