11  Borrowing Detection & Language Contact

NoteLearning Objectives

By the end of this chapter, you will be able to:

  • Distinguish inherited vocabulary from borrowed vocabulary using phonotactic features
  • Identify phonological cues that mark a word as foreign to a language
  • Engineer features for a borrowing detection classifier
  • Evaluate classifier performance on the English loanword detection task
  • Explain why recent vs. ancient borrowings look different
  • Describe the geography and history of major borrowing events in English

11.1 The Norman Invasion, One Word at a Time

On October 14, 1066, the English army lost at Hastings and William the Conqueror gained a throne. The linguistic consequences lasted longer than the battle: for the next three centuries, the language of the ruling class was Norman French. English peasants kept their cows, pigs, sheep, and deer. French-speaking nobles ate beef, pork, mutton, and venison. The two words for each animal are not synonyms: they are the same referent experienced from opposite ends of the social order, encoded in two different vocabularies that English eventually absorbed as one.

This is borrowing at its most spectacular—a sudden, well-documented influx of vocabulary from a prestige language following a conquest. But borrowing also happens quietly, gradually, wherever two language communities are in sustained contact. Scandinavian settlers in northern England gave us sky, skin, egg, window, and the pronouns they, them, their. The Romans, who never conquered Ireland but traded with it, left traces in Celtic languages. Every language is a palimpsest of historical contacts, and the borrowed words are the ink stains.

For a computational system, the question is: can we detect these stains automatically?

11.2 Phonotactics: The Fingerprint of Foreign Words

Every language has implicit rules about which sequences of sounds are permissible. These rules—phonotactics—are acquired in childhood along with the phoneme inventory itself. Native speakers know, without being taught, that English words can begin with str- (string, stripe, strong) but not with tl- (except in some borrowed words), and that German words can end with -cht (Macht, Nacht, Tracht) but English words rarely do.

When a word is borrowed from another language, it often carries phonotactic patterns that are foreign to the borrowing language. Even after phonological adaptation, traces can remain.

Table 11.1: Common phonotactic markers of loanword status in English. Patterns marked as ‘foreign’ are statistically more common in borrowed than inherited words.
Pattern Likely source Status Examples
0 Initial /v/ French/Latin (rare in native Germanic) Foreign signal valley, veal, vision, virtue
1 Initial /dʒ/ (j-sound) French/Latin (Old English had no /dʒ/) Foreign signal judge, gentle, giant, jazz
2 Final stress (café, naïve) French, Italian, other Romance Foreign signal café, fiancé, naïve, résumé
3 ph spelling for /f/ Greek (phone, photo, philosophy) Foreign signal phone, photo, philosophy, graph
4 Initial /ʃtʃ/ clusters German, Yiddish Foreign signal schnitzel, shtick, schmuck
5 Final -que / -ique French (unique, mystique) Foreign signal mystique, grotesque, unique
6 Intervocalic /v/ French/Latin Ambiguous novel, level, oval, rival
7 Latin -tion, -ment, -ous French/Latin Strong foreign signal nation, movement, gorgeous

11.3 Feature Engineering for Borrowing Detection

Our classifier will use features derived from phonotactics, morphology, and historical sound correspondences.

Listing 11.1
Feature matrix sample:
     word  latin_suffix  greek_element  germanic_pattern  borrowed
0   night           0.0            0.0               1.0         0
1   water           0.0            0.0               0.0         0
2  mother           0.0            0.0               1.0         0
3  father           0.0            0.0               1.0         0
4    hand           0.0            0.0               0.0         0
5   bring           0.0            0.0               0.0         0
6    walk           0.0            0.0               0.0         0
7   sleep           0.0            0.0               0.0         0
8   think           0.0            0.0               1.0         0
9   child           0.0            1.0               0.0         0
Logistic Regression F1: 0.831 ± 0.093
Random Forest F1: 0.819 ± 0.136
Figure 11.1: Logistic regression coefficients for borrowing detection features. Positive = associated with borrowed words; negative = associated with inherited words.

11.4 The Social Life of Borrowed Words

Borrowing is not random. It follows social patterns with remarkable consistency across cultures and time periods.

Prestige borrowing is the most common type in recorded history. When a language community is in contact with another community that has higher social, military, or economic prestige, vocabulary flows from the prestigious language into the lower-prestige one—not the reverse. Norman French provided English with an enormous lexicon of administrative, legal, culinary, and artistic terms precisely because French was the language of power after 1066. English speakers who wanted to participate in the ruling culture learned French words; French speakers rarely needed to learn English ones. The asymmetry of the borrowing records the asymmetry of social power.

Domain-specific borrowing reflects which culture innovated in which domain. English borrowed heavily from Greek for scientific and medical terms because Greek was the language of the Hellenistic scientific tradition. It borrowed from Arabic for mathematical and astronomical terms (algebra, algorithm, zenith, nadir) because the Islamic world preserved and extended Greek mathematics during the European Middle Ages. It borrowed from Italian for musical terms (allegro, soprano, forte, piano) because Italian opera and instrumental music dominated European culture in the seventeenth and eighteenth centuries. Each borrowing stratum is a map of intellectual and cultural contact.

Phonological assimilation over time is the process by which borrowed words are gradually adapted to the phonological patterns of the borrowing language. A word borrowed yesterday sounds foreign; a word borrowed a thousand years ago sounds native. English street (from Latin strata via, paved road) has been adapted so thoroughly to English phonological patterns that it is indistinguishable from inherited Germanic vocabulary by most metrics. Only comparative reconstruction can reveal its origin.

The computational implication: borrowing detection must be calibrated to the expected time depth of the loans being detected. A system trained to detect recent French borrowings in English will fail on ancient Latin borrowings because the features that signal foreignness erode with time.

11.5 Why Ancient vs. Recent Borrowing Looks Different

Not all loanwords are equally detectable. The key variable is time.

Recent borrowings (last 200 years) arrive with their foreign phonology largely intact. Café, fiancé, genre, naive—these words still carry French accents, French stress patterns, and French phonotactics.

Medieval borrowings (Norman French, 1000–1400 CE) have been partially phonologically adapted to English. Castle, fashion, virtue, vision—these were borrowed 600–800 years ago and have been partly “Anglicized,” but they retain Latin suffixes and non-Germanic initial consonants.

Ancient borrowings (pre-Old English, pre-Roman) are nearly undetectable. The words that came into Proto-Germanic from Latin during the period of Roman-Germanic contact (roughly 300 BCE–400 CE) have undergone all the same sound changes as native vocabulary. Street comes from Latin strata via (paved road), but it has had Germanic-style sound changes applied to it for two thousand years and looks completely native.

Figure 11.2: Detectability of English loanwords by borrowing period. Older borrowings have lower classifier confidence because they have been phonologically assimilated.

11.6 Contact Zones: Where Borrowing Is Heaviest

Borrowing is not random. It clusters in geographic contact zones and follows social patterns: vocabulary from prestige languages, trade languages, and languages of conquest spreads faster than vocabulary from isolated or low-prestige communities.

Well-documented contact zones: - English: massive French/Latin borrowing post-1066; Old Norse borrowing in the Danelaw - Japanese: waves of Chinese borrowing (Sino-Japanese vocabulary), then Dutch borrowing (scientific terms), then English borrowing (modern technology) - Swahili: Arabic borrowing along the East African coast trade routes; then English and French colonial-era borrowing - Balkan Sprachbund: Turkish, Greek, Romanian, Bulgarian, Albanian, and Serbian share syntactic and lexical features despite belonging to different language families—a remarkable case of contact-induced convergence

11.7 The Challenge of Fully Assimilated Loans

The hardest borrowing detection problem is the fully assimilated loan: a word borrowed so long ago that it has undergone all the same phonological and morphological changes as native vocabulary. For English, words borrowed into Proto-Germanic from Latin before the Germanic migrations (pre-400 CE) are essentially indistinguishable from inherited vocabulary on phonological grounds alone.

The only reliable tool for these cases is comparative reconstruction: if a word appears in English but not in other Germanic languages, and it matches a Latin or Greek word in structure, it may be an ancient loan. This requires the full comparative method—which, as we have seen, can now be partially automated.

11.8 Summary

Borrowing detection is a classification problem that exploits the phonotactic foreignness of loanwords relative to the inherited vocabulary of a language. Features derived from initial consonant clusters, final syllable stress, Latin and Greek morphological suffixes, and non-native phoneme sequences achieve F1 scores of 0.70–0.85 on English loanword classification tasks. The key confounding factor is borrowing age: ancient loans have been phonologically assimilated and are nearly indistinguishable from native vocabulary on surface features alone; only comparative reconstruction methods can reliably detect them. Major borrowing events cluster in historical contact zones—post-conquest prestige borrowing, trade route vocabulary transfer, and colonial-era language spread—and leave characteristic patterns in the phonological profile of the affected language.

11.9 Further Reading

  • Haspelmath, M., & Tadmor, U. (Eds.). (2009). Loanwords in the World’s Languages: A Comparative Handbook. De Gruyter Mouton.
  • Embleton, S. (1986). Statistics in Historical Linguistics. Bochum: Brockmeyer. On probabilistic approaches to cognate and loan detection.
  • Thomason, S. G. (2001). Language Contact: An Introduction. Edinburgh University Press.
  • Youn, H., et al. (2016). On the universal structure of human lexical semantics. PNAS. Cross-linguistic analysis of basic vocabulary.