Self-segregating morphology

From the Logical Languages Wiki
Jump to navigation Jump to search

This article contains opinions that may not necessarily reflect the views of the LLWiki or larger loglanger community.

A language is said to have a self-segregating (occasionally self-segmenting) morphology if every utterance in the language can only be broken up into words and morphemes, or parsed, in a single way. The written forms of natural and constructed languages are often self-segregating, but in most languages, it is possible for a spoken phrase to have two or more possible parses. This creates ambiguity. For languages engineered to be unambiguous, it is necessary to define phonological patterns for words in order that no two sequences of words sound identical. All or most loglangs, depending on the definition used, have self-segregating morphologies.

Related terms

The terms monoparsing and audio-visual isomorphism are nearly synonymous, but both are stronger. Monoparsing is synonymous with syntactic unambiguity. It means the property whereby every well-formed text has a unique, transparent grammatical structure. Audio-visual isomorphism, or AVI, means the property whereby there is an exact one-to-one correspondence of informational content between spoken and written forms of a language. Orthographic features like italic or bold typefaces disrupt audio-visual isomorphism, since there is no exact spoken analogue for them. Therefore, a language can have a self-segregating morphology but lack AVI.

The problem: Homophones in natural and constructed languages

Often, in natural languages, pairs of homophones exist as the result of sound changes or borrowing. These types of homophones, like dear and deer, are typically absent from engineered languages. However, in most constructed languages, compounding, derivational morphology or phrase formation can produce homophones. An ideal self-segregating morphology will prevent homophones on both the word level and the intra-word level.

In English, word-level homophony occurs in many phrases: attack and a tack are homophonous if pronounced normally, as are euthanasia and youth in Asia. There are in addition a vast number of phrases with subtle phonetic distinctions that can be hard to perceive, such as the sky [ðɪ̈ˈskaɪ̯] and this guy [ðɪs ˈg̊aɪ̯].

Morpheme-level homophony occurs in agglutinative languages like Esperanto frequently, in words like fireĝido. This word can be parsed as fi‐reĝido ‘a corrupt prince’ or fireĝ‐ido ‘offspring of a tyrant ’.[1]

Engineered languages like Lojban, Latejami and Toaq exemplify the means by which such accidental ambiguities can be prevented. These languages have formulas that describe every possible word. It is, at least in theory, provable that their formulas only generate words that self-segregate.

Self-segregation strategies

Self-segregation strategies (hereafter SS strategies) come in several varieties that can be treated separately, even though their borders are fuzzy and they are often mixed together in the design of a given language. In general, all SS strategies are analogous to types of codes in coding theory. For simplicity, this article will focus on self-segregation at the level of words.

The fixed-length strategy

If every word in a language is the same length, then it is trivial to say the language is self-segregating at the word level. (See fixed-length codes.) The same is true for the morpheme level. In some logical languages, all affixes are the same length (at least all nonfinal affixes).

Variable-length strategies

Strategies in which the length of words or morphemes can vary are more naturalistic, and hence much more common.

AB Strategies

These strategies are the most common of all. They are analogous to prefix codes. They work by defining at least two sets of parsing elements, A and B. The elements may be any type of phonological entity, including:

  • phonemes; obstruents vs. sonorants (e.g. Ceqli)
  • syllables; heavy vs light (e.g. Tanbau)
  • tone-bearing sequences (e.g. Toaq)

A self-segregation formula or word-shape formula is defined in terms of A and B. The simplest good formula is A*B, that is, one B element, optionally preceded by any number of A elements. This may also be called the “right-breaking” formula or method. It is common for a good reason: pausing, or cutting off speech, in the middle of a word will never create a new word.

A related strategy uses the “left-breaking” formula AB*: one A element, optionally followed by any number of B elements. This is strictly inferior in one sense, because pausing in the middle of a word can create another word. At the level of syntax, Lojban’s sentence-starter particle, .i, exemplifies the method. In Lojban morphology, too, this method plays a role in self-segregation. Content words (brivla) must have a consonant cluster within the first five segments.[2] The presence of a cluster signifies the approximate left edge of a word.

A third common strategy is A+B+: one or more A, followed by one or more B. The minimal word is AB. A word ends after the last B before an A. Ceqli uses this strategy: A is an obstruent consonant such as /p/, /g/ or /z/; B is a sonorant consonant, such as /ŋ/, /l/, /r/ or /w/, or a vowel. Legal words include grin (AABB), diyan (ABBBB) and starloremi (AABBBBBBBB).[3]

There are many other word-shape formulas that produce self-segregating words. A variety of formulas are possible with just A and B elements. Sometimes it is simpler to list all patterns, or all up to a certain length, rather than to write the full regex-style formula. Some examples are as follows:



A language has a suboptimal morphology if it lacks one or more pattern that fits the formula. For example, a language might have pattern (1) above, but lack words of the shape AA. This could occur due to the language's phonotactics, or just due to careless design.

Some language have more than two sets of elements as well. Latejami’s formula uses three sets. Latejami is largely a CV language, but, for the purposes of word-level self-segregation, only consonants matter. Certain consonants are A elements, an apparently arbitrary set: {b c d f j k q r t x z}[4]. Others are B elements: {g m n p s v}. The basic pattern is A+B+, ignoring the vowels. The third element set has only a single phoneme, /l/, so it can be called L. L allows variations on the formula that are useful in the broader context of Latejami morphology.


‘Word-projection’ strategies

These strategies use an element somewhere in a word that “projects” its length some distance ahead. Practically, this must be done in terms of syllables. An word-initial element may signify, for instance, that the word ends after the third next syllable. Or, it might “project” the length of a word out to the next stressed syllable, or the next stressed syllable plus one (i.e. the next posttonic syllable).

There are all manner of possibilities, often impractical, including backwards projection and interactions with other SS strategies. This is an area that has not been much explored. Ithkuil IV is so far the only major language to utilize word-projection. It uses so-called parsing adjuncts to delimit words by projection in special circumstances.

In unusual situations (e.g., singing a song) when pitch-accent is unavailable or undesirable as a means of parsing word boundaries and the placement of pauses between words is unrealistic, then a special parsing adjunct of the form ’V’ may be placed before any word to be parsed, where ’V’ represents a single vowel between two glottal stops, the particular vowel indicating the syllabic stress of the following word, as follows:

  • ’a’ indicates the following word is monosyllabic
  • ’e’ indicates the following word bears ultimate stress
  • ’o’ indicates the following word bears penultimate stress
  • ’u’ indicates the following word bears antepenultimate stress[5]

‘Out-of-band’ or ‘comma’ strategies

This rather clumsy strategy uses a special element that is found nowhere else in the language to bracket a word, like a comma code. Lojban uses this strategy in many places: the “pauses” that surround names are the foremost example, but the la'o and zoi particles used for bracketing foreign material do something similar. These words introduce a single-use “comma”, a word that the speaker knows ahead of time will not appear in the foreign text, and use it as the end marker of the text.

Lexical exclusivity

We can use the term lexical exclusivity to describe a situation where self-segregation is ensured in a non-formulaic way, by picking out desired forms by hand and eliminating forms that they conflict with.

Sometimes naturalism or a posteriori faithfulness matters more than formal simplicity in a language. A language designer may use the formula A*B, but want to have a word of the shape ABAB. Let us say A is any consonant, and B is any vowel. The designer wants the word /nagi/ to exist in the language, even though it has the shape ABAB. As long as they ensure that no word /na/ exists, /nagi/ can exist without breaking the SS formula. /na/ and /nagi/ are mutually exclusive, and the designer has chosen to exclude one from the lexicon for the sake of the other. Through lexical exclusivity, the designer has engaged in special pleading.

It is always possible to expand an SS formula to cover all cases of lexical exclusivity, but this can make the formula too complicated to be illuminating. It seems more intuitive to treat these as exceptions to the rule.


  1. Rye, Justin B. 2021. “Learn Not to Speak Esperanto,” section 07. Accessed from (19 June 2021).
  2. Cowan, John W. 2016. The Complete Lojban Language, section 4.3. Fairfax, VA: The Logical Language Group. Accessed from (19 June 2021).
  3. May, R. 2017. “The Alphabet and Sounds” In Ceqli. Accessed from (19 June 2021).
  4. Morneau, R. 2007. The Lexical Semantics of a Machine Translation Interlingua, section 2.5.1. On Rick Morneau’s homepage. Accessed from (19 June 2021).
  5. Quijada, J. 2021. “Design for the new revision of Ithkuil,” section 2.3. Accessed from (19 June 2021).