Do you really mean that?
Do you know what Christmas, footballers and mines have in common? No? Then think about it as you read this post. The solution to the riddle brings us to one of the phenomena that is examined in linguistics under the heading “semantics”.
Semantics and the challenge of digitizing unstructured information
However, semantics is not only about Christmas, footballers and mines, but (even primarily …) about meaning in general – especially about words and sentences.
Many scientists have tried to find a general principle of meaning. How do you describe meaning, what is “meaning”? There have been disputes over this, including linguistic wars in the 1960-1970s.
It is the variety of meanings and the different approaches to describing meaning that make semantics so exciting. Anyone who wants to understand (or better: to analyze) a text with the help of computer science must be aware of this diversity. It shows the possibilities as well as the limits of digitization of meaning as it can be found in documents and texts.
While we can map the characters and letters either directly in digitized form as ASCII or UNICODE characters and, for example, automatically count them, the meaning of words and sentences cannot be mapped directly. Texts are also referred to as unstructured documents or unstructured information. But that does not mean that our language and meaning in general are completely unstructured. What makes automated text analysis so difficult is rather the different structural principles that, in some cases all at once, a word, sentence or text are based on.
Again and again attempts are being made to disassemble the meaning of words. Why? Because by knowing the meaning of the smallest elements or semantic features our language contains, we could put together the meaning of individual words from these elements. That would be a big help for determining relationships between words using mathematical principles to only mention one example.
Let us assume, for example, that the features +/- male, +/- female, +/- adult, +/- civil, +/- monarch exist as the smallest elements of meaning. We could describe the meaning of woman and man, queen and king as follows:
|Element of meaning / Meaning||Frau||man||Königin||König|
The relationship between queen and king can then be described as a transformation, namely QUEEN – female + male = KING.
Similar regularities also exist in the names of animals: mare – stallion – foal; cow – bull – calf; sheep – ram – lamb.
The limits of disassembling the meaning of words
There are meanings like king and queen that are good for disassembling. However, with other words we often come up against the limits of feature-based analysis. Is it a feature of birds that they can fly? If so, then why is a penguin a bird? What exactly distinguishes a dish from a bowl or a cup from a mug? What are the features of the words to buy, democracy, or happiness?
To deal with problems such as the penguin and cup subject, in semantics we use prototypes. A prototype represents an abstract category such as bird or mug mentally and does not necessarily exist in reality. With this “ideal” bird or mug we then compare the specimens that we encounter. More complex situations such as to purchase, which involve multiple participants, things exchanged, intentions, etc. are handled in a similar way.
For the technical processing of semantics, semantha uses language models. These models can mathematically represent the relationships between queen and king, woman and man surprisingly well (see figure below). However, they have difficulties with the presentation of more complex relationships (1:M relations are difficult, Translating Embeddings for Modeling Multi-relational Data) or contexts, such as in a purchase event.
There are some semantic relationships between words that we can observe over and over again. An important relationship is grouping or categorization, e.g. “a hammer is a tool”, “a shirt is an item of clothing”. First of all, the umbrella terms such as tool and item of clothing allow us to understand things in the world, in this case according to their purpose. When we know that a hammer and a screwdriver are tools, we can better classify the next tool. By using the umbrella term, we can refer to one or more items, e.g. “Would you hand me the tool, please?” – is referring to a hammer and a saw.
semantha determines semantic similarities between sentences or paragraphs. To compare the similarities between words and umbrella terms, we compared sentences with sementha in which only one word is exchanged, in this case: “The father put the X back on the shelf.” Instead of X we have used the words magazine and book; shirt, t-shirt and clothes; as well as tool, hammer and screwdriver. The table in the figure below shows that semantha considers shirt and clothes to be similar to one another, as well as shirt and t-shirt (both sub-terms of the same umbrella term clothes) – at least more similar than hammer and shirt, which have different umbrella terms.
A special semantic relationship exists between (almost) identical words such as laces or shoelaces. We call that synonymy. While there is seldom absolute equality of meaning (and be it due to nuances or associations), there are many words that can be used identically depending on the situation. For example, how many words do you know for wind turbine? Wind turbine, wind power plant, wind turbine, maybe more?
With the help of synonyms and other linguistic means, it is possible to express the same content in completely different ways.
The teacher is happy that school is beginning again.
The teacher is pleased that school is starting again.
One word, many meanings
You must have been thinking all the time about what Christmas, footballers and mines have in common. The solution is: tunnels. The word stud obviously has several meanings depending on the context in which it is used. In semantics this is called polysemy.
Polysemous words can be a problem for the technical processing of semantics. Typically, language models only show one single mathematical representation per word: every word is represented by a vector, but not by all of its possible meanings.
To illustrate what this means for the semantic similarity of texts, we have given semantha® three text sections each on the topics of football, Christmas and mining, in which the word Stollen occurs. As the table below shows, from the perspective of the language model, the texts do not form pure groups. Instead, there are similarities, for example, between the mine and football texts, but also between Christmas texts and mine texts.
In a counter-experiment, we replaced all occurrences of “Stollen” in the nine texts with similar words (for example, in the mine texts, we used “Gang” instead of “Stollen”). Now semantha does not recognize any semantic similarities between the various topics of mining, football and Christmas (see table below).
Polysemy is not uncommon. Nobody can avoid it when writing texts. Current research is therefore concerned with avoiding the “polysemy problem” in the technical processing of semantics.
Semantics in the work environment
We are interested whether you are faced with semantic challenges in your work environment and what kinds of challenges you are faced with. Please send us an email to email@example.com.