semantha goes east – AI speaks many languages
semantha is a system that helps with the analysis of documents. It can be used, for example, to speed up the manual process of checking and validating documents, since semantha analyses them at the level of their meaning. Another example of using semantha is that of automating the clustering and classification of documents – based on their content, as opposed to a simple character- or word-based analysis.
Many of the examples on our website, and in our blog, illustrating semantha’s capabilities, are in English. Recently, we have seen an increasing number of requests for a variety of other languages. We have mentioned semantha’s multilingual capabilities elsewhere. In this blog post, we would like to provide more background and examples. So, stay tuned if you are
- considering expanding your business to new geographic areas, such as looking for business partners and customers abroad;
- creating or acquiring new production sites, or otherwise re-evaluating your supply chain;
- looking to increase the understanding of their own supply chain;
to name just a few among the many business processes where multilingual document analysis can help you.
Checking documents in a new language
Let’s assume your business is assessing relocating some of its supply chain to Eastern Europe. To this purpose, usually, a lot of documents – structured and non-structured – need to be gathered, sometimes roughly categorized, sometimes analyzed with scrutiny. Since semantha is traditionally good at unstructured documents (also known as “running text documents”), semantha can help with documents such as product descriptions, audit reports, contracts, insurance policies, and the like.
One of the usages of semantha is to provide her with a small set of previously seen examples (actually, semantha will start working with one single example) – for example, a paragraph containing a no-go condition in a contract – and ask her to find a similar paragraph or text passage in a new document. Since you have been working with semantha, you have some examples of insurance conditions you want to look out for, and those have been provided in English or German. However, Eastern European languages are quite different from English, to the extent that a native speaker of English might not guess the meaning of a text from looking at it without further hints.
To illustrate, let’s take a look at the Polish text, below. It is an excerpt from the insurance policy your business is assessing. Unless you have a good command of Polish or a similar Eastern European language, you would need to have the entire text translated before you could scan it for relevant clauses.
With semantha, using your English examples, you can combine these steps into one, so you quickly get an idea of whether and where the relevant topics are in the document at hand.
Let’s assume you always check your insurance contracts for the risks of STORM, EARTHQUAKE, and SNOW. You have shown semantha a few English examples that find these topics in English insurance policies, as well as German examples from German insurance policies. For example, in one of your previous English documents, the following blurb has been identified as relevant to the risk of SNOW:
Weight of snow or ice on roofs:
an accumulation of precipitation that causes collapses for 168 consecutive hours.
Now, you consider expanding to Poland, and you are in the very first phase of screening original Polish insurance policies for your paramount risks. Since you don’t speak Polish, you ask semantha to color-code relevant paragraphs in the Polish documents. You have defined STORM-related passages to appear in purple, EARTHQUAKE paragraphs to appear in orange, and SNOW paragraphs to be highlighted in blue.
You upload the Polish insurance contract to semantha. You can use semantha’s powerful user interface to guide your analysis, or you can quickly download an annotated version of the original document. We use the latter variant to illustrate the results, and focus on the main ingredients.
Screenshot close-up: Can be opened and viewed in PDF-Reader. On the left side is the new insurance document with the paragraphs highlighted for which semantha found relevant hotspots based on your English and German examples.
At one glance, we see that the Polish insurance policy contains mentions of all three risks we are interested in, which have been retrieved based on cross-lingual similarity to our English and German examples. To illustrate, we find purple, orange, as well as blue highlighting on page 9 of the multi-page document. Thanks to the comments that semantha attached to these paragraphs, we can jump there directly, and even get an understanding of why these passages were found.
On the right, we can see our matching English and German examples with details, such as topic, language (of the example), match score (to indicate how similar the match is), and even the English or German text on which the match is based.
While the EARTHQUAKE example is quite close to a direct translation, the SNOW example shows that semantha also finds text passages that have similar meaning across languages. This is exactly what semantha was designed to do in the first place – find similar paragraphs in different documents in the same language. To illustrate, the direct translation of the Polish text highlighted as SNOW (blue) into English would be:
72 consecutive hours in relation to the weight of snow or ice on the roof and/or accumulation of precipitation causing the roof to collapse
Now that you know that these hotspots have been found, you can either categorize the Polish document accordingly, or go into a detailed human analysis of exactly those paragraphs or pages that are relevant.
The multilingual version of semantha can analyze documents for hotspots based on examples in other languages. We have shown this using one of the features useful for checking and validating documents. Similarly, and based on the same semantic understanding capabilities, semantha can classify, compare, and search documents across various languages.
When it comes to Eastern European languages, semantha’s linguistic capabilities include Polish, as seen in the example above, but also Ukrainian, for example, as well as many other languages using the Latin, Cyrillic or Greek alphabets. Also, we support many more languages spoken around the globe. Get in touch with us to find out more about using semantha for your next international endeavor.
We thank Grzegorz Wereda for sharing his experience in this field with us. Thanks for that Grzegorz!!
Fotos: AdobeStock.com / tomeyk; AdobeStock.com / underwaterstas; AdobeStock.com / zgphotography