For the project, we have created a list of lemmas (basic forms) by frequency of use. The lemmas, along with information on their frequency, were extracted from both versions of the corpus (one annotated using the Toygger tagger, the other using the Concraft tagger). Foreign elements, punctuation and numbers were not included. Frequency was calculated only for lemmas (basic forms); therefore, for example, the conjunction żeby and the particle żeby were counted as one. This decision was brought on by uncertainty regarding the inflectional category of many of the words contained in the corpus, in particular those that are indeclinable. Finally, all of the lemmas were included into a single list, which contains, in separate columns, information on frequency of use in both versions of the corpus. The list contains 286 980 lemmas. However, many of these were misinterpretations of tokens unknown to the Korbeusz morphological analyzer and had to be excluded from the final list. This was the result of errors that always appear on every stage of corpus building, such as typos in transliteration. The way in which we removed the misinterpretations from the list will be described in the next chapter.
Included below are the 200 most common lemmas in the corpus, along with their frequency of use in both versions of the corpus (tagged by two above-mentioned taggers). The frequency may differ slightly between the two versions, since some tokens have been interpreted differently depending on the tagger, and thus linked with different basic forms. For example the token winę in the same part of the text Toygger interpreted as a form of the noun wina ‘fault’, while Concraft – as a form of the verb winąć ‘to plait’.
The lemmas are sorted by frequency of use that come from the Toygger version.
In accordance with the nature of our project, the list of basic forms from the corpus should adhere to the rules for entries in the Electronic Dictionary of 17th-18th-Century Polish. Therefore, we have omitted above-mentioned incorrect interpretations, as well as lemmas which would not constitute an independent lexeme wherever possible. The lemmas are, needless to say, transcribed, not transliterated (for more on the subject of transliteration and transcription see “Instruction”).
As mentioned above, we have consequently omitted words tagged as foreign (in foreign languages), punctuation and numerals (including roman numerals).
Among the 286 980 lemmas, 6198 contained symbols not found in the Polish alphabet: punctuation marks and symbols (e.g. lemmmas to_jest, arcy-biskup, w-tobie, k'myśli, po-, bę-, otwierał', ś^o^, G**, \), numerals (e.g. 6-funtowy, ½, niesie1318), as well as letters from foreign alphabets (e.g. jεy, εkstractu, až). These are often expansions of abbreviations (e.g. to_jest stands for the frequently used abbreviation tj. ‘i.e.’) or the abbreviations themselves, as well as adjectives containing numerals (e.g. 6-funtowy ‘6-pound’). Other than that, many of these lemmas result from incorrect segmentation or transcription or, less commonly, from other errors typical for various stages of large-scale research. Out of these lemmas only the particles +ż, +że and the adverb +kroć were included in the list, as they appear in the database of the Korbeusz morphological analyzer. After the deletion of the 6195 basic forms containing symbols not found in the Polish alphabet, 280 785 lemmas were left on the final list.
Next, we have removed from the list the 214 600 lemmas which have been assigned to tokens that could not be identified by the Korbeusz morphological analyzer. Most of these were the result of various errors – if a token was not identified by Korbeusz, the taggers should decide on an interpretation based on patterns taken from hand-tagged material. In such cases, an unmodified form of the token is treated as a lemma. For these reasons, we have decided to remove all such lemmas from the list; searching them for possible lexemes would be an additional, time-consuming task and, as such, not included in the project. Most of the lemmas in question rarely appear in the corpus. Only 3 of them can be found among the first thousand entries on the frequency list, only 5 in the next thousand entries, 4 in the third and 8 in the fourth thousand. It is only much lower on the list that their appearance starts becoming more common. More than half of these lemmas have a frequency of one. Therefore, regardless of their relatively common appearance on the list, the corpus itself contains comparatively few tokens lemmatized in this way.
After the rejection of the aforementioned lemmas, the final list contains 66 185 entries. It may seem as if a great many of the words in the corpus were misidentified, seeing that more than 220 thousand lemmas were removed from the list. However, misidentified tokens comprise merely 4% of all segments from the corpus.
We have decided to leave those lemmas that start with a capital letter on the list, as they constitute a significant part of the corpus, especially these that stand near the top of the list. The list contains 14 595 lemmas starting with at least one capital letter (in total there were many more, but great amount of them were removed as unknown to Korbeusz). It comes as no surprise that the word Bóg ‘God’ is the most common in this category. Also common are proper names and nationalities (e.g. Chrystus ‘Christ’, Turek ‘Turk’, Polak ‘Pole’, Wojciech (Polish first name), Marcin (Polish first name), Mahomet ‘Muhammad’, Rzeczpospolita ‘Republic’, Lwów ‘Lviv’, Potocki (Polish surname), Jowisz ‘Jupiter’, Pegaz ‘Pegasus’), as well as segments that have been interpreted as modern day acronyms (e.g. BC, SA, CD). Although some common nouns have been for various reasons erroneously interpreted as surnames (and as a result we have two lemmas on the list instead of one, e.g. Zwada and zwada instead of zwada), their frequency is generally low.
The final list contains 66 185 lemmas; 200 of these are included below.
Note: when searching for chosen lemmas, it is important to select the “reject foreign segments” option. Otherwise the result will often be higher than the one recorded on the frequency list.
|Lemma||Number of occurrences - Toygger||Number of occurrences - Concraft|