Miloš Jakubíček is an NLP researcher and software engineer. His research interests are devoted mainly to two fields: effective processing of very large text corpora and parsing of morphologically rich languages. Since 2008 he has been involved in the development of Sketch Engine corpus management suite on behalf of Lexical Computing, a small research company working at the intersection of corpus and computational linguistics. Since 2011 he was director of the Czech branch of Lexical Computing leading the local develo pment team of Sketch Engine and became CEO of the Lexical Computing in 2014. He is a fellow of the NLP Centre at Masaryk University, where his interests lie mainly in syntactic analysis and its practical applications. Co-author |
Miloš Jakubíčekwill present… From parallel corpora to bilingual terminology:a hybrid approach
Abstract We present a working system for extracting bilingual terminology from parallel corpora which is available in the Sktech Engine corpus management system. The extraction process consists of two steps: first, the system extracts all possible terms from source and target corpora separately using existing framework for terminology extraction Sketch Engine, second it computes co-occurrence statistics for all possible pairs of source-target term pairs aligned (e.g. on the sentence level) in the parallel corpus. The list of all possible pairs is then sorted according to the logarithmic variant of the dice coefficient (logDice). The terms are extracted using a rule-based method. The system contains several term grammars (available for Czech, Dutch, English, French, German, Chinese variants, Italian, Japanese, Korean, Polish, Portuguese, Russian and Spanish) describing syntactic structure of terms in the given language. The grammars define mainly noun phrases as the most frequent type of terms. The underlying formalism for describing the syntactic rules is the Corpus Query Language (CQL) implemented in Sketch Engine. The second phase depends heavily on the parallel corpus and is purely statistical. Once the terms are extracted from both source and target corpora, the system goes through all aligned structures and counts co-occurrences of term pairs. LogDice coefficient is then used to discriminate between salient candidate pairs and a pure coincidence of some co-occurring pairs. Quality of the resulting candidate lists is discussed and several improvements are proposed in the paper. Below is an example of top candidates for an English-French domain parallel corpus. |