Clickhere for the link to the paper.
The paper was presented at the 21st International Conference on Asian Language Processing (IALP 2017)
IALP is a series of conferences with unique focus on Asian Language Processing. The conference aims to advance the science and technology of all the aspects of Asian Language Processing by providing a forum for researchers in the different fields of language study all over the world to meet.
- Alfred John Tacorda
- Marvin John Ignacio
- Nathaniel Oco
- Rachel Edita Roxas
Byte pair encoding(BPE) is an approach that segments the corpus in such a way that frequent sequence of characters are combined; it results to having word surface forms divided into its’ root word and affix. It alone handles out-of-vocabulary words, but tends to not consistently segment inflected words. Controlled byte pair encoding (CBPE) allowed our word-level neural machine translation (NMT) model to easily recognize inflected words which are prevalent in morphologically-rich languages. It prevented BPE from merging affixes in a word to other characters in the word. Our resulting NMT models from CBPE consistently evaluates affixes that could’ve been segmented with variations in BPE. In our experiments, we considered 119,969 English-Filipino parallel language pairs from an existing dataset, with Filipino as a morphologically-rich language. The results show that BPE and CBPE both showed improvements in the BLEU scores from 38.31 to 44.82 and 44.07 for English→Filipino, and from 32.17 to 35.25 and 35.98 for Filipino→English, respectively. The lower scores in the Filipino→English can be attributed to other language characteristics of Filipino such as free word order, one-to-many relationship in translating from English to Filipino, and some transliterations in the parallel corpus. CBPE also performed slightly better for English→Filipino than for Filipino→English.