Phoneme Inventory, Trigrams and Geographic Location as Features for Clustering Different Philippine Languages


Click here for the link to the paper

The paper was presented at the 2016 Conference of The Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Techniques (O-COCOSDA)

The purpose of Oriental COCOSDA (the oriental chapter of COCOSDA) is to exchange ideas, to share information and to discuss regional matters on creation, utilization, dissemination of spoken language corpora of oriental languages and also on the assessment methods of speech recognition/synthesis systems as well as promote speech research on oriental languages.


  • Angelica Dela Cruz
  • Nathaniel Oco
  • Leif Romeritch Syliongka
  • Rachel Edita Roxas


In this paper, orthographic, geographic and phonetic features were explored to cluster 32 Philippine languages and identify closely-related languages. For the orthographic data, we collected religious text documents online and 100,000 words per language were used as training data. These words were cleaned and trigram profiles were generated. For the geographic feature, we used the location where the language is spoken. For the phonetic feature, the phoneme inventory of the languages was utilized. The languages were clustered using two clustering algorithms, hierarchical and k-means algorithm. Purity was used as an evaluation metric to validate the clusters made. For both hierarchical clustering and k-means algorithm, the highest purity value of a cluster is 0.67, this is an indication that members in a particular cluster have similar attributes. As future work, semantic features can be added to improve the data set and additional languages can be considered.

Posts created 35

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Posts

Begin typing your search term above and press enter to search. Press ESC to cancel.

Back To Top