Developing a Model Corpus for Endangered Languages

Date
2014-07-11
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
World languages are becoming endangered at an unprecedented rate. Linguists estimate that 50-90% of the world’s 6700 languages are endangered. Krauss (1992) predicted that as few as 10% of the world’s languages may survive by 2100, which means we may be left with only 670 languages. Given the relationship between linguistic diversity, cultural diversity, intellectual diversity and biological diversity, this study is timely because it contributes to maintaining the biodiversity of languages. This research expands upon the previous success of corpus linguistics for certain world languages to develop and document the Somali language. This educational technology study examines the phenomenon of endangered languages by examining technology and languages in Corpus Linguistics, the study of language in real world texts. Through analysis and a review of the literature and historical precedents, the study demonstrates how the Somali language, spoken by millions, can be labelled an endangered language. Using a pragmatic approach and expanding upon the theories and experience of previous corpus constructions, this study developed an iterative design framework in five separate but complementary phases for corpus design in challenging contexts and drafted a sampling frame for text collection. The research designed and built a Somali language corpus as a working prototype for endangered languages that illustrates how it could be used for language development and documentation purposes. Phase one used the Somali language and collected 22 texts from different genres to produce a general corpus of 865,214 words. Aided by quantitative results generated by antConc corpus tools, the study reveals the 20 most frequently used Somali words. The identification of such words has practical implications for teachers, students, curriculum developers, lexicographers, and language policy makers because their high frequency has pedagogical significance. The study illustrates how corpus tools identify the least commonly used words with implications for those who are involved in language documentation and development initiatives. Using types/token ratio analysis, the study indicates that certain genres such as poetry have an inherently richer vocabulary, which has implications for lexicographers and those who are involved in vocabulary and terminology development in languages.
Description
Keywords
Education
Citation
Hashi, A. (2014). Developing a Model Corpus for Endangered Languages (Doctoral thesis, University of Calgary, Calgary, Canada). Retrieved from https://prism.ucalgary.ca. doi:10.11575/PRISM/25614