LIVAC Synchronous Corpus

1. Introduction

LIVAC is an uncommon language corpus dynamically maintained since 1995. Different from other existing corpora, LIVAC has adopted a rigorous and regular as well as "Windows" approach in processing and filtering massive media texts from representative communities in the Pan-Chinese region including Hong Kong, Macau, Taipei, Singapore, Shanghai, Beijing, Guangzhou, Shenzhen. The contents are thus deliberately repetitive in most cases, represented by textual samples drawn from editorials, local and international news, cross-Formosan Straits news, as well as news on finance, sports and entertainment. By 2020, more than 700 million characters of news media texts have been processed and analyzed and have yielded an expanding Pan-Chinese dictionary of 2.4 million words from the Pan-Chinese printed media. Through rigorous analysis based on computational methodology, LIVAC has at the same time accumulated a large amount of accurate and meaningful statistical data on the Chinese language and their speech communities in the Pan-Chinese region.

 

The "Windows" approach is the most representative feature of LIVAC and it has enabled Chinese media texts from the Pan-Chinese context to be quantitatively analyzed, according to various attributes such as locations, time and subject domains. Thus, various types of comparative studies and applications in information technology as well as development of related innovative applications have been possible. Moreover, LIVAC has also allowed longitudinal development to be taken into account, facilitating Key-word-in-context (KWIC) and comprehensive study of target words and their underlying concepts as well as linguistic structures over 20 years, based on variables such as specifications of region, duration and content domain. Results from the extensive and accumulative data analysis contained in LIVAC has enabled the cultivation of textual databases of proper names, places, organizations, new words, bi-weekly and annual rosters of media figures. Related applications have included the establishment of verb and adjective lexicons, the formulation of sentiment indices, to measure and compare the popularity of global media figures in the Chinese media (LIVAC Annual Pan-Chinese Celebrity Rosters, later renamed as Pan-Chinese Media Personalities Rosters) [http://www.livac.org/celebrity.php?lang=en] and construction of monthly new word lexicons (LIVAC Annual Pan-Chinese New Word Rosters) [http://www.livac.org/newword.php?lang=en]. On this basis, the analysis of the emergence, diffusion and transformation of new words, and the publication of dictionaries of neologisms have been made possible.

 

2. Corpus data processing:

a.  Accessing media texts, manual input etc

b.  Text unification including conversion from simplified to traditional Chinese

    characters, stored as Big5 and Unicode versions

c.  Automatic word segmentation, automatic alignment

d.  Manual verification, Part-of-speech tagging

e.  Extraction of words, and addition to regional sub-corpora

f.  Combination of regional sub-corpora to the LIVAC corpus

 

3. Labeling and Applications

a. categories used include general terms and proper names, such as specific terms

  (general names, surnames, semi titles, places, organizations, commercial terms,

  other proper names, time, prepositions, locations, etc); stack-words; loanwords;

  case-word; numerals, etc.

 

b. construction of databases of proper names, places, and specific terms etc

 

c. release of rosters: "new word rosters", "celebrity rosters", "place name rosters",

   compound words and matched words

 

d. other parts of speech tagging and sub-database, such as common nouns;.

   numerals, numeral classifiers, various types of verbs, various types of

   adjectives, pronouns; adverbs; prepositions; conjunctions; particles marking

   mood, onomatopoeia; interjection; etc

 

4. Applications

 

5. Historical background

LIVAC was launched by the Language Information Sciences Research Centre of the City University of Hong Kong, and nurtured in part by Chilin (HK) Ltd and its subsidiary, ChiLinStar Ltd in Zhuhai, China, under the aegis of the CityU Enterprise Ltd of Hong Kong. From 2010 to 2013, LIVAC was hosted by the Hong Kong Institute of Education's Research Centre on Linguistics and Language Information Sciences. From July 2013, LIVAC' has been exclusively hosted and maintained by Chilin (HK) Ltd.

 

6. Website and online inquiry system

LIVAC website: www.livac.org     Email: livac.org@hotmail.com

 

7.  Some Related Publications (including Chinese and English bibliography)

1. Books

       鄒嘉彥、黎邦洋、陳偉光、王士元(編)(1998),《漢語計量與計算研究》,香港,香港城市大學語言資訊科學研究中心。[目錄]

       鄒嘉彥、游汝杰(編)(2007),《21世紀華語新詞語詞典》(簡體字版),上海,復旦大學出版社。[前言]

       鄒嘉彥、游汝杰(編)(2008),《21世紀華語新詞語詞典》(繁體字版),台灣,麗文出版社。[前言]

       鄒嘉彥、游汝杰(編)(2010),《全球華語新詞語詞典》,北京,商務印書館。[前言]

       Tsou, Benjamin, and Kwong, Olivia. (Eds). Journal of Chinese Linguistics Monograph Series No.25: Studies on Corpus Linguistics and Linguistic Corpus in the Chinese Content. Hong Kong: Chinese University Press.

       Tsou, B. K., Kwong, O.Y. (Eds). (2015). (Linguistic Corpus and Corpus Linguistics in the Chinese Context ) Journal of Chinese Linguistics Monograph Series Number 25, 2015邹嘉彦、邝蔼儿(编)《汉语语料库及语料库语言学》《中国语言学报》专刊第25期, Hong Kong: The Chinese University Press [简介 ]

       Chin, Chi-on Andy and Kwok, Bit-chee and Tsou, Benjamin K., (Eds). (2016). Commemorative Essays for Professor Yuen-Ren Chao: Father of Modern Chinese Linguistics. Taiwan: Crane Publishing.

2. Book Chapters

       鄒嘉彥、黎邦洋(2003),〈漢語共時語料庫與資訊開發〉,徐波、孫茂松、靳光瑾編《中文資訊處理若干重要問題》[《973計劃國家語言自然語言理解與知識扢掘》總體刊物](頁147-165),北京,科學出版社。[全文]

       Tsou, Benjamin. (2004). "Chinese Language Processing at the Dawn of the 21st Century" in C R Huang and W Lenders (eds) Language and Linguistics Monograph Series B: Frontiers in Linguistics I, pp189-207. Institute of Linguistics, Academia Sinica. [全文]

       鄒嘉彥(2005),〈21世紀初的中文處理〉(呂學強翻譯),俞士汶、黃居仁編《計算語言學前瞻》(頁209-258),北京,商務印書館。[全文]

       鄒嘉彥、莫宇航(2013),〈漢語書面語的歷史與現狀:海峽兩岸漢語書面語近年演變:以語料庫為出發點〉,馮勝利編《漢語書面語的歷史與現狀》(頁58-75),北京,北京大學出版社。[全文]

       Tsou, Benjamin, and Kwong, Olivia. (2015). LIVAC as a Monitoring Corpus for Tracking Trends beyond Linguistics. In Tsou, Benjamin, and Kwong, Olivia., (eds.), Linguistic Corpus and Corpus Linguistics in the Chinese Context (Journal of Chinese Linguistics Monograph Series No.25). Hong Kong: The Chinese University Press, pp. 447-471. [全文]

       Tsou, Benjamin. (2016). Skipantism Revisited: Along with Neologisms and Terminological Truncation. In Chin, Chi-on Andy and Kwok, Bit-chee and Tsou, Benjamin K., (eds.), Commemorative Essays for Professor Yuen-Ren Chao: Father of Modern Chinese Linguistics. Taiwan: Crane Publishing. pp. 343-357. [全文]

       Tsou, B. K. (2017). Loanwords in Mandarin Through Other Chinese Dialects. In R. Sybesma, W. Behr, Y. Gu, Z. Handel, C.-T. Huang & J. Myers (Eds.), The Encyclopaedia of Chinese Language and Linguistics(Vol. 2, pp. 641-647). Leiden; Boston: BRILL. [全文]

3. Serial Publications

       Tsou, Benjamin, Lin, H.-L., Chan, T., Hu, J.-P., Chew, C.-H. and Tse, J. (1997). "A Synchronous Chinese Language Corpus from Different Speech Communities: Construction and Application" International Journal of Computational Linguistics and Chinese Language Processing, 2(1), pp.91-104. [全文]

       Kwong, Olivia. Tsou, Benjamin, and Lai, Tom. (2004). "Alignment and Extraction of Bilingual Legal Terminology from Context Profiles." Terminology, 10(1), pp.81-99. [全文]

       Kwong, Olivia, and Tsou, Benjamin. (2004). "A Synchronous Corpus-Based Study of Verb-Noun Fluidity in Chinese." Journal of Chinese Language and Computing, 13(3), pp.227-278. [全文]

       Kwong, Olivia, and Tsou, Benjamin. (2005). "A Synchronous Corpus-Based Study on the Usage and Perception of Judgement Terms in the Pan-Chinese Context." International Journal of Computational Linguistics and Chinese Language Processing, 10(4), pp.519-532. [全文]

       Kwong, Olivia, and Tsou, Benjamin. (2006). "Feasibility of Enriching a Chinese Synonym Dictionary with a Synchronous Chinese Corpus". Lecture Notes in Computer Science, 4139, pp.322-332. [全文]

       鄒嘉彥、鄺藹兒、路斌、蔡永富(2011),〈漢語共時語料庫與追蹤語料庫: 語料庫語言學的新方向〉,《中文信息學報: 慶祝中國中文信息學會成立三十周年紀念論文集》,256),38-45[全文]