The 29th Pacific Asia Conference on Language, Information and Computation



Keynote talk


Sumita Eiichiro
(NICT, Japan)
Talk Title:Research Activities for Translating Asian Languages

Abstract: This talk will introduce automatic translation projects for Asian languages, wherein we intend to seek greater cooperation.
First, a worldwide speech translation consortium, Universal Speech Translation Advanced Research (U-STAR), is introduced. Speech translation involves the integration of three elements: speech recognition, machine translation, and speech synthesis; therefore, to build a speech translation system that includes many languages including Asian languages, it is a good idea to cooperate with other laboratories that specialize in the languages concerned. The consortium now comprises 32 institutes from 27 different countries/regions. The collaboration has improved the accuracy of the integrated systems and has created new forms of integration. U-STAR is always open and welcomes new participants.
Second, we introduce two projects related to the translation of Asian languages: the Workshop on Asian Translation (WAT) and the Asian Language Treebank (ALT). WAT is an open evaluation campaign focusing on translation among Asian languages. We will outline the workshops conducted in past two years' and touch on our plan for next year. ALT is currently a start-up project that will undertake the task of building a treebank of Asian languages. This will be a valuable language resource, not only as a parser for each language but also as an accurate translation system from one language to another.
Third, we discuss the Global Communication Program (GCP), a Japanese government project announced in April 2014 to develop a multi-lingual speech translation system to bridge the language barrier during the Olympic Games in 2020. It aims to provide real-time machine translation services, by using National Institute of Information and Communications Technology's (NICT) translation technology, in day-to-day situations to help foreigners who may feel hesitant about coming to Japan. It will cover 10 languages, including Asian ones, e.g., Thai, Vietnamese, Indonesian, and Myanmar. At NICT, public and private entities have already begun working together as part of a nationwide collaboration. This talk will explain the current status and future vision.
Finally, we touch on NICT's recent research topics, including an approach to high-quality patent translation and new ideas on neural translation.


Zhou Guodong
(Soochow University)

Talk: Building Chinese Discourse Corpus with Connective-driven Dependency Tree Structure

Abstract: It is well-known that interpretation of a text requires understanding of its rhetorical relation hierarchy since discourse units rarely exist in isolation. Such discourse structure is fundamental to discourse understanding and many text-based applications. In this talk, we propose a Connective-driven Dependency Tree (CDT) scheme to represent the discourse rhetorical structure in Chinese language, with elementary discourse units as leaf nodes and connectives as non-leaf nodes, largely motivated by the Penn Discourse Treebank and the Rhetorical Structure Theory. In particular, connectives are employed to directly represent the hierarchy of the tree structure and the rhetorical relation of a discourse, while the nuclei of discourse units are globally determined with reference to the dependency theory. Guided by the CDT scheme, we manually annotate a Chinese Discourse Treebank (CDTB) of 500 documents. Preliminary evaluation justifies the appropriateness of the CDT scheme to Chinese discourse analysis and the usefulness of our manually annotated CDTB corpus.

Guodong Zhou is a distinguished professor (Grade II) and a member of the university academic committee in Soochow University, China. He obtained his Ph.D. degree from National University of Singapore in 1999. He joined the Institute of Infocomm Research, Singapore in 1999 and Soochow University in 2006. His research interests include natural language processing and artificial intelligence with more and more focus on fundamental language issues.

Prof Zhou has published over 100 papers in leading NLP and AI conferences and journals such as ACL/EMNLP/COLING/AAAI/IJCAI with over 4000 citations (Google Scholar). He was/is on the editorial board of several international journals, such as Computational Linguistics, ACM TALIP and Chinese Journal of Software, and is a regular PC member of the major conferences in NLP and AI.

Since 2006, Prof Zhou has established the Suda NLP lab with now 16 staff members, including 7 full professors and 7 associate professors.


Philippe Blache
CNRS

Talk Title:New approaches to sentence processing: a cognitive perspective

Abstract: Sentence processing is usually considered as an incremental mechanism: each new word is integrated into a structure under construction that can be interpreted compositionally. In this architecture, understanding a sentence comes to a step-by-step building of the meaning. I will present in this talk different elements challenging this approach. Starting from works in linguistics, psycholinguistics and natural language processing, we will see that language processing by human can be, depending on the situation, very superficial and incomplete. A more realistic language processing architecture would therefore have to integrate into a unique model different levels of processing: one which is superficial, relying on the recognition of large units with a strong cohesion; and another consisting in a classical incremental word by word integration. This organization corresponds to a double level shallow-and-deep parsing process

Invited Talk


Wang Renqiang
(Sichuan International Studies University)
Talk Title:Two-level Word Class Categorization Model in Analytic Languages and Its Implications for POS Tagging in Modern Chinese Corpora

Abstract: Categorization is a fundamental task in linguistics, and linguistic categories like word classes or parts of speech were considered as the study of "god particles" in language in the 36th Annual Conference of the German Linguistic Society held at the University of Marburg, Germany, in March, 2014. The study of word classes has a history of over 4000 years, and the word class problem in over one thousand analytic languages like Modern Chinese, Modern English and Tongan can be seen as the Goldbach Conjecture in linguistics, which has witnessed several upsurges over the last century. Since analytic languages have few or no inflections, a theory of word class categorization in analytic languages like Modern Chinese can contribute to general linguistic theory and POS tagging in corpora. Based on the perspectives of language as a complex adaptive system and the nature of major parts of speech as propositional speech act functions proposed by Croft (1991, 2000, 2001, 2007) and Croft & van Lier (2012) on the basis of Searle (1969), Wang Renqiang (2014a) argues in his Two-level Word Class Categorization Model that just as there are two states of existence of word at the two levels of langue (i.e. word type or lexeme in lexicon in a communal language) and parole (i.e. word token in syntax), word class categorization also happens at the two levels: the word token categorization in syntax at parole is the speaker's expression of propositional speech act functions like reference, predication and modification, whereas the word type categorization in lexicon at langue is the conventionalized propositional speech act functions of a word type resulted from self-organization or collective unconscious; The class membership of a word type does not have a priori existence, nor is it precategorial, but is liable to change through recurrent use in various propositional speech act constructions in syntax at parole; The multifunctionality or multiple class membership of word types in synchrony derives from diachronic change and is closely related to frequency of use, which reveals the competing motivations of economy and iconicity in communication; The class membership (either single or multiple class membership) of a word type is its meaning potential(s) at langue, which is to be discovered by descriptive linguists through corpus-based usage pattern surveys, as is done by dictionary compilers in word class labeling. Empirical studies by Wang Renqiang (2013, 2014b) have shown that multiple class membership is characteristic of analytic languages in lexicon at langue, like Modern Chinese and Modern English, and that the types of multiple class membership in Modern Chinese is similar to that of Modern English, though The Contemporary Chinese Dictionary (5th ed.) minimized the number of multi-category lexemes by following the principle of parsimony/simplicity, creating a false impression that the percentage of multi-category lexemes in Modern Chinese is far lower than that in Modern English. It is found that this false impression results to some degree from the ban of multiple class membership for self-reference lexemes advocated by leading scholars like Zhu Dexi (1985), Guo Rui (2002), and Shen Jiaxuan (2009), who argue for multifuntionality of Chinese word classes rather than Chinese lexemes. However, this has obviously led to indeterminacy of Chinese word classes. Wang Renqiang & Zhou Yu (2015) have verified the positive correlation between multiple class membership and frequency, and proved that the principle of parsimony is theoretically invalid and practically misleading. They suggest that in the studies of Chinese word classes we respect the truth of language use in the Chinese community, and that we stow away Occam's razor, casting off the constraints of the principle of parsimony. Finally, after examining the current status of POS tagging in Modern Chinese corpora like the Modern Chinese Corpus of the National Language Commission of China, including its achievements and problems, we will explore the implications of Two-level Word Class Categorization Model for POS tagging in Modern Chinese corpora. According to Bakeoff (2008), among the 5 POS tagged corpora in his survey, 3 are based on the word class information in dictionaries while 2 are token-based. Huang Changning (2014) pointed out that the machine learnability of the latter 2 corpora is 2-4 percent higher than the former 3, which indicates that the accuracy of automatic POS tagging can be improved dramatically if we tag the class membership of word tokens in syntax. And this is also in accord with the implications of the Two-level Word Class Categorization Model for POS tagging in Modern Chinese corpora.

Key words: Two-level Word Class Categorization Model; implications; POS tagging in Chinese corpora

Dr. Renqiang Wang is Professor and Dean of Graduate School of Sichuan International Studies University, and executive director of ChinaLex Bilingual Committee. His research interests include language typology, lexicography, cognitive linguistics, corpus linguistics and translation studies. He has undertaken 2 national level research projects like "An Empirical Study of Word Class Labeling in Dictionaries of Chinese as a Second Language" (08XYY009) and "A Trans-disciplinary Study of the Word Class Problem in Analytic Languages" (15BYY169) funded by the National Social Sciences Foundation of China. He has published 7 books and over 30 articles in peer-reviewed journals like Foreign Language Teaching and Research, Journal of Foreign Languages, and Modern Foreign Languages. He has been a visiting scholar at the University of Edinburgh and the University of New Mexico, where he collaborated with Professor Joan Bybee, former president of the Linguistic Society of America.


Yao Yao
(The Hong Kong Polytechnic University)
Talk Title:How to model language variation and what does the model tell us?

Abstract: Variation is ubiquitous in language. The same word can be pronounced slightly differently by different speakers, or even the same speaker in different contexts; the same message may be encoded in different syntactic forms with more or less the same meaning. What is interesting is to find out (1) whether the variation is (at least partially) predictable, and if so, what are the predictors, and (2) what the predicting patterns can tell us about how people process language. In this talk, I will introduce a series of studies that use corpus data to build statistical models of language variation. The variation phenomena range from phonetic variation to syntactic (word order) variation. Results of these models produce significant implications for theories of word and sentence processing, and may also bring insight for computational linguistic research.