HLTC was founded as one of the Emerging High Impact Area research centers at HKUST, led by seven faculty members from the EEE and the CS departments specializing in speech and signal processing, statistical and corpus-based natural language processing, machine translation, text mining, information extraction, Chinese language processing, knowledge management, and related fields.
Due to the increasing sophistication of computing and communications systems, there is a growing demand for intelligent multimedia interfaces to them. Speech, by far, is the most direct and natural means for human beings to communicate. With the rapid growth of the Internet, the emergence of computer telephony integration, and the expanding deployment of wireless communication networks, we anticipate speech interfaces in the user's own language through wireless terminals to intelligent agents providing interactive problem solving capabilities over the world-wide communication and computer networks.
Because of the human involvement in the communication chain, spoken language processing has emerged as a new exciting research field. In the last two decades, advances in automatic speech recognition and natural language processing have triggered the development of a number of spoken language system applications ranging from small vocabulary keyword recognition over dial-up telephone lines, to medium vocabulary voice interactive systems on personal computers, to large vocabulary speech dictation, spontaneous speech understanding, speech translation and spoken dialogue systems. These advances have been built upon the contributions from researchers in a number of distinctly different areas, including acoustics and transducers; signal processing; communication systems; speech coding, recognition and synthesis; natural language understanding and generation; language translation; heuristic search and problem solving; multimedia presentation, database management and design; human factors, etc.
Much of the advance in spoken language technology has been made by a collaborative community, in which responsibilities, such as collecting large speech and text corpora, defining common tasks, developing research tools, building research infrastructure, establishing common evaluation metrics, and training engineers, are shared among participating groups. Recognizing this as well as the need to bring Chinese spoken language technology to a level comparable to that of many European languages, a group of faculty whose research interests span the wide range of issues encountered in designing human-machine communication interfaces have banded together to form the Human Language Technology Center at the Hong Kong University of Science and Technology.
The primary objective of the HLTC is to explore new research directions and to develop new applications in the area of language and speech technology. Relevant applications include automated language translation for the Internet, speech recognition/synthesis for computer I/O, and spoken language understanding for the telephone. A second focus of this program is to advance the state of art in machine processing of Chinese language and Chinese information, and to lead the Chinese language technology development in this region (i.e., Hong Kong, Taiwan, Singapore, and China) where such a need is obvious and the transfer of such technology to various industries is imminent.
A team of faculty from the EEE and the CS departments including Oscar Au (EEE), Roland Chin (CS), Pascale Fung (EEE), Brian Mak (CS), Bertram Shi (EEE), Manhung Siu (EEE), and Dekai Wu (CS) forms the core group of this multidisciplinary program.
Overview of research plan
The following sections describe the core language technologies under development at the HLTC.
These core technologies apply to an entire spectrum of important applications for the coming decade. Of these, we focus on a set of critical, representative tasks in which the key research issues surface and which serve as the test-bed for our core technologies. In increasing order of sophistication level, these are:
- Small-vocabulary isolated-word English/Chinese speech recognition for command situations.
- Medium-vocabulary robust text translation between English and Chinese for unrestricted input.
- Dynamic-vocabulary continuous speech recognition for command situations. (Recognition where the vocabulary space is relatively large, but the set of possible candidates at any particular time is small.)
- Medium-vocabulary clean continuous speech translation between English and Chinese for cooperative, grammatical speech input under laboratory recording conditions.
- Large-vocabulary robust cellular telephone spontaneous speech-to-speech translation between English and Chinese.
The final task is a `stress test' requiring major advances in all our core technologies. A representative scenario is a gentleman standing on a busy street corner in the central business district in Hong Kong, making arrangements to meet a prospective business client on his cellular telephone, saying: well uh, sounds good, what aah, which p- place didja wanna meet at? The client on the other end hears a Chinese translation, and responds in Chinese, which the gentleman on the corner hears in English.
Developing new techniques for the core technologies that address various aspects of these tasks is a significant part of our research effort. Our focus upon them also allows for rapid deployment of the technologies we develop in practical applications. In the following sections, we describe briefly our major targeted directions:
- Cantonese corpus collection.
- Robust recognition of telephone speech.
- Acoustic Modeling.
- Language Modeling.
- Translation and Understanding.
Cantonese corpus collection
Although Mandarin and Cantonese are often considered two different dialects of the same language, linguists classify them as distinct languages due to criteria including mutual unintelligibility, lexical differences, and grammatical differences. Despite their common origins, Cantonese is unintelligible to the average Mandarin speaker (just as French is unintelligible to the Spanish). The study of spoken Cantonese is a very sparsely-researched area that we in Hong Kong have a comparative advantage to investigate.
While English speech databases have been available for some time, and Mandarin speech databases are becoming available, Cantonese speech databases are still lacking. To develop speech recognition algorithms and deploy Cantonese-based enhanced network services, we will collect several large spoken databases on Cantonese. Our goals are: (1) to define the types of databases according to application areas; (2) to design each text database based on its type and use such text databases as scripts for speech example collection; (3) for each type of database, collect Cantonese utterances in either a high-quality microphone environment or through a telephone channel, again according to the type of the database; and (4) to organize the database into machine readable CDROMs and tapes for future research.
Robust recognition of telephone speech
For telecommunication applications, advanced speech technologies have always enhanced network services. However, distortions and variations caused by differences in handsets, local loops, PBX and network equipment, local and long distance channels, speakers, and speaking environments have made enhancing recognition performance over the public service telephone network a big challenge for speech researchers and engineers. Cellular telephone services, fast becoming one of the most important telecommunication sectors for Asia, present additional technological challenges for speech recognition. Developing techniques which enable the speech recognizer to perform robustly over telephone and cellular telephone channels are a significant part of our research effort.
In our example, we would expect several types of distortion. First, background noise in the speaker's environment due to traffic noise or other competing speakers may be present. Second, the characteristics of the transmission channel, e.g. fading in cellular telephone transmissions will also distort the signal. Third, variations in speech signals produced by different speakers can be difficult to model. Even for the same speaker, articulation can change due to environmental influences (the Lombard effect).
Our research seeks to use adaptive probabilistic techniques to extend existing techniques such as cepstral mean subtraction to compensate for nonlinear and/or highly non-stationary channel distortions, additive noise and articulation effects which are not handled effectively by current techniques. One advantage of our approach is that we do not require a priori knowledge of exact utterance in order to adapt the recognizer. The adaptation is unsupervised, being based on feedbacks from the models used for speech recognition. This is appealing because the end goal of the processing is to improve the recognition performance. In addition, these techniques can always be combined with others such as robust signal parameterizations to further improve performance.
One task common to many speech recognition systems the extraction of the appropriate acoustic features of speech useful for recognition from the raw speech waveform. The speech feature set must be rich enough to provide sufficient statistics for good speech recognition while suppressing irrelevant information. The importance of different features is also application dependent. For example, pitch may be useful in speaker-dependent speech recognition but is generally considered irrelevant in speaker-independent speech recognition of English. However, pitch is very important in tonal languages such as Chinese. Part of our research consists of evaluating the use of pitch and other novel features to improve the recognition performance of our Chinese language recognizers.
Another interesting problem is speaker accent and dialect adaptation. In Hong Kong, most people speak a mixture of English and Cantonese. Hong Kong Cantonese has a basic Cantonese syntax with some English words. Hong Kong English is spoken with a distinctive local accent. Its syntax and vocabulary is also different from that of American English or British English. An ASR system trained on the latter is unlikely to perform well for local users. For Mandarin speech input, there is also a large difference between regional accents and the standard Mandarin in all speech databases. In fact, this training/testing set discrepancy exists even for different regional accents in American English. Accent/dialect differences have been traditionally treated as part of the speaker variabilities problem. Most systems rely on using a different training set for different accents. However, collecting large databases of regional accents and dialects requires a lot of human effort. Hence, we are investigating the problem of adapting the system trained by American or British English to Hong Kong English, and adapting standard Mandarin system to regional accents and dialects. We treat accent adaptation as a separate problem from speaker adaptation tasks. The latter includes variabilities in gender, vocal tract length, age, etc. The former has one single variability---the accent caused by the speaker's native language. We are investigating whether we can use articulatory rules of the speaker's native language to modify the standard model for adaptation. We can also adapt the parameter set of the HMMs to account for accent variations. We will compare and contrast these two approaches.
Our recognition systems are built using hidden Markov Models (HMMs), a stochastic framework which enables smooth integration with our approaches to language modeling, as discussed in the next section. A fundamental choice which must be made when building HMM-based speech recognition systems is the choice of the basic unit for recognition. For small vocabulary systems, it is often sufficient to use words as the basic unit of recognition. However, as the vocabulary increases, so does the number of models which must be trained. To overcome these problems, most large vocabulary systems are based on sub-word models such as phonemes.
While the choice of sub-word units and word juncture models has been studied extensively for English and many European languages, an appropriate choice for many tonal Asian languages, in particular Mandarin and Cantonese, is still an open question. Using our Chinese language speech corpus, we will be working with linguists in defining appropriate choices of sub-word units for recognition. Moreover, since Hong Kong Cantonese is really a mixture of two languages, it might not be efficient for the speech recognizer to switch between two separate acoustic models and two separate language models in mid-sentence. An interesting question is whether or not it is possible to select appropriate sub-word units for multiingual recognition, e.g., sub-word units for both Cantonese and Mandarin or Cantonese, Mandarin and English.
Other topics under investigation include discriminative training and reducing the computation complexity of HMM based speech recognizers. Discriminative training is found to outperform the traditional maximum likelihood (ML) training. We are developing discriminative training schemes on our application test-beds to improve recognition performance. In general, the computational requirements of HMM-based speech recognizers are very high, especially for training. The use of context dependent sub-word units to model inter-word articulation further increases the complexity of HMM recognizers though some methods have been proposed to lower the complexity. We are also studying ways to reduce the computational complexity of the recognizer without sacrificing performance.
There is currently a gap between the language modeling research from the speech recognition community and the grammar research from the natural language processing community. The former primarily uses simple approaches like n-grams, yielding better performance empirically for prediction (entropy) and for discrimination (recognition). The latter, besides possessing a great deal more intuitive appeal, has the ability to represent long-distance dependencies. For example, when the gentleman in Central says which place didja wanna meet at, the traditional n-gram model cannot predict very well that meet at should end the sentence, because it has already lost track of the preceding which place.
Various approaches to add structure to a stochastic language model include finite-state structures, stochastic context-free grammar (SCFG) models, stochastic versions of lexicalized context-free formalisms, etc. However, although intuitively well-motivated, none of these models have surpassed the n-gram approach in terms of either perplexity or recognition accuracy.
To investigate the appropriateness of the general approach, we are performing a systematic analysis of the information available in long-distance dependencies. We believe it is essential to establish some upper bounds on the amount of information available from long-distance dependencies first, before constructing new ad hoc variations on the structure-language-model theme. This determines what level of performance gain can be expected in principle from any stochastic language model incorporating long-distance dependencies.
We use a parsed corpus for the analysis and information-theoretic quantities like mutual information to measure the amount of extra information available from long-distance relations as compared to a baseline n-gram model. We are evaluating the informativeness of entities such as dependencies, specific types of relations, and specific lexemes joined by specific relations.
Based on the analysis, we will design, construct, and incrementally refine new language models for written and spoken English and Chinese that incorporate varying levels of linguistic structure. These models will aim to capture regularities that arise from long-distance dependencies, which n-gram models cannot represent. At the same time, we will retain as many of the n-gram parameters as needed to capture important lexical dependencies.
Translation and understanding
Speech translation is one of the major hot-beds of research in language processing due to the obvious practical utility of the application. The closely linked task of understanding can also be seen as translation of the input utterance into an internal symbolic representation. For both translation and understanding, our approach is to emphasize robustness over deep semantics; we begin with relatively impoverished semantic representations and incrementally deepen the level of semantic analysis.
On the translation side, we have developed a new paradigm, bilingual language modeling, in which the stochastic source generates transductions rather than strings. We have introduced the stochastic inversion transduction grammar (SITG) formalism and have applied it to automatic bracketing and word matching for bilingual texts, to improving Chinese segmentation as well as to extracting Chinese-English phrasal translation examples. We are continuing to refine and develop these translation models, including a state-of-the-art statistical translation engine targeted for Chinese-English translation. Our SITG-based translation algorithm can replace the exponential A* search in current IBM models with a polynomial-time O(n7) algorithm, thus rendering statistical translation models more efficient.
For language understanding in general, we are introducing new language-independent statistical methods for automatically extracting and learning lexical and grammatical patterns from large corpora. We are also extending and developing new models constructed through extensive manual linguistic analysis of Chinese and English. We are currently analyzing the dependency relations in Chinese with the goal of building a domain-specific Chinese dependency parser. These models will be combined so as to retain the advantages of each; statistical methods for their robustness and scalability, linguistic methods where they prove more accurate.
We are also designing new methods to cope with the unique problems encountered in translating and understanding spoken language, as opposed to written language. Some of the problems under investigation include: (1) Many words in spoken language do not exist in the written form (didja wanna). (2) Many constructions are unique to spoken language and are not found in formal written text (well uh, sounds good). (3) Spoken language includes frequent discontinuities (what ahh), stuttering (p- place), repetitions, restarts, and ungrammatical utterances. (4) The real-time demands are more pressing for spoken language translation than for written.
These are some of the major directions being pursued by the HKUST Human Language Technology Center, which focuses on the development of language and speech technology with particular emphasis on Chinese language and information. Using a test-bed consisting of a set of representative tasks with varying levels of sophistication as both a guide and a means to evaluate our research, we are conducting research in areas including: Cantonese corpus collection, robust speech recognition of telephone speech, acoustic modeling, language modeling, and translation and understanding. Having assembled a critical mass of researchers with complementary areas of expertise, we are working on integrating the latest research in speech recognition and natural language processing in achieving our goals.
The Hong Kong University of Science & Technology
HKUST, Clear Water Bay, Hong Kong
http://www.cs.ust.hk/~hltcLast updated: 2003.09.08