News Archives

  • UNM
  • >Home
  • >News
  • >2006
  • >March
  • >Exploiting Syntactic, Semantic and Lexical Regularities in Language Modeling

Exploiting Syntactic, Semantic and Lexical Regularities in Language Modeling

March 9, 2006

  • Date: Thursday, March 9th, 2006 
  • Time: 11:00 am — 12:15 pm
  • Place: Woodward 149

Shaojun Wang (UNM Faculty Candidate) Alberta Ingenuity Center for Machine Learning Department of Computer Science University of Alberta

Language modeling — accurately calculating the probability of naturally occurring word sequences in human natural language — lies at the heart of some of the most exciting developments in computer science, such as speech recognition, machine translation, information retrieval and bioinformatics. In this talk, I present two approaches for statistical language modeling which simultaneously incorporate various aspects of natural language, such as local word interaction, syntactic structure and semantic document information.

The first approach is based on a new machine learning technique we have proposed—the latent maximum entropy principle—which allows relationships over hidden features to be effectively captured in a unified model. Our work extends previous research on maximum entropy methods for language modeling, which only allow observed features to be modeled. The ability to conveniently incorporate hidden variables allows us to extend the expressiveness of language models while alleviating the necessity of pre-processing the data to obtain explicitly observed features. We then use these techniques to combine two standard forms of language models: local lexical models (trigram models) and global document-level semantic models (probabilistic latent semantic analysis, PLSA).

The second approach is aimed at encoding syntactic structure into semantic trigram language model with tractable parameter estimation algorithm. We propose a directed Markov random field (MRF) model that combines trigram models, PCFGs and PLSA. The added context-sensitiveness due to trigrams and PLSAs and violation of tree structure in the topology of the underlying random field model make the inference and estimation problems plausibly intractable, however the analysis of the behavior of the composite directed MRF model leads to a generalized inside-outside algorithm and thus to rigorous exact EM type re-estimation of the model parameters.

Our experimental results on the Wall Street Journal corpus show that both approaches induce significant reductions in perplexity over current state-of-art technique.