Ending-based Strategies for Part-of-speech Tagging
Greg Adams, Beth Millar, Eric Neufeld, Tim Philip
Probabilistic approaches to part-of-speech tagging rely primarily on whole-word statistics about word/tag combinations as well as contextual information. But experience shows about 4 per cent of tokens encountered in test sets are unknown even when the training set is as large as a million words. Unseen words are tagged using secondary strategies that exploit word features such as endings, capitalizations and punctuation marks. In this work, word-ending statistics are primary and whole-word statistics are secondary. First, a tagger was trained and tested on word endings only. Subsequent experiments added back whole-word statistics for the words occurring most frequently in the training set. As grew larger, performance was expected to improve, in the limit performing the same as word-based taggers. Surprisingly, the ending-based tagger initially performed nearly as well as the word-based tagger; in the best case, its performance significantly exceeded that of the word-based tagger. Lastly, and unexpectedly, an effect of negative returns was observed - as grew larger, performance generally improved and then declined. By varying factors such as ending length and tag-list strategy, we achieved a success rate of 97.5 percent.
Keywords: Probabilistic reasoning, natural language processing, hidden Markov models.
PS Link: ftp://skorpio.usask.ca/pub/eric/paper.ps.Z
PDF Link: /papers/94/p1-adams.pdf
AUTHOR = "Greg Adams
and Beth Millar and Eric Neufeld and Tim Philip",
TITLE = "Ending-based Strategies for Part-of-speech Tagging",
BOOKTITLE = "Proceedings of the Tenth Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI-94)",
PUBLISHER = "Morgan Kaufmann",
ADDRESS = "San Francisco, CA",
YEAR = "1994",
PAGES = "1--7"