Introduction to Arabic Computational Linguistics: Modern Standard Arabic and Arabic Dialects

Mona Diab and Nizar Habash
Columbia University

The tutorial introduces the different challenges of automatic language processing when working with Arabic and its dialects. The first half of the tutorial is focused on Standard Arabic and the second discusses challenges from Arabic dialects. In both parts we present an introduction to issues in phonology, morphology and syntax and their computational implications. The tutorial endnotes include an extensive bibliography and guide to resources in Arabic NLP.

This tutorial is designed for computer scientists and linguistics alike. The tutorial will provide NLP system developers/researchers with necessary background information for working with the Arabic and its dialects, which have recently become a focus of an increasing number of projects in computational linguistics.

Attendants are NOT expected to know Arabic. Previous versions of the tutorial were given at AMTA 2004, ACL 2005, AMTA 2006, LREC 2006, NAACL 2007, AMTA 2008, MEDAR 2009: http://www1.ccls.columbia.edu/~cadim/presentations.html

Short Bios of Speakers

Mona Diab received her PhD in 2003 in the Linguistics department and UMIACS, University of Maryland College Park. Her PhD work focused on lexical semantic issues and was titled Word Sense Disambiguation within a Multilingual Framework. Mona is currently an associate research scientist at the Center for Computational Learning Systems, Columbia University. Her research includes work on word sense disambiguation, automatic acquisition of natural language resources such as dictionaries and taxonomies, unsupervised learning methods, lexical semantics, cross language knowledge induction from both parallel and comparable corpora, Arabic NLP in general, tools for processing Arabic(s), computational modeling of Arabic dialects, Arabic syntactic and semantic parsing.

Dr. Diab served as co-chair – together with Kareem Darwish and Nizar Habash – of the Workshop on Computational Approaches to Semitic Languages (ACL 2005). She was also a senior member in the 2005 JHU summer workshop on Parsing Arabic Dialects. In 2005, she co-founded the Columbia Arabic Dialect Modeling (CADIM) group together with Nizar Habash and Owen Rambow. She has published over 45 articles in different conferences, journals and workshops. Mona has presented her work in numerous lectures and tutorials both for academic and industrial audiences.

Mona’s website: http://www.cs.columbia.edu/~mdiab

Nizar Habash received his PhD in 2003 from the Computer Science Department, University of Maryland College Park. His Ph.D. thesis is titled Generation-Heavy Hybrid Machine Translation. He is currently an Associate research scientist at the Center for Computational Learning Systems in Columbia University. His research includes work on machine translation, natural language generation, lexical semantics, morphological analysis, generation and disambiguation, computational modeling of Arabic dialects, and Arabic dialect parsing.

Dr. Habash served as co-chair for the Workshop on Computational Approaches to Semitic Languages (ACL 2005) and also the Workshop on Machine Translation for Semitic Languages (MT Summit 2003). In 2005, he co-founded the Columbia Arabic Dialect Modeling (CADIM) group. He is the vice-president of the Semitic Language Special Interest Group in the Association of Computational Linguistics. Finally, he served as research program co-chair for AMTA 2006.

Dr. Habash has published numerous papers in international conferences and journals and has given many lectures and tutorials for academic and industrial audiences.

Nizar’s website: http://www.nizarhabash.com/

CADIM website: http://www1.ccls.columbia.edu/~cadim/