There is a strong difference in focus between the standard practice of theoretical linguistics and the practical needs of the developers of applied Natural Language Processing (NLP) systems. On one hand, theoretical linguists are primarily interested by the fundamental understanding of language, in particular how it functions, and how humans acquire it. In this perspective, they usually carry out in-depth studies of specific linguistic data, sometimes fairly marginal or belonging to dialectal varieties of major languages. On the other hand, NLP system developers need comprehensive descriptions, both in terms of syntax and vocabulary, of the languages to be processed. In this perspective, their main linguistic task is therefore to select a particular linguistic model and to complete within that model the description of the language(s) needed by the targeted applications.
As speech is one of the most natural channels used by humans to communicate with each other, automated computer-based language processing cannot be dissociated from the analysis and efficient processing of speech signals, this very specific type of signals produced by our vocal organs and processed by our auditory system.
Linguistic models for parsing and generation
There are several criteria one can use to select an appropriate linguistic model for the development of natural language parsers or generators. Leaving aside small-scale and/or domain-specific applications for which developers often use homemade linguistic models, the vast majority of sizeable parsers and generators use variants of generative grammar, such as HPSG (Head-driven Phrase Structure Grammar), LFG (Lexical Function Grammar) or GB (Government and Binding), or variants of DP (Dependency Grammar). All these formalisms are powerful enough to describe the syntactic features of natural languages and sufficiently formalized to be used (usually with some adaptations) as models in computational developments. In practice, therefore, the choice of a particular model is often a matter of personal preferences, or acquaintance with one particular formalism rather than strict evaluation of the comparative merits of competing models.
Theorical linguistics vs. applied natural language processing (NPL)
It is important to stress the difference in focus between theoretical linguistics on the one hand and the more practical linguistic needs of NLP on the other hand. As we pointed out in the introduction to Chapter 1, the primary interest of theoretical linguistics ultimately lies with the understanding of what language is, how it functions and how humans acquire it. The pursuit of these goals requires in-depth investigations of specific linguistic data, which are sometimes fairly marginal or belong to dialectal varieties of major languages. Over the last 20 years, comparative studies of specific linguistic data have also been quite popular among theoretical linguists. By contrast, developers of NLP systems (parsers and/or generators) need comprehensive descriptions of a particular language, both in terms of its syntax (description of all the major constructions, of all the language-specific constructions, etc.), and of its lexicon (detailed description of all the function words – prepositions, conjunctions, determiners, etc. – as well as selectional properties of verbs, adjectives and nouns, etc.). The fact that modern theoretical linguistics does not provide such wide and consistent descriptions of languages is a source of frustration for computational linguists and may explain, at least in part, why so many of them have designed their own linguistic model.
While it is true that modern theoretical linguistic books do not offer comprehensive descriptions of languages2, they do offer nevertheless indispensable tools, insights and partial descriptions which computational linguists cannot ignore if they intend to develop well-defined comprehensive systems.
To summarize, there is a clash between the standard practice of theoretical linguistics and the needs of applied NLP (depth vs. breadth). The task of a computational linguist will be to select a particular linguistic model and to complete within that model the description of the language(s) needed by the application. This is certainly a laborintensive task, but a necessary one for every one who wants to develop a large-scale parser or generator.
The role of a syntatic parser
In the previous section, we pointed out
(i) that there is no off-the-shelf comprehensive description of a language that a computational linguist could directly use for a largescale application,
(ii) that the task of achieving such a description is quite intensive (and costly). To justify such an enterprise, we must show its relevance. In the following, we assume as background context the development of a large-scale parser, for instance to be used in a translation system. Notice that very similar arguments could be made for the generation task.
The role of a syntactic parser in a large-scale NLP application can be described as follows:
identification of lexical units (including part-of-speech disambiguation);
identification of phrases;
determination of syntactic relations between phrases…