In the long run, high-level language technologies include the development of methods of syntactic and semantic computer processing (parsing as well as generation) as a necessary prerequisite for development of industrial systems covering:
However, in the current stage of development, only grammar checkers constitute a feasible application domain for commercial usage of high-level language technology. This observation is based on five following reasons:
As is well-known, two main approaches may be identified in connection with the development of high-level word-processing applications such as grammar and style checkers, namely, a grammar-based approach and a pattern-matching-based approach. Although in the current state of development no significant differences with respect to performance of the applications based on one or the other approach have been observed, we can foresee that, in the long run, applications developed with a grammar-based perspective will offer a much more reliable and efficient service to the user.
There are several reasons which support this view. First, grammar-based systems, usually conceived with a modular architecture, are much more easily extendible as linguistic research proceeds. Extendibility in this case may be carried out along two different dimensions:
Of course, this is a somewhat ideal description of the actual situation where some extra tuning of the different levels of linguistic knowledge will always be required, but this should be compared with the pattern matching approach which, in all likelihood, is committed to a complete redesign of the system at the moment of extending it in one or the other dimension.
Second, it is not at all obvious that the performance of a system which does not perform a minimum of grammatical analysis is really satisfactory. Consider the following examples, extracted from an actual grammar checking session over a technical document written in English by a non-native speaker. The software used was Microsoft¹s Word 5.0 (TM) word processor and the Correct Grammar (TM) grammar checker. As the reader will observe, most examples concern agreement.
For the sentence:
Our hypothesis will be supported by both diachronic and synchronic evidence from agreement phenomenaThe grammar checker displayed the following error message:
The word both does not agree with evidence OR the word evidence does not agree with bothwhich indicates the failure to identify the word both as the member of the coordinated phrase that modifies the noun evidence.
An interesting case that does not seem to have been triggered by coordination is the following:
I will take these considerations, and further evidence that we will discuss presently, to be sufficient for keeping the two agreement phenomena separate ...which triggered the message:
The word evidence does not agree with separate OR Consider separates instead of separateNote, first, that evidence and separate are not related at all, since the former is part of the parenthetical clause, while the second belongs to the infinitival complement of the verb take. However, the grammar checker takes evidence to be the subject of separate which is identified as a verb instead of as an adjective, presumably due to the fact that it considers all the material that follows evidence to be a relative clause. Note, in passing, that the putative relative clause is almost gibberish in English: (evidence) that we will discuss presently, to be sufficient for keeping the two agreement phenomena. The only correct analysis would have been to take separate to be an adjective that modifies the NP the two agreement phenomena and a sensible stylistic suggestion would have been to reverse the order of NP and adjective, since the former is heavier than the latter (i. e. for keeping separate the two agreement phenomena).
Relative clauses are another source of problems. Consider the sentence:
This view is, [...], against the widely accepted idea that agreement involves some covariation in the form of both the target and the controller of an agreement relation, which is, precisely, what makes it different from government.which was followed by the message:
Consider are instead of is OR consider make instead of makesUnless coordination is again responsible for the confusion, the motivation for this message is not clear. The antecedent of which is the (rather heavy) NP the widely accepted idea that ..., which is third person singular; on the other hand, what has no antecedent, since it is part of the free relative clause that is the complement of is and has to be interpreted as the subject of makes. Note that this is the only possible interpretation, because what in English cannot be used as a relative pronoun unless it is antecedentless.
Finally, let us give an example involving discourse/text structure:
Second, we have the problem of assigning a definiteness value to those verbal forms that are not ambiguous ...Here, the error indication was:
Consider Seconds instead of Second OR consider is instead of areWhere the grammar checker fails to identify the word second as part of an enumeration; something very common in written text.
Thus, even for English, a language where a fairly large amount of relatively fixed agreement and word order patterns can be identified, a pattern matching system fares rather poorly for the average word-processor user.
This situation may become much more complex if one wishes to apply the same approach to other languages like those of the Germanic, Romance or Slavic families with less fixed word order patterns and more complex agreement systems. Consequently, for a grammar checker for these languages, a notable amount of applied linguistic research is necessary in order to be able to supply the implementors with the appropriate descriptions and formalizations of the linguistic knowledge required to attain a minimum of acceptable results in the performance of the final product. This research may be subdivided into two main areas: descriptive and formal research. The former is aimed at providing accurate descriptions of the specific linguistic problems involved, whereas the second is concerned with the expression of this descriptive knowledge within the appropriate grammar formalism, capable of capturing the relevant generalizations in a computationally tractable way.
Similarly, the specific nature of the product, which is supposed to work on real text, requires research in several areas of parsing technologies aimed at the development of robust processing systems. There exists a rather wide range of possible techniques that can be applied here and that deserve consideration; among them, the precompilation of frequently encountered structures facilitates the detection of errors and, in turn, control the application of further techniques such as constraint relaxation techniques or specialized rules for ill-formed input.