University of Saarland
Fachschaft Computerlinguistik
Proceedings 5. TaCoS

Grammar-Based Grammar Checker
A feasible practical implementation
of high-level language technologies

Karel Oliva · University of Saarland
FB 8.7, Computational Linguistics · D66041 Saarbrücken
e-Mail: olivaNOSPAM@coli.uni-sb.de (NOSPAM bitte entfernen um die richtige Adresse zu bekommen)

In the long run, high-level language technologies include the development of methods of syntactic and semantic computer processing (parsing as well as generation) as a necessary prerequisite for development of industrial systems covering:

Grammar-checkers
Grammar-based word-processing: grammar-based editor, able of, e. g., replacement of all occurrences of a given noun or nominal phrase in all its morphological forms (cases) with another nominal phrase in the appropriate form, including the contingent necessary changes in grammatical agreement
Means for style checking (e.g., occurrence of excessively long phrases, lexical repetitions) and testing text coherence (e.g., fluency of theme/rheme articulation between any two following sentences) of documents written in a natural language
Natural language interfaces to different types of software systems
Natural language based information retrieval
Automatic abstracting
Possibly other aids for application areas such as computer-driven or computer-assisted machine translation, computer-assisted language teaching and learning, etc.

However, in the current stage of development, only grammar checkers constitute a feasible application domain for commercial usage of high-level language technology. This observation is based on five following reasons:

A useful application does not require a degree of reliability that cannot be guaranteed at the current state of the art. There is no program today that can derive a complete and correct syntactic analysis for every sentence it may be confronted with. As long as a grammar checker is robust in the sense that it does not break down if a full analysis cannot be obtained, it can be usefulin particular if it is able to detect potential errors in partially analyzed sentences.
A grammar checker does not require linguistic competence in semantics and world knowledge in order to be useful. Many other prospective linguistic applications such as text understanding or abstracting systems have not yet made it into useful products because they would need better semantic capabilities and processing of world knowledge before they could satisfy user¹s needs. The limitations of current language technology in the areas of semantics and world knowledge turns out to be a true bottle neck for many application domains. Clearly, grammar checker technology will also benefit immensely from progress in computational semantics, but even the detection of errors on purely syntactic grounds is a very useful functionality.
Since the usefulness of a grammar checker grows as the number of detected error types increases and the number of ³false alarms² decreases, this product type can be improved gradually from version to version.
The application does not require language generation capabilities. Parsing technology is much better developed than generation.
Researchers do not need to convince industry of the feasibility and market chances of the application since there are already some well-selling products. analtogether word processing technology is the software application domain with the most immediate growth potential.

As is well-known, two main approaches may be identified in connection with the development of high-level word-processing applications such as grammar and style checkers, namely, a grammar-based approach and a pattern-matching-based approach. Although in the current state of development no significant differences with respect to performance of the applications based on one or the other approach have been observed, we can foresee that, in the long run, applications developed with a grammar-based perspective will offer a much more reliable and efficient service to the user.

There are several reasons which support this view. First, grammar-based systems, usually conceived with a modular architecture, are much more easily extendible as linguistic research proceeds. Extendibility in this case may be carried out along two different dimensions:

An intralanguage dimension, understood as extension and refinement of the coverage of an already existing system for some specific language to more complex grammatical phenomena and even extragrammatical phenomena (e. g., discourse/text structure).
An interlanguage dimension. Accepting one of the major assumptions of modern theoretical linguistics that all human languages share a common basis of universal rules and principles, extendibility of a general grammar-based system designed for some specific natural language to other natural languages should involve just the addition of (previously described and formalized) language particular rules and principles to the original common basis.

Of course, this is a somewhat ideal description of the actual situation where some extra tuning of the different levels of linguistic knowledge will always be required, but this should be compared with the pattern matching approach which, in all likelihood, is committed to a complete redesign of the system at the moment of extending it in one or the other dimension.

Second, it is not at all obvious that the performance of a system which does not perform a minimum of grammatical analysis is really satisfactory. Consider the following examples, extracted from an actual grammar checking session over a technical document written in English by a non-native speaker. The software used was Microsoft¹s Word 5.0 (TM) word processor and the Correct Grammar (TM) grammar checker. As the reader will observe, most examples concern agreement.

For the sentence:

Our hypothesis will be supported by both diachronic and synchronic evidence from agreement phenomena

The grammar checker displayed the following error message:

The word both does not agree with evidence OR the word evidence does not agree with both

which indicates the failure to identify the word both as the member of the coordinated phrase that modifies the noun evidence.

An interesting case that does not seem to have been triggered by coordination is the following:

I will take these considerations, and further evidence that we will discuss presently, to be sufficient for keeping the two agreement phenomena separate ...

which triggered the message:

The word evidence does not agree with separate OR Consider separates instead of separate

Note, first, that evidence and separate are not related at all, since the former is part of the parenthetical clause, while the second belongs to the infinitival complement of the verb take. However, the grammar checker takes evidence to be the subject of separate which is identified as a verb instead of as an adjective, presumably due to the fact that it considers all the material that follows evidence to be a relative clause. Note, in passing, that the putative relative clause is almost gibberish in English: (evidence) that we will discuss presently, to be sufficient for keeping the two agreement phenomena. The only correct analysis would have been to take separate to be an adjective that modifies the NP the two agreement phenomena and a sensible stylistic suggestion would have been to reverse the order of NP and adjective, since the former is heavier than the latter (i. e. for keeping separate the two agreement phenomena).

Relative clauses are another source of problems. Consider the sentence:

This view is, [...], against the widely accepted idea that agreement involves some covariation in the form of both the target and the controller of an agreement relation, which is, precisely, what makes it different from government.

which was followed by the message:

Consider are instead of is OR consider make instead of makes

Unless coordination is again responsible for the confusion, the motivation for this message is not clear. The antecedent of which is the (rather heavy) NP the widely accepted idea that ..., which is third person singular; on the other hand, what has no antecedent, since it is part of the free relative clause that is the complement of is and has to be interpreted as the subject of makes. Note that this is the only possible interpretation, because what in English cannot be used as a relative pronoun unless it is antecedentless.

Finally, let us give an example involving discourse/text structure:

Second, we have the problem of assigning a definiteness value to those verbal forms that are not ambiguous ...

Here, the error indication was:

Consider Seconds instead of Second OR consider is instead of are

Where the grammar checker fails to identify the word second as part of an enumeration; something very common in written text.

Thus, even for English, a language where a fairly large amount of relatively fixed agreement and word order patterns can be identified, a pattern matching system fares rather poorly for the average word-processor user.

This situation may become much more complex if one wishes to apply the same approach to other languages like those of the Germanic, Romance or Slavic families with less fixed word order patterns and more complex agreement systems. Consequently, for a grammar checker for these languages, a notable amount of applied linguistic research is necessary in order to be able to supply the implementors with the appropriate descriptions and formalizations of the linguistic knowledge required to attain a minimum of acceptable results in the performance of the final product. This research may be subdivided into two main areas: descriptive and formal research. The former is aimed at providing accurate descriptions of the specific linguistic problems involved, whereas the second is concerned with the expression of this descriptive knowledge within the appropriate grammar formalism, capable of capturing the relevant generalizations in a computationally tractable way.

Similarly, the specific nature of the product, which is supposed to work on real text, requires research in several areas of parsing technologies aimed at the development of robust processing systems. There exists a rather wide range of possible techniques that can be applied here and that deserve consideration; among them, the precompilation of frequently encountered structures facilitates the detection of errors and, in turn, control the application of further techniques such as constraint relaxation techniques or specialized rules for ill-formed input.

Zurück zur Übersicht der Beiträge

University of Saarland Fachschaft Computerlinguistik Proceedings 5. TaCoS

Grammar-Based Grammar Checker A feasible practical implementation of high-level language technologies

University of Saarland
Fachschaft Computerlinguistik
Proceedings 5. TaCoS

Grammar-Based Grammar Checker
A feasible practical implementation
of high-level language technologies