Treebank - What Is The Purpose of A Treebank?

What Is The Purpose of A Treebank?

Treebanks can be used in corpus linguistics for studying syntactic phenomena or in computational linguistics for training or testing parsers. Diachronic corpora can be used to study the time course of syntactic change.

The value of parsed corpora is becoming more and more widely understood. The data of introspection has been crucial to syntactic research because introspection provides evidence, not only of what is possible in a given language but also of what is not possible. Such negative evidence is, of course, not available in corpora of actual writing or speech. On the other hand, introspection about grammar is itself inevitably partial, as linguists have found when attempting to parse actual speech and writing, and it provides relatively poor information about the information structure of sentences; that is, the discourse contexts in which given syntactic constructions are licensed.

Once parsed, a corpus will contain evidence of both frequency (how common different grammatical structures are in use) and coverage (the discovery of new, unanticipated, grammatical phenomena).

An automatically parsed corpus that is not corrected by human linguists is useful. It can provide evidence of rule frequency for a parser. A parser may be improved by applying it to large amounts of text and gathering rule frequencies. However, it should be obvious that only by a process of correcting and completing a corpus by hand is it possible then to identify rules absent from the parser knowledge base. (As a bonus, frequencies are likely to be more accurate.)

Potentially, however, by far the most interesting question for theoretical linguists and psycholinguists is interaction evidence in parsed corpora. A completed treebank can help linguists carry out experiments as to how the decision to use one grammatical construction tends to influence the decision to form others. The idea here is not to improve parsing algorithms but to go to the heart of the question of linguistic choice: to try to understand how speakers and writers make decisions as they form sentences.

Interaction research is particularly fruitful as further layers of annotation, e.g. semantic, pragmatic, are added to a corpus. It is then possible to evaluate the impact of 'non-syntactic' phenomena on grammatical choices.

The parsing and exploitation of parsed corpora has become an important subdiscipline of Corpus Linguistics ever since the first large-scale treebank, The Penn Treebank, was published. Many of the theoretical criticisms of lexical corpora do not apply to parsed corpora. Results from a parsed corpus are more closely commensurate with linguistic theories. However, a new epistemological problem arises: a parsed corpus necessarily requires a particular analysis, and this analysis, and the theory behind it, may be incorrect or deficient.

Read more about this topic:  Treebank

Famous quotes containing the word purpose:

    The moment a mere numerical superiority by either states or voters in this country proceeds to ignore the needs and desires of the minority, and for their own selfish purpose or advancement, hamper or oppress that minority, or debar them in any way from equal privileges and equal rights—that moment will mark the failure of our constitutional system.
    Franklin D. Roosevelt (1882–1945)