Corpora

www.americancorpus.org (Corpus of American English)

Definitions

First computerized in 1971

[http://iteslj.org/Articles/Krieger-Corpus.html]

A corpus consists of a databank of natural texts, compiled from writing and/or a transcription of recorded speech.A concordancer is a software program which analyzes corpora and lists the results.The main focus of corpus linguistics is to discover patterns of authentic language use through analysis of actual usage. Corpus linguistics’ only concern is the usage patterns of the empirical data and what that reveals to us about language behavior.

One frequently overlooked aspect of language use which is difficult to keep track of without corpus analysis is register… Corpus analysis reveals that language often behaves differently according to the register, each with some unique patterns and rules.

Corpus linguistics provides a more objective view of language than that of introspection, intuition and anecdotes… an investigator can discover not only the patterns of language use, but the extent to which they are used, and the contextual factors that influence variability.

Syllabus Design: By conducting an analysis of a corpus which is relevant to the purpose a particular class, the teacher can determine what language items are linked to the target register.

Materials Development: With the help of a corpus, a materials developer could create exercises based on real examples which provide students with an opportunity to discover features of language use.

Classroom Activities: The teacher can guide a predetermined investigation which will lead to predictable results or can have the students do it on their own, leading to less predictable findings. This exemplifies data driven learning, which encourages learner autonomy by training students to draw their own conclusions about language use.

[I forget this source]

Written Language Corpora, collections of text in electronic form, are being collected for research and commercial applications in natural language processing (NLP). Written Language Corpora have been used to improve spelling correctors, hyphenation routines and grammar checkers, which are being integrated into commercial word-processing packages. Lexicographers have used corpora to study word use and to associate uses with meanings. Statistical methods have been used to find interesting associations among words (collocations). Language teachers are now using on-line corpora in the classroom to help learners distinguish central and typical uses of words from mannered, poetic, and erroneous uses. Terminologists are using corpora to build glossaries to assure consistent and correct translations

Tony McEnery & Andrew Wilson

In principle, any collection of more than one text can be called a corpus, (corpus being Latin for “body”, hence a corpus is any body of text). But the term “corpus” when used in the context of modern linguistics tends most frequently to have more specific connotations than this simple definition. The following list describes the four main characteristics of the modern corpus.

  • Sampling and representativeness
  • Finite size
  • Machine-readable form
  • A standard reference

Main uses of corpus

  • Reference Book Publishing
  • Dictionaries, grammar books, teaching materials, usage guides, thesauri. Increasingly, publishers are referring to the use they make of corpus facilities: it’s important to know how well their corpora are planned and constructed.
  • Linguistic Research
  • Raw data for studying lexis, syntax, morphology, semantics, discourse analysis, stylistics, sociolinguistics…
  • Artificial Intelligence
  • Extensive data test bed for program development.
  • Natural language processing
  • Taggers, parsers, natural language understanding programs, spell checking word lists…
  • English Language Teaching
  • Syllabus and materials design, classroom reference, independent learner research.

How many? Same or different?

  • ICAME is an international organization of linguists and information scientists working with English machine-readable texts. The aim of the organization is to collect and distribute information on English language material available for computer processing and on linguistic research completed or in progress on the material, to compile an archive of English text corpora in machine-readable form, and to make material available to research institutions. http://icame.uib.no
  • London-Lund Corpus (couldn’t get on it): The goal of the Survey of English Usage is to provide the resources for accurate descriptions of the grammar of adult educated speakers of English. For that purpose the major activity of the Survey has been the assembly and analysis of a corpus comprising samples of different types of spoken and written British English. The original target for the corpus of one million words has now been reached, and the corpus is therefore complete. As the name implies, the London-Lund Corpus of Spoken English (LLC) derives from two projects. The first is the Survey of English Usage (SEU) at University College London, launched in 1959 by Randolph Quirk, who was succeeded as Director in 1983 by Sidney Greenbaum. The second project is the Survey of Spoken English (SSE), which was started by Jan Svartvik at Lund University in 1975 as a sister project of the London Survey.
  • The American National Corpus (ANC) project is creating a massive electronic collection of American English, including texts of all genres and transcripts of spoken data produced from 1990 onward. The ANC will provide the most comprehensive picture of American English ever created, and will serve as a resource for education, linguistic and lexicographic research, and technology development. When completed, the ANC will contain a core corpus of at least 100 million words, comparable across genres to the British National Corpus (BNC). The corpus will also include an “opportunistic” component of potentially several hundreds of millions of words, chosen to provide both the broadest and largest selection of texts (and, where available, annotations) possible. This website allows you to quickly and easily search more than 100 million words of text of American English from 1923 to the present, as found in TIME magazine. You can see how words, phrases and grammatical constructions have increased or decreased in frequency and see how words have changed meaning over time.
  • The SEU corpus contains 200 samples or ‘texts’, each consisting of 5000 words, for a total of one million words. The texts were collected over the last 30 years, half taken from spoken English and half from written English. The spoken English texts comprise both dialogue and monologue. The written English texts include not only printed and manuscript material but also examples of English read aloud, as in broadcast news and scripted speeches. The range of varieties assembled in the whole corpus is displayed in Figure 1:1.)
  • ICE-GB – i.e. the British Component of the International Corpus of EnglishBritish National Corpus (A huge corpus of 100 million words. http://www.natcorp.ox.ac.uk/). The BNC is a corpus – a collection of samples of real life language, chosen to be as varied as possible in its coverage. It includes speech as well as a wide variety of different kinds of written language, all chosen from the same time. The written part of the BNC (90%); the spoken part (10%). Work on building the corpus began in 1991, and was completed in 1994.

Monolingual: It deals with modern British English, not other languages used in Britain. However non-British English and foreign language words do occur in the corpus.

Synchronic: It covers British English of the late twentieth century, rather than the historical development which produced it.

It includes many different styles and varieties, and is not limited to any particular subject field, genre or register. In particular, it contains examples of both spoken and written language.

The BNC Simple Search is a quick and simple way to search the full BNC for a word or a phrase. More complex searches can also be performed

The result of a search is displayed as a list of up to 50 randomly selected instances headed by a note of the total frequency of the search string. A new search for the same string will generate a different set of randomly selected examples. The source of each example can be checked by clicking on the text code preceding each line.

In addition to just finding a word or phrase, the Simple Search service can also be used for more complex queries. Use the _ character to match any single word, for example bread _ butter finds bread and butter, bread or butter, bread with butter, etc. Use the = character to restrict searches by part of speech, for example house=VVB finds only verbal uses of house. Use braces { and } to enclose a regular expression, for example {s[iau]ng} finds sing, sang or sung.

  • The Louvain Centre for English Corpus Linguistics has played a pioneering role in promoting computer learner corpora (CLC) and was among the first, if not the first, known as the International Corpus of Learner English (ICLE) and is the result of over ten  years of collaborative activity between a number of universities internationally and currently contains over 3 million words of writing by learners of English from 21 different mother tongue backgrounds.
  • French corpus Frantext of Institut National de la Langue Francaise
  • German Institut für deutsche Sprache
  • Dutch Instituut voor Nederlandse Lexicologie
  • Danish Dansk Korpus
  • Italian Istituto di Linguistica Computazionale
  • Spanish Reference Corpus Project of Sociedad Estatal del V Centenario
  • Norwegian corpora of Norsk Tekstarkiv
  • Swedish Stockholm-Umea Corpus
  • Corpora at Sprakdata
  • International Computer Archive of Modern English
  • Helsinki Language Corpus (diachronic corpus consisting of a selection of texts covering the Old, Middle, and Early Modern English periods. Available through ICAME)
  • ICLE – International Corpus of Learner English
  • Freiburg-LOB (FLOB) = The LOB Corpus (corpus research database: [http://www.helsinki.fi/varieng/CoRD/corpora/index.html; I could get to an index page but not to a database]
  • Freiburg-Brown (Frown)
  • Kolhapur Corpus (India)
  • Cambridge International Corpus [couldn’t get to a DB]
  • Australian Corpus of English (ACE)
    Wellington Corpus (New Zealand)
  • International Corpus of English – East African component
  • Lancaster/IBM Spoken English Corpus (SEC)
  • Michigan Corpus of Academic Spoken English (http://micase.elicorpora.info/)
  • Brown Corpus (A corpus of written American English from 1961. Compiled at Brown University. The very first corpus. All handwritten. Available from the Text Laboratory; couldn’t figure out how to get on it)
  • Oxford Text Archive (http://ota.ahds.ac.uk/)
  • Association for Computational Linguistics’ Data Collection Initiative
  • Linguistic Data Consortium
  • Lancaster-Oslo-Bergen Corpus (A corpus of written British English from 1961. Same size and build-up as the Brown Corpus. This was the second in existence, following Brown, and the first in British ENglish.)
  • Consortium for Lexical Research
  • The COLT Corpus (Corpus of London Teenage Speech)
  • Collins Cobuild (http://titania.cobuild.collins.co.uk/form.html – Allows you to search in the Bank of English, but displays only a limited amount of data.)
  • Japanese Electronic Dictionary Research
  • European Corpus Initiative

Questions

  • It is me (10) versus It is I (27)
  • Corpora (13.5) versus Corpuses (.07) [BNC’s simple search]
  • Syllabi (87) versus syllabuses (.01) [In Academic > Education context]

Fave

  • The only one I could begin to get to work online was: http://corpus.byu.edu/ . I created an account.