2.4 Catalogue of corpora and software quoted

Parent Previous Next


This section exclusively provides descriptions of the corpora and software mentioned in the chapters. For additional resources see Useful internet resources.



CORPORA


The BNC is a 100-million-word corpus of written (90 %) and spoken (10 %) British English, compiled in the 1990s (data collection ended in 1993). The 4,124 corpus texts of varying length originate from post-1960, mainly from the late 1980s and 1990s. The written texts contain 75% informative prose and 25% imaginative prose. The spoken part includes 2,000 hours of transcribed recordings by 124 volunteers from 38 different parts of the UK (with balanced coverage of male and female speakers of different ages). The corpus is automatically part-of-speech tagged (cf. Baker 2006: 24; Meyer 2002: 143; Anderson and Corbett 2009: 182).


BNC editions:

For a comparative review of three full BNC editions see Section 3.1.1.


The official BNC website www.natcorp.ox.ac.uk/ provides full information about the corpus and the official BNC-XML edition with XAIRA software. Simple searches (also according to part-of-speech tags) can also be carried out through this site; the result page indicates the total number of hits for a search and displays a random sample of fifty retrieved examples in context (cf. Anderson and Corbett 2009: 183).


BYU-BNC: the online version of the BNC developed at Brigham Young University (BYU)

This interface offers the facility of identifying collocates, comparing words across registers, and viewing all hits for search terms in the corpus, but the query options are limited and full corpus texts are not available (cf. Anderson and Corbett 2009: 183-184).

http://corpus.byu.edu/bnc/.


BNCweb: the web interface of the BNC, developed at Zurich University

BNCweb is a web-based search and retrieval tool used for searching and retrieving lexical, grammatical and textual data from the BNC. It enables users to carry out searches, view, sort and thin concordances, calculate collocations using a range of statistical measures, specify POS-tag-based searches, carry out distribution analyses and create sub-corpora (cf. Baker 2006: 21; Hoffman 2008), as well as export and reimport database results. The latest edition of BNCweb includes the CQP query mode (see CQP under CONCORDANCE PROGRAMS).

http://www.bncweb.info/


BNC Baby: compilation of four one-million-word genre-based BNC subsets (academic, fiction, newspaper and conversation) with added lemma information and POS-tagging

http://www.natcorp.ox.ac.uk/corpus/index.xml.ID=products#baby


BNC Sampler: BNC two-million-word subcorpus, 50% spoken, sampled to be similar to the full corpus, with detailed part-of-speech tagging which has been manually checked and edited.

http://www.natcorp.ox.ac.uk/corpus/index.xml.ID=products#sampler


This one-million-word corpus of written British English published in 1991 and compiled at the University of Freiburg is an update of the LOB corpus (the Lancaster-Oslo-Bergen Corpus of British English). It is divided into two-thousand-word samples in varying genres intended to replicate the LOB corpus in terms of its sampling frame (cf. Baker 2006: 74; Meyer 2002: 145).

http://khnt.hit.uib.no/icame/manuals/flob/INDEX.HTM



CONCORDANCE PROGRAMS


– a concordance tool for analyzing the lexis of texts and corpora. It can be used to produce frequency lists, run concordance searches, calculate collocations for particular words and identify keywords in a text according to frequency. It is equally applicable to both plain texts and texts with markup tags. Wordsmith does not require the corpus to be indexed in advance (Baker 2006: 169-170; Oakes 1998: 193-194). The latest version of Wordsmith is 5.0. In their chapter, Smith and Seoane use Wordsmith 3.0.

http://www.lexically.net/downloads/version5/HTML/index.html?concord_see_do_proc.htm


- a Windows-based concordance package designed for use with monolingual corpus material. Monoconc is published by Athelstan, a company which publishes second language learning and corpus linguistics related books and software (Baker 2006: 116).

http://athel.com/mono.html


- or "Corpus Query Processor" - a retrieval tool of the IMS Corpus Workbench software package, developed at the University of Stuttgart. Developed specifically for large corpora, CQP is presented with the BROWN and BNC corpora. It has a powerful query builder based on its own query syntax and offers all the standard features of corpus software: lexical and POS-tag query, concordancing and saving a sub-corpus. CQP is integrated in the latest BNCweb edition.

http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/



DATABASE PROGRAMS


- a popular database program, appropriate for lexical databases and other kinds of highly-structured linguistic resources, including audio or visual multimedia records.

http://www.filemaker.com/


- a relational database management system published by Microsoft

http://office.microsoft.com/en-us/access/default.aspx


- a spreadsheet application written and distributed by Microsoft

http://office.microsoft.com/en-us/excel/default.aspx



STATISTICAL ANALYSIS PACKAGES


- a statistical analysis and data management system, widely used for statistical analysis in the social sciences. It can take data from almost any type of file and use it to either conduct complex statistical analyses or to generate tabulated reports, charts, plots of distributions and trends, and descriptive statistics.

www.spss.com


- a statistical analysis package proved to be useful in computational linguistics, text analysis, and stylistics

www.minitab.com


Created with the Personal Edition of HelpNDoc: Free PDF documentation generator