4. Further reading

Previous Next


This section will name a few books, articles and online publications which are useful for finding further information on using the Internet as a corpus and as a source of general information.



http://www.linguistik-online.com/28_06/bickel.html (12.01.2011).



Bickel’s article is very short and superficially deals with the Internet as a linguistic corpus. After a brief introduction, the author deals with search engines and the question of whether the Internet is suitable as a basis for lexicographic and linguistic research. He conducts a test with AltaVista in order to find out whether the Internet can be trusted or whether Internet search results are likely to be significantly affected by any distorting or deviating factors. Unlike many other authors, Bickel supports the usage of the Internet for linguistic research as it has many advantages, which he names at the end of the article.



http://www.bubenhofer.com/korpuslinguistik/kurs/index.php?id=web_grundlagen.html (12.01.2011).



Noah Bubenhofer works as a linguist for the Institut für Deutsche Sprache in Mannheim and offers a theoretical and practical introduction to corpus linguistics on his German website, which is used actively by different universities and institutes. First of all, he offers a definition of ‘corpus’ and introduces different types of corpora. Furthermore, he explains how a corpus is compiled, annotated and how it can be used.

Especially helpful is the chapter ‘Web als Korpus’, as it briefly explains how the Internet is structured and how search engines work. Even people who are not that experienced in working with the Internet can understand Bubenhofer’s short and crisp expositions. In this chapter, Bubenhofer also offers various exercises concerning search engines and their facilities, which can help to familiarize oneself with the usage of Google, Altavista and such like. Additionally, he points out various problems of the Internet and gives examples of how the Web can be used for linguistic research.

Although the Web page basically deals with German corpus linguistics, it provides helpful information on the use of the Internet for linguistic research and is definitely worth a click.



This book explains everything there is to know about search engines. It was not edited by academics alone, but also by two people from the search engine business, who work for Yahoo! and Google respectively. The book consists of eleven chapters:

  1. Search Engines and Information Retrieval: short introduction to the topic.
  2. Architecture of a Search Engine: explains briefly how search engines work.
  1. Crawls and Feeds: answers the questions What can be searched and how do search engines find it?
  1. Processing Text: about the representation of text in a search engine.
  2. Ranking with Indexes: about how the websites are indexed and ranked.
  3. Queries and Interfaces: on how queries are processed and results presented.
  4. Retrieval Models: gives an overview of retrieval models.
  5. Evaluating Search Engines: answers the question What is a good search engine and how do I recognize it?
  6. Classification and Clustering: about categorizing the material found in the Internet and grouping related items.
  7. Social Search: about how search engines don’t treat every user the same way, but build up profiles and form social groups who may then get different results for the same query.
  8. Beyond Bag of Words: New means of retrieval and representation.

This book provides a lot of background information on search engines. It is, however, very technical and contains a lot of mathematical formulas. For someone with a liking for the technology and good technological understanding it will be helpful in understanding the way search engines work and also in perfecting the formulation of a query. For the rest it will be a difficult and frustrating read.



Online version:

http://scidok.sulb.uni-saarland.de/volltexte/2009/2148/pdf/Diemer_29_57.pdf (12.01.2011)


Diemer’s article can be downloaded as a PDF-file.

It deals with the Internet as a corpus and is therefore a helpful additional source to Hundt’s chapter. After a general introduction to corpus linguistics and theoretical linguistics, Diemer provides information about the development in corpus linguistics in the last decades starting with the Brown Corpus in the 60s. In a second chapter, he outlines questions, methods and possible applications of modern corpus linguistics according to Charles F. Meyer. He discusses whether the use of the Internet is the future of corpus linguistics and presents several examples of the linguistic use of Google. At the end of the article, Diemer gives an outlook on future developments in corpus linguistics.



This book is very useful if one wants to look more deeply into the use of the Internet as a corpus or for corpus building. It consists of four sections that look into every aspect of using the web for corpus linguistics. It provides further information and explanations on topics that were only briefly addressed in the article.

The four sections are:

  1. Accessing the web as corpus: information on WebCorp and KWiCFinder
  2. Compiling corpora from the Internet: three articles describing the compilation of corpora from different online sources.
  3. Critical Voices: articles in the defence of standard corpora.
  4. Language variation and change: articles on studies in which the Internet was used as source for data.

This book is definitely worth a look if one is interested in using the Internet as a source for linguistic data. It gives a good insight into programs like WebCorp or KWiCFinder, which make searching the Web easier. Also, there is a short abstract at the beginning of every article, so it is not necessary to read them all to know whether they are useful or not.



Kilgarriff’s article offers a further short introduction to the topic of using the Internet as a corpus. It gives a brief summary of both the advantages and disadvantages of the approach.

Kilgarriff also provides a short outlook on how the Internet could be used as a more reliable source of information and linguistic data.



Ó Dochartaigh’s book consists of ten chapters dealing with the Internet as a source for scientific research. It was written as an introduction to the Internet and its possibilities for research for students and researchers in the social sciences and offers extremely helpful information on the following topics:

  1. Research on the Internet: e.g. Why should the Internet be used for research? What is it good for? How can the Internet be understood?
  2. Research tools: e.g. Basic computer skills, Shareware and freeware, basic functions of emails, basic navigation like hyperlinks, bookmarks, find and such like.
  3. Searching for books and articles: e.g. Understanding databases, news and news archives.
  4. Making contact: e.g. discussion groups, making contact with other researchers and collaboration.
  5. The web: e.g. understanding web addresses, web sites, web browsers.
  6. Searching by Subject
  7. Searching the keyword search engines: e.g. search engines, understanding how the search engines work, profiling the search engines. This chapter is especially helpful for the use of the Internet and the ‘web as corpus’-approach as it offers detailed information which might help to explain the retrieved hits of a sample search.
  8. Classification, evaluation and citation
  9. Archives and statistics: e.g. understanding and using online archives, statistics websites and data archives. Among other things, this chapter introduces several websites of data archives which contain data sets which might be used for one’s own sample searches.
  10. Publishing on the Internet: e.g. what to publish, writing web documents and copyright.

Although Ó Dochartaigh’s book is not specialised on linguistic research and corpus linguistics, it provides helpful basic information for getting acquainted with the scientific usage of the Internet.



This book deals with corpus linguistics in general, and among others addresses the topic “The Web as Corpus”. It is organized into six chapters:

  1. Corpus Creation: the articles in this chapter explain the creation of a corpus based on more traditional material than the Internet.
  2. Diachronic Corpus Study – from past to present
  3. Diachronic Corpus Study – present day
  4. The Web as a Corpus: this section contains articles on WebCorp, a helpful online tool for linguistic web searches.
  5. Corpus Linguistics and Grammatical Theory
  6. Grammar Discussion Panel

The fourth chapter of this book is highly recommendable if one is interested in using WebCorp for an online search. The book was edited by Antoinette Renouf and Andrew Kehoe from the Research and Development Unit for English Studies at the University of Central England in Birmingham, which developed WebCorp. The information given is therefore very useful. This book also gives a short abstract before the articles.



This web page provides information about the KWiCFinder program. It gives step-by-step instructions on how to use the program. It also offers the program as free download. Furthermore, it names various articles by the author of KWiCFinder, which can be downloaded free of charge as PDF-files.



This web page is the site of WebCorp. Apart from the actual simple and advanced search, this site also contains a detailed and useful user guide and additional publications on WebCorp that can be downloaded as PDF-files.













Created with the Personal Edition of HelpNDoc: Single source CHM, PDF, DOC and HTML Help creation