3.1.6 Critical evaluation

Parent Previous Next


Many of the problems that accompany wanting to use the Internet as a corpus have been repeatedly addressed and demonstrated by means of example searches. The Internet holds many obstacles for a linguist, which commercial search engines such as Google are incapable of overcoming. They are not primarily designed to meet the needs of linguistic or academic users, which often makes it difficult to gather information on specifically targeted facets of a language. Yet the Web holds an abundance of valuable linguistic information. It provides a great opportunity to gain access to aspects of language which are not available in standard corpora. Therefore, the challenge which results from this circumstance and which many linguists and software engineers have already attended to is creating the means to access Web text.

As mentioned above, WebCorp is a stand-alone web concordance tool that rides on commercial search engines, which has been designed to provide the user with well processed and analysed output according to the linguist's interest. One of the major advantages to this program, which makes it extremely user-friendly, is that its development is based on user feedback. WebCorp is constantly being revised, with user's impressions and proposals being taken into account.

The program basically operates as follows: The user puts a search request into WebCorp, which is then translated and fed to the respective search engine. The search engine subsequently locates the relevant hits and returns these to WebCorp. Finally, WebCorp performs a kind of backup-validation, by accessing the URLs directly and afterwards returning the concordance results to the user interface. As illustrated above, WebCorp's user interface enables the linguist to manipulate the search results to a much greater extent than, for example, Google. The program provides the user with various options of filtering and post-editing, thus enabling WebCorp to retrieve valuable information on certain linguistic aspects such as “neologisms and coinages; newly-vogueish terms; rare or possibly obsolete terms; rare or possibly obsolete constructions; phrasal variability and creativity; basic statistical information and basic key phrase analysis” (Renouf, Kehoe & Banerjee 2005: 3). For specific examples of potential linguistic evidence retrieved by WebCorp please see Renouf, Kehoe & Banerjee (2005).

Aside from these valuable advantages, there are also a number of drawbacks to the program. For instance, problems which are intrinsic to the Web text such as unreliability in terms of punctuation and lack of information on language variety, date and author. Also, the ever-changing state of the Internet with pages and data continually being updated, often makes it impossible to access earlier data, thus inhibiting the user from repeatedly retrieving identical results.

The biggest disadvantage to WebCorp, however, is its reliance on commercial search engines. Some problems accompanying Google, for instance, such as unreliable word count statistics, lack of consistent support for wildcard search, listing of linguistically irrelevant pages as “top ranked” hits and an underlying commercial bias, which is difficult to mask out, to some extent hold equally true for WebCorp. Therefore, the software engineers at the Research and Development Unit for English Studies at the University of Central England in Birmingham are currently working on developing an independent, linguistically-tailored search engine which will offset a number of these problems.

At this point, however, the current WebCorp architecture will have to suffice. And for the most part it does, as the program's extensive functions developed so far outweigh its temporary disadvantages, providing linguists and academics with a helpful tool for efficiently extracting linguistic data from the Web (cf. Renouf, Kehoe & Banerjee 2005).

Created with the Personal Edition of HelpNDoc: Easily create PDF Help documents