Conférences plénières

Raising teachers’ awareness to corpora

Instituto Superior de Línguas e Administração, Lisboa - Portugal


Keywords : corpora, data-driven learning, user behaviour, teacher training, consciousness-raising

The last decade and a half has seen a dramatic increase in corpus availability and a steady growth in the number of supporters of the use of corpora in language teaching. The very fact that TaLC is at its seventh edition only confirms that the applied use of corpora in the language classroom is here to stay. Yet surveys such as those by Tribble (2001) and Mukherjee (2004) suggest that there is still a long way to go before corpora can be understood and used by language teachers in general. This paper examines some of the problems inexperienced corpus users encounter on their first hands-on contact with corpora and proposes a task-based, consciousness-raising approach to help teachers (who are not corpus linguists) understand the basics of corpora.

With a limited number of language teachers using corpora, it comes as no surprise that there do not seem to be any studies of this kind of user behaviour. Some of the difficulties novice corpus users encounter are however described in Bernardini (2000), Kennedy and Miceli (2001), Frankenberg-Garcia (2005) and Santos & Frankenberg-Garcia (submitted 2005). Although these studies differ quite substantially among themselves, they all converge to suggest that corpus skills which come as second nature to experts are not obvious at all to the untrained. Apart from corpus-specific difficulties in handling different search interfaces and CQLs – and the human-computer interaction issue should not be overlooked – these studies bring to light a number of very basic problems that novice users encounter no matter which corpus they use.

Findings such as the above suggest that language teachers who are new to corpora may find it difficult to grasp that corpora do not work in the same way as the familiar language learning resources – such as dictionaries, grammar books and text books – that they are accustomed to using. I therefore propose a series of consciousness-raising exercises aimed at helping language teachers gauge different types of corpora and discern which ones are best suited to their purposes, develop basic corpus-searching strategies, and get used to interpreting corpus data. The exercises are task-based and, unlike most corpus tutorials available, they are not corpus-specific. The overall idea is not to train corpus linguists, but simply to encourage teachers to become more confident about using corpora in the classroom.


Des machines pour enseigner les langues

Laboratoire LIDILEM, Université Stendhal de Grenoble, France


Si Thorndike imaginait déjà en 1912 l’apport et l’utilisation des livres manuels mécanisés, le chemin fut long avant que les premiers logiciels d’enseignement des langues ne voient le jour dans les années 70. Ils se consolident dans les années 80 et l’ALAO (Apprentissage des Langues Assisté par Ordinateur) se constitue en tant que domaine. Le développement de la micro-informatique dans les années 80 fut déterminant pour la démocratisation de ces logiciels qui sont proposés et utilisés à tout niveau d’enseignement.
Le plus souvent, ces machines à enseigner des langues, produits informatiques, ont une approche réductrice de la langue qui se limite à une séquence de caractères dépourvue de toute sémantique. Cette approche ne permet pas de considérer bon nombre de facettes de la langue et peut entraîner des apprentissages erronés.
Le premier but de cet exposé est de présenter l’impossibilité pour l’informatique de rendre compte des caractéristiques de la langue et la nécessité de considérer et utiliser les procédures du traitement automatique de la langue (TAL). Cette approche, qui voit le jour dans les années 80, permet de corriger bon nombre d’imperfections des logiciels de l’ALAO.
Le second but de cet exposé est de présenter la problématique de l’intégration du TAL à l’ALAO. Les travaux menés actuellement concernent aussi bien l’évaluation de la plus value pédagogique de l’apport du TAL que l’architecture des systèmes, l’intégration et l’exploitation de corpus ou l’indexation pédagogique des ressources. Nous présenterons quelques systèmes existants et nous illustrerons l’intégration du TAL à l’ALAO à l’aide de la plateforme MIRTO, développé à l’université Stendhal de Grenoble


Le Moteur de Recherche WebCorp – le prochain ordre d’ampleur

University of Central England, Birmingham


The web has unique potential among corpora to yield large-volume data on up-to-date language use, obvious shortcomings notwithstanding. Since 1998, we have been developing a tool, WebCorp, to allow corpus linguists to retrieve raw and analysed linguistic output from the web. Based on internal trials and user feedback gleaned from our site (, we have established a working system which supports thousands of regular users world-wide. Many of the problems associated with the nature of web text have been accommodated, but problems remain, some due to the non-implementation of standards on the Internet, and others to reliance on commercial search engines, which mediation slows up average WebCorp response time and thus places constraints on linguistic search.

To improve WebCorp performance, we are in the process of creating a tailored search engine. This will be integrated with a range of language-analysis and output-formatting tools to create a resource that improves significantly on the current research situation both in terms of performance and usability. That is to say, this will be a regular search engine, but it will be linguistically-tailored in the following ways: firstly, targeted subsets of the web will be downloaded; secondly, the data will be available as a series of texts and lines of context, but will also be processed into secondary linguistic databases containing information such as new words and typical word patterns; and thirdly, the search results will be offered in a range of familiar formats specifiable by the linguist, which make both study and publication more convenient.

This paper will outline the features of improved performance in WebCorp that will be possible once new linguistic knowledge has been integrated into the system, and greater storage and processing has been provided through the installation of the new search engine.


Key Words and Key Sections

University of Liverpool


This presentation explores the distribution of keywords (KWs) in text. Although Scott & Tribble (2006) explain the notion of KWs and some of their characteristics in texts, the notion is still fairly new and much remains to be done to pin down the elusive quality of keyness.

In particular, we shall be looking at the relationship between KWs and the section of the text in which they are found. A starting point is Katz (1996) who identifies “bursts” – of certain terms in text and Scott (2000) takes this further to distinguish between global and local KWs, but the present paper tries systematically to relate these to the text divisions as identified in BNC and other corpus texts. Thus in terms of scope discussed in Scott & Tribble (2006) we shall be operating both at the “whole text” level and at the “section” level. The aim is to identify any linkage between the two in terms of key lexis and to evaluate the implication of findings for KW theory and the nature of text.

The presentation will be illustrated using WordSmith Tools and outputs from that software suite.

