Plenary speakers

Ana Frankenberg-Garcia, (Instituto Superior de Línguas e Administração, (ISLA) Lisbonne)
Georges Antoniadis, (Université de Grenoble)
Antoinette Renouf, (University of Central England)
Mike Scott, (University of Liverpool)
Jean-Pierre Cendron, (Délégué à la stratégie de la Bibliothèque Nationale de France)

Raising teachers’ awareness to corpora

Ana Frankenberg-Garcia
Instituto Superior de Línguas e Administração, Lisboa - Portugal

Abstract

Keywords : corpora, data-driven learning, user behaviour, teacher training, consciousness-raising

The last decade and a half has seen a dramatic increase in corpus availability and a steady growth in the number of supporters of the use of corpora in language teaching. The very fact that TaLC is at its seventh edition only confirms that the applied use of corpora in the language classroom is here to stay. Yet surveys such as those by Tribble (2001) and Mukherjee (2004) suggest that there is still a long way to go before corpora can be understood and used by language teachers in general. This paper examines some of the problems inexperienced corpus users encounter on their first hands-on contact with corpora and proposes a task-based, consciousness-raising approach to help teachers (who are not corpus linguists) understand the basics of corpora.

With a limited number of language teachers using corpora, it comes as no surprise that there do not seem to be any studies of this kind of user behaviour. Some of the difficulties novice corpus users encounter are however described in Bernardini (2000), Kennedy and Miceli (2001), Frankenberg-Garcia (2005) and Santos & Frankenberg-Garcia (submitted 2005). Although these studies differ quite substantially among themselves, they all converge to suggest that corpus skills which come as second nature to experts are not obvious at all to the untrained. Apart from corpus-specific difficulties in handling different search interfaces and CQLs – and the human-computer interaction issue should not be overlooked – these studies bring to light a number of very basic problems that novice users encounter no matter which corpus they use.

Findings such as the above suggest that language teachers who are new to corpora may find it difficult to grasp that corpora do not work in the same way as the familiar language learning resources – such as dictionaries, grammar books and text books – that they are accustomed to using. I therefore propose a series of consciousness-raising exercises aimed at helping language teachers gauge different types of corpora and discern which ones are best suited to their purposes, develop basic corpus-searching strategies, and get used to interpreting corpus data. The exercises are task-based and, unlike most corpus tutorials available, they are not corpus-specific. The overall idea is not to train corpus linguists, but simply to encourage teachers to become more confident about using corpora in the classroom.

References

Bernardini, S. (2000) Systematising serendipity: Proposals for concordancing large corpora with language learners. In L. Burnard & T. McEnery (eds) Rethinking language pedagogy from a corpus perspective. Frankfurt am Main: Peter Lang, 225-234.
Frankenberg-Garcia, A. (2005) A Peek into What Today's Language Learners as Researchers Actually Do. International Journal of Lexicography, 18/3, 335-355.
Kennedy, C. & T. Miceli (2001) An evaluation of intermediate students' approaches to corpus investigation. Language Learning & Technology, 5/3, 77-90.
Mukherjee, J. (2004) Bridging the gap between applied corpus linguistics and the reality of English language teaching in Germany. In U. Connor & T. Upton (eds) Applied Corpus Linguistics: A Multidimensional Perspective. Amsterdam: Rodopi, 239-250.
Santos, D. & A. Frankenberg-Garcia (submitted 2005) The corpus, its users and their needs: a user-oriented evaluation of COMPARA.
Tribble (2001) Corpora and teaching: adjusting the gaze. Paper presented at the ICAME 2001 conference, Louvain, Belgium.

Georges Antoniadis

Des machines pour enseigner les langues

Georges Antoniadis
Laboratoire LIDILEM, Université Stendhal de Grenoble, France
Georges.Antoniadis@u-grenoble3.fr

Abstract

Si Thorndike imaginait déjà en 1912 l’apport et l’utilisation des livres manuels mécanisés, le chemin fut long avant que les premiers logiciels d’enseignement des langues ne voient le jour dans les années 70. Ils se consolident dans les années 80 et l’ALAO (Apprentissage des Langues Assisté par Ordinateur) se constitue en tant que domaine. Le développement de la micro-informatique dans les années 80 fut déterminant pour la démocratisation de ces logiciels qui sont proposés et utilisés à tout niveau d’enseignement.
Le plus souvent, ces machines à enseigner des langues, produits informatiques, ont une approche réductrice de la langue qui se limite à une séquence de caractères dépourvue de toute sémantique. Cette approche ne permet pas de considérer bon nombre de facettes de la langue et peut entraîner des apprentissages erronés.
Le premier but de cet exposé est de présenter l’impossibilité pour l’informatique de rendre compte des caractéristiques de la langue et la nécessité de considérer et utiliser les procédures du traitement automatique de la langue (TAL). Cette approche, qui voit le jour dans les années 80, permet de corriger bon nombre d’imperfections des logiciels de l’ALAO.
Le second but de cet exposé est de présenter la problématique de l’intégration du TAL à l’ALAO. Les travaux menés actuellement concernent aussi bien l’évaluation de la plus value pédagogique de l’apport du TAL que l’architecture des systèmes, l’intégration et l’exploitation de corpus ou l’indexation pédagogique des ressources. Nous présenterons quelques systèmes existants et nous illustrerons l’intégration du TAL à l’ALAO à l’aide de la plateforme MIRTO, développé à l’université Stendhal de Grenoble

Bibliographie

Antoniadis G. (2004). "Les logiciels d’apprentissage des langues peuvent-ils ignorer le TAL ?". Les cahiers de l’APLIUT, n° XXIII vol. 2, juin 2004. pp 81-97.
Antoniadis G., Chanier, T. (eds.), (2005) Numéro thématique « TAL et apprentissage des langues », ALSIC, vol. 8, n° 2
Antoniadis G., Echinard S., Kraif O., Lebarbé T., Loiseau M., Ponton C. (2004) "NLP-based scripting for CALL activities", eLearning for Computational Linguistics and Computational Linguistics for eLearning, International Workshop in Association with COLING 2004, Geneva, August 28th, 2004
Borin L. (2002). What have you done for me lately? The fickle alignment of NLP and CALL. EuroCALL 2002 pre-conference workshop on NLP and CALL.
Bruillard E. (1997). Les machines à enseigner. Hermès, Paris.
Brun C., Parmentier T., Sandor A., Segond F. (2002). "Les outils de TAL au service de la e-formation en langues". Multilinguisme et traitement de l’information, (dir. Segond F.). Paris : Hermès. pp. 223-250.
Chanier T. (1998). "Relations entre le TAL et l’ALAO ou l’ALAO un simple domaine d’application du TAL ?". International conference on natural language processing and industrial application (NLP+IA’98). août 1998, Moncton, Canada. http://lifc.univ-fcomte.fr/RECHERCHE/P7/pub/Moncton/index.htm
Granger S. (2002) A bird’s eye view of learner corpus research. In S. Granger, J. Hung & S. Petch-Tyson (eds.) Computer Learner Corpora, Second Language Acquisition and Foreign Language Teaching, Language Learning and Language Teaching 6. Benjamins: Amsterdam & Philadelphia, 3-33.
Granger S., Vandeventer A., Hamel M.J. (2001). Analyse des corpus d’apprenants pour l’ELAO basé sur le TAL. Traitement Automatique des Langues, Vol. 42, n°2, 609-621.
Heift T., Schulze M. 2003. Special issue of CALICO on Error Analysis and Error Correction in Computer-Assisted Language Learning, 20(3).
Jung, U.O.H. (2005). "CALL : past, present and future - a bibliometric approach". ReCALL, vol. 17, 1. pp 4-17.
Loiseau M. (2005). « Vers une utilisation du TAL dans la description pédagogique de textes dans l'enseignement des langues », Rencontre des Etudiants Chercheurs en Informatique et Traitement Automatique des Langues - RECITAL 2005, 6-10 Juin 2005, Dourdan
Perez D. (2004). Automatic Evaluation of User’s Short Essays by Using Statistical and Shallow Natural Language Processing Techniques. Advanced Studies Diploma Work. Universidad Autonoma de Madrid.
Selva T. (2002). Génération Automatique d’Exercices Contextuels de Vocabulaire. TALN 2002 pp 185-194.
Thorndike E.L. (1913). Educational Psychology : Vol. 2. The Psychology of Learning. Teachers College, New-York.
Tribble C., Barlow M. (2001). Using Corpora in Language Teaching and Learning. Special issue of Language Learning and Technology, Volume 5, Number 3.
Wible D., Kuo C-H., Chien F-Y., Liu A., Tsao N-L. (2001). A web-based EFL writing environment: integrating information for learners, teachers, and researchers. Computers and Education 37. pp 297-315.

Antoinette Renouf

WebCorp Linguist’s Search Engine – the next order of magnitude

Antoinette Renouf
University of Central England, Birmingham

Abstract

Le Web a le potentiel unique parmi des corpus de rapporter des données de large volume sur l'utilisation à jour de langue, malgré imperfections évidentes. Depuis 1998, nous avions développé un outil, WebCorp, pour permettre à des linguistes de corpus de rechercher le rendement linguistique cru et analysé du web. Basé sur des épreuves et la rétroaction d'utilisateur internes a glané de notre emplacement (http://www.webcorp.org.uk/), nous ont établi un système fonctionnant qui soutient des milliers d'utilisateurs réguliers dans le monde entier. Plusieurs des problèmes associés à la nature des textes du web ont été résolus, mais des problèmes demeurent, certains dus à l'non-exécution des normes sur l'Internet, et d'autres de la dépendence des moteurs de recherche commerciaux, quelle médiation ralentit le réponse moyen de WebCorp et place donc des contraintes sur la recherche linguistique.

Pour améliorer les performances de WebCorp, nous sommes en train de créer un moteur de recherche fait à mesure. Ceci sera intégré avec une gamme des outils pour l’analyse langue-analyse et de rendement-formatage pour créer une ressource qui s'améliore de manière significative sur la situation courante de recherches en termes d'exécution et rentabilité. C'est-à-dire, ce sera un Search Engine régulier, mais il linguistique-sera travaillé des manières suivantes : premièrement, des sous-ensembles visés de l'enchaînement seront téléchargés ; deuxièmement, les données seront disponibles en tant qu'une série de textes et lignes de contexte, mais seront également transformées en bases de données linguistiques secondaires contenant l'information telle que de nouveaux mots et modèles typiques de mot ; et troisièmement, les résultats de recherche seront offerts dans une gamme des formats familiers specifiable par le linguiste, qui rendent l'étude et la publication plus commodes.

Cet article décrira l'exécution améliorée du WebCorp qui sera rendu possible par l'integration dans le système des nouvelles connaissances linguistiques, ainsi par le plus grand stockage et traitement fournis par l'installation du nouveau moteur de recherche.)

Bibliography

Baroni, M. and S. Bernardini, (2004) ‘BootCaT: Bootstrapping corpora and terms from the web’, in Proceedings of LREC 2004, Lisbon: ELDA, 1313-1316.
Collier, A. & A. Renouf, (1995) 'A system of automatic textual abridgement', in Proceedings of AI'95, 15th International Conference, Language Engineering '95, Montpellier, June 27-30, 1995, pp. 395-407.
Fairon, C. (2000) GlossaNet: Parsing a web site as a corpus, Linguisticae Investigationes, October 2000, vol. 22, no. 2, pp. 327-340(14). (Amsterdam: John Benjamins).
Fletcher, W. (2001) ‘Concordancing the Web with KWiCFinder’, in Proceedings of The American Association for Applied Corpus Linguistics Third North American Symposium on Corpus Linguistics and Language Teaching. Available online from http://www.kwicfinder.com.
Ghani, R., Jones, R., Mladenic, D. (2001) ‘Mining the web to create minority language corpora’. CIKM 2001, 279-286.
Heritrix: http://crawler.archive.org/
Kehoe, A. (2006) ‘Diachronic linguistic analysis on the web with WebCorp ’, in A. Renouf and A. Kehoe (eds.) The Changing Face of Corpus Linguistics. Amsterdam & Atlanta: Rodopi.
Kehoe, A. and Renouf, A. (2002) ‘WebCorp : Applying the Web to Linguistics and Linguistics to the Web’. In Online Proceedings of World Wide Web 2002 Conference, Honolulu, Hawaii, 7-11 May 2002. http://www2002.org/CDROM/poster/67/
Kilgarriff, A. (2003) ‘Linguistic Search Engine’. In Proceedings of The Shallow Processing of Large Corpora Workshop (SProLaC 2003) Corpus Linguistics 2003, Lancaster University.
Larbin: http://larbin.sourceforge.net/
Macmillan: http://www.macmillandictionary.com/MED-Magazine/may2005/30-Corpora-Tips.htm
Nutch: http://incubator.apache.org/nutch/
Plucene: http://search.cpan.org/dist/Plucene/
Renouf, A. (1996) ‘The ACRONYM Project: Discovering the Textual Thesaurus’, in I. Lancashire, C. Meyer and C. Percy (eds.) Papers from English Language Research on Computerized Corpora (ICAME 16) Amsterdam & Atlanta: Rodopi. 171-187.
Renouf, A. (2002) ‘WebCorp: providing a renewable data source for corpus linguists’, in Granger, S. & S. Petch-Tyson (eds.) Extending the scope of corpus-based research. Amsterdam/Atlanta:Rodopi. 39-58.
Renouf, A., Morley, B. and A. Kehoe (2003) ‘Linguistic Research with the XML/RDF aware WebCorp Tool’. In Online Proceedings of WWW2003, Budapest. http://www2003.org/cdrom/papers/poster/p005/p5-morley.html
Renouf, A., Kehoe, A. and D. Mezquiriz (2004): ‘The Accidental Corpus: issues involved in extracting linguistic information from the Web’, in Aijmer, K. & B. Altenberg (eds.) Proceedings of 21st ICAME Conference, University of Gothenburg, May 22-26 2002, Amsterdam/Atlanta GA: Rodopi. 404-419
Resnik, P. and A. Elkiss (2003). ‘The Linguist's Search Engine: Getting Started Guide’. Technical Report: LAMP-TR-108/CS-TR-4541/UMIACS-TR-2003-109, University of Maryland, College Park, November 2003.

WaCky Project (2005) http://wacky.sslmit.unibo.it/

Mike Scott
Key Words and Key Sections

M. Scott
University of Liverpool

Abstract

This presentation explores the distribution of keywords (KWs) in text. Although Scott & Tribble (2006) explain the notion of KWs and some of their characteristics in texts, the notion is still fairly new and much remains to be done to pin down the elusive quality of keyness.

In particular, we shall be looking at the relationship between KWs and the section of the text in which they are found. A starting point is Katz (1996) who identifies “bursts” – of certain terms in text and Scott (2000) takes this further to distinguish between global and local KWs, but the present paper tries systematically to relate these to the text divisions as identified in BNC and other corpus texts. Thus in terms of scope discussed in Scott & Tribble (2006) we shall be operating both at the “whole text” level and at the “section” level. The aim is to identify any linkage between the two in terms of key lexis and to evaluate the implication of findings for KW theory and the nature of text.

The presentation will be illustrated using WordSmith Tools and outputs from that software suite.

Katz, Slava, 1996, Distribution of Common Words and Phrases in Text and Language Modelling, Natural Language Engineering 2 (1), 15-59.

Scott, M. 2000, Reverberations of an Echo, in B. Lewandowska-Tomaszczyk & P.J. Melia (eds.) PALC'99: Practical Applications in Language Corpora. Lodz Studies in Language, Volume 1. Frankfurt: Peter Lang., pp. 49-68.

Scott, M. & Tribble, C., 2006, Textual Patterns: keyword and corpus analysis in language education, Amsterdam: Benjamins.

Bio-data Mike Scott, publications at http://www.lexically.net/publications/publications.htm, has been a teacher of English as a Foreign Language and of ESP for more years than he wishes to remember, since 1990 at the University of Liverpool and before that at universities and language schools in Brazil and Mexico.

He is the author of WordSmith Tools (http://www.lexically.net/wordsmith). His latest book (written with Chris Tribble) is Textual Patterns: keyword and corpus analysis in language education.