Nesi, Hilary; Thompson, Paul


Encoding format: TEI XML

Linguistic corpora

Linguistics analysis (Linguistics)

  • size: 206 files: ca. 18.5 MB
The original recordings were made between 1999 and 2005. During this period, recordings were stored on computer and converted to MP3 format. The transcription work was conducted between 2000 and 2006, and the mark-up was added between 2003 and 2006.

The BASE corpus consists of 160 lectures and 39 seminars recorded in a variety of university departments. Holdings are distributed across four broad disciplinary groups, each represented by 40 lectures and 10 seminars. These groups are: Arts and Humanities, Social Studies and Sciences, Physical Sciences, and Life and Medical Sciences. The lectures and seminars have been transcribed and annotated using a system devised in accordance with the TEI Guidelines. There is a DTD file which must be kept in the same folder as the corpus files, named 'base.dtd'. The transcription and mark-up conventions are described in the 'BASE manual' document which is in PDF format, and the holdings are described in the Excel spreadsheet, 'BASE corpus holdings.xls'. The token count for the entire corpus is 1.6 million, and the files contain the transcripts of nearly 200 hours of recording.

Nesi, H. and H. Basturkmen (2006) 'Lexical bundles and discourse signalling in academic lectures'. International Journal of Corpus Linguistics 11(3) 147-168

Thompson, P. (2006) 'A corpus perspective on the lexis of lectures, with a focus on Economics lectures'. In K. Hyland and M. Bondi (eds) Academic Discourse Across Disciplines Bern: Peter Lang, pp. 253-270

Nesi, H. (2002) 'An English spoken academic word list' , in Braasch, A. and Provlsen, C. (eds) Proceedings of the Tenth EURALEX International Congress, Copenhagen: Center for Sprogteknologi

Nesi, H. (2001) 'A corpus based analysis of academic lectures across disciplines', in: Cotterill, J. and Ife A. (eds) Language Across Boundaries, London: Continuum Press

