British National Corpus, Baby edition

  • British National Corpus, Baby edition
  • BNC Baby

BNC Consortium


Distributed by the University of Oxford under the BNC User Licence. Clicking to download implies acceptance of the licence conditions.

Download: zip



Editorial Practice

Encoding format: TEI XML

OTA keywords

Linguistic corpora

LC keywords

Linguistics analysis (Linguistics)

  • designation: CollectionText
  • size: 182 files: ca. 195 MB
Creation Date


Source Description

Instant secure online access (currently only available for UK users with Shibboleth, and via EduGAIN)


British National Corpus is a snapshot of British English in the early 1990s. The British National Corpus is:
  • a sample corpus: composed of text samples generally no longer than 45,000 words.
  • a synchronic corpus: the corpus includes imaginative texts from 1960, informative texts from 1975.
  • a general corpus: not specifically restricted to any particular subject field, register or genre.
  • a monolingual British English corpus: it comprises text samples which are substantially the product of speakers of British English.
  • a mixed corpus: it contains examples of both spoken and written language.

BNC Baby consists of four one-million-word genre-based subsets (academic, fiction, newspaper and conversation), in XML with added lemma information and additional, simplified POS-tags for each word. The corpus is described in full at

Permanent URL