Digital language resources in Oxford
- British National Corpus:
- IVIE Corpus of English dialects
- Oxford English Corpus, access available on demand to Oxford researchers - please apply via the website
- Corpora from the OTA (Oxford Text Archive), hosted in Oxford since 1976
- Literary and linguistic electronic resources on OXLIP, now with a category for Linguistics. For further information contact Johanneke Sytsema
The University of Oxford has licences for 1997, 2008, 2009, 2010,2013 and 2015 for the Linguistic Data Consortium. Take a look at their catalogue, and if there is something there that you are interested in, and you don't see it in the list below, please get in touch with Martin Wynne. Thanks to OUP who paid for the 2009 licence in full for the University, Department of Computer Science who paid for the 2010 and 2015 licences, and the Phonetics Laboratory for 1997 and 2013. The following resources have been downloaded from the LDC and are now available online from IT Services for Oxford users. Consult the LDC catalogue for the full list of what is available, and get in touch with martin.wynne at it.ox.ac.uk. Please note that you are bound by the terms and conditions of the user agreements associated with each of these resources, which can be found on the LDC website.
- LDC2002L49 Buckwalter Arabic Morphological Analyzer Version 1.0
- LDC2004L02 Buckwalter Arabic Morphological Analyzer Version 2.0
- LDC2009T03 GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1
- LDC2009T09 GALE Phase 1 Arabic Newsgroup Parallel Text - Part 2
- LDC2009T10 Language Understanding Annotation Corpus
- LDC2009T12 2008 CoNLL Shared Task Data
- LDC2009T13 English Gigaword Fourth Edition disk 1 (4 Gb) and LDC2009T13 English Gigaword Fourth Edition disk 2 (4.5 Gb)
- LDC2009T22 Arabic Newswire English Translation Collection
- LDC2009T23 FactBank 1.0
- LDC2009T24 OntoNotes Release 3.0
- LDC2009T30 Arabic Gigaword Fourth Edition (2.5 Gb)
- LDC2008T18 New York Times Annotated Corpus (3.3 Gb)
The UK is a member of the CLARIN European Research Infrastructure Consortium, which offers easy access to language data and tools for research in the humanities and social sciences. The latest up to date information on activities and resources can be found at CLARIN website. The University of Oxford is home to the co-ordination of the CLARIN-UK Consortium.
These include resources in Czech, Danish, Dutch, English, German, Norwegian, and online interfaces to a number of other languages via the Corpuscle archive at the University of Bergen, including Abkhazian, Bulgarian, Older Scots, Persian, Solvenian, among others. In most cases, you need to log in to the sites listed at CLARIN protected resources via following the link to 'Log in via your institution' or 'EduGAIN', or simply 'Log in', and you will be redirected to WebAuth.
These resources represent just a small fraction of the numerous resources discoverable or made available via CLARIN, and which openly available to any user, or require other steps to obtain permission to use the resources. The full picture can be seen via the Virtual Language Observatory, where you can search for the language or resource type of your choice:
There are further corpora, copies of which may be available in Oxford, but under a variety of different licensing and access arrangements (often on optical disk). Please get in touch to add to the list. For these resources, contact Martin Wynne unless otherwise stated.
- BNC XML version, BNC Baby (sampler on one CD)
- Corpus of Spoken Dutch
- Corpus of Spoken Japanese
- IPI-PAN corpus of Polish
- COLT Corpus of London Teenagers' Speech
- Gesprochenes Jiddisch Textzeugen einer Europäisch-jüdischen Kultur
- ICAME corpus collection
- East meets West: a compendium of multilingual resources (the TELRI CD, parallel aligned corpora in many European languages)
- An informal interest group in corpus linguistics meets termly. Subscribe to the mailing list here
- A course in Corpus Linguistics is held at IT Services in Hilary Term each year, usually Thursdays 12:30 to 13:30 - more details here (look for session names starting with 'Corpus').
- Developing Linguistic Corpora: a free online guide to good practice in creating a corpus
- Discovering, creating and using digital resources
- Licensing issues relating to digital resources
- Connecting your project or resources with national and international infrastructures
- Writing the Technical Plan of an AHRC Research Grant application
- Planning a digital project
- Research data management
- Data and information visualization
- Text encoding
- Corpus linguistics
See the research support pages at IT Services for more information.