A Spoken Corpus of Cameroon Pidgin English: pilot study

Title

A Spoken Corpus of Cameroon Pidgin English: pilot study

Author

Dr. Melanie Green, University of Sussex; Dr. Miriam Ayafor, University of Yaoundé I; Dr. Gabriel Ozon, University of Sheffield

Availability

Distributed by the University of Oxford under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

Download: zip

Languages

Cameroon Pidgin

Editorial Practice

Encoding format: audio .wav files (RIFF little-endian, Microsoft PCM, 44.1 KHz, 16-bit stereo); text transcripts .txt files; POS-tagged text transcripts in .txt files

OTA keywords

Linguistic corpora
Corpus
Speech--Research

LC keywords

Linguistics
Linguistics analysis (Linguistics)
Speech--Synthesis
Pidgin English
Code switching (Linguistics)

Extent
  • designation: CollectionSound
  • size: 264 files: ca. 2.6 GB
Creation Date

The corpus was created between September 2014 and August 2016.

Source Description

This resource is a 240,000-word corpus of spoken Cameroon Pidgin English (CPE), a widely-used yet stigmatised and largely uncodified pidgin/creole variety.

The corpus consists of transcriptions of private and public dialogues and monologues, with mark-up and POS-tagging, together with accompanying sound files. The recordings were conducted in five different locations in Cameroon (Bamenda, Buea, Douala, Kumba and Yaounde), allowing some insights into regional variation. Text categories and the proportions of monologue and dialogue are guided by those of the International Corpus of English (ICE) project, which makes the corpus immediately comparable with existing corpora of post-colonial varieties of English.

  • Spelling: since there is no standardised orthography for CPE, the orthography adopted for this project is based on that developed by Ayafor (2014), which was kept under review during the course of the project.
  • Annotation was added to the transcriptions based on ICE guidelines for the annotation of spoken texts: standard mark-up symbols were used to denote text unit, speaker identification, overlapping speech, unclear words, uncertain transcriptions, anthropo-phonics, editorial comments, foreign words and indigenous language words.
  • Tagging: a tagset for CPE was devised based on CLAWS 5. Initially tagging was conducted manually, and then by means of TreeTagger. A third of the corpus has been post-checked, with accuracy rates at 94%.

The corpus is aimed at providing a resource for linguistic description and comparison. It allows linguists to identify and describe recurring grammatical patterns, as well as the phonology of the language (given the availability of sound files deposited with the text files). It also allows comparison of CPE with other pidgin/creole languages, other Cameroonian and West African languages, and other varieties of post-colonial English. Furthermore, the corpus provides an exceptional resource for the study of general/theoretical linguistics, creolistics, typology, language contact and change, sociolinguistics and discourse analysis.

The corpus contains 80 sound recordings of monologues (scripted and unscripted) and dialogues (public and private). Each sound file (in .wav format) is 10-15 minutes in length. These recordings have been transcribed (each approximately 3,000 words in length) and annotated. Transcriptions are submitted in two formats: (a) plain transcription (with basic markup indicating speaker turns, overlaps, etc.), and (b) a POS-tagged version, which adds POS-tags to the plain version of the transcription.

The language of the monologues is Cameroon Pidgin English, with codeswitching into English, French, and indigenous Cameroonian languages.

The accompanying documentation includes (i) a list of submitted files, (ii) a list of participant data, (iii) a tagging guide, (iv) a word list and spelling guide.

Notes

The POS-tagging was carried out by Sarah Fitzgerald.

Permanent URL

http://purl.ox.ac.uk/ota/2563