Arabic Speech Corpus


Arabic Speech Corpus


Nawar Halabi


Distributed by the University of Oxford under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

Download: zip



Editorial Practice

Encoding format: audio .wav files; text utterances .lab files; alignments .textgrid files

OTA keywords

Linguistic corpora

LC keywords

Linguistics analysis (Linguistics)

  • designation: CollectionSound
  • size: 5,444 files: ca. 1.3 GB
Creation Date


Source Description

This Speech corpus has been developed as part of PhD work carried out by Nawar Halabi at the University of Southampton. The corpus was recorded in south Levantine Arabic (Damascian accent) using a professional studio. Synthesized speech as an output using this corpus has produced a high quality, natural voice.

An updated version of the corpus replaced the original version in the Oxford Text Archive on 8th May 2017. In the updated version, 18 minutes of material were added to the corpus, recorded in a different session and used for evaluation, and some errors in the phonetic transcription were corrected..

The corpus has a dedicated website from which it can be downloaded at


The resource is a speech corpus, with digital audio files, text transcripts, and files containing time stamps of the phoneme boundaries.

  • 1813 .wav files containing spoken utterances.
  • 1813 .lab files containing text utterances.
  • 1813 .TextGrid files containing the phoneme labels with time stamps of the boundaries where these occur in the .wav files. These files can be opened using Praat software.
  • phonetic-transcript.txt which has the form "[wav_filename]" "[Phoneme Sequence]" in every line.
  • orthographic-transcript.txt which has the form "[wav_filename]" "[Orthographic Transcript]" in every line. Orthography is in Buckwalter Format which is friendlier where there is software that does not read Arabic script. It can be easily converted back to Arabic.

Permanent URL