The University of Washington/Northwestern University (UW/NU) Corpus 2.0

This page contains information about the second release of the UW/NC corpus, UW/NC 2.0. All information found here is also contained in the README file included with the corpus. You can download the entire corpus (in compressed .tar.gz format, 2.9 GB) here. Note that this corpus does NOT contain recordings from UW/NU v.1.0[1].

Citation

Please cite as:
Panfili, L. M., Haywood, J., McCloy, D. R., Souza, P. E., and Wright, R. A. (2017). The UW/NC Corpus, Version 2.0 http://depts.washington.edu/phonlab/resources/pnnc/pnnc2

Audio files

The corpus includes 22,460 audio files in WAV format, sampled at 44.1 kHz with 16-bit depth, and high-pass filtered from 60 to 22,000 Hz and smoothed at 100 Hz. Files are readings of the IEEE “Harvard” sentences by 33 different talkers from each of two dialect regions of American English: the Pacific Northwest (11 males, 9 females) and the Northern Cities (7 males, 6 females). Pacific Northwest speakers read the full set of 720 sentences, while Northern Cities speakers read a subset of 620 sentences. Unlike the original PN/NC corpus, this set of audio files has not been RMS-normalized.

TextGrids

A set of 22,460 time-aligned transcriptions are included in the corpus. These are TextGrids for use with the praat software[2] that have been automatically generated by the Penn Phonetics lab forced aligner software[3] and are known to contain misalignments. They have NOT been checked or corrected by humans (much less by well-trained phoneticians or speech scientists). Use at your own risk.

Sentences

The sentence texts are drawn from the IEEE “Harvard” set.[4] Transcripts of the 720 sentences (along with their identification numbers) are included in the corpus in tab-delimited format. Individual transcript files for each sentence are also included. Sentence identification numbers are derived from the “list-sentence” notation of the original IEEE sentence lists: for example, sentence 01-07 corresponds to sentence #7 from list #1 of the original numbering scheme.

Filename conventions

The first two characters in the filenames reflect the dialect region of the talker (PN = Pacific Northwest, NC = Northern Cities). The third character indicates talker gender, and the fourth and fifth characters are meaningless digits, serially assigned to talkers during corpus creation. (Note that due to increased subject numbers, these digits have increased to three characters from the UW/NC v.1.0.) After an underscore, the sentence identification number comprises the remainder of the filename. For example, file PNM02_01-07.wav is a recording of Pacific Northwest Male #002 reading sentence number 01-07.

Speech Errors

Some of the sentences contain speech errors, such as non-standard pronunciations, unnatural delivery, or pauses. These sentences (.wav and .TextGrid files) are contained in a separate subdirectory for each speaker, called [speakername]-err. If this directory is not present for a given speaker, no sentences contain speech errors.

References

[1] McCloy, D. R., Souza, P. E., Wright, R. A., Haywood, J., Gehani, N., & Rudolph, S. (2013). The PN/NC corpus. Version 1.0. http://depts.washington.edu/phonlab/resources/pnnc/pnnc1

[2] Boersma, P., & Weenink, D. (2013). Praat: Doing phonetics by computer. http://www.praat.org/

[3] Yuan, J., & Liberman, M. (2008). The Penn Phonetics Lab forced aligner. http://www.ling.upenn.edu/phonetics/p2fa/

[4] Rothauser, E. H., Chapman, W. D., Guttman, N., Hecker, M. H. L., Nordby, K. S., Silbiger, H. R., Urbanek, G. E., & Weinstock, M. (1969). IEEE recommended practice for speech quality measurements. IEEE Transactions on Audio and Electroacoustics, 17, 225–246. DOI: 10.1109/TAU.1969.1162058