CUCHILD: A Large-Scale Cantonese Corpus of Child Speech for Phonology and Articulation Assessment

Si-Ioi Ng, Cymie Wing-Yee Ng, Jiarui Wang, Tan Lee, Kathy Yuet-Sheung Lee, Michael Chi-Fai Tong


This paper describes the design and development of CUCHILD, a large-scale Cantonese corpus of child speech. The corpus contains spoken words collected from 1,986 child speakers aged from 3 to 6 years old. The speech materials include 130 words of 1 to 4 syllables in length. The speakers cover both typically developing (TD) children and children with speech disorder. The intended use of the corpus is to support scientific and clinical research, as well as technology development related to child speech assessment. The design of the corpus, including selection of words, participants recruitment, data acquisition process, and data pre-processing are described in detail. The results of acoustical analysis are presented to illustrate the properties of child speech. Potential applications of the corpus in automatic speech recognition, phonological error detection and speaker diarization are also discussed.


 DOI: 10.21437/Interspeech.2020-2148

Cite as: Ng, S., Ng, C.W., Wang, J., Lee, T., Lee, K.Y., Tong, M.C. (2020) CUCHILD: A Large-Scale Cantonese Corpus of Child Speech for Phonology and Articulation Assessment. Proc. Interspeech 2020, 424-428, DOI: 10.21437/Interspeech.2020-2148.


@inproceedings{Ng2020,
  author={Si-Ioi Ng and Cymie Wing-Yee Ng and Jiarui Wang and Tan Lee and Kathy Yuet-Sheung Lee and Michael Chi-Fai Tong},
  title={{CUCHILD: A Large-Scale Cantonese Corpus of Child Speech for Phonology and Articulation Assessment}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={424--428},
  doi={10.21437/Interspeech.2020-2148},
  url={http://dx.doi.org/10.21437/Interspeech.2020-2148}
}