Second International Conference on Spoken Language Processing (ICSLP'92)
Banff, Alberta, Canada
Attempts to improve the performance of a connectionist network trained to detect selected phonetic features in multispeaker connected speech indicate some of the limitations on the information available at a peripheral level of speech analysis. The three-layer feedforward network has 12 detector outputs, and is trained over large subsets of sentences from the MIT Ice Cream database. Its input consists of smoothed spectral vectors sampled at 15 msec intervals. Little contextual information is available to the detectors, since each vector has an effective window of about 30 msec. Overall, the detector network generalizes very well to new speakers and new sentences: average a' drops only to .93 on test data from .94 on training data. Frequently occurring features like sonorance are better discriminated than infrequent ones like rhotic, mainly because the learning algorithm gives greater weight to the many negative training instances than to the few positive ones; discrimination is improved by weighting the learning rate for positive and negative vectors in inverse proportion to their frequency of occurrence. Performance was also improved modestly by adding preceding and following vectors to the input. Several other modifications yielded little or no performance improvement.
Bibliographic reference. Bradshaw, Gary / Bell, Alan (1992): "Towards the performance limits of connectionist feature detectors", In ICSLP-1992, 467-470.