Multimodal speech data acquisition with the use of EMA, fast-speed video cameras and a dedicated microphone array

Conference

Electromagnetic Articulography (EMA) is a technology created over two decades ago. EMA enables to acquire spatiotemporal data from sensors placed on the tongue in order to obtain information about the positioning of the tongue, its shape and dynamics during vocalizations of various sounds of human speech. The articulograph is often supported by an audio recorder and a vision system.

In this paper, a novel system integrating EMA, audio and visual data recording is presented. The articulatory data was obtained with a Carsten's AG500 articulograph. The vision system was constructed from 3 high-speed cameras (Gazelle GZL-CL-22C5M-C) manufactured by Point Grey. The cameras registered movements of markers attached to the face of the speakers. The audio recorder consisted of a 16-channel microphone array and an electronic device that registered and processed signals from the microphones. The microphone array made it possible to map sources of sound propagation on the speaker’s face. The simultaneous recording of signals from EMA, the video system and the audio recorder is controlled from a computer with a host program and is supported by a synchronizer. The electromagnetic articulograph registers signals from EMA sensors which return their spatiotemporal positions with the sampling frequency of 200 Hz. The readouts of the spatial positioning of sensors attached to the tongue provide information about its shape and movements in time. There are three cameras registering the movements of external articulators and organs (e.g. lips, jaw and cheeks) from the front and side views. The cameras register movies with the frame rate of 200 FPS. The circular microphone array with 16 microphones records 16-channel audio with 96 kHz sampling rate and 16 bit resolution. During the recording sessions, the participants read aloud words that are displayed on the screen. An application on the host computer sends commands to AG500 which in turn generates synchronization signals in the TTL standard to external devices. These signals are used for activating the audio recorder and the synchronizer in the video system. Articulographic and simple acoustic analysis is performed with created in MATLAB software called phonEMAtool. This software is very useful and ergonomic for fast feature extraction of tongue movements during speech. The application allows to display simultaneously: speech waveform, EMA sensors position and orientation, phonetic segmentation. Before an analysis, the data from AG500 are pre-processed twice with a Savitzky-Golay filter so as to remove undesirable noise. In the paper an exemplary analysis performed by the phonEMAtool of particular articulatory gestures in the articulation of [m] in the Polish word Tamara is presented. Another analysis is beamforming of audio signals in order to obtain three-dimensional acoustic field distribution images. In the paper an example of this technique applied to the analysis of the nasal consonant in the word Tamara [tamara] has been shown. Analysis indicated that the highest intensity of the acoustic field during the pronunciation of the consonant [m] occurs in the nose region and for vowel [a], the highest intensity is observed in the mouth. Due to movement registration of facial markers the reconstruction of positions of external articulators can be obtained. With additional face triangulation using Delaunay algorithm some differences between positions of external articulators can be easily tracked.

The measurement system described in this paper is effective and allows for an examination of the vocal tract in 3 ways: tongue movements, acoustic field intensity distribution and external articulator movements. A particularly useful tool is the dedicated acoustic camera based on multi-channel audio recorder and a microphone array. The results obtained with this equipment are unique and show great research and application potential.

Reference Swiecinski, R. J. (2016). Multimodal speech data acquisition with the use of EMA, fast-speed video cameras and a dedicated microphone array. In Proceedings of the 23rd International Conference “Mixed Design of Integrated Circuits and Systems" (MIXDES 2016), (pp. 415-418)

Publication date

Jan 2016

Author(s)


Research database