This study was conducted according to the ethical principles based on the WMA Declaration of Helsinki and to the current legal provisions. It was approved by the ethics committee of the TU Dresden, and informed consent was obtained from the subjects.

Subjects and speech sounds

Vocal tract shapes of sustained speech sounds were acquired from two native German speakers, one male and one female. The male subject (s1) was 39 years old, 1.85 m tall, and grew up in the state Mecklenburg-Vorpommern (Mecklenburg-Western Pomerania) in Germany. He was a professional phonetician and speech scientist and lecturer at the university.

The female subject (s2) was 32 years old, 1.64 m tall, and grew up in the state Sachsen (Saxony) of Germany. She did her studies in speech science, which included professional speech training. Furthermore, she is a trained singer and has been singing in several semi-professional choirs since her childhood.

Each subject produced 22 sustained speech sounds while a volumetric MRI scan of their vocal tract was performed. The data were processed and analyzed as discussed below. The list of speech sounds is given in Table 1 and contains 8 tense vowels, 8 lax vowels, and 6 consonants. The subjects were asked to pronounce each sound like in the word given in the table. The two rightmost columns contain the unique labels for the vocal tract shapes used in the dataset and in the remainder of this paper. In the following, they will be referred to by the placeholder XX.

Table 1 List of vocal tract shapes (sustained phonemes) and their labels (XX) used in the file names and folder names.

Acquisition of MRI and reference audio data

The MR images of the vocal tract were acquired on a Siemens 3 T TIM Trio with a 12-channel head coil combined with additional neck elements. The sequence was a sagittal 3D volume interpolated gradient echo sequence (VIBE – fl3d-vibe) with 1.2 mm × 1.2 mm × 1.8 mm resolution, 44 sequential slices, matrix size 192 × 192, field of view = 230 mm × 230 mm, repetition time TR = 5:53 ms, echo time TE = 2:01 ms, flip angle 9°, Q-fatsat, 22 lines per shot, 7/8 phase partial Fourier, 6/8 slice partial Fourier, ipat factor 2 (PE only), 24 reference lines and a bandwidth of 220 Hz/pixel. The acquisition time for one volume was 14 s during which the speaker produced and sustained the corresponding speech sound. All 22 sounds per speaker were acquired in one session. After each scan, the image quality was carefully checked with respect to blurry parts or motion artifacts due to involuntary movements of the articulators during the 14 s scan time. Each scan was repeated as often as necessary to obtain a clean image.

Before the MRI sessions, the two subjects practiced to sustain the speech sounds for the required duration and with a High German quality. It was especially practiced to produce the lax vowels with the correct vowel quality, as they are normally produced as short vowels in German.

In addition to the MRI data, audio recordings of the speech sounds were obtained from both subjects. These recordings were not directly made during the MRI scans of the vocal tract because of the high noise level in the scanner. Instead, they were done in a separate session in a soundproofed audio studio using a studio condenser microphone (M930 by Microtech Gefell) connected to a mixing desk (Behringer Eurorack MX 1602) for power supply and preamplification. The signals were digitized with 44,100 Hz and 16 bit using the audio interface 896HD by MOTU and recorded with the Software Audacity 2.2.0 ( on a standard desktop computer.

The subjects were asked to produce the sounds as similar as possible to the situation in the MRI scanner and sustain the sounds for at least 10 s. The recordings were then symmetrically cropped around their center to a length of 1,000 ms for the tense vowels (which normally occur as long vowels in German) and the fricatives, and to a length of 200 ms for the lax vowels. They were then peak-normalized and windowed with a Tukey (tapered cosine) window for a fade-in and fade-out of 20 ms each. Finally, the audio signals were padded with 200 ms of silence at the beginning and end. The resulting audio signals are contained in the dataset as the files XX-reference-sound.wav.

Measurement of maxilla and mandible shapes

For each subject, plaster models of the maxilla and mandible were created by means of alginate impressions according to the standard procedure used in dentistry39. The plaster models were 3D-scanned to obtain 3D boundary models of the objects (see top row of Fig. 1). Scanning was performed with a NextEngine 3D laser scanner and the corresponding NextEngine ScanStudio software. Each plaster model was scanned both from a horizontal (as in Fig. 1) and a vertical view (standing on the posterior side). In each position, the model was scanned from 7 angles in steps of 51.4° on the turntable with the following settings: Points per Square Inch: 4400 (SD); Target: Normal; Range: Macro. The individual scans per object were then aligned using corresponding points on the surface, fused into a single boundary model, and exported as a binary STL file. These are contained as the files s1-mandible.stl, s1-maxilla.stl, s2-mandible.stl, and s2-maxilla.stl in the dataset.

Segmentation of the vocal tract

To obtain the inner surface representations of the vocal tracts from the MRI data, each vocal tract was processed according to the steps below. All required software tools were free and open source.

  1. 1.

    The boundary models of the maxilla and mandible were merged with the MRI data of the vocal tract shape using the software 3D Slicer40 ( The MRI voxel data were first upsampled to obtain smaller voxels with a uniform edge length of 0.25 mm. Then the triangle meshes of the maxilla and mandible were carefully positioned with respect to the MRI data using affine transforms. Finally, all voxels contained within the closed surfaces of the maxilla and mandible were set to a constant mid-level gray value.

  2. 2.

    The high-resolution voxel data from step 1 were used to segment the vocal tract with the software ITK-SNAP41 ( The 3D segmentation was performed semi-automatically based on the implemented active contour method42. The nasal cavity was excluded from the segmentation, even when there was a slight velo-pharyngeal opening for some vowels. The segmentation result was a boundary model of the air-filled pharyngeal and oral cavities that extended slightly into the free space in front of the open mouth.

  3. 3.

    The closed boundary model obtained in step 2 was opened at the glottal end and the mouth using the software Blender ( The glottal end was opened with a cutting plane through the vocal folds, while the mouth was opened with curved cutting planes that were fitted to the shape of the lips.

  4. 4.

    The surface model opened at the glottis and the mouth was manually smoothed with a sculpting tool using Blender and a Laplacian filter using the software Meshlab43 ( It was taken care that important details like the teeth, the uvula, and the epiglottis were not accidentally removed.

The triangle meshes of the inner vocal tract surfaces are provided as the files XX-inner-surface.stl in the dataset.

Creation of 3D-printable models and 3D-printing

To obtain 3D-printable models of the vocal tract, the inner surface meshes were converted into closed solids by giving the vocal tract walls a finite thickness. For each model, we first created an offset mesh as the exterior shell for the solid using the software Meshlab. The offset mesh was created at a distance of 4 mm outwards from the inner surface mesh for a wall thickness of 4 mm, and then trimmed using Blender. The outer shell was then smoothed and fused with the inner shell using Blender. The gaps between the meshes were closed and a uniform adapter (socket) was added to the glottal end of the model. The adapter was designed as a disk-shaped ring with a thickness of 4 mm and inner and outer diameters of 10 mm and 30 mm, respectively. The upper side of the ring was positioned flush with the glottal plane (inlet of the vocal tract). Hence, the glottal opening of all models consisted of a hole with 10 mm diameter. The complete set of volume models including the adapter is supplied as the files XX-printable-model.stl in the dataset. For easier 3D-printing, the models have also been halved through the midsagittal plane, and the two halves are represented by the files XX-printable-left-half.stl and XX-printable-right-half.stl.

Each vocal tract half was 3D-printed on an Ultimaker 3 printer, which uses fused deposition modeling and has two extruders. The vocal tract walls were printed with the material PLA (polylactic acid, brand “innofil”) from one extruder, and support structures were printed with the water-soluble material PVA (polyvinyl alcohol, using the material sold by Ultimaker) from the other extruder. The layer thickness was 0.1 mm and the infill ratio was 100% for PLA (i.e. the walls were “solid” inside) and 20% for PVA. Both extruders had a nozzle diameter of 0.4 mm. The vocal tract halves were oriented with the cutting plane, i.e. the midsagittal plane on the build plate. The build plate was heated to 60° C with a heating bed for better adhesion. The print time was about 20 h per half. The material consumption was about 50 g PLA and 10 g PVA per half, i.e. the mass of a complete vocal tract model was about 100 g. After printing all objects and dissolving their support structures, the two halves of each vocal tract model were carefully sanded at the side that adhered to the build plate and glued together with cyanoacrylate adhesive (“superglue”).

Due to the PLA material used for 3D printing, the walls of the vocal tract models were essentially hard compared to the soft walls of a human vocal tract. For the sake of reproducibility, we made no attempt here to create more realistic soft walls, because suitable methods to achieve this for detailed vocal tract geometries have not been explored yet. However, future studies could readily use the models in the dataset to create and examine soft-walled models.

Measurement of the volume velocity transfer functions

For each of the 44 physical vocal tract models, the volume velocity transfer function (VVTF) was measured. The VVTF (H(omega )) is often used to characterize vocal tract acoustics12,44,45 and usually defined as the complex ratio of the volume velocity ({U}_{2}(omega )) through the lips to the volume velocity ({U}_{1}(omega )) through the glottis, i.e.,

$$H(omega )={U}_{2}(omega )/{U}_{1}(omega ).$$


here, the transfer functions were determined for the case of an infinite glottal impedance, i.e., a closed glottal end of the tubes. The determination of the VVTF based on Eq. (1) is technically very challenging46, because it would require a broadband volume velocity source ({U}_{1}(omega )) at the glottis and a broadband volume or particle velocity sensor at the mouth. A simpler yet precise approach to determine (H(omega )) was presented by Fleischer et al.47, which was also adopted in the present study. Fleischers’ method does not require a volume velocity source or sensor, but can determine (H(omega )) solely from two sound pressure measurements ({P}_{1}(omega )) and ({P}_{2}(omega )) at the glottis and the lips, respectively, as described below. This method is based on the principle of reciprocity and theoretically well-founded47.

The experimental setup for the measurements is shown in Fig. 2a. The vocal tract model was placed at a fixed distance of about 30 cm in front of a loudspeaker. A 1/4-inch measurement microphone (MK301E capsule with MV301 preamplifier by Microtech Gefell) was inserted into the hole at the glottal end of the model so that its membrane was flush with the glottal plane. A measurement consisted of two steps. In the first step, the loudspeaker emitted a broadband excitation signal (sine sweep) into the open mouth of the model while the sweep response ({P}_{1}(omega )) was measured with the glottis microphone. In the second step, the mouth of the model was tightly closed with a plug made of modeling clay (about 1 cm thick) and another 1/4-inch measurement microphone (G.R.A.S. 46BL) was centrally positioned about 2 mm in front of the closed mouth. This microphone recorded the response ({P}_{2}(omega )) for the same excitation signal as in step 1. The VVTF was finally calculated as (H(omega )={P}_{1}(omega )/{P}_{2}(omega )) (which is the same as ({U}_{2}(omega )/{U}_{1}(omega ))). Both microphones were connected to an audio interface (Terratec Aureon XFire 8.0 HD), which in turn was connected to a laptop computer (MSI GT72-2QE) with the operating system Windows 8.1, 64 Bit.

Fig. 2

(a) Setup for measuring the (closed-glottis) volume velocity transfer function between the glottis and the lips. First, the reference sound pressure in front of the closed mouth was measured as ({P}_{2}(omega )). Then the clay and the mouth microphone were removed (gray items) and the sound pressure ({P}_{1}(omega )) was measured inside the vocal tract at the level of the glottis. (b) Setup for aeroacoustic measurements.

The measurements were performed with the open-source software MeasureTransferFunction48, which implements the method by Farina49. The excitation signal used in this software was a logarithmic sine sweep with a power band from 100 Hz to 10,000 Hz (fade-in and fade-out from 50–100 Hz and 10,000–11,000 Hz, respectively) and a duration of 10.4 s. The source signal amplitude was set to 0.5, i.e. to 50% of the value range. The output level and input level of the audio interface were set to 100% and 50%, respectively. The audio signals were sampled with 96,000 Hz and quantized with 24 bit. A major benefit of using logarithmic sweeps to characterize acoustic systems is that the linear impulse response can be separated from signal components generated by harmonic distortions49. Accordingly, the linear response was manually extracted in all recorded signals before further processing. The different sensitivities of the microphones used at the glottis (3.2 mV/Pa) and the mouth (18 mV/Pa) were compensated by adding 15 dB to the calculated VVTF.

Due to small variations of the latency of the audio system, there was usually a small time lag (tau ) between the sweep responses, from which ({P}_{1}(omega )) and ({P}_{2}(omega )) were calculated. According to the time-shift property of the Fourier Transform, the shift of a time signal by (tau ) causes its spectrum to be changed by the factor ({e}^{jomega tau }), where (j=sqrt{-1}). This means that a phase response (varphi (omega )=text{arg}H(omega )) is the sum of the “true” phase response and a linear function (omega tau ), where the slope (tau ) may vary across models. Therefore, to explore the phase responses of the models, it might be more convenient to do it on the basis of the group delay (-dvarphi (omega )/domega ), where the linear function translates into a constant offset.

All measurements were performed in the large climate-controlled anechoic chamber at the TU Dresden at a temperature of 22 °C, an atmospheric pressure of 1007 hPa, and an air humidity of 46%. The anechoic chamber is a free-field room (all six sides covered with 1 m absorbing foam spikes) with a free volume of 1000 m3 and a degree of sound absorption of at least 99% for frequencies between 60 Hz and 16 kHz. Before the measurements, the vocal tract models were tightly wrapped in multiple layers of sound-absorbing fabric. This minimized the external excitation of the (plastic) vocal tract walls by the source signal during the measurement of P1. Wall vibrations due to the external excitation would otherwise interfere with the sound field in the models and create spectral artifacts. The two sweep responses P1 and P2 are contained in the files XX-sweep-primary.wav and XX-sweep-reference.wav in the dataset. The transfer functions (H(omega )) are given in the files XX-vvtf-measured.txt.

Calculation of the volume velocity transfer functions

For comparison with the measurements of the physical models, the VVTFs were also determined numerically using the finite element method (FEM). The calculation was similar to that described by Fleischer et al.47 on the basis of the freely available software FEniCS50 ( To create the FE models, the inner surface meshes of the vocal tract (XX-inner-surface.stl) were first “closed” at the glottal end and the mouth, as in the files XX-inner-surface-closed.stl. These closed-surface meshes were then converted into volume meshes (XX-fem.msh) for the FE simulations with the free software Gmsh51 ( In the volume meshes, the regions of the glottis, the mouth opening, and the vocal tract walls were manually marked to define the boundary conditions for the acoustic simulation.

The FE models were discretized with linear shape functions and had a number of degrees of freedom between 99,688 (model s1-22-ehe-schwa) and 147,806 (model s2-08-guete-tense-y). Furthermore, the maximum mean element size was 2.99 mm (model s1-22-ehe-schwa). For a maximum analysis frequency of 10,000 Hz and a sound velocity of 345 m/s at 22 °C, there were on average 11 elements/wavelength.

The acoustic simulation was based on the numerical analysis of the Helmholtz equation

$$-({kappa }^{2}+{nabla }^{2})P(overrightarrow{x},omega )=0,$$


where P is the complex-valued scalar acoustic pressure, (overrightarrow{x}) is the position in ({{mathbb{R}}}^{3}), (omega ) is the angular frequency, (kappa =omega /c) is the wave number, and c = 345 m/s is the speed of sound at 22 °C. The application of a frequency-independent particle velocity V0 at the glottis leads to the boundary condition

$$nabla {P}_{{rm{glottis}}}cdot overrightarrow{n}=-,jomega varrho {V}_{0}$$

at the glottal surface, where (varrho =1.18) kg/m3 is the density of air. At the vocal tract walls, the boundary condition

$$nabla {P}_{{rm{wall}}}cdot overrightarrow{n}=-,jkappa frac{varrho c}{{Z}_{{rm{wall}}}}{P}_{{rm{wall}}}$$

with the empirical value ({Z}_{{rm{wall}}}=500cdot varrho c) was applied47. At the lip opening, the boundary condition

$$nabla {P}_{{rm{lips}}}cdot overrightarrow{n}=-,jkappa frac{varrho c}{{Z}_{{rm{r}}}}{P}_{{rm{lips}}}$$

was implemented. The radiation impedance was assumed to be

$${Z}_{r}=varrho cleft(frac{{(kappa r)}^{2}}{1+{(kappa r)}^{2}}+jfrac{kappa r}{1+(kappa r)}right),$$

where (r=sqrt{{A}_{{rm{lips}}}/(2pi )}) and ({A}_{{rm{lips}}}) represents the lip opening area52,53. Based on the computed pressure ({P}_{{rm{lips}}}) at the central point of the lip opening, the default value ({V}_{0}), and the geometrical measures ({A}_{{rm{lips}}}) and ({A}_{{rm{glottis}}}), the transfer function

$${H}_{{rm{FEM}}}(omega )={A}_{{rm{lips}}}cdot frac{{P}_{{rm{lips}}}(omega )}{{Z}_{{rm{lips}}}(omega )}/({A}_{{rm{glottis}}}cdot {V}_{0})$$

was calculated for frequencies between 0 and 10,000 Hz in steps of 0.961304 Hz. The computing time per model was up to 9 hours using 12 cores of the Intel Skylake Gold 6148 CPU available at the North-German Supercomputing Alliance (HLRN). The results are contained in the files XX-vvtf-calculated.txt in the dataset.

Measurement of flow-induced noise for different fluid power levels

To characterize the 44 vocal tract models in aeroacoustic terms, the setup in Fig. 2b was used to create different levels of stationary airflow through the models. For each level, we recorded the volume velocity ({U}_{{rm{sub}}}), the subglottal pressure ({P}_{{rm{sub}}}), and the turbulence sound ({P}_{{rm{rad}}}) radiated from the mouths of the models. The airflow was generated by a fan (type U71HL-024KM-43 by Micronel) and led into an air tank, which was connected to a “lung” via a 200 cm long rubber tube with an inner diameter of 19 mm. The air tank and the lung were boxes with inner volumes of 30 × 30 × 50 cm3 and 23 ×23 × 23 cm3, respectively. Both boxes were lined with sound absorbing foam (NOMA ACOUSTIC 25 mm by NMC) and meant to attenuate the noise from the fan. A short horn connected to a straight tube (18 mm inner diameter) was used to represent the bronchia and the trachea and led the airflow from the lung to the glottal end of the vocal tract models. The dimensions of the horn and the tube were chosen to approximate the cross-sectional area function of the human subglottal system54. Both the horn and the tube were 3D-printed with the material PLA and with a wall thickness of 3 mm (100% infill ratio). The upper 3 cm of the tracheal tube (corresponding to the conus elasticus) tapered from 18 mm diameter to 10 mm diameter to match the diameter of the glottal hole of the attached vocal tract model. The 3D-printable volume models for these parts are contained in the files trachea.stl and bronchial_horn.stl in the dataset.

A data acquisition device (DT9837C by Data Translation) connected to a laptop computer (MSI GT72-2QE running MS Windows 8.1) was used to simultaneously measure

  • the radiated sound pressure ({P}_{{rm{rad}}}) using a measurement microphone (1/2 inch capsule MK 250 with preamplifier MV 210 by Microtech Gefell GmbH) positioned 30 cm in front and 30 cm sideways from and directed towards the mouth of the vocal tract model (to prevent the airstream from directly hitting the microphone membrane),

  • the subglottal pressure ({P}_{{rm{sub}}}) using a pressure measuring device (DMU4 by Kalinsky Sensor Elektronik, Erfurt, Germany) attached to a pressure tap 12 cm below the glottis,

  • and the volume velocity ({U}_{{rm{sub}}}) at the entrance of the lung using a flowmeter (type AWM720-P1 by Honeywell).

All three signals were digitized with a sampling rate of 48,000 Hz and quantized with 24 bits. A custom-made software was used to record and display the signals, and to control the fan power. The fan power could only be adjusted in (small) steps. For a more precise adjustment of the subglottal pressure and the flow, we used a servo valve attached to the air tank. A single-board computer (type Raspberry Pi 3 Model B+) with a custom-made Python script was used to translate the high-level commands of the software on the laptop computer into electrical control voltages for the fan and the valve. The air tank with the fan and the valve were located in a separate soundproofed chamber to prevent their noise from disturbing the measurements.

For a consistent aeroacoustic characterization of the vocal tract models, we decided to apply the same six levels of fluid power (which is the product of the subglottal pressure and the volume velocity) to each model, namely 500 mW, 1000 mW, 1500 mW, 2000 mW, 2500 mW, and 3000 mW. Using fixed power levels instead of fixed levels of subglottal pressure or flow allowed to cope with the wide range of flow resistances across the models. According to the analysis by Stevens45, a fluid power level of 500 mW is roughly typical for “normal” speech production, while about 3000 mW is the maximum that humans can achieve.

For each power level and model, the three signals described above (radiated sound, flow, subglottal pressure) were captured for a duration of 10 s. The audio files of the radiated sounds are included in the dataset as the files XX-noise-500mW.wav, …, XX-noise-3000mW.wav. The samples in these files are floating point values proportional to the sound pressure measured at the microphone, where the value 1.0 corresponds to a sound pressure of 12.62 Pa. For each of these audio files, the power spectral density (PSD) has been estimated using Welch’s method as implemented in the function pwelch() in the Signal Processing Toolbox of Matlab R2017b. We used overlapping Hamming windows of 1024 samples (which was also the FFT length) and a window overlap of 512 samples so that the spectral resolution was 46.9 Hz. The resulting PSDs (with the unit Pa2/Hz) for the six power levels were summarized in the files XX-noise-psd.txt. Finally, the average volume velocity, the average subglottal pressure, and the overall sound pressure level (SPL) of the radiated sound for each power level were tabulated in the files XX-noise-metadata.txt. The SPLs were calculated from the audio signal x(k) as

$$SPL=20cdot {log }_{10}left(frac{1}{{P}_{{rm{r}}{rm{e}}{rm{f}}}}sqrt{frac{1}{N}mathop{sum }limits_{k=0}^{N-1}{x}^{2}(k)}right),$$


where k is the sample index, x(k) has the unit Pa, (N=48000cdot 10) is the number of audio samples for 10 s, and ({P}_{{rm{ref}}}=2cdot 1{0}^{-5}) Pa.

Synthesis of speech sounds with the 3D-printed models

For each 3D-printed vocal tract model, the corresponding speech sound was (physically) synthesized. The tense and the lax vowels were synthesized with the setup in Fig. 2b, but with a vibrating reed source inserted between the upper end of the trachea and the glottal hole of the vocal tract models as in Birkholz et al.31. The vibrating reed source was developed by Arai33 and is an improved version of the design published previously32. The subglottal pressure was individually adjusted for each vowel as roughly the midpoint between the onset and the offset pressures of the source in combination with the respective supraglottal model. The generated sound was captured with a measurement microphone (1/2 inch capsule MK 250 with preamplifier MV 210 by Microtech Gefell GmbH) positioned 30 cm in front and 30 cm sideways from and directed towards the mouth of the vocal tract model. The sound generated by each model was recorded for 10 s. The recordings were symmetrically cropped around their center to a length of 1000 ms for the tense vowels and to a length of 200 ms for the lax vowels. They were then peak-normalized and windowed with a Tukey (tapered cosine) window for a fade-in and fade-out of 20 ms each. Finally, the stimuli were padded with 200 ms of silence at the beginning and end.

The voiceless fricatives were synthesized with the setup in Fig. 2b with a constant subglottal pressure of 800 Pa. Each of these sounds was also recorded for 10 s, cropped to 1000 ms around the center of the recording, and otherwise processed like the vowel recordings. The resulting audio signals are contained in the dataset as the files XX-model-sound.wav.

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *