Voice acoustic analysis in ENT practice

Part I: Theoretical foundations of voice acoustic analysis

Selected methods of analysis of periodic signals

Author: Ph.D. Marcin Just (DiagNova Technologies)

Periodic signals can be analyzed in many ways, but due to their nature, two methods turn out to be particularly useful - Fourier analysis and linear prediction. For a speech signal, parameterization is often used, which is essentially an extension of both methods.

Fourier analysis

It is based on a simple idea - instead of examining a complicated signal, it should be presented as a sum of simpler signals, the behavior of which is easier to predict and study. Such analysis is used in many fields of science - wherever the tested system behaves the same for simple component signals and for their sum (the system must be linear, i.e. when increasing the input signal, the output must be proportionally increased). In acoustics, this assumption is correct in many cases. It is important to choose the so-called bases, i.e. a set of simple (easy to analyze) signals that can add up to each signal found in the system. In the case of periodic signals, such a very interesting set is the set of sinusoids (trigonometric functions of the form y = sin (ax + b) , where the parameter a determines the frequency of repeating constant fragments - Fig. 6), with more and more frequent oscillations.

Fourier analysis

Fig. 6. Sinusoid: a) y = sin(x + 0); b) y = sin(3x + π/4)

They are a good choice as they are periodical themselves. Neither of them can be represented as a sum of other sinusoids, which ensures the uniqueness of the signal representation (there is only one combination of sinusoids giving the total of the tested signal).

The principle of Fourier analysis is simple - the smallest repeating element is selected from the analyzed waveform and presented as a sum of sinusoids, as shown in Fig. 7.

 Principle of Fourier analysis

Fig. 7. Principle of Fourier analysis

The magnitude of the contribution of individual sinusoids (for specific oscillation frequencies) is usually visualized as the height of the bar in a bar graph (Fig. 8). Such a plot is called a signal spectrum.

 Fourier spectrum

Fig. 8. Fourier spectrum

The characteristic "peaks" on the spectrum diagram for the speech signal correspond to the frequency gain in the resonance chambers, the so-called formants. This is actually the only useful piece of information that can be read from such a chart. Its usefulness is therefore limited (especially since there are better methods for reading the controls). It is particularly important that a single spectrum does not, by definition, carry any information about the differences between successive base periods in the speech signal (and the examiner is particularly interested in this information in speech analysis). However, it is enough to introduce a slight modification - not one basic period is analyzed, but rather a conglomerate of several such periods - to significantly improve the usefulness of the Fourier analysis. Then the spectrum becomes fundamentally different. In an ideal case, when the analyzed periods are identical and there is no noise, the spectrum is then in the form of single isolated peaks (Fig. 10a). The peaks are separated by "empty" areas of longer length; the larger set of periods is analyzed.

Due to the problems with the exact selection of groups of several periods, a simpler method is used - the length of the analyzed fragment is constant, equal to e.g. 0.025 s and does not constitute a full multiple of the period length (e.g. it covers about 6.3 periods, which causes some negligible errors). This is a typical example of conventional Fourier analysis used e.g. in spectrograms. The spectrum obtained for a signal of this length is quite characteristic - it contains maxima (peaks) for a multiple of the fundamental frequency (Fig. 9) - the so-called harmonic frequencies.

 Fourier spectrum

Fig. 9. Classic Fourier spectrum

It is important that no matter if the consecutive periods are different or identical, whether there are noises or not, the bars between the peaks do not have zero height if the analyzed period does not exactly match the length of the specified number of periods. Of course, the height of the "peaks" reflects the contribution of successive multiples of the fundamental frequency (that should only be included in the speech signal), and the height of the bars between the peaks is related to the presence of noise, but overlaps with the phenomenon related to the uncorrelated length of the interval and the multiples of the basic period. It slightly disturbs and hinders the analysis process. A much "cleaner" image of the spectrum is obtained when the analyzed section covers a strictly defined number of periods.

a)  Fourier spectrum
b)  Fourier spectrum

Fig. 10. Spectra for the analyzed segment correlated with length of the period: a) ideal case; b) average case

The amplitude of any noise and distortion can now be accurately determined in relation to the amplitude of the harmonic components by measuring the ratio of harmonic peaks to non-harmonic peaks (Fig.10b). While the peaks associated with distortions (differences in the length and shape of the periods) predominate in the lower part of the spectrum, the upper part of the spectrum (above 4000 Hz) carries information about noise.

Spectra are determined for relatively short periods of time, so it can be done several times for the entire recorded sample. By placing successively calculated spectral diagrams next to each other and converting the height of the bars to the degree of darkness, an exceptionally useful diagram is obtained, called a spectrogram. The method of its generation is shown in Fig. 11.

 Generating a spectrogram

Fig. 11. Generating a spectrogram from a series of spectra

Narrow and broadband spectrograms

The greater the length of the signal sample when creating the spectrum, the better we will obtain its resolution in frequency (better separation of harmonic peaks). The spectrogram obtained from such spectra will also have an excellent resolution in the frequency domain, however, the large time interval between consecutive spectra will make its temporal resolution small. Conversely, by using very short samples to create the spectrum, we will obtain excellent temporal resolution, and worse frequency resolution. Unfortunately, these two requirements cannot be reconciled. Hence, there are two types of spectrograms:

  • narrowband (20Hz band, good frequency resolution – Fig. 12a);
  • broadband (240 Hz bandwidth, good time domain resolution – Fig. 12b).
a) spectrogram narrowband
b) spectrogram broadband

Fig. 12. Two types of spectrograms: a) narrowband; b) broadband

The names of spectrogram types come from the times when they were created by electronic analyzers characterized by a specific band (frequency range). Bandwidth can simply be interpreted as the resolution of a spectrogram in the frequency domain.

Fourier analysis and the fundamental frequency

In the above-mentioned case, when a fragment of the signal longer than the base period is taken for the Fourier analysis, the spectrum will show peaks related to successive multiples of the fundamental frequency. Of course, the first one corresponds to the fundamental frequency itself (Fig. 13). In the narrowband spectrogram, successive harmonics (multiples of F0) appear as horizontal lines, the lowest of which determines the course of the fundamental frequency (Fig. 14).

 Fundamental frequency in the Fourier spectrum

Fig. 13. Fundamental frequency in the Fourier spectrum

 fundamental frequency

Fig. 14. The fundamental frequency on a narrowband spectrogram

Linear prediction

It is based, like the Fourier analysis, also on a relatively simple idea. If a speech signal is created from a signal directly generated by the vocal folds (in Poland this signal is referred to as "voice", in foreign literature - as a primary signal or excitation) subjected to multiple reflections in resonant cavities, it can be presented as a sum of several differently delayed excitation signals. Hence, it is only a step to try to express the speech signal at a given moment as a sum of samples of this signal in the previous moments.

Due to its close relationship with the operation of the resonance path, linear prediction is particularly suitable for determining its resonance frequencies (formants).

After the formants are determined, their influence on the speech signal (the original signal generated by the folds) can be eliminated, i.e. it can be recreated. This process is called reverse filtration.

Formants

Formants (vocal resonant frequencies) are the basic feature that differentiates individual sounds. Formants for words and sentences change with time according to the change of sounds, and their course, as shown in Fig. 15, can be traced on spectrograms (broadband spectrograms are commonly recommended for this purpose, but narrowband spectrograms are equally effective).Linear prediction allows you to automatically determine the formants more precise than the spectrograms.

 formants visible on the spectrogram

Fig. 15. The formants waveform visible on the spectrogram using the linear prediction method

Excitation signal (speech)

The function of the vocal folds is known relatively well (much better than the function of the basal membrane in the ear). Air flow changes related to the cyclic operation of folds are described by some simplified models (consisting of simple curves - function graphs). One of the most popular models is the LF. Without going into the formulas defining the fragments of the diagram defining the course of the air flow changes, it can be simplified to some extent (flow) as it was done in the diagram in Fig. 16. The lower graph showing the change in air flow velocity is especially important. It obtains the fastest opening of the vocal folds at the closest limit, and the minimum at the moment of their fastest closing - just before closing (analogous to a door slammed by drafts - the closest one since being activated).

 Model of air flow between the vocal folds

Fig. 16.

Using the markings as in the graphs in Fig. 16, one can define the opening factor (the ratio of the fold opening time to the length of the base period):

Qo = Tc/T0,

and the closing rate (ratio of the closing time to the length of the period):

Qz = (T0 – Tc)/T0,

The relationship between the signal recorded by the microphone and the excitation signal and the graphs in Fig. 16 is shown in Fig. 17.

 signal from the microphone

Fig. 17. The signal from the microphone and the excitation signal reconstructed from it using the inverse filtering method

Parametric acoustic analysis

Fourier analysis and linear prediction, despite being extremely useful, usually lead to certain graphs. Their analysis is always somewhat subjective. To ensure objective diagnostics, an automatic analysis of the speech signal is performed, resulting in a set of parameters. These parameters are often calculated using the results of both analyzes.

Basic parameters

The most commonly used parameters are:

  • F0dev – standard deviation of the fundamental frequency, measuring the long-term frequency stability;
  • jitter - measuring irregularities in the length of basic periods, i.e. short-term (period to period) changes in F0;
  • shimmer - measuring irregularities in the amplitude of a signal period-to-period;
  • NHR - specifying the content of non-harmonic components in the range of higher frequencies in relation to the harmonic components of lower frequency.

From a mathematical point of view, the above parameters are calculated very simply (jitter - the sum of the relative differences in the length of adjacent periods for all successive pairs in the entire analyzed waveform, shimmer - analogically, only for amplitudes, NHR - the ratio of non-harmonic components from Fig. 10b for frequencies greater than approx. 1200 Hz to harmonics for lower frequencies). However, they use previously determined, in a much more complicated way, values of fundamental frequencies (fundamental periods) or spectral analyzes.

Additional parameters

The basic parameters are universal, and therefore quite general. They are sensitive to most possible voice disorders and do not allow for more precise differentiation. Therefore, sets of supplementary parameters are introduced - more precisely "tuned" to individual disorders. The most important parameters are presented in Table 1.

Table 1. Selected additional parameters

Parameter Description
HPQ (harmonic perturbation quotient) Parameter specifying the constancy of the shape of the basic periods. By design, insensitive to differences in the length of the base periods
HPQh As HPQ, but only for components above 1200 Hz
RHPQ (residual harmonic perturbation quotient) Similar to HPQ, but the analysis is performed for the excitation signal restored from the microphone signal
RHPQh As RHPQ, but only for components above 1200 Hz
R2H (residual to harmonic) Parameter that determines the dynamics of vocal folds closure - sensitive to small organic changes
U2H (unharmonic to harmonic) Parameter determining the ratio of the non-harmonic part to the harmonic - it defines both the level of disturbances and distortions
U2Hl Similar to U2H, but for the lower (up to 4000 Hz) part of the spectrum - it defines the level of speech signal distortion
U2Hh Similar to U2H, but for the upper (above 4000 Hz) part of the spectrum - it rather defines the interference level
S2H (subharmonic to harmonic) Parameter specifying the ratio of the amplitude of subharmonics to harmonics for the lower (up to 4000 Hz) part of the spectrum - the level of distortions related to the different work of both folds
Q A parameter that determines the frequency above which the harmonics do not significantly dominate the noise and distortion
Yg Automatically determined Yanagihara coefficient

Parameter grouping

Most of the parameters are created in several varieties. Knowing about their interdependence, it is easy to put them together into groups.

The following parameter groups can be distinguished:

  • F0;
  • jitter and derivatives (RAP, PPQ);
  • shimmer and derivatives (APQ);
  • HPQ and derivatives (HPQh, RHPQ, RHPQh);
  • R2H;
  • U2H and derivatives (U2Hl, U2Hh);
  • S2H (somewhat similar to U2H);
  • NHR;
  • YG, Q;
  • voice field, F0 standard dev., amplitude standard dev.