A new approach to the evaluation of vocal effort by PSOLA method

PSOLA method

Definition
PSOLA is a method used in voice synthesis to create speech material while retaining a good level of naturalness. The acronym stands for Pitch Synchronous OverLap and Add and it refers to the fact that the speech material is created by concatenating ("overlapping and adding") elementary elements. The duration of those segments is proportional to the pitch periods. The method can be used for changing pitch and duration of an utterance. This transformation can be accomplished simply by extracting such periods and by recollecting them in a way different from the original. The procedure consists of two steps: an analysis one and a synthesis one.

Analysis
In the analysis phase, short-time (ST) signals are extracted by means of a weighting window from the original vocal signal. Windows are centered at some mark points which constitute the analysis time axis. The duration of the window (W_m) is proportional to the analysis local pitch period d_m(t).
In formulas:

x_m(t) = x(t) h_m(t-t_m) m = 0,...,M

W_m = m d_m= m (t_m- t_m-1)

Where x(t) is the original vocal signal, h_m(t) is the weighting window, x_m(t) is the analysis ST-signal, t_m is the sequence of pitch mark points on the analysis time axis (see fig.1) and M is the total number of pitch periods of the vocal signal. m is the proportionality factor, and typically m=2 is used for a wide-band analysis, corresponding to an overlap factor of 50%, among adjacent pitch periods.

Synthesis
In the synthesis phase, synthetic ST-signals x_q(t) are obtained from the analysis ST-signals by means of a transformation Y.

  Eq. 1                                                x_q(n)=Y(x_m(n))                   with q = 0, ..., Qand  m = 0, ...,M

Where M and Q are the total numbers of pitch periods for the source and the target vocal signals, respectively. If no spectral modification is required, then Y(..) is the identity function.
Finally, they are concatenated by another synthesis window h_q(t), whose width is twice the synthesis local pitch period d_q. Synthetic ST-signals are centered on the mark points of the synthesis time axis (that is on t_{q ,}with q = 0, ..., Q). Overlap generally occurs, and overlapping samples are added. The resulting synthetic numeric signal is given by:

x_synth(n) =S_qx_q(n)h_q(t_q-n) / S_qh_q²(t_q-n)

In this work, however, no synthesis window has been used, so that :

x_synth(n) = S_qx_q(n-q_q)

Where q_q(q = 1,...,Q) is a sequence of local delays for the synthetic sequence. In TABLE I, relations among pitch factor modification, overlapping and duration of the signal (with respect to the original one) are shown, in case the total number of ST-signals is not changed ( and so, pitch and duration are modified by the same factor).

**`TABLE I`**
Pitch factor	Overlapping	Duration
m = m	No overlapping	The signal is longer than the original
-1 < m < m	< 50%	The signal is longer than the original
m = 1	50%	The signal is as long as the original
m > 1	> 50%	The signal is shorter than the original

Mapping
A delicate part of the algorithm is the mapping between analysis and synthesis time axis. If duration and pitch are to be scaled both by the same factor a, then a simple "stretching" of the signal is to be made: the synthetic pitch period becomes d_q(t)= a d_n(t). Thus, the total duration is a times the original one and, as the number of ST-signals has not changed, pitch is automatically scaled by the same factor a. If duration and pitch must be altered by different factors, a different number of elementary segments is required. PSOLA method, in this case, makes use of a very simple elimination-repetition technique. If only pitch is to be modified, then the synthesis time axis will have the same duration, but it will be necessary to scale the local pitch period, thus varying the total number of ST-signals. If only duration modification is required, ST-signals must be added (or suppressed) without altering the distance among adjacent pitch periods. If different scale factors are necessary, a previous scaling of the duration is to be made, and a successive pitch scaling is accomplished.
In figure 1 a mapping is shown, in which both pitch and duration are modified (by different factors), and in which the total number of ST-signals passes from M to Q.

Using a pitch synchronous method requires every vocal signal file to be accompanied by a pitch marks file, giving the information about where and how extracting the ST-signals.
Pitch mark points must be placed at the maxima of the instantaneous energy of the speech signal, so as to maximally preserve the part of the ST signal which is less influenced by the neighbouring periods. By doing so, the distorsions due to the overlap operation are minimized.