Definition
PSOLA is a method used in voice synthesis
to create speech material while retaining a good level of naturalness.
The acronym stands for Pitch Synchronous OverLap and Add and it refers
to the fact that the speech material is created by concatenating ("overlapping
and adding") elementary elements. The duration of those segments is proportional
to the pitch periods. The method can be used for changing pitch and duration
of an utterance. This transformation
can be accomplished simply by extracting such periods and by recollecting
them in a way different from the original. The procedure consists of two
steps: an analysis one and a synthesis one.
Analysis
In the analysis phase, short-time
(ST) signals are extracted by means of a weighting window from the
original vocal signal. Windows are centered at some mark points which constitute
the analysis time axis. The duration of the window (Wm) is proportional
to the analysis local pitch period dm(t).
In formulas:
xm(t) = x(t) hm(t-tm) m = 0,...,M
Wm = m dm = m (tm - tm-1)
Where x(t) is the original vocal signal, hm(t) is the weighting window, xm(t) is the analysis ST-signal, tm is the sequence of pitch mark points on the analysis time axis (see fig.1) and M is the total number of pitch periods of the vocal signal. m is the proportionality factor, and typically m=2 is used for a wide-band analysis, corresponding to an overlap factor of 50%, among adjacent pitch periods.
Synthesis
In the synthesis phase, synthetic
ST-signals xq(t) are obtained from the analysis ST-signals by
means of a transformation Y.
Eq. 1 xq(n)=Y(xm(n)) with q = 0, ..., Q and m = 0, ...,M
xsynth(n) = Sqxq(n)hq(tq-n) / Sqhq2(tq-n)
In this work, however, no synthesis window has been used, so that :
xsynth(n) = Sqxq(n-qq)
Where qq (q = 1,...,Q) is a sequence of local delays for the synthetic sequence. In TABLE I, relations among pitch factor modification, overlapping and duration of the signal (with respect to the original one) are shown, in case the total number of ST-signals is not changed ( and so, pitch and duration are modified by the same factor).
Pitch factor | Overlapping | Duration |
m = m | No overlapping | The signal is longer than the original |
-1 < m < m | < 50% | The signal is longer than the original |
m = 1 | 50% | The signal is as long as the original |
m > 1 | > 50% | The signal is shorter than the original |
Using a pitch synchronous method requires
every vocal signal file to be accompanied by a pitch marks file,
giving the information about where and how extracting the ST-signals.
Pitch mark points must be placed at
the maxima of the instantaneous energy of the speech signal, so as to maximally
preserve the part of the ST signal which is less influenced by the neighbouring
periods. By doing so, the distorsions due to the overlap operation are
minimized.