From:
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 5, JULY 2007
Adaptation of Bayesian Models for "Single-Channel"
Source Separation and its Application to Voice/Music
Separation in Popular Songs
An efficient model must be able to yield a rather accurate description
of a given source or class of sources, in terms of a
collection of spectral shapes corresponding to the various behaviors
that can be observed in the source realizations. This requires
GMMs with a large number of Gaussian functions, which
raises a number of problems:
• trainability issues linked to the difficulty in gathering and
handling a representative set of examples for the sources
or classes of sources involved in the mix;
• selectivity issues arising from the fact that the particular
sources in the mix may only span a small range of observations
within the overall possibilities covered by the general
models;
• sensor and channel variability which may affect to a large
extent the acoustic observations in the mix and cause a
more or less important mismatch with the training conditions;
• computational complexity which can become intractable
with large source models, as the separation process requires
factorial models [5], [6].
A typical situation which illustrates these difficulties arises
for the separation of voice from music in popular songs. For
such a task, it turns out to be particularly unrealistic to accurately
model the entire population of music sounds with a
tractable and efficient GMM. The problem is all the more acute
as the actual realizations of music sounds within a given song
cover much less acoustic diversity than the general population
of music sounds.