WO2014153922A1

WO2014153922A1 - Human voice extracting method and system, and audio playing method and device for human voice

Info

Publication number: WO2014153922A1
Application number: PCT/CN2013/082328
Authority: WO
Inventors: 佘海波; 王进军; 刘书昌; 张欣
Original assignee: 中兴通讯股份有限公司
Priority date: 2013-03-29
Filing date: 2013-08-27
Publication date: 2014-10-02
Also published as: CN104078051B; CN104078051A

Abstract

A human voice extracting method and system, and an audio playing method and device for a human voice. The method comprises: extracting, from the beginning of an original sound signal, a sound signal in which both a human voice and background sound appear, as a sample (S101); detecting a primary pitch from the sample (S102); and using the primary pitch as a reference frequency, and comparing, with the reference frequency, a fundamental frequency of sound, in sound except the sample in the original sound signal, belonging to a same sound source, to determine whether the sound source belongs to a human voice. According to the human voice extracting method, a human voice can be easily and conveniently extracted from mixed audio.

Description

Vocal sound extraction method, system and human voice audio playing method and device

Technical field

The present invention relates to the field of mixed audio separation and extraction, and in particular, to a vocal extraction method and system, and a vocal audio playing method and apparatus.

Background technique

In order to extract and enhance vocals from audio such as two-channel stereo to achieve clearer and effective noise reduction, a sound separation technique capable of extracting a single audio from mixed audio is needed. The current technology that can meet this requirement is mainly based on the audio separation technology of Computational Auditory Scene Analysis (CASA).

Auditory Scene Analysis (ASA) technology, which uses the various characteristics of the sound (time domain, frequency domain, spatial position, etc.) to decompose a mixed sound signal into multiple signals, and each signal belongs to a different Physical sound source. The Computational Auditory Scene Analysis (CASA) technique uses computer technology to simulate the human auditory system, ultimately giving the computer a voice-like ability similar to that of the human ear. The conventional CASA system first divides the sound into the simultaneous part of the vocal and background sounds and the part of the background sound only; the signal of the simultaneous occurrence of the vocal and background sounds is decomposed by the multi-channel filter; the signal of each channel is performed. Classification, to determine whether it is a vocal or background sound.

However, the use of CASA technology to classify the signals of each channel, the method of extracting vocals requires a comprehensive consideration of various characteristics of the audio signal, such as main pitch, multiple harmonics, energy, amplitude modulation, start tone and termination tone. The extraction algorithm is complex and the calculation amount is large.

Summary of the invention

The invention provides a vocal sound extraction method, system and vocal audio playing method and device, to solve the technical problem of how to easily extract human voice from mixed audio.

In order to solve the above technical problem, the present invention provides a vocal sound extraction method, the method comprising: extracting a sound signal co-occurring between a human voice and a background sound as a sample from a beginning of an original sound signal; detecting a main sound from the sample high; Taking the main pitch as a reference frequency, comparing the pitch frequency of the sound belonging to the same sound source in the sound portion other than the sample to the reference frequency determines whether the sound source belongs to a human voice.

Preferably ,

Using the main pitch as a reference frequency, comparing the pitch frequency of the sound belonging to the same sound source in the sound portion other than the sample to the reference frequency to determine whether the sound source belongs to a human voice, Includes:

Dividing the sound portion of the original sound signal other than the sample into a plurality of frames;

Each frame of the sound signal is subjected to a multi-channel filter to obtain a plurality of time-frequency units, and the adjacent time-frequency units belonging to the same sound source are combined as one segment;

If more than half of the time-frequency units in a segment have the same fundamental frequency as the reference frequency, the segment is a vocal segment.

Preferably, the method further includes:

After determining whether or not all the segments of each frame are vocal segments, the main pitch is continuously detected from subsequent adjacent frames. If the main pitch is changed, the changed main pitch is used as the reference frequency, and the segment in the frame is continuously determined. Whether it is a vocal fragment.

Preferably ,

If the main pitch changes, the changed main pitch is used as the reference frequency, including: if the main pitch changes, continue to determine whether the main pitch of the subsequent frame is the changed value, if the main pitch of the consecutive multiple frames is The change value is based on the changed main pitch as the reference frequency.

In order to solve the above technical problem, the present invention also provides a method for playing a human voice audio, the method comprising:

Extracting a human voice signal from the original sound signal by the method described above;

The human voice signal is linearly combined with the original sound signal and played.

In order to solve the above technical problem, the present invention also provides a vocal sound extraction system, the system package a sample extraction unit, a main pitch detection unit, and a vocal detection unit, wherein

The sample extracting unit is configured to: extract a sound signal co-occurring from the vocal and the background sound as a sample from the beginning of the original sound signal, and send the sample to the main pitch detecting unit; the main pitch detecting unit, Set to: detect a main pitch from the sample, and send the main pitch to the vocal detection unit;

The vocal detection unit is configured to: use the main pitch as a reference frequency, and divide the original sound signal by a pitch frequency of a sound belonging to the same sound source other than the sample and the reference frequency A comparison determines whether the sound source is a human voice.

Preferably ,

The vocal detection unit is configured to: use the main pitch as a reference frequency, and divide the original sound signal by a pitch frequency of a sound belonging to the same sound source other than the sample and the reference frequency A comparison determines whether the sound source belongs to a human voice, including:

The vocal detection unit divides the sound portion of the original sound signal except the sample into a plurality of frames; and passes each frame of the sound signal through a multi-channel filter to obtain a plurality of time-frequency units, and merges adjacent ones into the same sound The time-frequency unit of the source is used as a segment; if the pitch frequency of more than half of the time-frequency units in one segment is equal to the reference frequency, the segment is determined to be a vocal segment.

Preferably ,

The main pitch detecting unit is further configured to: after the vocal detecting unit detects one frame, continue to detect the main pitch from the subsequent adjacent frames, and if the main pitch changes, send the changed main pitch as the reference frequency to the reference frequency. The vocal detection unit.

Preferably:

The main pitch detecting unit is configured to: change the main pitch, and change the main pitch as the reference frequency, including:

The main pitch detecting unit continues to determine whether the main pitch of the subsequent frame is the changed value when the main pitch changes, and if the main pitch of the plurality of subsequent frames is the changed value, the changed main pitch is used as a reference. frequency.

In order to solve the above technical problem, the present invention also provides a vocal audio playing device, the device comprising a vocal sound extraction system and a playing system, wherein: The vocal extraction system extracts a vocal signal from the original sound signal by using the system as described above, and transmits the vocal signal to the playing system;

The playing system is configured to: linearly combine the vocal signal and the original sound signal to play.

The above technical solution judges whether the human voice is the reference frequency of the main pitch of the sound signal, and is simple to implement compared with the existing technical solution for extracting human voice; and the above technical solution only needs to find the human voice and the background sound together from the beginning of the original sound signal. The sound signal does not need to divide the original sound signal into a portion where both the human voice and the background sound appear simultaneously and only the portion of the background sound, which simplifies the amount of preprocessed data of the sound. BRIEF abstract

1 is a flow chart of a voice extraction method according to an embodiment of the present invention;

2 is a composition diagram of a human voice extraction system of the present embodiment. Preferred embodiment of the invention

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings. It should be noted that, in the case of no conflict, the features in the embodiments and the embodiments in the present application may be arbitrarily combined with each other.

FIG. 1 is a flowchart of a voice extraction method according to an embodiment of the present invention.

S101 extracts, as a sample, a sound signal that is common to the human voice and the background sound from the beginning of the original sound signal;

For example, a sound of about 10 s can be read from the beginning of the original sound signal, and the part where the adult sound and the background sound coexist and the part which only has the background sound are separated; if the 10s does not find the part where the human voice and the background sound coexist, Can read the next 10s until the vocals are found;

S102 detects a main pitch from the sample;

The main pitch detection is also called the pitch frequency detection;

Specific detection steps may include:

1) dividing the sample into several frames from the time domain, such as 20 ms as the frame length and 10 ms as the frame shift; 2) For each frame,

First, the auditory peripheral processing is performed: The time form of the frame impulse response using the Gammatone filter with the channel number N=128 is:

Where / is the filter order, is the filter bandwidth, / is the filter center frequency; the data of each channel obtained after the frame passes through the Gammatone filter is a basic time-frequency (TF) unit; according to the human ear The auditory characteristic, each time-frequency (TF) unit belongs to a sound source (or belongs to the background sound, or belongs to the human voice);

Secondly, the autocorrelation of each channel is calculated to obtain a correlation graph; on the correlation graph, the highest peak point information of the low frequency channel and the envelope information of the high frequency channel are used to determine the fundamental frequency of the frame;

The autocorrelation calculation formula is:

A _H (c,m ) =—— ^ h(c, mT - n)h(c, mT -n - t)

N _£ is the frame period (autocorrelation window size), N \0, N _c ) is the value of the signal output at channel C and time n, c is the channel, m is the frame, and t is determined by the signal frequency corresponding to the maximum delay of the window. The value of t is 0~12.5ms, and T is the number of samples corresponding to the frame shift;

3) after obtaining the fundamental frequency of each frame, excluding the base frequency with a large deviation, and taking the average value of the remaining fundamental frequencies to obtain the main pitch;

S103, using the main pitch as a reference frequency, comparing the pitch frequency of the sound belonging to the same sound source in the sound portion other than the sample to the reference frequency to determine whether the sound source belongs to the human voice. , including:

1) dividing the sound portion of the original sound signal except the sample into multiple frames; for the Android platform, since the sound is regarded as a "stream" input and output, the sound stream is read into a buffer (buffer) The relevant function is processed, and the processed sound stream is played out; from reading the sound stream into the buffer until the sound stream is played for about 28 ms, the sound portion of the original sound signal other than the sample can be followed. Dividing a frame into multiple frames every 28ms;

2) passing each frame of the sound signal through a multi-channel filter to obtain a plurality of time-frequency units, merging adjacent ones The time-frequency unit belonging to the same sound source is used as a segment; thus, by combining the time-frequency units, one frame signal may include a plurality of segments, and this process is called segmentation;

The multi-channel filter may be a Gammatone filter;

When merging adjacent time-frequency units belonging to the same sound source, first determine the cross-correlation of adjacent time-frequency units. If the cross-correlation value of the adjacent time-frequency unit is greater than a preset threshold, the adjacent time-frequency belongs to the same a sound source;

The cross-correlation calculation formula is:

- 1'

C _H (c, m) = ^ A _H (c, m, t) A _H [c + 1, m, t) where ^ , /Μ, Ο represent normalized 4 ( c, ιη, ΐ) 3) If the pitch frequency of more than half of the time-frequency units in a segment is equal to the reference frequency, the segment is a vocal segment.

Since the main pitch is constantly changing when the vocals are singing, in order to ensure that the main pitch as the reference frequency accurately reflects the vocals, it is necessary to constantly correct the main pitch, that is, whether or not the vocal segments are determined for all the segments of each frame. After that, the main pitch is continuously detected from the subsequent adjacent frames. If the main pitch changes, the changed main pitch is used as the reference frequency, and it is determined whether the segment in the frame is a vocal segment; further, in order to avoid the main pitch being short-lived If the main pitch of the subsequent frames is the changed value, if the main pitch of the consecutive frames is the changed value, the changed main pitch is used as the reference frequency. If the main pitch is not detected from subsequent frames (if the vocal disappears) after all the segments of each frame are determined as vocal segments, the vocal and background sounds are re-extracted from the current frame. The sound signal is taken as a sample.

This iteratively corrects the main pitch, and in the case of low algorithm complexity, it can meet the needs of real-time processing.

Based on the above vocal sound extraction method, this embodiment also provides a vocal audio playing method. In the method, the vocal signal is extracted from the original sound signal by the vocal extraction method as described above, and the vocal signal is linearly combined with the original sound signal and played. The separated vocal and original sound superposition can achieve the effect of speech enhancement. 2 is a composition diagram of a human voice extraction system of the present embodiment.

The system includes a sample extraction unit, a main pitch detection unit, and a vocal detection unit, wherein: the sample extraction unit is configured to extract, as a sample, a sound signal that is common to the vocal and background sounds from the beginning of the original sound signal, and Sending the sample to the main pitch detecting unit;

The main pitch detecting unit is configured to detect a main pitch from the sample, and send the main pitch to the vocal detecting unit;

The vocal detection unit is configured to compare the pitch frequency of the sound belonging to the same sound source in the sound portion other than the sample with the reference frequency by using the main pitch as a reference frequency Determining whether the sound source is a human voice;

The vocal detection unit is configured to divide the sound portion of the original sound signal except the sample into a plurality of frames, such as dividing the sound portion of the original sound signal other than the sample into a plurality of frames every 28 ms. To adapt to the sound processing mechanism of the Android platform; each frame of the sound signal is subjected to a multi-channel filter to obtain a plurality of time-frequency units, and the adjacent time-frequency units belonging to the same sound source are combined as one segment; If the pitch frequency of more than half of the time-frequency unit is equal to the reference frequency, it is determined that the segment is a human voice segment.

Since the main pitch is constantly changing when the vocal is singing, in order to ensure that the main pitch as the reference frequency accurately reflects the human voice, the above-mentioned main pitch detecting unit is further used for the vocal detecting unit to continue the subsequent adjacent frame after detecting one frame. The main pitch is detected, and if the main pitch is changed, the changed main pitch is transmitted as the reference frequency to the vocal detection unit; to avoid a short transition of the main pitch, the main pitch detecting unit is in the subsequent adjacent frame When it is detected that the main pitch changes, it is continued to determine whether the main pitch of the subsequent frame is the changed value. If the main pitch of the consecutive frames is the changed value, the changed main pitch is transmitted as the reference frequency. The vocal detection unit is described.

The above-mentioned main pitch detecting unit is further configured to re-trigger the sample extracting unit to re-extract the sound signal of the common sound of the human voice and the background sound from the current frame when the main pitch is not detected from the subsequent adjacent frames (if the human voice disappears). As a sample.

Based on the above-described human voice extraction system, this embodiment also provides a human voice audio playback device. The device comprises the above-mentioned vocal sound extraction system and a playback system;

a vocal sound extraction system, configured to extract a vocal signal from the original sound signal, and send the vocal signal to the playing system;

The playing system is configured to linearly combine the vocal signal and the original sound signal to play. The device superimposes the separated human voice with the original sound to achieve a voice enhancement effect.

One of ordinary skill in the art will appreciate that all or a portion of the above steps may be accomplished by a program instructing the associated hardware, such as a read-only memory, a magnetic disk, or an optical disk. Optionally, all or part of the steps of the foregoing embodiments may also be implemented by using one or more integrated circuits. Accordingly, each module/unit in the foregoing embodiment may be implemented in the form of hardware, or may use software functions. The form of the module is implemented. The invention is not limited to any specific form of combination of hardware and software.

It is to be understood that the invention may be embodied in other forms and modifications without departing from the spirit and scope of the invention.

Industrial applicability

The above technical solution judges whether the human voice is the reference frequency of the main pitch of the sound signal, and is simple to implement compared with the existing technical solution for extracting human voice; and the above technical solution only needs to find the human voice and the background sound together from the beginning of the original sound signal. The sound signal does not need to divide the original sound signal into a portion where both the human voice and the background sound appear simultaneously and only the portion of the background sound, which simplifies the amount of preprocessed data of the sound.

Claims

claims

1. A human voice extraction method, the method includes:

Extract the sound signal in which the human voice and the background sound co-occur from the beginning of the original sound signal as a sample; detect the main pitch from the sample;

Using the main pitch as a reference frequency, compare the fundamental frequency of the sound belonging to the same sound source in the sound part of the original sound signal except the sample with the reference frequency to determine whether the sound source belongs to a human voice.

2. The method of claim 1, wherein,

Using the main pitch as a reference frequency, compare the fundamental frequency of the sound belonging to the same sound source in the sound part of the original sound signal except the sample with the reference frequency to determine whether the sound source belongs to a human voice, include:

dividing the sound portion of the original sound signal other than the sample into multiple frames;

Pass each frame of sound signal through a multi-channel filter to obtain multiple time-frequency units, and merge adjacent time-frequency units belonging to the same sound source as a segment;

If the fundamental frequency of more than half of the time-frequency units in a segment is equal to the reference frequency, the segment is a vocal segment.

3. The method of claim 2, wherein the method further includes:

After determining whether all the segments in each frame are human voice segments, continue to detect the main pitch from subsequent adjacent frames. If the main pitch changes, use the changed main pitch as the reference frequency to continue to determine the segments in the frame. Whether it is a vocal clip.

4. The method of claim 3, wherein,

If the main pitch changes, using the changed main pitch as the reference frequency includes: If the main pitch changes, continue to determine whether the main pitch of the subsequent frames is the changed value, if the main pitches of multiple consecutive subsequent frames are For this change value, the changed main pitch is used as the reference frequency.

5. A vocal audio playback method, the method includes:

The method according to any one of claims 1 to 4 is used to extract the human voice signal from the original sound signal; the human voice signal and the original sound signal are linearly combined and then played.

6. A human voice extraction system, the system includes a sample extraction unit, a main pitch detection unit, and a human voice detection unit, wherein,

The sample extraction unit is configured to: extract the sound signal in which the human voice and the background sound co-occur from the beginning of the original sound signal as a sample, and send the sample to the main pitch detection unit; the main pitch detection unit, Set to: detect the main pitch from the sample, and send the main pitch to the human voice detection unit;

The human voice detection unit is configured to: use the main pitch as a reference frequency, and compare the fundamental frequency of the sound belonging to the same sound source in the sound part of the original sound signal except the sample with the reference frequency. Compare and determine whether the sound source is a human voice.

7. The system of claim 6, wherein,

The human voice detection unit is configured to: use the main pitch as a reference frequency, and compare the fundamental frequency of the sound belonging to the same sound source in the sound part of the original sound signal except the sample with the reference frequency. Compare and determine whether the sound source is a human voice, including:

The human voice detection unit divides the sound part of the original sound signal except the sample into multiple frames; passes each frame of sound signal through a multi-channel filter to obtain multiple time-frequency units, and merges adjacent ones belonging to the same sound The time-frequency unit of the source is regarded as a segment; if the fundamental frequency of more than half of the time-frequency units in a segment is equal to the reference frequency, the segment is determined to be a vocal segment.

8. The system of claim 7, wherein,

The main pitch detection unit is also set to: after the human voice detection unit detects one frame, it continues to detect the main pitch from subsequent adjacent frames. If the main pitch changes, the changed main pitch is sent as a reference frequency to The human voice detection unit.

9. The system of claim 8, wherein,

The main pitch detection unit is set to: when the main pitch changes, the changed main pitch is used as the reference frequency, including:

When the main pitch changes, the main pitch detection unit continues to determine whether the main pitch of subsequent frames is the changed value. If the main pitch of multiple subsequent frames is the changed value, the changed main pitch is used as a reference. frequency.

10. A human voice audio playback device, the device includes a human voice extraction system and a playback system, in:

The human voice extraction system uses the system as described in any one of claims 6 to 9 to extract the human voice signal from the original sound signal, and sends the human voice signal to the playback system;

The playback system is configured to linearly combine the vocal signal and the original sound signal before playing.