CN107799126B - Voice endpoint detection method and device based on supervised machine learning - Google Patents

Voice endpoint detection method and device based on supervised machine learning Download PDF

Info

Publication number
CN107799126B
CN107799126B CN201710957669.3A CN201710957669A CN107799126B CN 107799126 B CN107799126 B CN 107799126B CN 201710957669 A CN201710957669 A CN 201710957669A CN 107799126 B CN107799126 B CN 107799126B
Authority
CN
China
Prior art keywords
voice
segment
speaker
audio
speech
Prior art date
Application number
CN201710957669.3A
Other languages
Chinese (zh)
Other versions
CN107799126A (en
Inventor
宋亚楠
邱楠
王昊奋
Original Assignee
苏州狗尾草智能科技有限公司
上海瓦歌智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州狗尾草智能科技有限公司, 上海瓦歌智能科技有限公司 filed Critical 苏州狗尾草智能科技有限公司
Priority to CN201710957669.3A priority Critical patent/CN107799126B/en
Publication of CN107799126A publication Critical patent/CN107799126A/en
Application granted granted Critical
Publication of CN107799126B publication Critical patent/CN107799126B/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection

Abstract

The invention relates to a voice endpoint detection method and a device based on supervised machine learning, wherein the method comprises the following steps: detecting a mute section, a transition section and an end section from the acquired audio; inputting the mute section and the end section into a pre-constructed background noise model, and identifying the current scene to which the audio belongs; representing the voice segment to be recognized by using a vector, wherein the voice segment to be recognized is audio except the mute segment, the transition segment and the end segment; and inputting the recognized current scene and the vectorized speech segment to be recognized into a pre-constructed RNN model, and recognizing speech end points, wherein the speech end points comprise a starting point of the speech segment and an end point of the speech segment. According to the voice endpoint detection method and device based on supervised machine learning, provided by the invention, the scene is used as one of the input parameters of the RNN model by judging the current scene, so that the judgment accuracy of the RNN model is improved, and the accuracy and the efficiency of voice endpoint detection are improved.

Description

Voice endpoint detection method and device based on supervised machine learning

Technical Field

The invention relates to the technical field of voice recognition, in particular to a voice endpoint detection method and device based on supervised machine learning.

Background

Voice Activity Detection (VAD) refers to detecting valid Voice segments from a continuous audio stream, and includes two aspects, i.e., detecting a front end point, which is a starting point of valid Voice, and detecting a rear end point, which is an end point of valid Voice. In the voice application, the end point detection of voice is necessary, firstly, in the scene of storing or transmitting voice, the effective voice is separated from the continuous voice stream, and the data volume of storing or transmitting can be reduced. Secondly, in a specific application scene such as human-computer interaction, endpoint detection can be used for simplification, for example, in a recording scene, the operation of ending recording can be omitted by voice post-endpoint detection. Thus, accurate voice endpoint detection will improve channel utilization and reduce the amount of data for voice processing.

As shown in fig. 5, a segment of speech containing two words is given, and it can be seen intuitively from fig. 5 that the amplitude of the sound wave of the silent part at the beginning and the end of a segment of audio is small, while the amplitude of the effective speech part is large, the amplitude of one signal intuitively indicates the magnitude of the signal energy, the energy value of the silent part is small, and the energy value of the effective speech part is large. The speech signal is a one-dimensional continuous function with time as an argument, and the computer-processed speech data is a time-ordered sequence of samples of the speech signal, the size of the samples also representing the energy of the speech signal at the sample points.

Early endpoint detection algorithms were based on short-term energy and zero-crossing rate, cepstrum distance, spectral entropy detection, etc. But these methods are difficult to balance in terms of recognition performance and processing speed.

Other end point detection methods include time domain parametric methods, transform domain parametric methods, and statistical modeling methods. The time domain parameter method is only suitable for detecting stationary noise, and has poor robustness to different noise backgrounds. The transform domain parametric method can only support noisy speech with SNR > 0db, and it fails for the case where the noise and speech signals have similar transform domain characteristics. Statistical model methods are computationally expensive and may require different statistical models to be built for different noise backgrounds.

With the continuous development and application of intelligent robots, an accurate and efficient voice endpoint detection method is urgently needed.

Disclosure of Invention

Aiming at the defects in the prior art, the voice endpoint detection method and device based on supervised machine learning provided by the invention have the advantages that the scene is used as one of the input parameters of the RNN model by judging the current scene, so that the judgment accuracy of the RNN model is improved, and the accuracy and the efficiency of voice endpoint detection are improved.

In a first aspect, the present invention provides a method for detecting a voice endpoint based on supervised machine learning, including:

step S1, detecting a mute section, a transition section and an end section from the acquired audio;

step S2, inputting the mute section and the end section into a pre-constructed background noise model, and identifying the current scene to which the audio belongs;

step S3, representing the speech segment to be recognized by a vector, wherein the speech segment to be recognized is the audio without the mute segment, the transition segment and the end segment;

step S4, inputting the recognized current scene and the vectorized speech segment to be recognized into a pre-constructed RNN model, and recognizing a speech endpoint, where the speech endpoint includes a start point of the speech segment and an end point of the speech segment.

Preferably, the detecting of the mute segment, the transition segment and the end segment from the acquired audio includes: and detecting the transition section, the mute section and the end section by adopting a short-time energy and zero crossing rate method.

Preferably, the method for constructing the background noise model includes:

and analyzing the audio frequency in each specific scene to obtain the characteristics of the background noise in each specific scene, and modeling the background noise in the specific scene to obtain a background noise model.

Preferably, the RNN model construction method includes:

collecting a large amount of voices, marking a mute section, a transition section, a starting point, an end point and an ending section of each voice, and marking a scene where the voices are located;

carrying out segmentation sampling on the voice segments, and converting each voice segment obtained by segmentation into vector representation with the same dimensionality;

synthesizing all the voice segments into a vector through linear regression transformation to obtain vector representation of the voice segments;

and training by taking the vector representation of the voice segment, the starting point and the end point of the effective voice segment and the scene where the voice is positioned as input of the RNN to obtain an RNN model capable of outputting the voice starting point and the voice end point according to the input voice.

Preferably, before the step S4, the method further includes: inputting the voice segment to be recognized into the user characteristic model to obtain the characteristics of a speaker;

the step S4 includes: and inputting the recognized current scene, the characteristics of the speaker and the vectorized speech segment to be recognized into a pre-constructed RNN model, and recognizing speech end points, wherein the speech end points comprise a starting point of the speech segment and an end point of the speech segment.

Preferably, before the step S1, the method further includes:

detecting instruction words from the acquired audio;

inputting the audio segment corresponding to the instruction word into a user characteristic model to obtain and store the characteristics of the speaker corresponding to the instruction word;

and acquiring a new audio input user characteristic model to obtain the characteristics of a plurality of persons, identifying the audio which is most similar to the stored characteristics of the speaker, and acquiring the subsequent audio of the speaker by a sound source positioning technology.

Preferably, before step S1, the method further includes:

detecting instruction words from the acquired audio;

inputting the audio segment corresponding to the instruction word into a user characteristic model to obtain and store the characteristics of the speaker corresponding to the instruction word;

acquiring a real-time image through a camera, identifying facial features of multiple persons in the real-time image, finding out a current speaker according to the characteristics of the speaker, and recording the facial features of the current speaker;

acquiring a new image through the camera, identifying a face most similar to the recorded facial features of the speaker from the image, and positioning the spatial position of the most similar face;

and acquiring the audio frequency of the speaker by combining a sound source positioning technology according to the spatial position.

Preferably, the method for constructing the user feature model includes:

collecting a large number of effective voice sections, and marking the characteristics of a speaker in each effective voice section; wherein the features include: the age and gender of the user;

classifying the effective voice sections according to the marked characteristics;

and carrying out frequency domain and time domain statistics on the effective voice sections in each class to obtain voice characteristics contained in user voices with different characteristics, and establishing a user characteristic model capable of judging the characteristics of the speaker according to the effective voice sections.

In a second aspect, the present invention provides a speech endpoint detection apparatus based on supervised machine learning, including:

the segmentation detection module is used for detecting a mute segment, a transition segment and an end segment from the acquired audio;

the scene identification module is used for inputting the mute section and the end section into a pre-constructed background noise model and identifying the current scene to which the audio belongs;

the vectorization module is used for representing the voice section to be recognized by a vector, wherein the voice section to be recognized is the audio without the mute section, the transition section and the ending section;

and the voice endpoint identification module is used for inputting the identified current scene and the vectorized segment to be identified into a pre-constructed RNN model and identifying a voice endpoint, wherein the voice endpoint comprises a starting point of the voice segment and an end point of the voice segment.

Preferably, the segment detection module is specifically configured to: and performing a transition section, a mute section and an end section by adopting a short-time energy and zero crossing rate method.

Preferably, the method for constructing the background noise model includes:

and analyzing the audio frequency in each specific scene to obtain the characteristics of the background noise in each specific scene, and modeling the background noise in the specific scene to obtain a background noise model.

Preferably, the RNN model construction method includes:

collecting a large amount of voices, marking a mute section, a transition section, a starting point, an end point and an ending section of each voice, and marking a scene where the voices are located;

carrying out segmentation sampling on the voice segments, and converting each voice segment obtained by segmentation into vector representation with the same dimensionality;

synthesizing all the voice segments into a vector through linear regression transformation to obtain vector representation of the voice segments;

and training by taking the vector representation of the voice segment, the starting point and the end point of the effective voice segment and the scene where the voice is positioned as input of the RNN to obtain an RNN model capable of outputting the voice starting point and the voice end point according to the input voice.

Preferably, the system further comprises a multi-person scene recognition module, configured to:

detecting instruction words from the acquired multi-person audio;

inputting the audio segment corresponding to the instruction word into a user characteristic model to obtain the characteristic of the speaker corresponding to the instruction word;

and acquiring a real-time image through a camera, judging the current speaker by combining the characteristics of the speaker, and acquiring the audio corresponding to the current speaker from the multi-person audio.

Preferably, the method for constructing the user feature model includes:

collecting a large number of effective voice sections, and marking the characteristics of a speaker in each effective voice section; wherein the features include: the age and gender of the user;

classifying the effective voice sections according to the marked characteristics;

and carrying out frequency domain and time domain statistics on the effective voice sections in each class to obtain voice characteristics contained in user voices with different characteristics, and establishing a user characteristic model capable of judging the characteristics of the speaker according to the effective voice sections.

In a third aspect, the invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the method of any of the first aspects.

Drawings

Fig. 1 is a flowchart of a voice endpoint detection method based on supervised machine learning according to an embodiment of the present invention;

fig. 2 is a preferred flowchart of a voice endpoint detection method based on supervised machine learning according to an embodiment of the present invention;

fig. 3 is a block diagram illustrating a voice endpoint detection apparatus based on supervised machine learning according to a fourth embodiment of the present invention;

fig. 4 is a block diagram of a preferred structure of a voice endpoint detection apparatus based on supervised machine learning according to a fourth embodiment of the present invention;

FIG. 5 is a diagram of a segment of a speech signal containing two words.

Detailed Description

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and therefore are only examples, and the protection scope of the present invention is not limited thereby.

It is to be noted that, unless otherwise specified, technical or scientific terms used herein shall have the ordinary meaning as understood by those skilled in the art to which the invention pertains.

Example one

As shown in fig. 1, the present embodiment provides a voice endpoint detection method based on supervised machine learning, including:

step S1, detecting a mute section, a transition section and an end section from the acquired audio;

and step S2, inputting the mute section and the end section into a pre-constructed background noise model, and identifying the current scene to which the audio belongs.

Wherein, the scene can be determined based on the using environment and the using scene of the specific product, such as: for products positioned as smart home services, usage scenarios can be roughly divided into a rest scenario, a reading scenario, a viewing scenario, a game scenario, a party scenario, and the like.

The construction method of the background noise model comprises the following steps: and analyzing the audio frequency in each specific scene to obtain the characteristics of the background noise in each specific scene, and modeling the background noise in the specific scene to obtain a background noise model. Because the amplitude of the voice in the mute section and the ending section represents the energy contained in the background noise, the characteristics of the background noise in a specific scene can be obtained by analyzing the mute section and the ending section, and the background noise in the specific scene is modeled. The following method can be specifically adopted for modeling: modeling by using a probability statistics method, modeling by using a K-means clustering method, and modeling by using a CNN method. When the K-means method is used for clustering, the K value can be effectively preset by judging the number of scenes where the voice is located through experience.

When the background noise model constructed by the method is used for actually carrying out voice endpoint detection, the specific scene where the current voice is located can be obtained by taking the mute section and the noise section as the input of the background noise model.

And step S3, representing the speech segment to be recognized by a vector, wherein the speech segment to be recognized is the audio without the mute segment, the transition segment and the end segment.

The method for representing the voice segment to be recognized by using the vector specifically comprises the following steps: the method comprises the steps of carrying out segmentation sampling on voice, converting each voice segment into vector representation with the same dimension, synthesizing the voice segments into a vector through linear regression transformation, and enabling the synthesized vector to be the vector representation of the voice.

Step S4, inputting the recognized current scene and the vectorized speech segment to be recognized into a pre-constructed RNN model, and recognizing a speech endpoint, where the speech endpoint includes a start point of the speech segment and an end point of the speech segment.

The method for constructing the RNN (recurrent neural network) model comprises the following steps:

collecting a large amount of voices, marking a mute section, a transition section, a starting point, an end point and an ending section of each voice, and marking a scene where the voices are located;

carrying out segmentation sampling on the voice segments, and converting each voice segment obtained by segmentation into vector representation with the same dimensionality;

synthesizing all the voice segments into a vector through linear regression transformation to obtain vector representation of the voice segments;

and training by taking the vector representation of the voice segment, the starting point and the end point of the effective voice segment and the scene where the voice is positioned as input of the RNN to obtain an RNN model capable of outputting the voice starting point and the voice end point according to the input voice.

Among them, the preferred embodiment of step S1 includes: and detecting the transition section, the mute section and the end section by adopting a short-time energy and zero crossing rate method.

The short-time energy and zero crossing rate method comprises the following steps:

step S101, setting two threshold values of energy _ low, energy _ high, zcr _ low and zcr _ high for short-time energy and zero crossing rate respectively; wherein energy _ high > energy _ low, zcr _ high > zcr _ low.

Step S102, calculating the short-time energy and zero crossing rate zcr of a frame of voice, if energy > energy _ low and zcr > zcr _ low, indicating that the voice starts to enter a transition section; if energyy > energy _ high and zcr > zcr _ high, the speech start cannot be determined, and the calculation of the short-time energy and zero crossing rate is continued for several frames.

According to the voice endpoint detection method based on supervised machine learning, the scene is used as one of the input parameters of the RNN model by judging the current scene, so that the judgment accuracy of the RNN model is improved, and the accuracy and the efficiency of voice endpoint detection are improved.

In order to improve the detection accuracy of the speech endpoint, the embodiment may further identify the speaker characteristics in the audio, and use the speaker characteristics as one of the input parameters of the RNN model, thereby improving the accuracy of the judgment of the RNN model. The characteristics of the speaker may be the voice characteristics such as the tone and intonation of the speaker, or may be the characteristics such as the gender and age of the speaker further deduced from the voice.

Specifically, the user feature model is established by the following method, including:

collecting a large number of effective speech segments, and labeling the characteristics of the speaker in each effective speech segment, wherein the characteristics comprise: voice characteristics such as tone and intonation, age and sex of the user, etc.;

classifying the effective voice sections according to the marked characteristics;

and carrying out frequency domain and time domain statistics on the effective voice sections in each class to obtain voice characteristics contained in user voices with different characteristics, and establishing a user characteristic model capable of judging the characteristics of the speaker according to the effective voice sections.

When the method is applied, the voice collected naturally is input into the user characteristic model, and then the characteristics of the speaker can be obtained.

When the RNN model is trained, the speech for training is processed and then input into the user characteristic model, so that the characteristics of a speaker can be obtained, and the characteristics are marked into the training speech to be used as one of the input parameters of the RNN model, so that the RNN model obtained by training can output a speech starting point and an end point according to the vector representation of the input speech section, the scene where the speech is located and the characteristics of the speaker.

As shown in fig. 2, before step S4, the method further includes: and step S5, inputting the speech segment to be recognized into the user characteristic model to obtain the characteristics of the speaker.

Accordingly, step S4 includes: and inputting the recognized current scene, the characteristics of the speaker and the vectorized speech segment to be recognized into a pre-constructed RNN model, and recognizing speech end points, wherein the speech end points comprise a starting point of the speech segment and an end point of the speech segment.

By taking the characteristics of the speaker as one of the input parameters of the RNN model, the external interference can be effectively reduced, the real voice segment of the speaker can be identified, and the accuracy of voice endpoint detection is improved. And tracking and positioning of the speaker can be realized in a multi-person scene according to the sound characteristics of the speaker, so that the voice recognition precision of the multi-person scene is improved.

Example two

In human-computer interaction, in a complex scene, a robot acquires voices of multiple speakers, and in order to realize a speaker recognized in a situation where multiple persons sound at the same time, on the basis of the first embodiment, another voice endpoint detection method applicable to a multiple-person scene provided by the present embodiment includes:

in step S10, an instruction word is detected from the acquired audio.

The acquired audio may include sounds of multiple persons.

Wherein, the instruction word needs to be preset according to specific situations, for example, under the intelligent home scene, the instruction word may include: turning on the air conditioner, turning down the volume, turning off the desk lamp, starting the showing, etc. In a specific implementation process, when the instruction words are recognized from the user voice, the response and feedback of the speaker are tracked by analyzing the characteristics of the speaker.

And step S20, inputting the audio segment corresponding to the instruction word into a user characteristic model, obtaining the characteristics of the speaker corresponding to the instruction word and storing the characteristics.

The method for constructing the user feature model refers to the first embodiment.

Step S30, acquiring a real-time image through a camera, recognizing facial features of multiple persons in the real-time image, finding out a current speaker according to the characteristics of the speaker, and recording the facial features of the current speaker.

Wherein the speaker is characterized by the age and gender of the user. The facial feature recognition is realized by adopting a general face recognition technology, and the estimation of the age and the gender of the user can be realized.

And step S40, acquiring a new image through the camera, identifying the face most similar to the recorded facial features of the speaker from the image, and positioning the spatial position of the most similar face.

In this way, tracking and localization of the speaker can be achieved based on the recorded facial features of the speaker.

The spatial location is located by an image, which belongs to the prior art and is not described herein again.

And step S50, acquiring the audio frequency of the speaker according to the spatial position by combining a sound source positioning technology. And voice endpoint recognition can be carried out aiming at the acquired audio.

The sound source positioning technology is realized by adopting the existing method and is divided into a single-microphone method and a multi-microphone method. The multi-microphone method is that a plurality of sound receiving devices distributed at different positions are adopted to receive sound, and the sound source can be positioned by comparing the strength and the sequence of sound signals collected by sound receiving, namely, the position of a user (speaker) can be obtained, and the sound sent by the users at different positions can be distinguished and collected respectively. In the single-microphone method, one sound receiving device is used to obtain multiple sounds in space, and the sounds are processed by a later algorithm, and the processing method is the prior art and is not described herein again.

In step S60, a silence period, a transition period, and an end period are detected from the audio of the speaker.

And step S70, inputting the mute section and the end section into a pre-constructed background noise model, and identifying the current scene to which the audio belongs.

The method for constructing the background noise model refers to the first embodiment.

And step S80, representing the speech segment to be recognized by a vector, wherein the speech segment to be recognized is the audio without the mute segment, the transition segment and the end segment.

The first embodiment refers to a method for representing a speech segment to be recognized by a vector.

Step S90, inputting the recognized current scene and the vectorized speech segment to be recognized into a pre-constructed RNN model, and recognizing a speech endpoint, where the speech endpoint includes a start point of the speech segment and an end point of the speech segment.

The first embodiment refers to a method for constructing a Recurrent Neural Network (RNN) model.

For example, in human-computer interaction, after a user wakes up a robot by a wake-up word, the robot can obtain the position where the user (speaker) is located by positioning a sound source, but when several people are talking at the position at the same time, the robot cannot always pick up sound correctly. Based on the method of the embodiment, the robot can acquire the current scene when processing the awakening words, and if the scene belongs to the situation that several people talk at the same time, the robot can estimate the age and the gender of the speaker by inputting the voice segment corresponding to the awakening words into the model, so that the robot can judge the speaker (excluding non-speakers) through the image acquired by the camera. After the processing, the robot can correctly recognize the current scene and the speaker, the picked voice and the current scene are used as the input of the RNN model after pickup, the voice endpoint is obtained, the accuracy of voice recognition in the scene of multi-person sounding is improved, the speaker can be tracked and positioned through the face recognition technology, and multi-round interaction between the human and the computer is realized in the multi-person environment.

In order to improve the detection accuracy of the speech endpoint, the embodiment may further identify the speaker characteristics in the audio, and use the speaker characteristics as one of the input parameters of the RNN model, thereby improving the accuracy of the judgment of the RNN model. The detailed description refers to the implementation of a relevant part.

EXAMPLE III

In human-computer interaction, in a complex scene, a robot acquires voices of multiple speakers, and in order to realize a speaker recognized in a situation where multiple persons sound at the same time, on the basis of the first embodiment, another voice endpoint detection method applicable to a multiple-person scene provided by the present embodiment includes:

step S100, detecting an instruction word from the acquired audio.

The acquired audio may include sounds of multiple persons.

Wherein, the instruction word needs to be preset according to specific situations, for example, under the intelligent home scene, the instruction word may include: turning on the air conditioner, turning down the volume, turning off the desk lamp, starting the showing, etc. In a specific implementation process, when the instruction words are recognized from the user voice, the response and feedback of the speaker are tracked by analyzing the characteristics of the speaker.

And step S200, inputting the audio segment corresponding to the instruction word into a user characteristic model to obtain and store the characteristics of the speaker corresponding to the instruction word.

Step S300, obtaining the characteristics of a plurality of persons by obtaining a new audio input user characteristic model, identifying the audio which is most similar to the stored characteristics of the speaker, and obtaining the subsequent audio of the speaker by a sound source positioning technology.

The method for constructing the user feature model refers to the first embodiment.

The sound source positioning technology is realized by adopting the existing method and is divided into a single-microphone method and a multi-microphone method. The multi-microphone method is that a plurality of sound receiving devices distributed at different positions are adopted to receive sound, and the sound source can be positioned by comparing the strength and the sequence of sound signals collected by sound receiving, namely, the position of a user (speaker) can be obtained, and the sound sent by the users at different positions can be distinguished and collected respectively. In the single-microphone method, one sound receiving device is used to obtain multiple sounds in space, and the sounds are processed by a later algorithm, and the processing method is the prior art and is not described herein again.

In step S400, a silence segment, a transition segment, and an end segment are detected from the audio of the speaker.

Step S500, inputting the mute section and the end section into a pre-constructed background noise model, and identifying the current scene to which the audio belongs.

The method for constructing the background noise model refers to the first embodiment.

And step S600, representing the voice section to be recognized by using a vector, wherein the voice section to be recognized is the audio without the mute section, the transition section and the ending section.

The first embodiment refers to a method for representing a speech segment to be recognized by a vector.

And S700, inputting the recognized current scene and the vectorized voice section to be recognized into a pre-constructed RNN model, and recognizing voice end points, wherein the voice end points comprise a start point of the voice section and an end point of the voice section.

The first embodiment refers to a method for constructing a Recurrent Neural Network (RNN) model.

For example, in human-computer interaction, after a user wakes up a robot by a wake-up word, the robot can obtain the position where the user (speaker) is located by positioning a sound source, but when several people are talking at the position at the same time, the robot cannot always pick up sound correctly. Based on the method of the embodiment, the robot can acquire the current scene when processing the awakening words, and if the scene belongs to the situation that several people talk at the same time, the robot can estimate the age and the gender of the speaker by inputting the voice segment corresponding to the awakening words into the model, so that the robot can judge the speaker (excluding non-speakers) through the image acquired by the camera. After the processing, the robot can correctly recognize the current scene and the speaker, the picked voice and the current scene are used as the input of the RNN model after pickup, the voice endpoint is obtained, the accuracy of voice recognition in the scene of multi-person sounding is improved, the speaker can be tracked and positioned through the characteristics of the speaker voice, and multi-round interaction between the human and the computer is realized in the multi-person environment.

In order to improve the detection accuracy of the speech endpoint, the embodiment may further identify the speaker characteristics in the audio, and use the speaker characteristics as one of the input parameters of the RNN model, thereby improving the accuracy of the judgment of the RNN model. The detailed description refers to the implementation of a relevant part.

Example four

Based on the same inventive concept as the above embodiment, the present embodiment provides a voice endpoint detection apparatus based on supervised machine learning, as shown in fig. 3, including:

the segmentation detection module is used for detecting a mute segment, a transition segment and an end segment from the acquired audio;

the scene identification module is used for inputting the mute section and the end section into a pre-constructed background noise model and identifying the current scene to which the audio belongs;

the vectorization module is used for representing the voice section to be recognized by a vector, wherein the voice section to be recognized is the audio without the mute section, the transition section and the ending section;

and the voice endpoint identification module is used for inputting the identified current scene and the vectorized segment to be identified into a pre-constructed RNN model and identifying a voice endpoint, wherein the voice endpoint comprises a starting point of the voice segment and an end point of the voice segment.

Wherein the segment detection module is specifically configured to: and performing a transition section, a mute section and an end section by adopting a short-time energy and zero crossing rate method.

The construction method of the background noise model comprises the following steps: and analyzing the audio frequency in each specific scene to obtain the characteristics of the background noise in each specific scene, and modeling the background noise in the specific scene to obtain a background noise model.

The construction method of the RNN model comprises the following steps:

collecting a large amount of voices, marking a mute section, a transition section, a starting point, an end point and an ending section of each voice, and marking a scene where the voices are located;

carrying out segmentation sampling on the voice segments, and converting each voice segment obtained by segmentation into vector representation with the same dimensionality;

synthesizing all the voice segments into a vector through linear regression transformation to obtain vector representation of the voice segments;

and training by taking the vector representation of the voice segment, the starting point and the end point of the effective voice segment and the scene where the voice is positioned as input of the RNN to obtain an RNN model capable of outputting the voice starting point and the voice end point according to the input voice.

Preferably, as shown in fig. 4, the system further includes a user feature identification module, configured to: and inputting the voice section to be recognized into the user characteristic model to obtain the characteristics of the speaker. The output end of the user characteristic identification module is connected with the input end of the voice endpoint identification module.

Correspondingly, the voice endpoint recognition module is configured to: and inputting the recognized current scene, the characteristics of the speaker and the vectorized speech segment to be recognized into a pre-constructed RNN model, and recognizing speech end points, wherein the speech end points comprise a starting point of the speech segment and an end point of the speech segment.

Preferably, the system further comprises a first multi-person identification module for:

detecting instruction words from the acquired audio;

inputting the audio segment corresponding to the instruction word into a user characteristic model to obtain and store the characteristics of the speaker corresponding to the instruction word;

and acquiring a new audio input user characteristic model to obtain the characteristics of a plurality of persons, identifying the audio which is most similar to the stored characteristics of the speaker, and acquiring the subsequent audio of the speaker by a sound source positioning technology.

The output end of the first multi-person identification module is connected with the input end of the segmentation detection module.

Preferably, the system further comprises a second multi-person identification module for:

detecting instruction words from the acquired audio;

inputting the audio segment corresponding to the instruction word into a user characteristic model to obtain and store the characteristics of the speaker corresponding to the instruction word;

acquiring a real-time image through a camera, identifying facial features of multiple persons in the real-time image, finding out a current speaker according to the characteristics of the speaker, and recording the facial features of the current speaker;

acquiring a new image through the camera, identifying a face most similar to the recorded facial features of the speaker from the image, and positioning the spatial position of the most similar face;

and acquiring the audio frequency of the speaker by combining a sound source positioning technology according to the spatial position.

And the output end of the second multi-person identification module is connected with the input end of the segmentation detection module.

The construction method of the user feature model comprises the following steps:

collecting a large number of effective voice sections, and marking the characteristics of a speaker in each effective voice section; wherein the features include: the age and gender of the user;

classifying the effective voice sections according to the marked characteristics;

and carrying out frequency domain and time domain statistics on the effective voice sections in each class to obtain voice characteristics contained in user voices with different characteristics, and establishing a user characteristic model capable of judging the characteristics of the speaker according to the effective voice sections.

The voice endpoint detection apparatus based on supervised machine learning provided by this embodiment and the voice endpoint detection method based on supervised machine learning are based on the same inventive concept, and have the same beneficial effects, and are not described herein again.

EXAMPLE five

Based on the same inventive concept as the first and second embodiments, the present embodiment provides a computer-readable storage medium on which a computer program is stored, which, when executed by a processor, implements the method of any of the first and second embodiments.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the present invention, and they should be construed as being included in the following claims and description.

Claims (9)

1. A voice endpoint detection method based on supervised machine learning is characterized by comprising the following steps:
step S1, detecting a mute section, a transition section and an end section from the obtained audio by adopting a short-time energy and zero crossing rate method;
step S2, inputting the mute section and the end section into a pre-constructed background noise model, and identifying the current scene to which the audio belongs;
step S3, representing the speech segment to be recognized by a vector, wherein the speech segment to be recognized is the audio without the mute segment, the transition segment and the end segment;
step S4, inputting the recognized current scene and the vectorized speech segment to be recognized into a pre-constructed RNN model, and recognizing speech end points, wherein the speech end points comprise a starting point of the speech segment and an end point of the speech segment; wherein
The construction method of the RNN model comprises the following steps:
collecting a large amount of voices, marking a mute section, a transition section, a starting point, an end point and an ending section of each voice, and marking a scene where the voices are located;
carrying out segmentation sampling on the voice segments, and converting each voice segment obtained by segmentation into vector representation with the same dimensionality;
synthesizing all the voice segments into a vector through linear regression transformation to obtain vector representation of the voice segments;
and training by taking the vector representation of the voice segment, the starting point and the end point of the effective voice segment and the scene where the voice is positioned as input of the RNN to obtain an RNN model capable of outputting the voice starting point and the voice end point according to the input voice.
2. The method of claim 1, wherein the background noise model is constructed by:
and analyzing the audio frequency in each specific scene to obtain the characteristics of the background noise in each specific scene, and modeling the background noise in the specific scene to obtain a background noise model.
3. The method according to claim 1, further comprising, before the step S4: inputting the voice segment to be recognized into a user characteristic model to obtain the characteristics of a speaker;
the step S4 includes: and inputting the recognized current scene, the characteristics of the speaker and the vectorized speech segment to be recognized into a pre-constructed RNN model, and recognizing speech end points, wherein the speech end points comprise a starting point of the speech segment and an end point of the speech segment.
4. The method according to claim 1, further comprising, before the step S1:
detecting instruction words from the acquired audio;
inputting the audio segment corresponding to the instruction word into a user characteristic model to obtain and store the characteristics of the speaker corresponding to the instruction word;
and acquiring a new audio input user characteristic model to obtain the characteristics of a plurality of persons, identifying the audio which is most similar to the stored characteristics of the speaker, and acquiring the subsequent audio of the speaker by a sound source positioning technology.
5. The method according to claim 1, further comprising, before the step S1:
detecting instruction words from the acquired audio;
inputting the audio segment corresponding to the instruction word into a user characteristic model to obtain and store the characteristics of the speaker corresponding to the instruction word;
acquiring a real-time image through a camera, identifying facial features of multiple persons in the real-time image, finding out a current speaker according to the characteristics of the speaker, and recording the facial features of the current speaker;
acquiring a new image through the camera, identifying a face most similar to the recorded facial features of the speaker from the image, and positioning the spatial position of the most similar face;
and acquiring the audio frequency of the speaker by combining a sound source positioning technology according to the spatial position.
6. The method according to any one of claims 3-5, wherein the method for constructing the user feature model comprises:
collecting a large number of effective voice sections, and marking the characteristics of a speaker in each effective voice section; wherein the features include: the age and gender of the user;
classifying the effective voice sections according to the marked characteristics;
and carrying out frequency domain and time domain statistics on the effective voice sections in each class to obtain voice characteristics contained in user voices with different characteristics, and establishing a user characteristic model capable of judging the characteristics of the speaker according to the effective voice sections.
7. A voice endpoint detection apparatus based on supervised machine learning, comprising:
the segmentation detection module is used for detecting a mute section, a transition section and an end section from the acquired audio frequency by adopting a short-time energy and zero crossing rate method;
the scene identification module is used for inputting the mute section and the end section into a pre-constructed background noise model and identifying the current scene to which the audio belongs;
the vectorization module is used for representing the voice section to be recognized by a vector, wherein the voice section to be recognized is the audio without the mute section, the transition section and the ending section;
the speech endpoint recognition module is used for inputting the recognized current scene and the vectorized segment to be recognized into a pre-constructed RNN model and recognizing a speech endpoint, wherein the speech endpoint comprises a starting point of the speech segment and an end point of the speech segment; wherein
The construction method of the RNN model comprises the following steps:
collecting a large amount of voices, marking a mute section, a transition section, a starting point, an end point and an ending section of each voice, and marking a scene where the voices are located;
carrying out segmentation sampling on the voice segments, and converting each voice segment obtained by segmentation into vector representation with the same dimensionality;
synthesizing all the voice segments into a vector through linear regression transformation to obtain vector representation of the voice segments;
and training by taking the vector representation of the voice segment, the starting point and the end point of the effective voice segment and the scene where the voice is positioned as input of the RNN to obtain an RNN model capable of outputting the voice starting point and the voice end point according to the input voice.
8. The apparatus of claim 7, further comprising a multi-person scene recognition module configured to:
detecting instruction words from the acquired multi-person audio;
inputting the audio segment corresponding to the instruction word into a user characteristic model to obtain the characteristic of the speaker corresponding to the instruction word;
and acquiring a real-time image through a camera, judging the current speaker by combining the characteristics of the speaker, and acquiring the audio corresponding to the current speaker from the multi-person audio.
9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method of one of claims 1 to 6.
CN201710957669.3A 2017-10-16 2017-10-16 Voice endpoint detection method and device based on supervised machine learning CN107799126B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710957669.3A CN107799126B (en) 2017-10-16 2017-10-16 Voice endpoint detection method and device based on supervised machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710957669.3A CN107799126B (en) 2017-10-16 2017-10-16 Voice endpoint detection method and device based on supervised machine learning

Publications (2)

Publication Number Publication Date
CN107799126A CN107799126A (en) 2018-03-13
CN107799126B true CN107799126B (en) 2020-10-16

Family

ID=61533110

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710957669.3A CN107799126B (en) 2017-10-16 2017-10-16 Voice endpoint detection method and device based on supervised machine learning

Country Status (1)

Country Link
CN (1) CN107799126B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108764304B (en) * 2018-05-11 2020-03-06 Oppo广东移动通信有限公司 Scene recognition method and device, storage medium and electronic equipment
CN108922513A (en) * 2018-06-04 2018-11-30 平安科技(深圳)有限公司 Speech differentiation method, apparatus, computer equipment and storage medium
CN109065027A (en) * 2018-06-04 2018-12-21 平安科技(深圳)有限公司 Speech differentiation model training method, device, computer equipment and storage medium
CN108920639A (en) * 2018-07-02 2018-11-30 北京百度网讯科技有限公司 Context acquisition methods and equipment based on interactive voice
CN108986825A (en) * 2018-07-02 2018-12-11 北京百度网讯科技有限公司 Context acquisition methods and equipment based on interactive voice
CN108920640B (en) * 2018-07-02 2020-12-22 北京百度网讯科技有限公司 Context obtaining method and device based on voice interaction
CN108962226B (en) * 2018-07-18 2019-12-20 百度在线网络技术(北京)有限公司 Method and apparatus for detecting end point of voice
CN109036371B (en) * 2018-07-19 2020-12-18 北京光年无限科技有限公司 Audio data generation method and system for speech synthesis
CN108986844B (en) * 2018-08-06 2020-08-28 东北大学 Speech endpoint detection method based on speaker speech characteristics
CN109448705B (en) * 2018-10-17 2021-01-29 珠海格力电器股份有限公司 Voice segmentation method and device, computer device and readable storage medium
CN109658920B (en) * 2018-12-18 2020-10-09 百度在线网络技术(北京)有限公司 Method and apparatus for generating a model
CN110289016A (en) * 2019-06-20 2019-09-27 深圳追一科技有限公司 A kind of voice quality detecting method, device and electronic equipment based on actual conversation
CN110660385A (en) * 2019-09-30 2020-01-07 出门问问信息科技有限公司 Command word detection method and electronic equipment
CN112101046A (en) * 2020-11-02 2020-12-18 北京淇瑀信息科技有限公司 Conversation analysis method, device and system based on conversation behavior

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5794195A (en) * 1994-06-28 1998-08-11 Alcatel N.V. Start/end point detection for word recognition
US20020198704A1 (en) * 2001-06-07 2002-12-26 Canon Kabushiki Kaisha Speech processing system
CN1588535A (en) * 2004-09-29 2005-03-02 上海交通大学 Automatic sound identifying treating method for embedded sound identifying system
CN1912993A (en) * 2005-08-08 2007-02-14 中国科学院声学研究所 Voice end detection method based on energy and harmonic
US7610199B2 (en) * 2004-09-01 2009-10-27 Sri International Method and apparatus for obtaining complete speech signals for speech recognition applications
CN101625857A (en) * 2008-07-10 2010-01-13 新奥特(北京)视频技术有限公司 Self-adaptive voice endpoint detection method
US20120089393A1 (en) * 2009-06-04 2012-04-12 Naoya Tanaka Acoustic signal processing device and method
CN102982811A (en) * 2012-11-24 2013-03-20 安徽科大讯飞信息科技股份有限公司 Voice endpoint detection method based on real-time decoding
CN103077728A (en) * 2012-12-31 2013-05-01 上海师范大学 Patient weak voice endpoint detection method
CN103489454A (en) * 2013-09-22 2014-01-01 浙江大学 Voice endpoint detection method based on waveform morphological characteristic clustering
CN103854662A (en) * 2014-03-04 2014-06-11 中国人民解放军总参谋部第六十三研究所 Self-adaptation voice detection method based on multi-domain joint estimation
US20160180510A1 (en) * 2014-12-23 2016-06-23 Oliver Grau Method and system of geometric camera self-calibration quality assessment
CN105869658A (en) * 2016-04-01 2016-08-17 金陵科技学院 Voice endpoint detection method employing nonlinear feature
CN106462804A (en) * 2016-06-29 2017-02-22 深圳狗尾草智能科技有限公司 Method and system for generating robot interaction content, and robot
CN107039035A (en) * 2017-01-10 2017-08-11 上海优同科技有限公司 A kind of detection method of voice starting point and ending point
CN107240398A (en) * 2017-07-04 2017-10-10 科大讯飞股份有限公司 Intelligent sound exchange method and device

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5794195A (en) * 1994-06-28 1998-08-11 Alcatel N.V. Start/end point detection for word recognition
US20020198704A1 (en) * 2001-06-07 2002-12-26 Canon Kabushiki Kaisha Speech processing system
US7610199B2 (en) * 2004-09-01 2009-10-27 Sri International Method and apparatus for obtaining complete speech signals for speech recognition applications
CN1588535A (en) * 2004-09-29 2005-03-02 上海交通大学 Automatic sound identifying treating method for embedded sound identifying system
CN1912993A (en) * 2005-08-08 2007-02-14 中国科学院声学研究所 Voice end detection method based on energy and harmonic
CN101625857A (en) * 2008-07-10 2010-01-13 新奥特(北京)视频技术有限公司 Self-adaptive voice endpoint detection method
US20120089393A1 (en) * 2009-06-04 2012-04-12 Naoya Tanaka Acoustic signal processing device and method
CN102982811A (en) * 2012-11-24 2013-03-20 安徽科大讯飞信息科技股份有限公司 Voice endpoint detection method based on real-time decoding
CN103077728A (en) * 2012-12-31 2013-05-01 上海师范大学 Patient weak voice endpoint detection method
CN103489454A (en) * 2013-09-22 2014-01-01 浙江大学 Voice endpoint detection method based on waveform morphological characteristic clustering
CN103854662A (en) * 2014-03-04 2014-06-11 中国人民解放军总参谋部第六十三研究所 Self-adaptation voice detection method based on multi-domain joint estimation
US20160180510A1 (en) * 2014-12-23 2016-06-23 Oliver Grau Method and system of geometric camera self-calibration quality assessment
CN105869658A (en) * 2016-04-01 2016-08-17 金陵科技学院 Voice endpoint detection method employing nonlinear feature
CN106462804A (en) * 2016-06-29 2017-02-22 深圳狗尾草智能科技有限公司 Method and system for generating robot interaction content, and robot
CN107039035A (en) * 2017-01-10 2017-08-11 上海优同科技有限公司 A kind of detection method of voice starting point and ending point
CN107240398A (en) * 2017-07-04 2017-10-10 科大讯飞股份有限公司 Intelligent sound exchange method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Robust voice activity detection algorithm for estimating noise spectrum;Kyoung-Ho Woo等;《ELECTRONICS LETTERS》;20000120;第36卷(第2期);第180-181页 *
强背景噪声下语音端点检测的算法研究;吴边等;《计算机工程与应用》;20111121;第137-139页 *

Also Published As

Publication number Publication date
CN107799126A (en) 2018-03-13

Similar Documents

Publication Publication Date Title
EP3353677B1 (en) Device selection for providing a response
TWI619114B (en) Method and system of environment-sensitive automatic speech recognition
Zhou et al. A review of recent advances in visual speech decoding
CN105765650B (en) With multidirectional decoded voice recognition
Trigeorgis et al. Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network
US20180182415A1 (en) Augmented multi-tier classifier for multi-modal voice activity detection
US20180158449A1 (en) Method and device for waking up via speech based on artificial intelligence
US9020822B2 (en) Emotion recognition using auditory attention cues extracted from users voice
US10176811B2 (en) Neural network-based voiceprint information extraction method and apparatus
US9818431B2 (en) Multi-speaker speech separation
Oliver et al. Layered representations for human activity recognition
US9595259B2 (en) Sound source-separating device and sound source-separating method
Anguera et al. Speaker diarization: A review of recent research
US8731936B2 (en) Energy-efficient unobtrusive identification of a speaker
KR102134201B1 (en) Method, apparatus, and storage medium for constructing speech decoding network in numeric speech recognition
US7636662B2 (en) System and method for audio-visual content synthesis
Matthews et al. Extraction of visual features for lipreading
CN100573664C (en) The multi-sensory audio input system that head is installed
US10679612B2 (en) Speech recognizing method and apparatus
US7369991B2 (en) Speech recognition system, speech recognition method, speech synthesis system, speech synthesis method, and program product having increased accuracy
US9293133B2 (en) Improving voice communication over a network
Lu et al. Speakersense: Energy efficient unobtrusive speaker identification on mobile phones
CN1622200B (en) Method and apparatus for multi-sensory speech enhancement
EP3164871B1 (en) User environment aware acoustic noise reduction
Lee et al. Modeling mutual influence of interlocutor emotion states in dyadic spoken interactions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: Room 301, Building 39, 239 Renmin Road, Gusu District, Suzhou City, Jiangsu Province, 215000

Applicant after: Suzhou Dogweed Intelligent Technology Co., Ltd.

Applicant after: Shanghai Wage Intelligent Technology Co Ltd

Address before: 518000 Dongfang Science and Technology Building 1307-09, 16 Keyuan Road, Yuehai Street, Nanshan District, Shenzhen City, Guangdong Province

Applicant before: Shenzhen green bristlegrass intelligence Science and Technology Ltd.

Applicant before: Shanghai Wage Intelligent Technology Co Ltd

Address after: Room 301, Building 39, 239 Renmin Road, Gusu District, Suzhou City, Jiangsu Province, 215000

Applicant after: Suzhou Dogweed Intelligent Technology Co., Ltd.

Applicant after: Shanghai Wage Intelligent Technology Co Ltd

Address before: 518000 Dongfang Science and Technology Building 1307-09, 16 Keyuan Road, Yuehai Street, Nanshan District, Shenzhen City, Guangdong Province

Applicant before: Shenzhen green bristlegrass intelligence Science and Technology Ltd.

Applicant before: Shanghai Wage Intelligent Technology Co Ltd

GR01 Patent grant
GR01 Patent grant