CN112420079B

CN112420079B - Voice endpoint detection method and device, storage medium and electronic equipment

Info

Publication number: CN112420079B
Application number: CN202011296495.9A
Authority: CN
Inventors: 张晓萌; 马路; 赵培; 苏腾荣
Original assignee: Qingdao Haier Technology Co Ltd; Haier Smart Home Co Ltd
Current assignee: Qingdao Haier Technology Co Ltd; Haier Smart Home Co Ltd
Priority date: 2020-11-18
Filing date: 2020-11-18
Publication date: 2022-12-06
Anticipated expiration: 2040-11-18
Also published as: CN112420079A

Abstract

The invention discloses a voice endpoint detection method and device, a storage medium and electronic equipment. Wherein, the method comprises the following steps: acquiring target audio data to be identified; inputting target audio data into a first voice recognition model to obtain a recognition result, wherein the first voice recognition model is a deep neural network model used for recognizing a voice frame contained in the target audio data; under the condition that a plurality of candidate voice frames are identified from the target audio data, acquiring energy information corresponding to each candidate voice frame in the plurality of candidate voice frames; and determining a voice endpoint containing a valid voice frame in the candidate voice frames under the condition that the energy information meets the judgment condition. The invention solves the technical problem of low accuracy of the voice endpoint detection result.

Description

Voice endpoint detection method and device, storage medium and electronic equipment

Technical Field

The present invention relates to the field of speech recognition, and in particular, to a method and an apparatus for detecting a speech endpoint, a storage medium, and an electronic device.

Background

The voice end point detection is to detect an effective voice segment from a continuous voice stream, wherein the effective voice segment comprises a front end point which is a starting point of effective voice and a rear end point which is an end point of effective voice. In the speech recognition and signal processing, effective speech is recognized through speech endpoint detection, so that the effective speech is separated from continuous speech flow in the scene of speech storage or transmission, the data volume of storage or transmission can be reduced, and the workload and the complexity of man-machine interaction are simplified. Therefore, the voice endpoint detection is a necessary link of front-end processing in the voice communication, voice recognition and voice coding technologies, and plays an important role in subsequent voice processing performance expression.

In the related art, the voice endpoint detection generally adopts a voice endpoint detection method based on a mixture gaussian model, and for each frame of input audio, the probability of voice and the probability of noise are respectively calculated. However, the gaussian mixture model has limited modeling capability, and cannot realize accurate modeling of speech, and particularly, in a complex speech environment, the speech endpoint detection performance based on the gaussian mixture model is seriously degraded, so that the speech endpoint detection accuracy is low.

In view of the above problems, no effective solution has been proposed.

Disclosure of Invention

The embodiment of the invention provides a voice endpoint detection method and device, a storage medium and electronic equipment, and aims to at least solve the technical problem of low accuracy of a voice endpoint detection result.

According to an aspect of the embodiments of the present invention, there is provided a method for detecting a voice endpoint, including: acquiring target audio data to be identified; inputting the target audio data into a first speech recognition model to obtain a recognition result, wherein the first speech recognition model is a deep neural network model for recognizing a speech frame contained in the target audio data; under the condition that a plurality of candidate voice frames are identified from the target audio data, acquiring energy information corresponding to each candidate voice frame in the plurality of candidate voice frames; and determining a voice endpoint including an effective voice frame in the candidate voice frames when the energy information meets a judgment condition.

According to another aspect of the embodiments of the present invention, there is also provided a voice endpoint detection apparatus, including: the first acquisition module is used for acquiring target audio data to be identified; the recognition module is used for inputting the target audio data into a first voice recognition model to obtain a recognition result, wherein the first voice recognition model is a deep neural network model used for recognizing a voice frame contained in the target audio data; a second obtaining module, configured to obtain energy information corresponding to each candidate speech frame in the multiple candidate speech frames when multiple candidate speech frames are identified from the target audio data; and a determining module, configured to determine a speech endpoint including an effective speech frame in the candidate speech frames when the energy information satisfies a determination condition.

According to another aspect of the embodiments of the present invention, there is also provided a computer-readable storage medium, in which a computer program is stored, where the computer program is configured to execute the above-mentioned voice endpoint detection method when running.

According to another aspect of the embodiments of the present invention, there is also provided an electronic device, including a memory and a processor, where the memory stores a computer program, and the processor is configured to execute the voice endpoint detection method through the computer program.

In the embodiment of the invention, the voice data is input into the first voice recognition model to obtain the recognition result of the voice frame in the voice data, and the voice frame is judged again through the energy information under the condition that the voice frame is recognized, and the voice frame is determined to be the effective voice frame end point only when the judgment condition is met.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a schematic diagram of an application environment of an alternative voice endpoint detection method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating the flow of an alternative voice endpoint detection method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating a flow of an alternative voice endpoint detection method according to an embodiment of the present invention;

FIG. 4 is a schematic diagram illustrating a flow of an alternative voice endpoint detection method according to an embodiment of the present invention;

FIG. 5 is a schematic diagram illustrating a flow of an alternative voice endpoint detection method according to an embodiment of the present invention;

FIG. 6 is a schematic diagram illustrating a flow of an alternative voice endpoint detection method according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a flow of yet another alternative voice endpoint detection method according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of an alternative voice endpoint detection apparatus according to an embodiment of the present invention;

FIG. 9 is a schematic structural diagram of another alternative voice endpoint detection apparatus according to an embodiment of the present invention;

fig. 10 is a schematic structural diagram of an alternative electronic device according to an embodiment of the invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in other sequences than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

According to an aspect of the embodiments of the present invention, a voice endpoint detection method is provided, and optionally, as an optional implementation manner, the voice endpoint detection method may be applied, but not limited, to the environment as shown in fig. 1. Terminal device 102 interacts with server 112 through network 110.

Alternatively, the terminal apparatus 102 acquires the audio data and sends the audio data to the server 112 through the network 110, and the server 112 receives the audio data through the network 110 and inputs the audio data into the first speech recognition model for recognizing a speech frame included in the target audio data to obtain a recognition result. And under the condition that the voice frame is identified in the audio data, acquiring energy information corresponding to the voice frame. And judging whether the energy information meets the judgment condition or not, and determining a voice endpoint containing a valid voice frame in the corresponding voice frame under the condition that the energy information meets the judgment condition. The server 112 transmits the final recognition result to the terminal apparatus 102 through the network 110 so that the terminal apparatus 102 receives the recognition result of the audio data.

Optionally, in this embodiment, the terminal device 102 may be a device configured to collect and store audio data, and may include, but is not limited to, at least one of the following: mobile phones (such as Android Mobile phones, iOS Mobile phones, etc.), notebook computers, tablet computers, palm computers, MID (Mobile Internet Devices), PAD, desktop computers, smart televisions, etc. The network 110 may include, but is not limited to: a wired network, a wireless network, wherein the wired network comprises: a local area network, a metropolitan area network, and a wide area network, the wireless network comprising: bluetooth, WIFI, and other networks that enable wireless communication. The server 112 may be a single server, a server cluster composed of a plurality of servers, or a cloud server. The above is merely an example, and this is not limited in this embodiment.

As an alternative implementation, as shown in fig. 2, the voice endpoint detection method includes:

s202, target audio data to be identified are obtained;

s204, inputting the target audio data into a first voice recognition model to obtain a recognition result, wherein the first voice recognition model is a deep neural network model used for recognizing a voice frame contained in the target audio data;

s206, under the condition that a plurality of candidate voice frames are identified from the target audio data, acquiring energy information corresponding to each candidate voice frame in the plurality of candidate voice frames;

and S208, under the condition that the energy information meets the judgment condition, determining a voice endpoint containing a valid voice frame in the candidate voice frames.

Alternatively, the target audio data may be, but is not limited to, the raw audio data of the speech endpoint to be recognized. The target audio data may be an audio clip obtained by the terminal device using a client with an audio acquisition function or a voice acquisition function of the terminal device. The format and the data amount of the target audio data are not limited at all.

Alternatively, the first speech recognition model may be, but is not limited to, for recognizing valid speech frames included in the input audio segment. Valid speech frames may be, but are not limited to, audio frames containing valid content, not noise or ambient sound. Optionally, the first speech recognition model performs a speech frame determination for each frame of audio of the input audio, and the determination result is one of a speech frame and a non-speech frame. Optionally, the first speech recognition model sets the recognition result of the audio frame determined as a speech frame, and sets the recognition result of the audio frame determined as a non-speech frame.

Alternatively, the first speech recognition model may be, but is not limited to, a feedforward Deep Neural Networks (DNN) model.

Alternatively, the candidate speech frame may be, but is not limited to, an audio frame in which the output recognition result of the first speech recognition model in the target audio data is a speech frame.

Alternatively, the energy information corresponding to the speech frame may be, but is not limited to, an amplitude value of the speech frame.

In the embodiment of the application, audio data is input into the first voice recognition model to obtain a recognition result, and under the condition that a voice frame is recognized, the voice frame is judged again through energy information, and only when the judgment condition is met, the voice frame is determined to be an effective voice frame end point.

As an optional implementation manner, before the obtaining of the target audio data to be identified, the method further includes:

the method comprises the steps of obtaining a plurality of sample audio data to train so as to obtain a first voice recognition model, wherein the first voice recognition model is a deep neural network model constructed based on a Keras framework.

Alternatively, the sample audio data may include, but is not limited to: noise set, different object voice set, different content voice set, and different energy voice set. The noise set may include, but is not limited to: outdoor environmental noise and household environmental noise. Different subject speech combinations may include, but are not limited to: male voice, female voice, child voice, and senior voice. Different content speech collections may include, but are not limited to: voice instruction, voice wakeup, and voice interaction.

As an alternative implementation, as shown in fig. 3, the inputting the target audio data into the first speech recognition model to obtain the recognition result includes:

s302, performing framing processing on target audio data in a first voice recognition model to obtain a plurality of audio frames;

s304, respectively preprocessing the plurality of audio frames to obtain respective corresponding audio features;

s306, calculating to obtain the voice recognition probability corresponding to each of the plurality of audio frames based on the audio features, wherein the audio frames are determined to be voice frames under the condition that the voice recognition probability is greater than a first threshold value; and determining the audio frame as a non-speech frame under the condition that the speech recognition probability is less than a first threshold value.

Optionally, the first speech recognition model includes a speech tagging tool for tagging the audio frame.

And taking the input audio data as a voice instruction of the intelligent home which is recorded by both men and women as an example to carry out label marking explanation in the training of the first voice recognition model. The total duration of the audio data is not limited, and includes various indoor environmental noises. The first speech recognition model firstly performs framing processing on the audio data, and divides the audio data into a plurality of audio frame segments with a frame as a unit, wherein the frame length is 10 milliseconds. The voice labeling tool generates a '1' label under the condition that the first voice recognition model judges that the current audio frame is a voice frame, and performs label labeling on the current frame; and under the condition that the first speech recognition model judges that the current frame is a non-speech frame, generating a '0' label, and labeling the current frame. Thus, the speech tagging work in the first speech recognition model tags all audio frames in the sample audio data.

Alternatively, the first speech recognition model may be, but is not limited to being, provided as two layers, both of which are fully connected layers. Two layers of the first speech recognition model are distinguished by a first fully-connected layer and a second fully-connected layer. The first full-connection layer receives signal characteristics of input audio data, probability calculation that an audio frame is a speech frame and a non-speech frame is completed by the second full-connection layer through sigmoid activated function operation, modeling is performed on probability distribution of the input speech frame and the input non-speech frame through a Softmax function, and a probability result is output. The output probability result is two probability values which are both between 0 and 1 and the sum of the two probability values is 1, and respectively represents the probability that the audio frame is a speech frame and the probability that the audio frame is a non-speech frame.

Optionally, when the probability that the audio frame is a speech frame is greater than the probability that the audio frame is a non-speech frame, the audio frame is determined to be a speech frame, and the tag of the audio frame is set to "1", otherwise, the audio frame is determined to be a non-speech frame, and the tag of the audio frame is set to "0".

As an optional implementation manner, the preprocessing the multiple audio frames respectively to obtain the audio features corresponding to the multiple audio frames respectively includes:

carrying out high-frequency signal enhancement processing on the audio frame to obtain an audio frame after signal enhancement;

performing time-frequency conversion processing on the audio frame subjected to signal enhancement to obtain an audio frame converted into a frequency domain;

filtering the audio frame converted into the frequency domain to obtain a filtered audio frame;

carrying out zooming processing on the filtered audio frame to obtain a zoomed audio frame;

and carrying out dynamic differential processing on the zoomed audio frame to obtain audio characteristics.

Alternatively, the high frequency signal enhancement processing may be accomplished, but is not limited to, using a filter.

The high-frequency signal enhancement processing inputs the audio frame obtained by framing into a high-pass filter, and enhances the high-frequency part in the audio frame so as to simplify the process of solving the frequency spectrum according to the signal-to-noise ratio, so that the frequency spectrum can be obtained by using the same signal-to-noise ratio in the whole frequency band from low frequency to high frequency. In particular, the signal enhancement processing may, but is not limited to, use the following formula:

s(n)＝s(n)-k×s(n-1) (1)

where k is an enhancement coefficient and n represents the length of each of the audio frames.

Alternatively, the time-frequency conversion process may be, but is not limited to, converting a time-domain signal into a frequency-domain signal, and in particular, is not limited to converting a time-domain signal of an audio frame into a frequency-domain signal using a fourier transform.

Alternatively, the filtering process may be performed by, but not limited to, a triangular filter bank.

The audio frame after the time-frequency conversion processing passes through a group of Mel-scale triangular filter banks: a filter bank of M filters is defined, where the number of filters is similar to the number of critical bands, and the filters used are triangular filters with a center frequency of f (M), M =1, 2. The interval between each f (m) decreases as the value of m decreases and increases as the value of m increases.

Alternatively, the scaling process may be, but is not limited to, performing a logarithmic function calculation on the operation result obtained by the filtering process, and scaling the vertical axis by the calculation of the logarithmic function.

Optionally, the dynamic difference processing may be, but not limited to, extracting dynamic difference parameters of the audio frame after the scaling processing, and taking the dynamic difference parameters as the audio features.

Alternatively, the processing manner and flow of the first speech recognition model for the input target audio data may be, but are not limited to, as shown in fig. 4. The first speech recognition model first performs step S402 on the input audio data, frames, and divides the target audio data into audio frames. After the audio frame is obtained, step S404 is executed to enhance the signal and to obtain the high frequency band in the audio frame. After the enhancement is completed, step S406 is performed to perform time-frequency conversion, and convert the audio frame from a time-domain signal to a frequency-domain signal. Step S408 is performed on the frequency domain signal of the converted audio frame, and the filtering is performed to eliminate the influence of the harmonic and reduce the amount of computation. After the filtering is completed, step S410 is performed to scale and amplify the signal energy difference of the audio frame. After the scaling is completed, step S412 is performed to perform dynamic difference, and extract dynamic difference parameters from the audio frame signal.

In the embodiment of the application, the vocal cords and lips effect in the sounding process is eliminated through signal enhancement processing, the high-frequency part of an audio frame, which is restrained by a sounding system, is compensated, the formants of high frequency are highlighted, time-frequency conversion processing is performed to convert time-domain signals of the audio frame into frequency-domain signals, so that subsequent feature extraction is facilitated, the frequency spectrum converted into the frequency-domain signals is smoothed through filtering processing, the harmonic wave effect is eliminated, the formants of audio data are highlighted, meanwhile, the operation amount is reduced, and the energy difference is amplified through logarithmic operation, so that the extraction of dynamic difference parameters is realized.

As an optional implementation manner, as shown in fig. 5, the obtaining energy information corresponding to each candidate speech frame in a plurality of candidate speech frames includes:

s502, acquiring a first energy value corresponding to each continuous voice segment in a plurality of candidate voice frames, wherein the first energy value is used for representing the average energy value of the voice signals of the voice segments;

s505, determining a second energy value according to the first energy value, wherein the second energy value is used for representing the maximum energy value in the plurality of first energy values;

s506, the first energy value and the second energy value are used as energy information of the candidate voice frame.

It should be noted that, each continuous speech segment in the candidate speech frame refers to a speech segment for energy calculation obtained by subdividing the valid speech frame identified by the first speech recognition model, and the calculation of the first energy value and the second energy value is performed in units of speech segments. The numbering of speech segments in a candidate speech frame is naturally formed in chronological order.

Alternatively, the first energy value of the nth speech segment may be calculated, but is not limited to, using the following formula:

wherein n represents the sequential number of the current speech segment in the candidate speech frame, M represents the number of energy points contained in the previous n speech segments, and | x (M) | represents the absolute value of the energy value of the mth energy point.

The first energy value of a speech segment represents the average energy value of the speech segment cut off to the current, i.e. nth, speech segment in the candidate speech frame.

Alternatively, the second energy value of the nth speech segment may be calculated using, but is not limited to, the following formula:

E ₂ (n)＝MAX[E(1)，...，E(n)] (3)

wherein n is a positive integer of 1 or more.

The second energy value of the speech segment is the maximum value of the first energy values up to the current speech segment, i.e., the 1 st to nth speech segments.

As an optional implementation manner, the obtaining energy information corresponding to each candidate speech frame in the multiple candidate speech frames includes:

repeatedly executing the following steps until all voice segments of the candidate voice frame are traversed;

acquiring a first energy value corresponding to an ith voice fragment, wherein i is a positive integer;

comparing a first energy value corresponding to the ith voice fragment with a historical maximum energy value;

when the first energy value corresponding to the ith voice segment is larger than the historical maximum energy value, taking the first energy value corresponding to the ith voice segment as the updated maximum energy value, and taking the first energy value corresponding to the ith voice segment as the second energy value;

and taking the historical maximum energy as a second energy value when the first energy value corresponding to the ith voice fragment is smaller than the historical maximum energy value.

It should be noted that, the energy information is obtained for all the speech segments included in the candidate speech frame, and the first energy value and the second energy value of each speech segment are sequentially calculated. The historical maximum energy value refers to the energy value with the maximum value in the first energy values of the 1 st to the (i-1) th voice segments, and is also the second energy value of the (i-1) th voice segment.

Alternatively, the obtaining of the first energy value and the second energy value corresponding to the voice segment may be, but is not limited to, as shown in fig. 6. Setting a candidate voice frame in which the voice segments are positioned to be divided into N voice segments, arranging the voice segments according to an original time sequence, setting natural numbers to be 1-N, and setting the current voice segment as the first voice segment in the candidate voice frameN segments, wherein N is more than or equal to 1 and less than or equal to N. Step S602 is executed, all energy points contained in the voice segments 1 to n and corresponding energy values are determined, and E is calculated ₁ (n) of (a). To obtain E ₁ After (n), step S604 is executed to determine E ₁ (n) whether or not less than E ₂ (n-1). In case of judgment of yes, namely E ₁ (n) is less than E ₂ (n-1), step S606 is executed to connect E ₂ (n-1) is determined as E ₂ (n) of (a). In the case of a negative judgment, i.e. E ₁ (n) E or more ₂ (n-1), step S608 is executed to connect E ₁ (n) is determined as E ₂ (n)。

determining a first energy value of a candidate voice frame according to the first energy value of the voice segment;

determining a second energy value of the candidate voice frame according to the second energy value of the voice segment, and adjusting the second energy value of the candidate voice frame according to the scaling coefficient to obtain the adjusted second energy value of the candidate voice frame;

and under the condition that the first energy of the candidate voice frame is greater than the second energy value of the adjusted candidate voice frame, determining that the energy information of the candidate voice frame meets the judgment condition.

It should be noted that the first energy value of the candidate speech frame is an average energy value of all speech segments included in the candidate speech frame, and the second energy value of the candidate speech frame is a maximum energy value of the average energy values of all speech segments included in the candidate speech frame. According to the above-mentioned manner for obtaining the first energy value and the second energy value of the speech segment, the first energy value and the second energy value of the last speech segment of the candidate speech frame are the first energy value and the second energy value of the candidate speech frame. For example, the candidate speech frame includes N speech segments, and the first energy value and the second energy value of the nth speech segment are used as the first energy value and the second energy value of the candidate speech frame.

Optionally, the value of the scaling factor is between 0-1.

Optionally, aThe process of obtaining the energy information of the candidate speech frame and obtaining the judgment result of the candidate speech frame through the energy information of the speech segments of the candidate speech frame may be, but is not limited to, as shown in fig. 7, the candidate speech frame is set to be divided into N speech segments, which are arranged according to the original time sequence, and the natural number is 1-N. Step S702 is executed to obtain the current speech segment and the number n thereof determined by the candidate speech frame. If the current speech segment is determined to be the nth segment, step S704 is executed to determine whether N is less than N. If the judgment result is yes, i.e. N < N, the step S706 is executed to calculate E ₁ (n) and E ₂ (n) in the formula (I). After the calculation is completed, step S708 is executed to make n = n +1. After the assignment is completed, step S704 is performed.

Since N is set as the number of the current speech piece, 1. Ltoreq. N.ltoreq.N, the judgment result of step S704 represents N = N if no. If the determination result is no, that is, N = N, step S710 is executed to calculate E ₁ (N) and E ₂ (N) is provided. In the formation of E ₁ (N) and E ₂ (N), step S712 is executed to calculate the candidate speech frame E ₁ ＝E ₁ (N), candidate Speech frame E ₂ ＝E ₂ (N)/10. After the assignment is completed, go to step S714 to determine E ₁ Whether or not it is greater than E ₂ . When the judgment result is yes, namely E ₁ ＞E ₂ In case of (3), step S716 is performed to determine the candidate speech frame as a valid speech frame. In case of a negative judgment, namely E ₁ ≤E ₂ In case of (3), step S718 is executed to determine the candidate speech frame as an invalid speech frame.

In the embodiment of the application, the energy information of the candidate voice frame is obtained by calculating the first energy value and the second energy value of the voice fragment in the candidate voice frame one by one, and the energy condition of the candidate voice frame output by the first recognition model is judged again by judging whether the energy information of the candidate voice frame meets the set condition or not, so that the detection accuracy of the effective voice frame is improved, and the accuracy of the voice endpoint detection result is improved.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art will appreciate that the embodiments described in this specification are presently preferred and that no acts or modules are required by the invention.

According to another aspect of the embodiment of the present invention, a voice endpoint detection apparatus for implementing the voice endpoint detection method is also provided. As shown in fig. 8, the apparatus includes:

a first obtaining module 802, configured to obtain target audio data to be identified;

the recognition module 804 is configured to input the target audio data into a first speech recognition model to obtain a recognition result, where the first speech recognition model is a deep neural network model for recognizing a speech frame included in the target audio data;

a second obtaining module 806, configured to obtain energy information corresponding to each candidate speech frame in the multiple candidate speech frames when the multiple candidate speech frames are identified from the target audio data;

a determining module 808, configured to determine a speech endpoint including a valid speech frame in the plurality of candidate speech frames if the energy information satisfies the determination condition.

As an optional implementation manner, the voice endpoint detection apparatus further includes:

the training module is used for acquiring a plurality of sample audio data for training before acquiring target audio data to be recognized so as to obtain a first voice recognition model, wherein the first voice recognition model is a deep neural network model constructed based on a Keras framework.

Optionally, the training module is further configured to train a voice tagging tool for tagging the audio frame when training the first voice recognition model. Optionally, the voice labeling tool generates a "1" label when judging that the current audio frame is a voice frame, and performs label labeling on the current frame; and under the condition that the current frame is judged to be a non-voice frame, generating a '0' label, and labeling the current frame.

As an optional implementation manner, the identification module further includes:

the framing unit is used for framing the target audio data in the first voice recognition model to obtain a plurality of audio frames;

the preprocessing unit is used for respectively preprocessing the plurality of audio frames to obtain respective corresponding audio features;

the calculating unit is used for calculating and obtaining the voice recognition probability corresponding to each of the plurality of audio frames based on the audio characteristics, wherein the audio frames are determined to be voice frames under the condition that the voice recognition probability is greater than a first threshold value; and determining the audio frame as a non-speech frame under the condition that the speech recognition probability is less than a first threshold value.

As an optional implementation, the preprocessing unit includes:

the high-frequency signal enhancement unit is used for carrying out high-frequency signal enhancement processing on the audio frame to obtain the audio frame after signal enhancement;

the time-frequency conversion unit is used for carrying out time-frequency conversion processing on the audio frame after the signal enhancement to obtain an audio frame converted into a frequency domain;

the filtering unit is used for filtering the audio frame converted into the frequency domain to obtain a filtered audio frame;

the zooming unit is used for zooming the filtered audio frame to obtain a zoomed audio frame;

and the dynamic difference unit is used for carrying out dynamic difference processing on the zoomed audio frame to obtain the audio features.

Alternatively, the high frequency signal enhancement unit may be, but is not limited to being, accomplished with a filter.

The high-frequency signal enhancement unit inputs the audio frame obtained by framing into the high-pass filter, and enhances the high-frequency part in the audio frame so as to simplify the process of solving the frequency spectrum according to the signal-to-noise ratio, so that the frequency spectrum can be obtained by using the same signal-to-noise ratio in the whole frequency band from low frequency to high frequency. In particular, the signal enhancement processing may, but is not limited to, use the following formula:

s(n)＝s(n)-k×s(n-1) (4)

Alternatively, the time-frequency converting unit may be, but is not limited to, converting the time-domain signal into a frequency-domain signal, and in particular, is not limited to converting the time-domain signal of the audio frame into a frequency-domain signal by using fourier transform.

Alternatively, the filtering unit may be, but is not limited to, performed by a triangular filter bank. The audio frame after time-frequency conversion passes through a group of Mel-scale triangular filter banks: a filter bank of M filters is defined, where the number of filters is similar to the number of critical bands, and the filters used are triangular filters with a center frequency of f (M), M =1, 2. The interval between each f (m) decreases as the value of m decreases and increases as the value of m increases.

Alternatively, the scaling unit may, but is not limited to, perform a logarithmic function calculation on the operation result obtained by the filtering process.

Alternatively, the dynamic difference unit may, but is not limited to, extract dynamic difference parameters of the scaled audio frame, and use the dynamic difference parameters as the audio features.

In the embodiment of the application, the signal enhancement unit is used for eliminating the vocal cords and lips effect in the sounding process to compensate the high-frequency part of the audio frame, which is restrained by the sounding system, to highlight the formants of high frequency, the time-frequency conversion unit is used for converting the time-domain signals of the audio frame into frequency-domain signals so as to facilitate the subsequent feature extraction, the filtering unit is used for smoothing the frequency spectrum converted into the frequency-domain signals, the harmonic wave is eliminated, the formants of audio data are highlighted, the operation amount is reduced, the energy difference is amplified through the scaling unit, and the dynamic difference unit is used for extracting the dynamic difference parameters.

As an alternative implementation, as shown in fig. 9, the second obtaining module 806 further includes:

a first obtaining unit 902, configured to obtain a first energy value corresponding to each continuous speech segment in a plurality of candidate speech frames, where the first energy value is used to represent an average energy value of a speech signal of the speech segment;

a second obtaining unit 904, configured to determine a second energy value according to the first energy value, where the second energy value is used to represent a maximum energy value of the plurality of first energy values;

and a synthesizing unit 906, configured to use the first energy value and the second energy value as energy information of the candidate speech frame.

Alternatively, the first obtaining unit may, but is not limited to, calculate using the following formula when obtaining the first energy value of the nth speech segment:

Alternatively, the second obtaining unit may, but is not limited to, calculate using the following formula when obtaining the second energy value of the nth speech segment:

E ₂ (n)＝MAX[E(1)，...，E(n)] (6)

wherein n is a positive integer of 1 or more.

Optionally, the second obtaining module further includes:

the execution unit repeatedly executes the following steps until all voice segments of the candidate voice frame are traversed;

the first subunit is used for acquiring a first energy value corresponding to the ith voice segment, wherein i is a positive integer;

the first comparison unit is used for comparing a first energy value corresponding to the ith voice fragment with a historical maximum energy value;

the second subunit is used for taking the first energy value corresponding to the ith voice fragment as the updated maximum energy value and taking the first energy value corresponding to the ith voice fragment as the second energy value when the first energy value corresponding to the ith voice fragment is larger than the historical maximum energy value;

and the third subunit is used for taking the historical maximum energy as the second energy value when the first energy value corresponding to the ith voice segment is smaller than the historical maximum energy value.

Optionally, as shown in fig. 9, the second obtaining module 806 further includes:

a first determining unit 912, configured to determine a first energy value of the candidate speech frame according to the first energy value of the speech segment;

the second determining unit 914, which determines the second energy value of the candidate speech frame according to the second energy value of the speech segment, and adjusts the second energy value of the candidate speech frame according to the scaling factor to obtain the adjusted second energy value of the candidate speech frame;

the third determining unit 916 determines that the energy information of the candidate speech frame satisfies the determination condition when the first energy of the candidate speech frame is greater than the second energy value of the adjusted candidate speech frame.

In the embodiment of the application, the second obtaining module determines the energy information of the candidate voice frame by obtaining the first energy value and the second energy value of each voice segment of the candidate voice frame, and determines whether the candidate voice frame is an effective voice frame by comparing the first energy value and the second energy value of the candidate voice frame, so that the candidate voice frame is judged again by the energy information, the detection accuracy of the effective voice frame is improved, and the accuracy of the voice endpoint detection result is improved.

According to another aspect of the embodiment of the present invention, there is also provided an electronic device for implementing the voice endpoint detection method, where the electronic device may be a terminal device or a server shown in fig. 1. The present embodiment takes the electronic device as a server as an example for explanation. As shown in fig. 10, the electronic device comprises a memory 1002 and a processor 1004, the memory 1002 having stored therein a computer program, the processor 1004 being arranged to execute the steps of any of the method embodiments described above by means of the computer program.

Optionally, in this embodiment, the electronic device may be located in at least one network device of a plurality of network devices of a computer network.

Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:

s1, acquiring target audio data to be identified;

s2, inputting the target audio data into a first voice recognition model to obtain a recognition result, wherein the first voice recognition model is a deep neural network model used for recognizing a voice frame contained in the target audio data;

s3, under the condition that a plurality of candidate voice frames are identified from the target audio data, acquiring energy information corresponding to each candidate voice frame in the plurality of candidate voice frames;

and S4, determining a voice endpoint containing a valid voice frame in the candidate voice frames under the condition that the energy information meets the judgment condition.

Alternatively, it can be understood by those skilled in the art that the structure shown in fig. 10 is only an illustration, and the electronic device may also be a terminal device such as a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a palm computer, and a Mobile Internet Device (MID), a PAD, and the like. Fig. 10 is a diagram illustrating a structure of the electronic apparatus. For example, the electronic device may also include more or fewer components (e.g., network interfaces, etc.) than shown in FIG. 10, or have a different configuration than shown in FIG. 10.

The memory 1002 may be configured to store software programs and modules, such as program instructions/modules corresponding to the voice endpoint detection method and apparatus in the embodiment of the present invention, and the processor 1004 executes various functional applications and data processing by running the software programs and modules stored in the memory 1002, that is, the voice endpoint detection method is implemented. The memory 1002 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 1002 may further include memory located remotely from the processor 1004, which may be connected to the terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof. The memory 1002 may be specifically, but not limited to, used for storing information such as sample characteristics of an item and a target virtual resource account number. As an example, as shown in fig. 10, the memory 1002 may include, but is not limited to, a first obtaining module 802, a recognition module 804, a second obtaining module 806, and a determining module 808 of the voice endpoint detection apparatus. In addition, the device may further include, but is not limited to, other module units in the voice endpoint detection apparatus, which is not described in detail in this example.

Optionally, the above-mentioned transmission device 1006 is used for receiving or sending data via a network. Examples of the network may include a wired network and a wireless network. In one example, the transmission device 1006 includes a Network adapter (NIC) that can be connected to a router via a Network cable and other Network devices so as to communicate with the internet or a local area Network. In one example, the transmission device 1006 is a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.

In addition, the electronic device further includes: a display 1008 for displaying the information of the order to be processed; and a connection bus 1010 for connecting the respective module parts in the above-described electronic apparatus.

In other embodiments, the terminal device or the server may be a node in a distributed system, where the distributed system may be a blockchain system, and the blockchain system may be a distributed system formed by connecting a plurality of nodes through a network communication. The nodes may form a Peer-To-Peer (P2P) network, and any type of computing device, such as a server, a terminal, and other electronic devices, may become a node in the blockchain system by joining the Peer-To-Peer network.

According to yet another aspect of the application, a computer program product or computer program is provided, comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the voice endpoint detection method provided in the various alternative implementations of the voice endpoint detection aspect described above. Wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.

Alternatively, in the present embodiment, the above-mentioned computer-readable storage medium may be configured to store a computer program for executing the steps of:

s1, acquiring target audio data to be identified;

Alternatively, in this embodiment, a person skilled in the art may understand that all or part of the steps in the methods of the foregoing embodiments may be implemented by a program instructing hardware associated with the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, read-Only memories (ROMs), random Access Memories (RAMs), magnetic or optical disks, and the like.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

The integrated unit in the above embodiments, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in the above computer-readable storage medium. Based on such understanding, the technical solution of the present invention may be essentially or partially contributed by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes several instructions for causing one or more computer devices (which may be personal computers, servers, or network devices, etc.) to execute all or part of the steps of the method according to the embodiments of the present invention.

In the above embodiments of the present invention, the description of each embodiment has its own emphasis, and reference may be made to the related description of other embodiments for parts that are not described in detail in a certain embodiment.

In the several embodiments provided in the present application, it should be understood that the disclosed client may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed coupling or direct coupling or communication connection between each other may be an indirect coupling or communication connection through some interfaces, units or modules, and may be electrical or in other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and amendments can be made without departing from the principle of the present invention, and these modifications and amendments should also be considered as the protection scope of the present invention.

Claims

1. A method for voice endpoint detection, comprising:

acquiring target audio data to be identified;

inputting the target audio data into a first voice recognition model for framing processing to obtain a plurality of audio frames, and labeling the audio frames through the first voice recognition model to obtain a recognition result, wherein the first voice recognition model is a deep neural network model for recognizing voice frames contained in the target audio data;

under the condition that a plurality of candidate voice frames are identified from the target audio data, each candidate voice frame is divided again to obtain continuous voice segments for energy calculation, and first energy values corresponding to the voice segments are obtained, wherein the obtaining of the first energy value corresponding to the nth voice segment comprises:

wherein E1 (n) represents a first energy value corresponding to the nth speech segment, n represents the sequence number of the current speech segment in the candidate speech frame, M represents the number of energy points contained in the first n speech segments,

represents an absolute value of an energy value of the mth energy point;

determining a first energy value corresponding to the candidate voice frame based on the first energy value corresponding to the voice segment, wherein the first energy value corresponding to the candidate voice frame is an average energy value of all the voice segments included in the candidate voice frame;

determining a second energy value corresponding to the candidate voice frame according to the corresponding first energy value of the voice segment, wherein the second energy value corresponding to the candidate voice frame is the maximum value of the first energy values of all the voice segments included in the candidate voice frame;

adjusting a second energy value corresponding to the candidate voice frame according to a scaling coefficient to obtain a second energy value corresponding to the adjusted candidate voice frame;

and determining a voice endpoint containing an effective voice frame in the plurality of candidate voice frames under the condition that the first energy value corresponding to the candidate voice frame is greater than the second energy value corresponding to the adjusted candidate voice frame.

2. The method of claim 1, further comprising:

repeatedly executing the following steps until all the voice segments of the candidate voice frame are traversed;

when the first energy value corresponding to the ith voice segment is larger than the historical maximum energy value, taking the first energy value corresponding to the ith voice segment as an updated maximum energy value, and taking the first energy value corresponding to the ith voice segment as the second energy value;

and when the first energy value corresponding to the ith voice fragment is smaller than the historical maximum energy value, taking the historical maximum energy as the second energy value.

3. The method of claim 1, wherein the inputting the target audio data into a first speech recognition model for framing to obtain a plurality of audio frames, and labeling the audio frames by the first speech recognition model to obtain a recognition result comprises:

performing framing processing on the target audio data in the first speech recognition model to obtain a plurality of audio frames;

respectively preprocessing the plurality of audio frames to obtain respective corresponding audio features;

calculating a voice recognition probability corresponding to each of the plurality of audio frames based on the audio features, wherein the audio frames are determined to be the voice frames under the condition that the voice recognition probability is greater than a first threshold value; determining that the audio frame is a non-speech frame if the speech recognition probability is less than the first threshold.

4. The method of claim 3, wherein the pre-processing the audio frames to obtain respective audio features comprises:

performing high-frequency signal enhancement processing on the audio frame to obtain the audio frame after signal enhancement;

performing time-frequency conversion processing on the audio frame after the signal enhancement to obtain the audio frame converted into a frequency domain;

filtering the audio frame converted into the frequency domain to obtain the filtered audio frame;

scaling the filtered audio frame to obtain a scaled audio frame;

and carrying out dynamic differential processing on the zoomed audio frame to obtain the audio characteristics.

5. The method according to any one of claims 1 to 4, further comprising, before the obtaining target audio data to be identified:

obtaining a plurality of sample audio data for training to obtain the first voice recognition model, wherein the first voice recognition model is a deep neural network model constructed based on a Keras framework.

6. A voice endpoint detection apparatus, comprising:

the first acquisition module is used for acquiring target audio data to be identified;

the recognition module is used for inputting the target audio data into a first voice recognition model for framing processing to obtain a plurality of audio frames, and labeling the audio frames through the first voice recognition model to obtain a recognition result, wherein the first voice recognition model is a deep neural network model used for recognizing voice frames contained in the target audio data;

a second obtaining module, configured to, when multiple candidate speech frames are identified from the target audio data, divide each candidate speech frame again to obtain continuous speech segments used for energy calculation, and obtain first energy values corresponding to the speech segments, where the obtaining of the first energy value corresponding to an nth speech segment includes:

represents an absolute value of an energy value of the mth energy point;

the apparatus is further configured to determine a first energy value corresponding to the candidate speech frame based on the first energy value corresponding to the speech segment, where the first energy value corresponding to the candidate speech frame is an average energy value of all the speech segments included in the candidate speech frame;

the apparatus is further configured to determine a second energy value corresponding to the candidate speech frame according to the corresponding first energy value of the speech segment, where the second energy value corresponding to the candidate speech frame is a maximum value of the first energy values of all the speech segments included in the candidate speech frame;

the device is further configured to adjust a second energy value corresponding to the candidate speech frame according to a scaling coefficient, so as to obtain an adjusted second energy value corresponding to the candidate speech frame;

a determining module, configured to determine a speech endpoint including an effective speech frame in the multiple candidate speech frames when a first energy value corresponding to the candidate speech frame is greater than a second energy value corresponding to the adjusted candidate speech frame.

7. A computer-readable storage medium, comprising a stored program, wherein the program when executed performs the method of any one of claims 1 to 4.

8. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method of any of claims 1 to 4 by means of the computer program.