CN113744730B

CN113744730B - Voice detection method and device

Info

Publication number: CN113744730B
Application number: CN202111067585.5A
Authority: CN
Inventors: 佘积洪; 朱宸都
Original assignee: Beijing Eswin Computing Technology Co Ltd
Current assignee: Beijing Eswin Computing Technology Co Ltd
Priority date: 2021-09-13
Filing date: 2021-09-13
Publication date: 2023-09-08
Anticipated expiration: 2041-09-13
Also published as: CN113744730A

Abstract

The application provides a sound detection method and a device, wherein the sound detection method comprises the following steps: acquiring audio data to be detected; determining the type of each frame of data in the audio data, wherein the type comprises voice and silence; and inputting voice data corresponding to frames belonging to the voice type in the audio data into a deep neural network to obtain voice data belonging to the target. Because few sounds exist in the mute data, only voice data in the audio data are input into the deep neural network to detect the target sound by removing the mute data in the audio data in advance, so that invalid detection of the mute data by the deep neural network is avoided, the calculated amount of the deep neural network on the audio data is reduced, the accuracy of target sound detection is ensured, and meanwhile, the efficiency of target sound detection is improved.

Description

Voice detection method and device

Technical Field

The present application relates to the field of sound detection technologies, and in particular, to a sound detection method, a sound detection device, an electronic device, and a storage medium.

Background

Sound detection is to detect a sound of a target from a piece of audio. Sound detection has wide application prospect. For example: the sound detection may be pre-processed as a front-end for speech recognition. That is, the voice data of the person is detected from the audio data, and then the voice data of the person is subjected to voice recognition, so that the voice recognition efficiency is improved. For another example: a meeting summary may also be formed using sound detection. Specifically, the audio data of a certain speaker is detected from the audio data of the conference, and a conference abstract is formed.

For sound detection, two methods are mainly used in general. First kind: the sound of the target in a piece of audio is distinguished from the sound of the non-target by a conventional algorithm (for example, a double-threshold algorithm, a Gaussian mixture model, etc.), so that the sound data of the target is obtained. Second kind: and distinguishing the sound of the target from the sound of the non-target in a section of audio through the deep neural network, so as to obtain the sound data of the target.

However, with the first method for detecting the target sound in the audio through the conventional algorithm, the target sound is not well distinguished from the transient sound (such as a table-knocked sound, a walking sound, etc.), and thus the detection effect of the target sound is not ideal. In the second mode of detecting the target sound in the audio through the deep neural network, when the target sound is detected, the deep neural network is required to judge each frame in the audio, and then whether the frame is a label of the target sound is output, so that the calculated amount of detecting the target sound through the deep neural network is large, and the efficiency of detecting the sound is affected.

Disclosure of Invention

The embodiment of the application aims to provide a sound detection method, a sound detection device, electronic equipment and a storage medium, so as to improve the efficiency and accuracy of sound detection.

In order to solve the technical problems, the embodiment of the application provides the following technical scheme:

the first aspect of the present application provides a sound detection method, the method comprising: acquiring audio data to be detected; determining a type of each frame of data in the audio data, the type comprising speech and silence; and inputting the voice data corresponding to the frames belonging to the voice type in the audio data into a deep neural network to obtain the voice data belonging to the target.

A second aspect of the present application provides a sound detection apparatus, the apparatus comprising: the acquisition module is used for acquiring the audio data to be detected; a determining module, configured to determine a type of each frame of data in the audio data, where the type includes speech and silence; and the prediction module is used for inputting the voice data corresponding to the frames belonging to the voice types in the audio data into the deep neural network to obtain the voice data belonging to the target.

Compared with the prior art, the sound detection method provided by the first aspect of the application determines whether each frame of data in the audio data belongs to a voice type or a mute type after the audio data to be detected is acquired, and then inputs the voice data corresponding to the frames belonging to the voice type in the audio data into the deep neural network to obtain the sound data belonging to the target. Because few sounds exist in the mute data, only voice data in the audio data are input into the deep neural network to detect the target sound by removing the mute data in the audio data in advance, so that invalid detection of the mute data by the deep neural network is avoided, the calculated amount of the deep neural network on the audio data is reduced, the accuracy of target sound detection is ensured, and meanwhile, the efficiency of target sound detection is improved.

The sound detection apparatus provided in the second aspect of the present application, the electronic device provided in the third aspect, and the computer-readable storage medium provided in the fourth aspect have the same or similar advantageous effects as the sound detection method provided in the first aspect.

Drawings

The above, as well as additional purposes, features, and advantages of exemplary embodiments of the present application will become readily apparent from the following detailed description when read in conjunction with the accompanying drawings. In the drawings, wherein like or corresponding reference numerals indicate like or corresponding parts, there are shown by way of illustration, and not limitation, several embodiments of the application, in which:

fig. 1 is a schematic flow chart of a sound detection method according to an embodiment of the application;

FIG. 2 is a second flow chart of a sound detection method according to an embodiment of the application;

FIG. 3 is a probability distribution curve of normalized power of the 19 th frequency point in the first frequency spectrum according to the embodiment of the present application;

FIG. 4 is a schematic diagram of a voice detection architecture according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a voice data structure according to an embodiment of the present application;

FIG. 6 is a flowchart illustrating a target voice recognition process for audio data according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a sound detection device according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a sound detection device according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the application to those skilled in the art.

It is noted that unless otherwise indicated, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this application belongs.

In the prior art, when the target sound data is detected from a section of audio data, if the traditional algorithm is adopted to distinguish the target sound data from the non-target sound data in the audio data, the traditional algorithm cannot effectively distinguish the target sound from the transient sound under the condition that the non-target sound is the transient sound, so that the accuracy of detecting the target sound is reduced. If the depth neural network is used for distinguishing the target sound data from the non-target sound data in the audio data, the depth neural network is used for detecting the target sound of each frame in the audio data, so that the calculated amount of target sound detection through the depth neural network is large, and the efficiency of target sound detection is reduced.

The inventors have found through intensive studies that the reason why the calculation amount of the deep neural network is large when performing sound detection is that the deep neural network needs to perform calculation once for each frame of data when performing sound detection. However, not every frame of audio data to be detected is necessarily input into the deep neural network for target sound detection, because there may be no sound, i.e., silence, on some frames of audio data. And the mute data is input into the deep neural network for calculation, so that the deep neural network has high accuracy, but has large calculation amount and low calculation efficiency when the deep neural network detects sound.

In view of this, the embodiment of the present application provides a sound detection method, which determines whether the type of each frame of data in the audio data belongs to voice data or to mute data before inputting the audio data to be detected into the deep neural network to detect the sound data of the target. Then, only the voice data is input into the deep neural network to be processed to detect the voice data of the target. Because few sounds exist in the mute data, only voice data in the audio data are input into the deep neural network to detect the target sound by removing the mute data in the audio data in advance, so that invalid detection of the mute data by the deep neural network is avoided, the calculated amount of the deep neural network on the audio data is reduced, the accuracy of target sound detection is ensured, and meanwhile, the efficiency of target sound detection is improved.

In practical applications, the target may be a person. Then, the voice of the human speaking is detected in the embodiment of the application. Further, the detected voice of the person can be subjected to semantic recognition or the like. Of course, the target may also be an animal. Then, the sound emitted by the animal is detected in the embodiment of the application. And the emotion, intention and the like of the animal can be known through the sound made by the animal. Of course, the target may also be an object. Then, the sound generated by the object is detected in the embodiment of the application. And the condition of the environment in which the object is located can be determined by sound emitted by the object. In the embodiment of the present application, the specific type of the target is not limited.

Next, a sound detection method provided by the embodiment of the present application will be described in detail.

Fig. 1 is a schematic flow chart of a sound detection method in an embodiment of the present application, and referring to fig. 1, the method may include:

s101: and acquiring audio data to be detected.

In order to realize the detection of the target sound, first, audio data containing the target sound, that is, audio data to be detected, needs to be acquired.

In the audio data to be detected, not only the target sound data but also the non-target sound data may be included. Non-target herein may refer to everything that is not related to the target. For example: when the target is a person, the non-target may be another person, various animals, various objects, environmental noise, or the like.

Of course, the audio data to be detected may include only the target sound data. The person who takes the audio data for detection does not know that only the target sound data is in the audio data to be detected and the non-target sound data is not in the audio data to be detected, so that the target sound detection is still needed for the audio data. The specific content included in the audio data is not limited herein.

S102: the type of each frame of data in the audio data is determined, the type including speech and silence.

Since audio data does not necessarily have to be continuously acoustic in the time axis. For example: a person does not speak continuously while he is speaking at a meeting, but rather has a slight pause between speaking or does not speak when he is speaking a session and then takes the next action. Thus, in the audio data, some frames correspond to sound data, and some frames have no sound data or only weak ambient sound data.

In order to improve the voice recognition efficiency, the calculation amount of the deep neural network is increased because the deep neural network still processes the data corresponding to the frames without voice data or with weak voice only, and the type of each frame of data in the audio data can be firstly identified to judge whether each frame of data is voice data or mute data. And further, the deep neural network processes the voice data in the audio data, so that the calculated amount of the deep neural network is saved.

For example, assume that a person is speaking content a continuously at the time corresponding to frames 1 to 100. Then, at the time corresponding to the 101 st frame to the 200 th frame, a mouthful of water is drunk, and the presentation is turned to the next page. Next, the content B is continuously described at the time corresponding to the 201 st to 300 nd frames. And during the time from the 1 st frame to the 300 rd frame, the speech is recorded, and audio data is formed. It can be seen that, in the audio data, the type of 1 st frame data is voice data, … …, the type of 100 th frame data is voice data, the type of 101 st frame data is mute data, … …, the type of 200 st frame data is mute data, the type of 201 st frame data is voice data, … …, and the type of 300 st frame data is voice data.

The specific manner of determining the type of each frame of data in the audio data may be, but is not limited to, spectral power, amplitude of the sound signal, and the like.

And the voice-type data may include sounds made by targets as well as sounds made by non-targets. For example: the sound emitted by the target may be human sound, and the sound emitted by the non-target may be animal such as cat and dog, or object such as table and chair.

S103: and inputting voice data corresponding to frames belonging to the voice type in the audio data into a deep neural network to obtain voice data belonging to the target.

After the data of each frame in the audio data is divided into voice data and mute data, only the voice data in the audio data can be input into the deep neural network to detect target sound, and then the target sound data is obtained from the voice data of the audio data.

The deep neural network may be any deep neural network capable of detecting sound. The specific type of the deep neural network is not limited herein.

After the voice data is input into the deep neural network, the deep neural network can calculate each frame of data in the voice data, and output a probability value that each frame of data is the target voice data. According to the probability value of each frame of data output by the deep neural network, the target sound data can be extracted from the voice data.

As can be seen from the foregoing, in the sound detection method provided by the embodiment of the present application, after the audio data to be detected is obtained, it is determined whether each frame of data in the audio data belongs to a voice type or a silence type, and then the voice data corresponding to the frame belonging to the voice type in the audio data is input into the deep neural network, so as to obtain the sound data belonging to the target. Because few sounds exist in the mute data, only voice data in the audio data are input into the deep neural network to detect the target sound by removing the mute data in the audio data in advance, so that invalid detection of the mute data by the deep neural network is avoided, the calculated amount of the deep neural network on the audio data is reduced, the accuracy of target sound detection is ensured, and meanwhile, the efficiency of target sound detection is improved.

Further, as a refinement and extension of the method shown in fig. 1, the embodiment of the application further provides a sound detection method. Fig. 2 is a second flow chart of a sound detection method in an embodiment of the present application, and referring to fig. 2, the method may include:

s201: and acquiring audio data to be detected.

Step S201 has the same or similar implementation as step S101, and will not be described here.

S202: first frame data in the audio data is acquired.

In essence, each frame of data in the audio data needs to be processed separately to determine whether each frame of data is of the speech type or of the silence type. Here, a specific manner of determining each frame of the audio data will be described by taking a certain frame of the audio data, that is, the first frame of the audio data as an example. Of course, the first frame data is not intended to limit the frame data to the data of the start frame in the audio data. The first frame data may be data of any one frame of audio data.

In practical application, the frame length and frame shift of one frame of audio data can be set according to practical requirements. For example: when the sampling frequency of the audio data is 16000Hz, the frame length may be 25ms and the frame shift may be 10ms.

S203: and carrying out Fourier transform on the first frame data to obtain a first frequency spectrum, wherein the first frequency spectrum comprises a plurality of frequency points.

In the time domain, the type of the first frame data is not easily determined. In the frequency domain, however, there is much information that is not known in the time domain. Accordingly, the first frame data may be converted to the frequency domain, so that the type of the first frame data is determined through information in the frequency domain.

Specifically, fourier transform is performed on the first frame data. After the first frame data is subjected to Fourier transform, a corresponding frequency spectrum of the first frame data, namely a first frequency spectrum, can be obtained. The first spectrum includes a plurality of frequency points. Each frequency point represents a frequency in the waveform of the first frame data in the time domain. Through a plurality of frequency points in the first frequency spectrum, the combination condition of the frequencies of the first frame data in the waveform of the time domain can be clearly known, and the probability of whether the first frame data is voice type data can be determined through each frequency point.

S204: and calculating the original power of each frequency point in the first frequency spectrum.

The original power of each frequency point can be calculated through the first frequency spectrum. Specifically, the following formula (1) can be used for calculation.

P' _signal (λ,k)＝|Y(λ,k)| ² Formula (1)

Wherein λ represents a λ frame, k represents a kth frequency point in the λ frame, Y (λ, k) represents a frequency spectrum of the kth frequency point of the λ frame, and P' _signal (lambda, k) represents the original power of the kth frequency bin of the lambda frame.

Of course, the power of each frequency point in the frequency spectrum can also be calculated by adopting other modes of calculating the power of the frequency point in the frequency spectrum. The specific calculation method is not limited herein.

It should be noted here that, due to the symmetry of the frequency spectrum, it is not necessary to perform power calculation once for all the frequency points in the first frequency spectrum, but only the first half of the frequency points in all the frequency points in the first frequency spectrum, that is, the power of the preset frequency point, may be calculated. In this way, the processing speed for the frequency bin can be improved. And then only the type corresponding to the preset frequency point is judged, so that the judging speed of the type can be improved, and the detection efficiency of sound is improved.

For example, assume that 15 ten thousand audio files need to be processed. For the 15 ten thousand audio files, frames are divided by frame length of 25ms and frame movement of 10ms, so as to obtain multi-frame data. For each frame of data, a short-time fourier transform of 512 frequency points is performed, and the transform is performed into a frequency spectrum. Because of the symmetry of the frequency spectrum, the frequency spectrum of the front 257 frequency points can represent each frame of data, and only the power of the front 257 frequency points in the frequency spectrum is calculated.

S205: and carrying out smoothing treatment on the original power of each frequency point in the first frequency spectrum to obtain the power of each frequency point in the first frequency spectrum.

After the original power of each frequency point in the first frequency spectrum is calculated, in order to improve the accuracy of the power value of each frequency point and further improve the accuracy of determining the frame data type, the original power of each frequency point in the first frequency spectrum can be subjected to smoothing treatment. Specifically, the processing can be performed by the following formula (2).

P _signal (λ,k)＝αP' _signal (λ,k)+(1-α)|Y(λ,k)| ² Formula (2)

Wherein λ represents a λ frame, k represents a kth frequency point in the λ frame, Y (λ, k) represents a frequency spectrum of the kth frequency point of the λ frame, and P' _signal (lambda, k) represents the original power of the kth frequency point of the lambda frame, P _signal (lambda, k) represents the power of the kth frequency point of the lambda frame, and alpha represents the smoothing factor. Typically, α takes a value between 0.5 and 1. If α is too small, the smoothing effect is poor. If alpha is too large, the power is excessively smoothed, and the detail information of the original power of the frequency point is lost.

Since the power of each frequency point needs to be normalized later, the minimum power of each frequency point needs to be determined in advance, and the minimum power of each frequency point is an estimate of noise power. When determining the minimum power of each frequency point in the first frequency spectrum, a continuous frequency spectrum minimum value tracking mode can be adopted. A specific manner of determining the minimum power of each frequency point in the first spectrum will be described below by taking a certain frequency point in the first spectrum, that is, a target frequency point as an example. However, this is not intended to limit the determination of the minimum power to only one frequency point in the spectrum, but rather the determination of the minimum power for each frequency point in the spectrum is required once.

Of course, the step S206 may be directly performed without performing the smoothing process on the original power of each frequency point in the first spectrum, that is, after the step S204 is performed, the step S205 may be skipped. In this way, the original power of each frequency point in the first frequency spectrum is directly used as the power of each frequency point in the first frequency spectrum for subsequent processing.

S206: when the power of the target frequency point in the first frequency spectrum is larger than the minimum power of the corresponding frequency point in the second frequency spectrum, calculating the minimum power of the target frequency point according to the power of the target frequency point in the first frequency spectrum, the power of the corresponding frequency point in the second frequency spectrum and the minimum power of the corresponding frequency point in the second frequency spectrum.

S207: when the power of the target frequency point in the first frequency spectrum is smaller than or equal to the minimum power of the corresponding frequency point in the second frequency spectrum, the minimum power of the target frequency point is the power of the target frequency point in the first frequency spectrum.

The second frame data corresponding to the second frequency spectrum is the last frame data of the first frame data in the audio data. When determining the minimum power of each frequency point corresponding to the first frame data, the last frame data of the first frame data, that is, the minimum power of each frequency point in the second frequency spectrum corresponding to the second frame data, needs to be referred to.

Specifically, it is determined whether the following formula (3) holds.

P _signal，min (λ-1,k)＜P _signal (lambda, k) equation (3)

Wherein P is _signal，min (lambda-1, k) represents the minimum power of the kth frequency point of the lambda-1 frame, P _signal (lambda, k) represents the kth frequency point of the lambda frameλ represents a λ frame, and k represents a kth frequency point in the λ frame.

If the above formula (3) holds, it is explained that the power of the frequency point corresponding to the current frame data increases, the minimum power of the frequency point is calculated by the following formula (4).

Wherein P is _signal，min (lambda, k) represents the minimum power of the kth frequency point of the lambda frame, P _signal，min (lambda-1, k) represents the minimum power of the kth frequency point of the lambda-1 frame, P _signal (lambda, k) represents the power of the kth frequency point of the lambda frame, P _signal (lambda-1, k) represents the power of the kth frequency point of the lambda-1 frame, lambda represents the lambda frame, k represents the kth frequency point in the lambda frame, and beta and gamma are related parameters.

Furthermore, both beta and gamma can take values between 0 and 1. Preferably, β can take the value of 0.96 and γ can take the value of 0.998. Of course, β, γ may take other values between 0 and 1. Specific values of β and γ are not limited herein.

In essence, equation (4) implements a first order differential operation, which is an approximation to the derivative in the discrete case. This can increase the calculation speed. And when the power of the noisy signal, namely the frequency point corresponding to the current frame, is increased, the derivative value is positive, and the noise estimation is increased. And when the power of the noisy signal, namely the frequency point corresponding to the current frame, is reduced, the derivative value is negative, and the noise estimation is reduced.

If the above formula (3) is not satisfied, which indicates that the power of the frequency point corresponding to the current frame data is reduced or unchanged, the minimum power of the frequency point is calculated by the following formula (5).

P _signal，min (λ,k)＝P _signal (lambda, k) equation (5)

Wherein P is _signal，min (lambda, k) represents the minimum power of the kth frequency point of the lambda frame, P _signal (lambda, k) represents the power of the kth frequency point of the lambda frame, lambda represents the lambda frame, and k represents the lambda framek frequency points.

Here, the steps S206 and S207 are alternatively performed.

After the minimum power of each frequency point in the first frequency spectrum is determined, subtracting the minimum power of the corresponding frequency point in the first frequency spectrum from the power of each frequency point in the first frequency spectrum to obtain the noise-removed signal power of each frequency point in the first frequency spectrum. And normalizing the denoising signal power of each frequency point, and judging whether the corresponding frequency point belongs to voice or silence based on the normalized power. Before this, a criterion for the determination, i.e. the preset power range, needs to be determined in advance.

S208: and constructing a probability distribution model of each frequency point in the first frequency spectrum.

S209: and integrating the probability curve in the probability distribution model to obtain two power values corresponding to the preset probability in the probability distribution model.

S210: and taking the two power values as preset power ranges of corresponding frequency points.

For each frequency point in the first spectrum, a probability distribution model needs to be constructed. And further, determining a reference, namely a preset power range, for judging the type of each frequency point based on each constructed probability distribution model. The specific process of determining the preset power range of the 19 th frequency point is described below by taking a probability distribution diagram of normalized power of a certain frequency point, for example, the 19 th frequency point, in the first frequency spectrum as an example. Of course, this is not intended to limit the first spectrum to having the 19 th frequency bin, and the 19 th frequency bin is merely exemplary.

Fig. 3 is a probability distribution curve of normalized power of the 19 th frequency point in the first frequency spectrum in the embodiment of the present application, and is shown in fig. 3, where the abscissa represents normalized power and the ordinate represents probability. Will be horizontal straight line T ₁ Intersecting the probability distribution curve to obtain a section of probability curve S ₁ . For probability curve S ₁ And integrating to obtain probability P. Moving a horizontal straight line T ₁ When the probability P is 75%, the probability curve S ₁ From a horizontal straight line T ₁ The range between the two normalized power thresholds thr1 and thr2 corresponding to the intersection point of (a) isSetting a power range.

Of course, the above 75% probability is also merely an exemplary illustration. The specific value of the probability can be set according to actual needs. Specific numerical values of the probabilities are not limited herein.

The reason why the preset power range is set as above is that if a certain frame of data belongs to voice but not to silence, the normalized power of the 19 th frequency point, i.e., P _voice (lambda, 19) will be within the preset power range thr1 to thr2 with a high probability. The preset power range can be used for judging whether the 19 th frequency point corresponding to the frame data belongs to voice or not, and the preset power range can be used for continuously judging whether the 19 th frequency point corresponding to other frame data belongs to voice or not. That is, the preset power range corresponding to each frequency point in one frame of data may be predetermined, and then the preset power ranges of the corresponding frequency points in other frames of data may all use the preset power ranges corresponding to each frequency point in the one frame of data. Thus, the calculated amount of the preset power range can be reduced, and the sound detection efficiency can be improved.

Here, the probability distribution curve of the normalized power of each frequency point may be obtained by any method of generating a probability distribution curve of the power of the frequency point. The specific acquisition mode of the probability distribution curve of the frequency point power is not limited herein.

After the power of each frequency point is obtained in the step S205, the minimum power of each frequency point is obtained in the steps S206 to S207, and the preset power range of each frequency point is obtained in the steps S208 to S210, the power of each frequency point in the first spectrum may be normalized, so as to determine whether the corresponding frequency point belongs to speech or silence based on the normalized power.

S211: subtracting the power of each frequency point in the first frequency spectrum from the minimum power of the corresponding frequency point to obtain the denoising signal power of each frequency point, and further carrying out normalization processing on the denoising signal power of each frequency point to obtain the normalization power of each frequency point.

Here, the corresponding frequency point is a frequency point corresponding to each frequency point in the first spectrum. For example: there are 1 st to 257 th frequency points in the first spectrum, and then, the 1 st to 257 th frequency points each correspond to minimum power. For example: the 1 st frequency point corresponds to the minimum power a, the 2 nd frequency point corresponds to the minimum power b, … …, and the 257 th frequency point corresponds to the minimum power s. The minimum power is obtained by nonlinear tracking based on frequency points, i.e. steps S206-S207 described above. And the minimum power of the corresponding frequency point represents the noise corresponding to the corresponding frequency point.

Specifically, the denoising signal power for each frequency point in the first spectrum can be obtained by the following equation (6).

P _voice (λ,k)＝P _signal (λ,k)-P _signal，min (lambda, k) equation (6)

Wherein P is _voice (lambda, k) represents the denoising tone signal power of the kth frequency point of the lambda frame, P _signal (lambda, k) represents the power of the kth frequency point of the lambda frame, P _signal，min (λ, k) represents the minimum power of the kth frequency point of the λ frame, λ represents the λ frame, and k represents the kth frequency point in the λ frame.

After the denoising signal power of each frequency point is obtained, the denoising signal power of each frequency point can be normalized, and then the normalized power of each frequency point is obtained. Here, any normalization processing method may be used to normalize the noise-removed signal power of each frequency point. The specific manner of normalization is not described here.

S212: when the normalized power of each frequency point in the first frequency spectrum is in a preset power range, marking a first label for the corresponding frequency point.

S213: when the normalized power of each frequency point in the first frequency spectrum is not in the preset power range, marking a second label for the corresponding frequency point.

S214: and generating a tag sequence of each frequency point in the first frequency spectrum.

The preset power range may be obtained by counting a large amount of known voice data and mute data. The first tag is used for representing that the first frame data belongs to voice at a corresponding frequency point. The second tag is used for representing that the first frame data belongs to silence on the corresponding frequency point.

After the normalized power of each frequency point in the first frequency spectrum is obtained, whether the normalized power of each frequency point is located in a corresponding preset power range or not is determined. If the frequency point is determined to be located, and the frequency point is indicated to belong to the voice, the frequency point can be marked as 1. If the frequency point is determined not to be located, which means that the frequency point is not in voice and possibly in silence, the frequency point can be marked as 0.

Of course, other types of labels may be used to distinguish whether the frequency bins belong to speech. The specific type of marking is not limited herein.

After the marking of the frequency points in the first frequency spectrum is completed, a sequence, i.e. [ i ] ₁ 、i ₂ 、……i _n ]Such a sequence. Wherein i is ₁ 、i ₂ 、……i _n And the number of the marked frequency points can be 0 or 1 respectively, and n is the number of the marked frequency points. Generally, how many frequency points exist in the frequency spectrum, the first half of the frequency points in all the frequency points are marked.

S215: and determining the weight corresponding to each frequency point according to the normalized power of each frequency point in the first frequency spectrum, wherein the weight is positively correlated with the normalized power.

Generally, the frequency of the audio data is between 300Hz and 3400Hz, and when the sampling frequency of the audio data is 16000Hz, the frequency spectrum after fourier transformation may describe a range of 0Hz to 8000 Hz. That is, the frequencies of the voice type data are mainly distributed over the first 128 frequency points. In addition, the power of different data at different frequency points is different. The higher the power of a certain data at a corresponding frequency point, the more important the frequency point is for judging whether the data is voice. That is, the contribution of different frequency points to the final judgment whether the data is voice data is different. Therefore, the weight w corresponding to each frequency point can be determined according to the normalized power of each frequency point in the first frequency spectrum ₁ 、w ₂ 、……、w _n ]. Wherein n is the number of frequency points.

Specifically, the greater the normalized power of a corresponding frequency point in the first spectrum, the greater the weight corresponding to the corresponding frequency point. Similarly, the smaller the normalized power of the corresponding frequency point in the first frequency spectrum, the smaller the weight corresponding to the corresponding frequency point.

S216: and carrying out weighted average on the tag sequence of the first frequency spectrum and the weight of each frequency point in the first frequency spectrum to obtain the voice confidence coefficient of the first frame data corresponding to the first frequency spectrum.

Specifically, the voice confidence of the first frame data can be calculated by the following formula (7).

Wherein C represents the confidence level of the voice, [ omega ] ₁ ,ω ₂ ......ω _n ]Representing the weight of each frequency point in the first frequency spectrum, [ i ] ₁ ,i ₂ ......i _n ]A tag sequence representing a first spectrum, n representing the number of bins.

S217: when the voice confidence coefficient of the first frame data is larger than the preset voice confidence coefficient, determining that the type of the first frame data is voice.

S218: and when the voice confidence coefficient of the first frame data is smaller than or equal to the preset voice confidence coefficient, determining that the type of the first frame data is mute.

In practical application, the preset voice confidence level can be set according to practical requirements. For example: 0.4, 0.5, 0.6, etc. When the voice data is required to be more comprehensively obtained from the audio data, the preset voice confidence level can be set lower; when it is desired to more accurately acquire voice data from audio data, the preset voice confidence level may be set higher. Specific values of the preset voice confidence are not limited herein.

Fig. 4 is a schematic diagram of a voice detection architecture in an embodiment of the present application, and referring to fig. 4, for n frequency points of first frame data in a first frequency spectrum of a frequency domain, whether the n frequency points belong to voice is respectively determined, so as to generate a sequence of n frequency points. At the same time, the weights of the n frequency points are determined. Then, the voice confidence of the first frame data is calculated based on the weights of the n frequency points and the sequence of the n frequency points. And finally, judging a threshold value according to the voice confidence of the first frame data, and outputting a judging result, wherein the judging result is used for representing whether the first frame data belongs to the voice data or not.

Through the steps S201 to S207, the noise is continuously updated in the process of detecting the audio data, and through the steps S208 to S218, when the threshold value is determined, the prior knowledge of the voice data (the normalized power range of the voice data at each frequency point and which frequency points are more important for determining the data as voice) is introduced to roughly distinguish the voice frame from the mute frame in the audio data, so that only the voice frame is sent into the deep neural network for voice detection in the follow-up, the operation amount can be greatly reduced while the detection accuracy is ensured, and the detection efficiency is improved.

In practical applications, convolutional neural networks may be used to classify and identify target sounds and non-target sounds in speech data.

Taking convolutional neural network as an example, before target sound detection is performed by using the convolutional neural network, a training data set needs to be acquired, a network is built, and a training network is required. After the network is trained, the target sound in the voice data can be detected more accurately. The following describes the process of target sound detection in terms of acquiring a training data set, constructing a network, training the network, and predicting using the network.

First aspect: a training dataset is acquired.

In practice, the training dataset may be generated by a subtitle-aligned movie corpus (SAM) for network training. Taking voice recognition as an example, voice data in a movie is acquired from the SAM. The voice data may include: pure voice, voice plus noise, voice plus music, etc. About 5 ten thousand samples. Non-human voice data in the movie is also acquired. The non-human voice data may include: other sounds that appear in the movie without human voice, etc. About 17 ten thousand samples. Each sample was about 0.63s in length. In this way, a training data set is obtained.

After the training data set is obtained, the data set may be further processed. Specifically, each voice segment with the length of 0.63s in the training data set is subjected to framing, the frame length is set to 25ms, the frame shift is set to 10ms, and each voice segment can be divided into 64 voice frames. Then, for each speech frame, a 64-dimensional mel-frequency cepstral coefficient (mel frequency cepstral coefficent, MFCC) feature is extracted to express each speech frame with the MFCC feature. And each speech segment is represented by a 64 x 64 speech pattern.

Here, MFCC is proposed to have a nonlinear correspondence with frequency based on auditory characteristics of the human ear. Thus, the MFCC characteristic is a calculated spectral characteristic using this relationship.

Second aspect: and (5) building a network.

In particular, an 8-layer depth residual network may be employed. Because the depth residual network can solve the degradation problem of the depth network through residual learning, the depth residual network is adopted to perform target voice recognition, and a better effect can be obtained.

Third aspect: the network is trained.

Each voice segment in the data set can be substituted into the built deep neural network for training.

Specifically, an adamW optimization algorithm may be used to perform network training until parameters in the network reach an optimum. In the training process, a learning rate wakeup technology is also adopted. The weights of the network are randomly initialized at the beginning of training, and if a larger learning rate is selected, the network may be unstable (shake). By the learning rate wakeup technique, the learning rate can be slowly increased from an initial small value to an initial large value at the beginning of training or in some training steps. In this way, the network may slowly stabilize at a small learning rate of warm-up. After the network is relatively stable, a larger learning rate is selected for training, so that the convergence rate of the network can be increased, and the prediction effect of the network is further improved.

Fourth aspect: the predictions are made using a network.

When the network predicts, the whole voice data is not predicted at one time, but the voice data is split into a plurality of voice fragments according to the preset frame length and frame shift, and the prediction is performed on each frame in each voice fragment. The following describes the network prediction process in detail by taking two adjacent voice segments in the voice data, namely, the first voice segment and the second voice segment as examples.

S219: and inputting the first voice segment in the voice data into the deep neural network to obtain the prediction results of the first voice frame and the second voice frame in the first voice segment.

S220: and inputting the second voice segment in the voice data into the deep neural network to obtain the prediction results of the second voice frame and the third voice frame in the second voice segment.

S221: and determining whether the second voice frame is from the target according to the prediction result of the second voice frame in the first voice segment and the prediction result of the second voice frame in the second voice segment.

The second voice frame in the first voice segment and the second voice frame in the second voice segment are the same frame in the voice data. The prediction results are used to characterize whether the corresponding speech frame is from the target.

That is, the voice data includes a plurality of voice segments, each of which includes a plurality of voice frames, and adjacent voice segments are partially overlapped on some of the voice frames. Fig. 5 is a schematic structural diagram of voice data in an embodiment of the present application, referring to fig. 5, in the voice data, there are a plurality of partially overlapped voice segments, for a certain frame in the voice data, for example: a current frame. There are a plurality of speech segments each containing a current frame. When determining whether the current frame is from the target, it is necessary to predict whether the current frame is from the target from a plurality of voice segments including the current frame, and then based on the prediction result of the current frame in the plurality of voice segments including the current frame, it is finally determined whether the current frame is from the target. For example: in the voice data, the duration of 0.63s is taken as a voice segment, the duration of 25ms is taken as a voice frame, when determining whether the current frame is from a target, the prediction of whether the current frame is from the target is needed to be carried out in 63 voice segments respectively, and then whether the current frame is from the target is finally determined based on the prediction result of the current frame in the 63 voice segments.

Generally, in practical applications, the number of speech frames in the first speech segment may be 64, i.e. the first speech frame and the second speech frame represent 64 speech frames. Accordingly, the number of speech frames in the second speech segment may also be 64, i.e. the second speech frame and the third speech frame represent 64 speech frames. Thus, after the first voice segment in the voice data is input into the deep neural network, the prediction results of 64 voice frames in the first voice segment can be obtained. And inputting the second voice segment in the voice data into the deep neural network, so that the prediction result of 64 voice frames in the second voice segment can be obtained. Since the voice frames in the first voice segment partially overlap with the voice frames in the second voice segment, it is finally determined whether the voice frame is from the target according to the prediction result of the 63 voice segments on the voice frame of one frame. And further, the voice frame data from the target, which is determined from the voice data, is taken as the voice data of the target.

For the deep neural network, after receiving a speech segment, the prediction results of all speech frames in the speech segment can be output. The prediction result of a certain speech frame in the speech segment is also the prediction result of the speech frame in other speech segments. Therefore, all the prediction results of the voice frame are obtained from each voice segment, and whether the voice frame belongs to the target is determined according to all the prediction results of the voice frame, so that the accuracy of voice frame prediction can be improved, and the accuracy of target sound detection can be further improved.

Specifically, of all the predicted results for the speech frame, none or both are. One belonging to the target and the other not belonging to the target. Therefore, the number of the two results can be counted, and the predicted result with a large number is taken as the final predicted result of the voice frame.

In practical application, after the voice segment is input into the deep neural network, the deep neural network can output the prediction result of each voice frame in the voice segment. The prediction result may be represented by 0 and 1, or may be represented by probability. When the prediction result is represented by 0 and 1, 0 is output, which may indicate that the current speech frame does not belong to the target, and 1 is output, which may indicate that the current speech frame belongs to the target. When the prediction result is represented by a probability, the probability value represents the probability that the current frame belongs to the target. Of course, the predicted result of the network may also be represented by other means, which are not particularly limited herein.

Fig. 6 is a schematic flow chart of target voice recognition on audio data in the embodiment of the application, referring to fig. 6, after the audio data is acquired, firstly, silence detection is performed on the audio data through noise estimation; then, eliminating the mute frame in the audio data to obtain a voice frame; then, generating a plurality of voice fragments based on the voice frames; then, inputting a plurality of voice fragments into a deep neural network to detect target sound; finally, the prediction result of each voice frame is obtained, namely whether each voice frame belongs to the target or not.

S222: and taking the data from the target determined from the voice data as the target sound data.

After determining whether each voice frame in the voice data belongs to the target through the deep neural network, the voice data of the target can be extracted from the voice data.

Based on the same inventive concept, as an implementation of the method, the embodiment of the application further provides a sound detection device. Fig. 7 is a schematic structural diagram of a sound detection device according to an embodiment of the present application, and referring to fig. 7, the device may include:

an acquisition module 701, configured to acquire audio data to be detected.

A determining module 702 is configured to determine a type of each frame of data in the audio data, where the type includes speech and silence.

And a prediction module 703, configured to input, in the audio data, speech data corresponding to a frame belonging to a speech type into a deep neural network, so as to obtain sound data belonging to a target.

Further, as a refinement and expansion of the device shown in fig. 7, the embodiment of the application further provides a sound detection device. Fig. 8 is a schematic diagram of a second structure of a sound detection apparatus according to an embodiment of the present application, and referring to fig. 8, the apparatus may include:

An acquisition module 801, configured to acquire audio data to be detected.

And a smoothing module 802, configured to smooth the power of each frequency point in the first spectrum, so as to obtain the processed power of each frequency point in the first spectrum.

The first preset module 803 is specifically configured to:

and constructing a probability distribution model of each frequency point in the first frequency spectrum.

And integrating the probability curve in the probability distribution model to obtain two power values corresponding to the preset probability in the probability distribution model.

And taking the two power values as preset power ranges of corresponding frequency points.

The second preset module 804 is specifically configured to:

when the power of the target frequency point in the first frequency spectrum is larger than the minimum power of the corresponding frequency point in the second frequency spectrum, the minimum power of the target frequency point is related to the power of the target frequency point in the first frequency spectrum, the power of the corresponding frequency point in the second frequency spectrum and the minimum power of the corresponding frequency point in the second frequency spectrum, and the second frame data corresponding to the second frequency spectrum is the last frame data of the first frame data in the audio data.

When the power of the target frequency point in the first frequency spectrum is smaller than or equal to the minimum power of the corresponding frequency point in the second frequency spectrum, the minimum power of the target frequency point is the power of the target frequency point in the first frequency spectrum.

And a weight module 805, configured to determine weights corresponding to the frequency points according to the normalized power of the frequency points in the first spectrum, where the weights are positively related to the normalized power.

The determination module 806 includes:

an acquisition unit 8061 for acquiring first frame data in the audio data.

The transforming unit 8062 is configured to perform fourier transform on the first frame data to obtain a first frequency spectrum, where the first frequency spectrum includes a plurality of frequency points.

A calculating unit 8063, configured to calculate power of each frequency point in the first spectrum.

The calculating unit 8063 is specifically configured to calculate power of a preset frequency point in the first spectrum, where the preset frequency point is a frequency point of a first half part of each frequency point of the first spectrum.

A determining unit 8064, configured to determine a type of the first frame data based on the power of each frequency point in the first spectrum.

The determining unit 8064 is specifically configured to:

subtracting the power of each frequency point in the first frequency spectrum from the minimum power of the corresponding frequency point to obtain the denoising signal power of each frequency point, wherein the minimum power is obtained based on nonlinear tracking of the frequency point;

and carrying out normalization processing on the denoising signal power of each frequency point to obtain the normalization power of each frequency point.

And determining the type of the first frame data based on the normalized power of each frequency point in the first frequency spectrum.

The determining unit 8064 is specifically configured to:

when the normalized power of each frequency point in the first frequency spectrum is in a preset power range, marking a first label for the corresponding frequency point, wherein the first label is used for representing that the first frame data belongs to voice on the corresponding frequency point.

When the normalized power of each frequency point in the first frequency spectrum is not in the preset power range, marking a second label for the corresponding frequency point, wherein the second label is used for representing that the first frame data belongs to silence on the corresponding frequency point.

And generating a tag sequence of each frequency point in the first frequency spectrum.

The type of the first frame data is determined based on the tag sequence.

The determining unit 8064 is specifically configured to:

and carrying out weighted average on the tag sequence and the weight of the corresponding frequency point to obtain the voice confidence coefficient of the first frame data corresponding to the first frequency spectrum.

And when the voice confidence coefficient of the first frame data is larger than the preset voice confidence coefficient, determining that the type of the first frame data is voice.

And when the voice confidence coefficient of the first frame data is smaller than or equal to the preset voice confidence coefficient, determining that the type of the first frame data is mute.

The prediction module 807 includes:

the first prediction unit 8071 is configured to input a first speech segment in the speech data into the deep neural network, and obtain a prediction result of a first speech frame and a second speech frame in the first speech segment, where the prediction result is used to characterize whether the corresponding speech frame is from the target.

The second prediction unit 8072 is configured to input a second speech segment in the speech data into the deep neural network, obtain a prediction result of a second speech frame and a third speech frame in the second speech segment, and make the second speech frame in the first speech segment and the second speech frame in the second speech segment be the same frame in the speech data.

A target prediction unit 8073, configured to determine whether the second speech frame is from the target according to a prediction result of the second speech frame in the first speech segment and a prediction result of the second speech frame in the second speech segment.

An extraction unit 8074 for taking the data from the target determined from the voice data as sound data of the target.

It should be noted here that the description of the above device embodiments is similar to the description of the method embodiments described above, with similar advantageous effects as the method embodiments. For technical details not disclosed in the embodiments of the apparatus of the present application, please refer to the description of the embodiments of the method of the present application.

Based on the same inventive concept, the embodiment of the application also provides electronic equipment. Fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application, and referring to fig. 9, the electronic device may include: a processor 901, a memory 902, a bus 903; the processor 901 and the memory 902 complete communication with each other through the bus 903; the processor 901 is operative to invoke program instructions in the memory 902 to perform the methods in one or more embodiments described above.

It should be noted here that the description of the above embodiments of the electronic device is similar to the description of the above embodiments of the method, with similar advantageous effects as the embodiments of the method. For technical details not disclosed in the embodiments of the electronic device of the present application, please refer to the description of the method embodiments of the present application for understanding.

Based on the same inventive concept, embodiments of the present application also provide a computer-readable storage medium, which may include: a stored program; wherein the program, when executed, controls a device in which the storage medium resides to perform the method of one or more embodiments described above.

It should be noted here that the description of the above embodiments of the storage medium is similar to the description of the above embodiments of the method, with similar advantageous effects as the embodiments of the method. For technical details not disclosed in the storage medium embodiments of the present application, please refer to the description of the method embodiments of the present application for understanding.

The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A sound detection method, the method comprising:

acquiring audio data to be detected;

determining a type of each frame of data in the audio data, the type comprising speech and silence;

inputting voice data corresponding to frames belonging to voice types in the audio data into a deep neural network to obtain voice data belonging to a target;

wherein the determining the type of each frame of data in the audio data includes:

acquiring first frame data in the audio data;

performing Fourier transform on the first frame data to obtain a first frequency spectrum, wherein the first frequency spectrum comprises a plurality of frequency points;

calculating the power of each frequency point in the first frequency spectrum;

determining the type of the first frame data based on the power of each frequency point in the first frequency spectrum;

Wherein the determining the type of the first frame data based on the power of each frequency point in the first spectrum includes:

subtracting the power of each frequency point in the first frequency spectrum from the minimum power of the corresponding frequency point to obtain the denoising signal power of each frequency point; the minimum power is obtained based on nonlinear tracking of frequency points;

normalizing the denoising voice signal power of each frequency point to obtain normalized power of each frequency point;

determining the type of the first frame data based on the normalized power of each frequency point in the first frequency spectrum;

wherein the determining the type of the first frame data based on the normalized power of each frequency point in the first spectrum includes:

when the normalized power of each frequency point in the first frequency spectrum is in a preset power range, marking a first label for the corresponding frequency point, wherein the first label is used for representing that the first frame data belongs to voice on the corresponding frequency point;

when the normalized power of each frequency point in the first frequency spectrum is not in the preset power range, marking a second label for the corresponding frequency point, wherein the second label is used for representing that the first frame data belongs to silence on the corresponding frequency point;

Generating a tag sequence of each frequency point in the first frequency spectrum;

the type of the first frame data is determined based on the tag sequence.

2. The method of claim 1, wherein prior to determining whether the normalized power for each frequency bin in the first spectrum is within the preset power range, the method further comprises:

constructing a probability distribution model of each frequency point in the first frequency spectrum;

integrating probability curves in the probability distribution model to obtain two corresponding power values of preset probability in the probability distribution model;

3. The method of claim 1, wherein prior to said determining the type of the first frame data based on the tag sequence, the method further comprises:

determining weights corresponding to all frequency points according to the normalized power of all frequency points in the first frequency spectrum, wherein the weights are in positive correlation with the normalized power;

the determining the type of the first frame data based on the tag sequence includes:

carrying out weighted average on the tag sequence and the weight of the corresponding frequency point to obtain the voice confidence coefficient of the first frame data corresponding to the first frequency spectrum;

When the voice confidence coefficient of the first frame data is larger than a preset voice confidence coefficient, determining that the type of the first frame data is voice;

4. The method of claim 1, wherein prior to said subtracting the power of each frequency bin from the minimum power of the corresponding frequency bin in the first spectrum, the method further comprises:

when the power of the target frequency point in the first frequency spectrum is larger than the minimum power of the corresponding frequency point in the second frequency spectrum, calculating the minimum power of the target frequency point according to the power of the target frequency point in the first frequency spectrum, the power of the corresponding frequency point in the second frequency spectrum and the minimum power of the corresponding frequency point in the second frequency spectrum, wherein the second frame data corresponding to the second frequency spectrum is the last frame data of the first frame data in the audio data;

5. The method of claim 4, wherein prior to determining whether the power of a target frequency bin in the first frequency spectrum is less than the minimum power of a corresponding frequency bin in the second frequency spectrum, the method further comprises:

and carrying out smoothing treatment on the power of each frequency point in the first frequency spectrum to obtain the smoothed power of each frequency point in the first frequency spectrum.

6. The method according to any one of claims 1 to 5, wherein inputting the voice data corresponding to the frames belonging to the voice type in the audio data into the deep neural network, to obtain the voice data belonging to the target, comprises:

inputting a first voice segment in the voice data into the deep neural network to obtain a prediction result of a first voice frame and a second voice frame in the first voice segment, wherein the prediction result is used for representing whether the corresponding voice frame is from the target;

inputting a second voice segment in the voice data into the deep neural network to obtain a prediction result of a second voice frame and a third voice frame in the second voice segment, wherein the second voice frame in the first voice segment and the second voice frame in the second voice segment are the same frame in the voice data;

Determining whether the second voice frame is from the target according to the prediction result of the second voice frame in the first voice segment and the prediction result of the second voice frame in the second voice segment;

and taking the data from the target determined from the voice data as the voice data of the target.

7. A sound detection device, the device comprising:

the acquisition module is used for acquiring the audio data to be detected;

a determining module, configured to determine a type of each frame of data in the audio data, where the type includes speech and silence;

the prediction module is used for inputting voice data corresponding to frames belonging to voice types in the audio data into a deep neural network to obtain voice data belonging to a target;

wherein, the determination module includes:

an acquisition unit configured to acquire first frame data in the audio data;

the transformation unit is used for carrying out Fourier transformation on the first frame data to obtain a first frequency spectrum, wherein the first frequency spectrum comprises a plurality of frequency points;

the calculating unit is used for calculating the power of each frequency point in the first frequency spectrum;

a determining unit, configured to determine a type of the first frame data based on power of each frequency point in the first spectrum;

Wherein, the determining unit is specifically configured to: subtracting the power of each frequency point in the first frequency spectrum from the minimum power of the corresponding frequency point to obtain the denoising signal power of each frequency point, wherein the minimum power is obtained based on nonlinear tracking of the frequency point; normalizing the denoising voice signal power of each frequency point to obtain normalized power of each frequency point; determining the type of the first frame data based on the normalized power of each frequency point in the first frequency spectrum;

wherein, the determining unit is specifically configured to: when the normalized power of each frequency point in the first frequency spectrum is in a preset power range, marking a first label for the corresponding frequency point, wherein the first label is used for representing that the first frame data belongs to voice on the corresponding frequency point; when the normalized power of each frequency point in the first frequency spectrum is not in the preset power range, marking a second label for the corresponding frequency point, wherein the second label is used for representing that the first frame data belongs to silence on the corresponding frequency point; generating a tag sequence of each frequency point in the first frequency spectrum; the type of the first frame data is determined based on the tag sequence.