CN114254685A

CN114254685A - Training method and device of sound detection model and detection method of sound event

Info

Publication number: CN114254685A
Application number: CN202011011003.7A
Authority: CN
Inventors: 冯祺徽; 曹海涛
Original assignee: Midea Group Co Ltd
Current assignee: Midea Group Co Ltd
Priority date: 2020-09-23
Filing date: 2020-09-23
Publication date: 2022-03-29

Abstract

The invention provides a training method and a device of a sound detection model and a detection method of a sound event, wherein the training method of the sound detection model comprises the following steps: acquiring a training sound signal, carrying out feature extraction on the training sound signal, and establishing a two-dimensional feature map training set; and importing the neural network model into a two-dimensional characteristic diagram training set, and training the neural network model through a loss function based on the hidden Markov model to obtain a sound detection model. According to the embodiment of the invention, the loss function defined based on the hidden Markov model is beneficial to identifying the specific event by sound event detection, and timely responding to the event without personnel intervention, so that sound detection is not dependent on experienced workers, on one hand, the detection efficiency is improved, the detection threshold is reduced, and on the other hand, the auditory sense of people is not damaged.

Description

Training method and device of sound detection model and detection method of sound event

Technical Field

The present invention relates to the field of sound detection technologies, and in particular, to a training method for a sound detection model, a training apparatus for a sound detection model, a detection method for a sound event, a detection apparatus for a sound event, a computer device, a computer-readable storage medium, and an electronic device.

Background

In the related art, sound quality inspection of products belongs to an important link in factory production. Whether quality problems such as screw not tightened, heating pipe not block in support appear can be judged through artifical monitoring sound. However, the manual monitoring method is inefficient and easily damages the human hearing.

Disclosure of Invention

The present invention is directed to solving at least one of the problems of the prior art or the related art.

To this end, the first aspect of the present invention provides a training method for a sound detection model.

The second aspect of the present invention provides a training apparatus for a sound detection model.

A third aspect of the invention proposes a method of detecting a sound event.

A fourth aspect of the invention proposes a detection device of a sound event.

A fifth aspect of the invention provides a computer apparatus.

A sixth aspect of the invention is directed to a computer-readable storage medium.

A seventh aspect of the present invention provides an electronic device.

In view of the above, a first aspect of the present invention provides a method for training a voice detection model, including: acquiring a training sound signal, carrying out feature extraction on the training sound signal, and establishing a two-dimensional feature map training set; and importing the neural network model into a two-dimensional characteristic diagram training set, and training the neural network model through a loss function based on the hidden Markov model to obtain a sound detection model.

In the technical scheme, a neural network model capable of automatically realizing sound event detection, namely a sound detection model, is trained, so that automatic sound event detection is favorably realized, and the efficiency of sound event detection is improved.

Specifically, a pre-stored training sound signal is obtained, the training sound signal is audio data containing multiple sound "events", and different training sound signals can be selected for different application scenarios.

Then, feature extraction is performed on the pre-stored training sound signals through a feature extraction method, so that one-dimensional sound information is converted into a two-dimensional feature map, and a two-dimensional feature map training set is formed. During training, a two-dimensional feature map training set after feature extraction is introduced into a preset neural network model, and the neural network model is trained through a loss function defined based on a hidden Markov model, so that a trained sound detection model is finally obtained. The overall length of the sound fragment can be judged based on the action of the loss function defined by the hidden Markov model, so that whether the duration of the sound fragment corresponds to the predicted event or not is judged, and the probability of the event occurrence in the sound fragment is weighted by the duration of the event, so that the trained sound detection model can detect the sound event more accurately.

The "event" may be an accident sound event caused by a production accident, such as an accident sound of falling of an article, glass breakage, or a specific quality problem of a produced product, such as a quality problem of a screw not tightened (when the screw is not tightened, a specific noise may occur during a test operation of the product), a quality problem of a heating pipe not clamped in a bracket (when the heating pipe is loosened, a specific noise may occur due to movement of the heating pipe), or other sounds occurring in a normal production activity of a factory, such as a talk between persons, a noise during normal operation of a machine, a down bell or other warning sounds, and the specific type of the "event" in a sound signal is not limited in the embodiment of the present invention.

According to the embodiment of the invention, the neural network model is trained on the basis of the loss function defined by the hidden Markov model, and the sound detection model obtained by training through the method can realize accurate sound event detection on places such as factories, streets, cells and the like, is beneficial to identifying the specific event by sound event detection, and timely handling the event without personnel intervention, so that the sound detection is not dependent on experienced workers, the detection efficiency is improved, the detection threshold is reduced, and the auditory sense of people is not damaged.

In addition, the training method of the voice detection model in the above technical solution provided by the present invention may further have the following additional technical features:

in the above technical solution, the feature extraction is performed on the sound signal, and specifically includes: framing the training sound signal to obtain a sample frame; and windowing the sample frame through a first window function, and extracting the characteristics of the windowed sample frame to obtain a two-dimensional characteristic map training set.

In the technical scheme, when the characteristics of sound are extracted, firstly, the sound signal is subjected to framing processing to obtain sample frames with a certain length, wherein each sample frame is a sound segment, and a part of overlap can exist between two adjacent sample frames. The frame length (the duration of the sound sample) and the overlap length of the sample frame may be adjusted according to a scene of a specific application for which the sound detection model is trained, for example, the sample frame may be 40ms long, and two adjacent sample frames overlap for 20 ms. It can be understood that the frame length and the overlapping frame length are only used as examples for illustration, and the length and the overlapping length of the sample frame are not limited in the embodiment of the present invention.

After a plurality of sample frames are obtained by a framing method, windowing is further performed on the obtained sample frames through a preset first window function. The window function may be a rectangular window or a non-rectangular window, such as a hamming window, a hanning window, and the like, and the specific type of the window function is not limited in the embodiment of the present invention.

After windowing, feature extraction is further performed on the windowed sample frame through a feature extraction method, a two-dimensional feature map training set is further obtained, the two-dimensional feature map is used as a training material of the neural network model, the training speed of the sound detection model is favorably accelerated, and the recognition accuracy of the finally obtained sound detection model on the sound event can be improved.

In any of the above technical solutions, training a neural network model by using a loss function based on a hidden markov model specifically includes: inputting a sample frame output by the neural network model into a target loss function, wherein the sample frame is the posterior probability of an event frame; and obtaining a current loss value output by the target loss function, and continuously training the neural network model by taking the target loss value range as a target until the current loss value falls into the target loss value range.

In the technical scheme, when a preset neural network model is trained through a two-dimensional feature map training set, the two-dimensional feature map training set is input into the neural network model, and the neural network model outputs a specific result, specifically, the posterior probability that an input sample frame is an event frame, according to the input two-dimensional feature map training set.

Here, if an "event" occurs in a sound segment (a sample frame), the sound segment (sample frame) is determined as an event frame, and accordingly, the higher the posterior probability that a sample frame is an event frame is, the more likely the "event" occurs in the sample frame.

After the posterior probability output by the neural network model is received, the value of the posterior probability is input into a loss function based on the hidden Markov model, and the loss value of the current neural network model can be obtained through the loss function. The loss value can represent the current prediction result of the neural network model, and the difference between the current prediction result and the actual result indicates that the prediction result of the neural network model is more accurate when the loss value is smaller.

After the current loss value of the current neural network model is obtained, the current loss value is compared with a preset target loss value, the neural network model is continuously trained by taking the target loss value as a target, and the prediction accuracy of the neural network model is improved by inputting more training sets. When the posterior probability predicted by the neural network model is obtained and the corresponding current loss value falls within the target loss value range, the prediction accuracy of the current neural network model can meet the requirement, and at the moment, the neural network model obtained by current training is stored and determined as the sound detection model serving as the target.

The embodiment of the invention trains the neural network model based on the loss function defined by the hidden Markov model, so that the finally obtained sound detection model judges whether the duration of a sound segment corresponds to the predicted event, and weights the probability of the occurrence of the event in a section of the sound segment by the duration of the event, thereby being beneficial to improving the accuracy of the sound detection model in recognizing the sound event.

In any of the above technical solutions, before training the preset neural network model through the target loss function, the training method of the acoustic detection model further includes: acquiring a preset initial loss function; and weighting the posterior probability of the sound event according to the continuous frame number of the event frame in any sound event output by the neural network model through the hidden state parameter of the hidden Markov model so as to obtain a loss function based on the hidden Markov model.

In the technical scheme, the loss function finally defined is specifically as follows:

where L is the value of the objective loss function, o_tFor the event state corresponding to the t-th frame, o_t＝(0，1)，

For the sequence of event states from frame 1 to frame T,

is a sequence of two-dimensional feature maps of the 1 st frame to the T th frame,

a posterior probability corresponding to the t-th frame, d_t-1The number of frames for which the event has continued at the time corresponding to the t-1 th frame, D (o)_t|d_t-1) Hidden state parameters of a hidden markov model.

Specifically, when o_tWhen 1, it indicates that the event occurs in the t-th frame, and when o_tWhen the value is 0, it means that no event occurs in the t-th frame. The neural network model is trained by taking the value of the loss function as a target, so that the neural network model can learn the length distribution of each type of event, the finally obtained sound detection model can distinguish and recognize each type of event more accurately, and the detection precision of the sound detection model is improved.

In any of the above technical solutions, the feature extraction performed on the sound signal specifically includes: a mel energy spectrum feature extraction method, a short-time fourier transform extraction method, a mel cepstrum coefficient extraction method, a Bark (Bark) domain energy spectrum extraction method, an equivalent rectangular bandwidth domain (Erb) energy spectrum extraction method or a gamma pass (gamma) cepstrum coefficient extraction method; and the neural network model specifically comprises: a convolutional-recurrent neural network, a convolutional neural network, a recurrent neural network, a hidden markov model, a gaussian mixture model, or a support vector machine.

In the technical scheme, according to different sound environments or particularly aiming at different sound event types, any one of a Mel energy spectrum feature extraction method, a short-time Fourier transform extraction method, a Mel cepstrum coefficient extraction method, a Bark domain energy spectrum extraction method, an Erb domain energy spectrum extraction method or a Gamma atom cepstrum coefficient extraction method is selected to convert one-dimensional sound signals into two-dimensional feature maps. The embodiment of the invention does not limit the specific type of the feature extraction method.

In order to adapt to various application environments, when the sound event is identified, any one of a convolution-recurrent neural network, a convolution neural network, a recurrent neural network, a hidden markov model, a gaussian mixture model or a support vector machine can be used as an applied neural network model, or multiple ones of the neural network models can be selected to form a 'multi-stage' neural network model. The embodiment of the present invention does not limit the specific form of the neural network model.

The second aspect of the present invention provides a training apparatus for a sound detection model, comprising: the signal processing module is used for acquiring a training sound signal, extracting characteristics of the sound signal and establishing a two-dimensional characteristic map training set; and the training module is used for leading a preset neural network model into a two-dimensional characteristic map training set, and training the preset neural network model through a target loss function to obtain a sound detection model.

In the technical scheme, a neural network model capable of automatically realizing sound event detection, namely a sound detection model, is trained through a training device of the sound detection model, so that automatic sound event detection is realized, and the efficiency of sound event detection is improved.

A third aspect of the present invention provides a method for detecting a sound event, including: determining at least one event frame in the sound signal to be detected by the sound detection model obtained by training through the training method of the sound detection model provided by any technical scheme; and determining a corresponding detection result according to the event frame, and outputting the detection result.

In the technical scheme, the voice signal to be detected is detected and identified by utilizing the trained voice detection model, so that automatic voice event detection is facilitated, and the efficiency of voice event detection is improved.

Specifically, the obtained signal to be detected is picked up by sound pickup equipment such as a microphone during the production of products in a factory, so as to obtain an analog signal or a digital signal of the sound during the production in the factory. Through the sound detection model, sound detection analysis can be carried out on a signal to be detected, and at least one event frame contained in the sound signal is further determined.

Determining a detection result through the event frame, such as: "16 detected: 00 to 16: 01 a screw loosening event occurred.

The voice detection model is obtained through training based on a loss function defined by a hidden Markov model, so that whether the duration of a voice segment corresponds to an event contained in the prediction of the voice segment can be judged, and the probability of the event occurring in a section of voice segment is weighted by the duration of the event, so that the voice event is detected more accurately.

According to the embodiment of the invention, the specific event is identified through sound event detection, timely response is made for the event, and meanwhile, personnel intervention is not needed, so that sound detection is not dependent on experienced workers, on one hand, the detection efficiency is improved, the detection threshold is reduced, and on the other hand, the auditory sense of people is not damaged.

In the above technical solution, determining at least one event frame in a sound signal to be detected specifically includes: carrying out feature extraction on a sound signal to be detected, and establishing a two-dimensional feature map data set; and importing the sound detection model into a two-dimensional feature map data set to obtain at least one event frame output by the sound detection model.

According to the technical scheme, when an event frame in the sound signal to be detected is determined through a sound detection model, data processing is firstly carried out on the sound signal to be detected. Specifically, firstly, feature extraction is performed on a sound signal to be detected through a feature extraction method, then a one-dimensional sound signal is converted into a two-dimensional feature map, and finally a corresponding two-dimensional feature map data set is established.

The established two-dimensional feature map data set is input into the sound detection model, the sound detection model can predict whether an event occurs in a corresponding sound segment according to the two-dimensional feature map data set, then outputs a corresponding event frame, accurate detection of the sound event is achieved, manual intervention is not needed in the process, and the problems that the auditory sense of a person is damaged and the like can be effectively avoided.

In any of the above technical solutions, acquiring at least one event frame output by the sound detection model specifically includes: determining a plurality of time frames corresponding to the two-dimensional feature map data set through a sound detection model; and respectively calculating the posterior probability of each time frame as an event frame through a solution space algorithm, and determining the time frame with the posterior probability higher than a probability threshold value as the event frame.

In this embodiment, a "time frame" is a sound segment, each two-dimensional feature map data set may include a plurality of such time frames, and the sound detection model may respectively predict the probability of occurrence of an "event" in each time frame, that is, the posterior probability of a time frame, specifically, an event frame. Specifically, each of the total time frames can be calculated separately by a solution space algorithm, being the posterior probability of the event frame.

When the posterior probability of an event frame is higher than the preset posterior probability threshold value, the event occurs in the sound segment, so that the corresponding time frame is marked as an event frame, and finally, one or more obtained event frames are normalized to form a final output result, which is beneficial to identifying the specific event by the detection result of the sound event and timely dealing with the event.

In any of the above technical solutions, calculating the posterior probability of each time frame as an event frame by a solution space algorithm, specifically includes: establishing a corresponding time frame sequence through a time frame, and establishing a solution space of an event sequence corresponding to the time frame sequence; in a solution space, solving the event sequence to obtain the posterior probability of each time frame as an event frame; and the sound event detection method further comprises: and determining an optimal solution sequence of the event sequence through a dynamic programming algorithm.

In the technical scheme, a solution space of the event sequence is firstly constructed, the solution space comprises all possible combinations of events on all time frames, and the size of the solution space is about 2^TKWhere T is the total length of the time frame sequence and K is the number of categories of all events.

Further, in the above solution space, for the event sequence

A solution is performed in which, among other things,

for the sequence of event states from frame 1 to frame T, o₁For the event state corresponding to frame 1, o_TFor the event state corresponding to the T-th frame when o_tWhen the frame is 1, the t-th frame is an event frame, and when the frame is o_tIf it is 0, the t-th frame is a non-event frame. Specifically, the following calculation formula can be used for solving:

further, the dynamic programming algorithm is utilized to carry out rapid solution search on the formula, and finally, the optimal solution sequence is obtained

I.e. the sequence of solutions that maximizes the objective function.

The objective function is specifically as follows:

since the number of solutions in the objective function is large, in order to simplify the calculation, the initial function is defined as:

and further defining a transfer function, and performing iterative solution on the initial function, wherein the transfer function specifically comprises:

wherein the content of the first and second substances,

o_tfor the event state corresponding to the t-th frame, o_t＝(0，1)，

For the sequence of event states from frame 1 to frame T,

a posterior probability corresponding to the t-th frame, d_t-1The number of frames for which the event has continued at the moment corresponding to the t-1 th frame，D(o_t|d_t-1) Hidden state parameters of a hidden markov model.

For the above dynamic programming algorithm, the time complexity is about KD²T, where K is the number of event classes, D is the maximum length of an event, and T is the length of the sequence, in some cases, to achieve parallel computation, the complexity may be compressed, e.g., to D²T。

The decision result that each time frame is the event frame is solved through a solution space algorithm, and whether each time frame is marked as the event frame in one sequence or not can be expressed through the decision result, so that the detection precision of the sound event is improved, and the detection efficiency is improved.

In any of the above technical solutions, feature extraction is performed on a sound signal to be detected, and a two-dimensional feature map data set is established, which specifically includes: framing a sound signal to be detected to obtain a signal frame; and windowing the signal frame through a second window function, and extracting the characteristics of the signal frame subjected to windowing through a characteristic extraction method to obtain a two-dimensional characteristic map data set.

In the technical scheme, when the characteristics of the sound signal to be detected are extracted, firstly, the sound signal to be detected is subjected to framing processing to obtain signal frames with a certain length, wherein each signal frame is a sound segment, and partial overlap can exist between two adjacent signal frames. The frame length (the duration of the sound sample) and the overlap length of the signal frames can be adjusted according to the actual application-specific scenario, for example, the signal frame may be 40ms long, and two adjacent signal frames overlap for 20 ms. It can be understood that the frame length and the overlapping frame length are only used as examples for illustration, and the length and the overlapping length of the sample frame are not limited in the embodiment of the present invention.

After obtaining a plurality of sample frames by a framing method, further performing windowing processing on the obtained signal frame by a preset second window function. The window function may be a rectangular window or a non-rectangular window, such as a hamming window, a hanning window, and the like, and the specific type of the window function is not limited in the embodiment of the present invention.

After windowing, feature extraction is further performed on the windowed signal frame through a feature extraction method, a two-dimensional feature map data set is further obtained, and the two-dimensional feature map data set is used as the input of a sound detection model, so that the speed and accuracy of sound event recognition are improved.

In any of the above technical solutions, the feature extraction method includes: a mel-frequency energy spectrum feature extraction method, a short-time fourier transform extraction method, a mel cepstral coefficient extraction method, a Bark (Bark) domain energy spectrum extraction method, an equivalent rectangular bandwidth domain (Erb) energy spectrum extraction method or a gamma-pass cepstral coefficient extraction method.

A fourth aspect of the present invention provides a device for detecting a sound event, comprising: the detection module is used for determining at least one event frame in the sound signal to be detected through the sound detection model obtained by training through the training method of the sound detection model provided in any technical scheme; and the output module is used for determining a corresponding detection result according to the event frame and outputting the detection result.

A fifth aspect of the present invention provides a computer apparatus comprising: a memory having a computer program stored thereon; a processor configured to implement the steps of the training method for the sound detection model provided in any one of the above technical solutions and/or the steps of the detection method for the sound event provided in any one of the above technical solutions when executing the computer program, so that the computer device includes all the beneficial effects of the training method for the sound detection model provided in any one of the above technical solutions and the detection method for the sound event provided in any one of the above technical solutions, which are not described herein again.

A sixth aspect of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when being executed by a processor, can implement the steps of the method for training a sound detection model provided in any one of the above technical solutions and/or the steps of the method for detecting a sound event provided in any one of the above technical solutions, and therefore, the computer-readable storage medium includes all the beneficial effects of the method for training a sound detection model provided in any one of the above technical solutions and the method for detecting a sound event provided in any one of the above technical solutions, which are not described herein again.

A seventh aspect of the present invention provides an electronic device, which includes a memory and a processor, where the memory stores a computer program, and the processor is configured to execute, by the computer program, the steps of the training method for a sound detection model provided in any one of the above technical solutions, and/or the steps of the detection method for a sound event provided in any one of the above technical solutions, so that the electronic device includes all the beneficial effects of the training method for a sound detection model provided in any one of the above technical solutions and the detection method for a sound event provided in any one of the above technical solutions, which are not described herein again.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 shows one of the flow diagrams of a method of training a voice detection model according to an embodiment of the invention;

FIG. 2 illustrates a second flowchart of a method for training a voice detection model according to an embodiment of the present invention;

FIG. 3 shows a third flowchart of a method for training a voice detection model according to an embodiment of the invention;

FIG. 4 shows a fourth flowchart of a method of training a voice detection model according to an embodiment of the invention;

FIG. 5 is a block diagram of a training apparatus for a voice detection model according to an embodiment of the present invention;

FIG. 6 shows one of the flow diagrams of a method of detection of a sound event according to an embodiment of the invention;

FIG. 7 illustrates a second flow chart of a method of detecting a sound event according to an embodiment of the present invention;

FIG. 8 shows a third flowchart of a method of detecting a sound event according to an embodiment of the invention;

FIG. 9 shows a fourth flowchart of a method of detection of a sound event according to an embodiment of the invention;

FIG. 10 shows a fifth flowchart of a method of detection of a sound event according to an embodiment of the invention;

fig. 11 is a block diagram showing a configuration of a sound event detection apparatus according to an embodiment of the present invention;

FIG. 12 is a logic diagram illustrating the generation of a sound detection model according to an embodiment of the present invention;

FIG. 13 illustrates a correspondence of sound signals to sound event detection in accordance with an embodiment of the present invention;

FIG. 14 shows a block diagram of a computer device according to an embodiment of the invention;

fig. 15 shows a block diagram of an electronic apparatus according to an embodiment of the present invention.

Detailed Description

In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described herein, and therefore the scope of the present invention is not limited by the specific embodiments disclosed below.

A training method of a voice detection model, a training apparatus of a voice detection model, a detection method of a voice event, a detection apparatus of a voice event, a computer device, a computer-readable storage medium, and an electronic device according to some embodiments of the present invention are described below with reference to fig. 1 to 15.

Example one

Fig. 1 shows one of the flowcharts of the training method of the acoustic detection model according to the embodiment of the present invention, and in particular, the training method of the acoustic detection model may include the following steps:

102, acquiring a pre-stored training sound signal, and performing feature extraction on the pre-stored training sound signal to establish a two-dimensional feature map training set;

and 104, importing a preset neural network model into a two-dimensional feature map training set, and training the neural network model through a loss function based on the hidden Markov model to obtain a sound detection model.

In the embodiment of the invention, a neural network model capable of automatically realizing sound event detection, namely a sound detection model, is trained, so that the automatic sound event detection is favorably realized, and the sound event detection efficiency is improved.

Example two

Fig. 2 shows a second flowchart of a training method of a voice detection model according to an embodiment of the present invention, and specifically, the training method of the voice detection model may include the following steps:

step 202, framing the training sound signal to obtain a sample frame;

and 204, windowing the sample frame through a preset first window function, and extracting the characteristics of the windowed sample frame to obtain a two-dimensional characteristic map training set.

In the embodiment of the present invention, when performing feature extraction on a sound, firstly, a sound signal is subjected to framing processing to obtain sample frames with a certain length, where each sample frame is a sound segment, and a partial overlap may exist between two adjacent sample frames. The frame length (the duration of the sound sample) and the overlap length of the sample frame may be adjusted according to a scene of a specific application for which the sound detection model is trained, for example, the sample frame may be 40ms long, and two adjacent sample frames overlap for 20 ms. It can be understood that the frame length and the overlapping frame length are only used as examples for illustration, and the length and the overlapping length of the sample frame are not limited in the embodiment of the present invention.

EXAMPLE III

Fig. 3 shows a third flowchart of a training method of a voice detection model according to an embodiment of the present invention, and in particular, the training method of the voice detection model may include the following steps:

step 302, inputting the posterior probability of the sample frame as an event frame into a target loss function;

and step 304, obtaining a current loss value output by the target loss function, and continuously training a preset neural network model by taking the target loss value range as a target until the current loss value falls into the target loss value range.

In the embodiment of the invention, when a preset neural network model is trained through a two-dimensional feature map training set, the two-dimensional feature map training set is input into the neural network model, and the neural network model outputs a specific result, specifically, the posterior probability that an input sample frame is an event frame, according to the input two-dimensional feature map training set.

Example four

Fig. 4 shows a fourth flowchart of a training method of a voice detection model according to an embodiment of the present invention, and in particular, the training method of the voice detection model may include the following steps:

step 402, acquiring a preset initial loss function;

and step 404, weighting the posterior probability of the sound event according to the number of the continuous frames of the event frames in the sound event output by the neural network model through the hidden state parameters to obtain a loss function.

In the embodiment of the present invention, the final definition of the hidden markov model-based loss function is specifically as follows:

For the sequence of event states from frame 1 to frame T,

EXAMPLE five

In some embodiments of the present invention, the feature extraction is performed on the sound signal, and the sound signal may be subjected to feature extraction by a feature extraction method, where the preset feature extraction method includes: a mel energy spectrum feature extraction method, a short-time fourier transform extraction method, a mel cepstrum coefficient extraction method, a Bark (Bark) domain energy spectrum extraction method, an equivalent rectangular bandwidth domain (Erb) energy spectrum extraction method or a gamma pass (gamma) cepstrum coefficient extraction method; correspondingly, the neural network model specifically includes: a convolutional-recurrent neural network, a convolutional neural network, a recurrent neural network, a hidden markov model, a gaussian mixture model, or a support vector machine.

In the embodiment of the invention, according to different sound environments or particularly aiming at different sound event types, any one of a mel-energy spectrum feature extraction method, a short-time Fourier transform extraction method, a mel cepstrum coefficient extraction method, a Bark domain energy spectrum extraction method, an Erb domain energy spectrum extraction method or a Gamma-tone cepstrum coefficient extraction method is selected to convert a one-dimensional sound signal into a two-dimensional feature map. The embodiment of the invention does not limit the specific type of the feature extraction method.

EXAMPLE six

Fig. 5 shows a block diagram of a training apparatus for a voice detection model according to an embodiment of the present invention, specifically, the training apparatus 500 for a voice detection model includes: the signal processing module 502 is configured to acquire a training sound signal, perform feature extraction on the sound signal, and establish a two-dimensional feature map training set; the training module 504 is configured to import the preset neural network model into the two-dimensional feature map training set, train the preset neural network model through the target loss function, and obtain the sound detection model.

In the embodiment of the invention, a neural network model capable of automatically realizing sound event detection, namely a sound detection model, is trained through a training device of the sound detection model, so that the automatic sound event detection is realized, and the efficiency of the sound event detection is improved.

The signal processing module 502 is further configured to perform framing on the training sound signal to obtain a sample frame; and windowing the sample frame through a first window function, and extracting the characteristics of the windowed sample frame to obtain a two-dimensional characteristic map training set.

When extracting the characteristics of sound, firstly, framing the sound signal to obtain sample frames with a certain length, wherein each sample frame is a sound segment, and a part of overlap may exist between two adjacent sample frames. The frame length (the duration of the sound sample) and the overlap length of the sample frame may be adjusted according to a scene of a specific application for which the sound detection model is trained, for example, the sample frame may be 40ms long, and two adjacent sample frames overlap for 20 ms. It can be understood that the frame length and the overlapping frame length are only used as examples for illustration, and the length and the overlapping length of the sample frame are not limited in the embodiment of the present invention.

The training module 504 is further configured to input the posterior probability of the event frame, which is the sample frame output by the neural network model, into the target loss function; and obtaining a current loss value output by the target loss function, and continuously training a preset neural network model by taking the target loss value range as a target until the current loss value falls into the target loss value range.

When a preset neural network model is trained through a two-dimensional feature map training set, the two-dimensional feature map training set is input into the neural network model, and the neural network model outputs a specific result, specifically, the posterior probability that an input sample frame is an event frame, aiming at the input two-dimensional feature map training set.

The training module 504 is further configured to obtain a preset initial loss function; and weighting the posterior probability of the sound event according to the continuous frame number of the event frame in any sound event output by the neural network model through the hidden state parameter of the hidden Markov model so as to obtain a loss function based on the hidden Markov model.

The resulting loss function is defined as follows:

For the sequence of event states from frame 1 to frame T,

Wherein, carry out feature extraction to the sound signal, specifically include: a mel energy spectrum feature extraction method, a short-time fourier transform extraction method, a mel cepstrum coefficient extraction method, a Bark (Bark) domain energy spectrum extraction method, an equivalent rectangular bandwidth domain (Erb) energy spectrum extraction method or a gamma pass (gamma) cepstrum coefficient extraction method; and the neural network model specifically comprises: a convolutional-recurrent neural network, a convolutional neural network, a recurrent neural network, a hidden markov model, a gaussian mixture model, or a support vector machine.

According to different sound environments or particularly aiming at different sound event types, any one feature extraction method of a Mel energy spectrum feature extraction method, a short-time Fourier transform extraction method, a Mel cepstrum coefficient extraction method, a Bark domain energy spectrum extraction method, an Erb domain energy spectrum extraction method or a Gamma-tone cepstrum coefficient extraction method is selected to convert the one-dimensional sound signals into the two-dimensional feature maps. The embodiment of the invention does not limit the specific type of the feature extraction method.

EXAMPLE seven

Fig. 6 shows one of the flowcharts of the method for detecting a sound event according to the embodiment of the present invention, and specifically, the method for detecting a sound event specifically includes the following steps:

step 602, determining at least one event frame in a sound signal to be detected through a sound detection model;

and step 604, determining a corresponding detection result according to the event frame, and outputting the detection result.

In the embodiment of the invention, the sound detection model obtained by training is utilized to detect and identify the sound signal to be detected, thereby being beneficial to realizing automatic sound event detection and improving the efficiency of sound event detection.

Example eight

Fig. 7 shows a second flowchart of a method for detecting a sound event according to an embodiment of the present invention, and specifically, the method for detecting a sound event includes the following steps:

step 702, extracting the characteristics of a sound signal to be detected to obtain a two-dimensional characteristic map data set;

step 704, inputting the two-dimensional feature map data set to the sound detection model, and obtaining at least one event frame output by the sound detection model.

In the embodiment of the invention, when the event frame in the sound signal to be detected is determined by the sound detection model, data processing is firstly carried out on the sound signal to be detected. Specifically, firstly, feature extraction is performed on a sound signal to be detected through a feature extraction method, then a one-dimensional sound signal is converted into a two-dimensional feature map, and finally a corresponding two-dimensional feature map data set is established.

Example nine

Fig. 8 shows a third flowchart of a method for detecting a sound event according to an embodiment of the present invention, and specifically, the method for detecting a sound event includes the following steps:

step 802, determining a plurality of time frames corresponding to a two-dimensional feature map data set through a sound detection model;

step 804, finding out the event frame sequence with the highest weighted posterior probability through a solution space algorithm, and using the event frame sequence as an optimal decision for determining whether each time frame is an event frame.

In the embodiment of the present invention, a "time frame" is a sound segment, each two-dimensional feature map data set may include a plurality of such time frames, and the sound detection model may respectively predict the probability of occurrence of an "event" in each time frame, that is, the posterior probability of a time frame, specifically, an event frame. Specifically, the event frame sequence with the highest a posteriori probability can be found through a solution space algorithm.

When the posterior probability of an event frame is higher than the preset posterior probability threshold, the event is considered to occur in the sound fragment, so that the corresponding time frame is marked as an event frame, and finally, the obtained one or more event frames are normalized to form a final output result, which is beneficial to identifying the specific event through the detection result of the sound event and timely coping for the event.

Example ten

Fig. 9 shows a fourth flowchart of a detection method of a sound event according to an embodiment of the present invention, specifically, the detection method of a sound event specifically includes the following steps:

step 902, establishing a corresponding time frame sequence through a time frame, and establishing a solution space of an event sequence corresponding to the time frame sequence;

and 904, solving the event sequence in the solution space, and determining the optimal solution sequence of the event sequence through a dynamic programming algorithm.

In an embodiment of the present invention, a solution space of the event sequence is first constructed, which includes all possible combinations of events over the entire time frame, wherein the size of the solution space is about 2^TKWhere T is the total length of the time frame sequence and K is the number of categories of all events.

Further, in the above solution space, for the event sequence

A solution is performed in which, among other things,

I.e. the sequence of solutions that maximizes the objective function.

The objective function is specifically as follows:

wherein the content of the first and second substances,

o_tfor the event state corresponding to the t-th frame, o_t＝(0，1)，

For the sequence of event states from frame 1 to frame T,

EXAMPLE eleven

Fig. 10 shows a fifth flowchart of a detection method of a sound event according to an embodiment of the present invention, specifically, the detection method of a sound event specifically includes the following steps:

step 1002, framing a sound signal to be detected to obtain a signal frame;

and 1004, performing windowing processing on the signal frame through a second window function, and performing feature extraction on the signal frame subjected to windowing processing through a feature extraction method to obtain a two-dimensional feature map data set.

In the embodiment of the invention, when the characteristics of the sound signal to be detected are extracted, firstly, the sound signal to be detected is subjected to framing processing to obtain signal frames with a certain length, wherein each signal frame is a sound segment, and a part of overlap can exist between two adjacent signal frames. The frame length (the duration of the sound sample) and the overlap length of the signal frames can be adjusted according to the actual application-specific scenario, for example, the signal frame may be 40ms long, and two adjacent signal frames overlap for 20 ms. It can be understood that the frame length and the overlapping frame length are only used as examples for illustration, and the length and the overlapping length of the sample frame are not limited in the embodiment of the present invention.

The feature extraction method comprises the following steps: a mel-frequency energy spectrum feature extraction method, a short-time fourier transform extraction method, a mel cepstral coefficient extraction method, a Bark (Bark) domain energy spectrum extraction method, an equivalent rectangular bandwidth domain (Erb) energy spectrum extraction method or a gamma-pass cepstral coefficient extraction method.

Example twelve

Fig. 11 shows a block diagram of a detection apparatus of a sound event according to an embodiment of the present invention, and the detection apparatus 1100 of a sound event includes: a detection module 1102, configured to determine at least one event frame in the to-be-detected sound signal by using the sound detection model obtained through training by the training method of the sound detection model provided in any of the embodiments; and an output module 1104, configured to determine a corresponding detection result according to the event frame, and output the detection result.

The detection module 1102 is further configured to perform feature extraction on the sound signal to be detected, and establish a two-dimensional feature map data set; and importing the sound detection model into a two-dimensional feature map data set to obtain at least one event frame output by the sound detection model.

When determining an event frame in a sound signal to be detected through a sound detection model, data processing is firstly carried out on the sound signal to be detected. Specifically, firstly, feature extraction is performed on a sound signal to be detected through a feature extraction method, then a one-dimensional sound signal is converted into a two-dimensional feature map, and finally a corresponding two-dimensional feature map data set is established.

The detection module 1102 is further configured to determine a plurality of time frames corresponding to the two-dimensional feature map data set through the acoustic detection model; and respectively calculating the posterior probability of each time frame as an event frame through a solution space algorithm, and determining the time frame with the posterior probability higher than a probability threshold value as the event frame.

A "time frame" is a sound segment, each two-dimensional feature map data set may include a plurality of such time frames, and the sound detection model may predict the probability of occurrence of an "event" in each time frame, i.e. the posterior probability of a time frame, specifically an event frame. Specifically, each of the total time frames can be calculated separately by a solution space algorithm, being the posterior probability of the event frame.

The detection module 1102 is further configured to establish a corresponding time frame sequence through the time frame, and establish a solution space of the event sequence corresponding to the time frame sequence; in a solution space, solving the event sequence, and determining an optimal solution sequence of the event sequence through a dynamic programming algorithm; and determining the posterior probability of each time frame in the time frame sequence through the optimal solution sequence.

First, a solution space of the event sequence is constructed, which includes all possible combinations of events over the entire time frame, wherein the size of the solution space is about 2^TKWhere T is the total length of the time frame sequence and K is the number of categories of all events.

Further, in the above solution space, for the event sequence

A solution is performed in which, among other things,

further, by using a dynamic programming algorithm, fast solution search is performed on the above formula, and the optimal solution sequence of the final sphere, that is, the objective function of iterative solution, specifically, the objective function is:

wherein the content of the first and second substances,

o_tfor the event state corresponding to the t-th frame, o_t＝(0，1)，

For the sequence of event states from frame 1 to frame T,

The posterior probability that each time frame is an event frame is solved through a solution space algorithm, so that the detection precision of the sound event is improved, and the detection efficiency is improved.

The detecting module 1102 is further configured to frame the sound signal to be detected to obtain a signal frame; and windowing the signal frame through a second window function, and extracting the characteristics of the signal frame subjected to windowing through a characteristic extraction method to obtain a two-dimensional characteristic map data set.

When the characteristics of the sound signal to be detected are extracted, firstly, the sound signal to be detected is subjected to framing processing to obtain signal frames with a certain length, wherein each signal frame is a sound segment, and partial overlap can exist between two adjacent signal frames. The frame length (the duration of the sound sample) and the overlap length of the signal frames can be adjusted according to the actual application-specific scenario, for example, the signal frame may be 40ms long, and two adjacent signal frames overlap for 20 ms. It can be understood that the frame length and the overlapping frame length are only used as examples for illustration, and the length and the overlapping length of the sample frame are not limited in the embodiment of the present invention.

EXAMPLE thirteen

In some embodiments of the present invention, the embodiments of the present invention are fully described in terms of practical applications.

In particular, sound event detection refers to identifying the category of various events in a sound signal and the starting and stopping time of the event occurrence so as to make corresponding decisions. In the actual production process of a factory, sound quality inspection of products is an important link, for example, when workers carry out power-on test on a washing machine, whether the products have quality problems is judged by monitoring the sound of the whole machine, for example, a heating pipe is not clamped into a support or a motor screw is not tightened, and the like. The traditional manual sound quality inspection method has low efficiency and easily causes damage to human hearing.

The embodiment of the invention provides a method for automatically identifying a sound signal by applying a sound event detection algorithm so as to determine whether a specific 'event' occurs at present. The sound event detection algorithm mainly comprises two processing steps:

1. converting one-dimensional sound signals into a two-dimensional characteristic diagram, such as Mel energy spectrum characteristics, by utilizing the technology in the field of signal processing;

2. and identifying event information contained in the two-dimensional sound characteristic diagram by using a deep learning algorithm, such as a convolution-recurrent neural network model, and outputting an event category and a time label.

The training process of the neural network mainly comprises two steps of operations: the method comprises the steps of firstly, utilizing a convolutional neural network to learn effective local information of a feature map, wherein the effective local information is mainly used for event classification, and secondly, utilizing a recursive neural network to learn context information between time frames, and the effective local information is mainly used for determining time labels.

Fig. 12 is a schematic diagram illustrating a logic diagram of generating a sound detection model according to an embodiment of the present invention, wherein the training process mainly includes the following steps:

1. collecting sound signals as a training set, and performing framing with the frame length of 40ms, overlapping the framing of 20ms and adding a Hamming window on each sound sample;

2. performing feature extraction on the sound signal based on the Mel energy spectrum, and establishing a two-dimensional feature map;

3. defining a new loss function (for the sake of simplicity, the following formula lists only a single sound event type, and since the probabilities of each event are independent, considering only one event type does not affect the generality);

the loss function is specifically as follows:

For the sequence of event states from frame 1 to frame T,

a posterior probability corresponding to the t-th frame, d_t-1The number of frames for which the event has continued at the time corresponding to the t-1 th frame, D (o)_t|d_t-1) Hidden state parameters of a hidden markov model. When o is_tWhen 1, it indicates that the event occurs in the t-th frame, and when o_tWhen the value is 0, it means that no event occurs in the t-th frame.

4. Constructing a convolution-recurrent neural network model, importing a two-dimensional characteristic graph training set, and training by taking a minimum loss function value as a target;

5. and after the training is finished, saving the parameters of the convolution-recurrent neural network model.

The test process includes the following steps:

1. collecting sound signals as a test set, performing framing with the frame length of 20-40ms and overlapping with the framing of 10-20ms and adding a Hamming window to each sound sample;

3. reading parameters of the convolution-recurrent neural network model, and importing a two-dimensional characteristic diagram test set to obtain an output time frame of the model;

4. integrating the posterior probability frame by frame through a dynamically planned solution space search algorithm based on a hidden Markov model, and finally identifying the category and the starting and stopping time of the sound event, wherein the method specifically comprises the following steps:

1) constructing a solution space of the sequence of sound events, i.e. a combination of all events over all time frames, the solution space having a size of about 2^TKWherein T is the sequence length and K is the number of event categories;

2) for sound event sequence

Calculating the fraction of the solution sequence by adopting a calculation formula as follows:

3) and (3) carrying out quick solution search by using a dynamic programming algorithm to obtain an optimal solution sequence, namely, iteratively solving an objective function:

the initial function is defined as:

the transfer function is defined as:

wherein the content of the first and second substances,

o_tfor the event state corresponding to the t-th frame, o_t＝(0，1)，

For the sequence of event states from frame 1 to frame T,

The complexity of the dynamic function is about KD²T, where K is the number of event classes, D is the maximum length of an event, and T is the length of the sequence, in some cases, to achieve parallel computation, the complexity may be compressed, e.g., to D²T。

4) And integrating the frames of 40ms into a segment of 0.1s, if a frame in the segment is identified as an event, determining that the event occurs in the segment, and otherwise, determining that no event occurs in the segment. All segments are then further integrated, eventually identifying the class of the output sound event and the start-stop moment.

Fig. 13 shows a mapping between a sound signal and sound event detection according to an embodiment of the present invention, in which event fragmentation occurring in a sound event detection algorithm is solved to a certain extent by integrating the model prediction event frames of the fragments, so that the event identification accuracy can be improved, and the later-stage actual landing application is facilitated.

Example fourteen

Fig. 14 shows a block diagram of a computer apparatus according to an embodiment of the present invention, and the computer apparatus 1400 includes: a memory 1402 on which a computer program is stored; the processor 1404, when being configured to execute the computer program, is configured to implement the steps of the method provided in any of the above embodiments, and therefore, the computer device 1400 simultaneously includes all the advantages of the method provided in any of the above embodiments, which are not described herein again.

Example fifteen

In some embodiments of the present invention, a computer-readable storage medium is provided, on which a computer program is stored, which, when being executed by a processor, is capable of implementing the steps of the training method for a sound detection model provided in any one of the above embodiments and/or the steps of the detection method for a sound event provided in any one of the above embodiments, and therefore, the computer-readable storage medium includes all the beneficial effects of the training method for a sound detection model provided in any one of the above embodiments and the detection method for a sound event provided in any one of the above embodiments, which are not described herein again.

Example sixteen

FIG. 15 shows a block diagram of an electronic device according to an embodiment of the invention, the electronic device 1500 including but not limited to: a radio frequency unit 1502, a network module 1504, an audio output unit 1506, an input unit 1508, a sensor 1510, a display unit 1512, a user input unit 1514, an interface unit 1516, a memory 1518, a processor 1520, and a power supply 1522. Those skilled in the art will appreciate that the electronic device configuration shown in fig. 15 does not constitute a limitation of the electronic device, and that the electronic device may include more or fewer components than shown, or some components may be combined, or a different arrangement of components. In the embodiment of the present application, the electronic device includes, but is not limited to, a mobile terminal, a tablet computer, a notebook computer, a palm computer, a vehicle-mounted electronic device, a wearable device, a pedometer, and the like.

Meanwhile, the processor 1520 can run the computer program on the memory 1518 to further implement the steps of the method in any of the above embodiments, so that the electronic device further includes all the advantages of any of the above embodiments, which is not described herein again.

It should be understood that, in the embodiment of the present application, the radio frequency unit 1502 may be configured to send and receive information or send and receive signals during a call, and in particular, receive downlink data of a base station or send uplink data to the base station. Radio frequency unit 1502 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier, a duplexer, and the like.

The network module 1504 provides wireless broadband internet access to the user, such as assisting the user in emailing, browsing web pages, and accessing streaming media.

The audio output unit 1506 may convert audio data received by the radio frequency unit 1502 or the network module 1504 or stored in the memory 1518 into an audio signal and output as sound. Also, the audio output unit 1506 may also provide audio output related to a specific function performed by the electronic device 1500 (e.g., a call signal reception sound, a message reception sound, etc.). The audio output unit 1506 includes a speaker, a buzzer, a receiver, and the like.

The input unit 1508 is for receiving an audio or video signal. The input Unit 1508 may include a Graphics Processing Unit (GPU) 5082 and a microphone 5084, the Graphics processor 5082 Processing image data of still pictures or video obtained by an image capture device (e.g., a camera) in a video capture mode or an image capture mode. The processed image frames may be displayed on the display unit 1512 or stored in the memory 1518 (or other storage medium) or transmitted via the radio frequency unit 1502 or the network module 1504. The microphone 5084 may receive sound and may be capable of processing the sound into audio data, and the processed audio data may be converted into a format output transmittable to a mobile communication base station via the radio frequency unit 1502 in case of a phone call mode.

The electronic device 1500 also includes at least one sensor 1510, such as a fingerprint sensor, a pressure sensor, an iris sensor, a molecular sensor, a gyroscope, a barometer, a hygrometer, a thermometer, an infrared sensor, a light sensor, a motion sensor, and others.

The display unit 1512 is used to display information input by the user or information provided to the user. The display unit 1512 may include a display panel 5122, and the display panel 5122 may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like.

The user input unit 1514 may be used to receive input numeric or character information, and generate key signal inputs related to user settings and function control of the electronic device. Specifically, the user input unit 1514 includes a touch panel 5142 and other input devices 5144. Touch panel 5142, also referred to as a touch screen, can collect touch operations by a user on or near it. The touch panel 5142 may include two parts of a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 1520, and receives and executes commands from the processor 1520. Other input devices 5144 may include, but are not limited to, a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, and a joystick, which are not described in detail herein.

Further, the touch panel 5142 can be overlaid on the display panel 5122, and when the touch panel 5142 detects a touch operation thereon or nearby, the touch panel is transmitted to the processor 1520 to determine the type of the touch event, and then the processor 1520 provides a corresponding visual output on the display panel 5122 according to the type of the touch event. The touch panel 5142 and the display panel 5122 can be provided as two separate components or can be integrated into one component.

The interface unit 1516 is an interface for connecting an external device to the electronic apparatus 1500. For example, the external device may include a wired or wireless headset port, an external power supply (or battery charger) port, a wired or wireless data port, a memory card port, a port for connecting a device having an identification module, an audio input/output (I/0) port, a video I/0 port, an earphone port, and the like. The interface unit 1516 may be used to receive input (e.g., data information, power, etc.) from an external device and transmit the received input to one or more elements within the electronic apparatus 1500 or may be used to transmit data between the electronic apparatus 1500 and the external device.

The memory 1518 may be used to store software programs as well as various data. The memory 1518 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the mobile terminal, and the like. Further, the memory 1518 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The processor 1520 performs various functions of the electronic device 1500 and processes data by running or executing software programs and/or modules stored in the memory 1518 and calling data stored in the memory 1518, thereby performing overall monitoring of the electronic device 1500. Processor 1520 may include one or more processing units; preferably, the processor 1520 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications.

The electronic device 1500 can also include a power supply 1522 that provides power to the various components, and preferably, the power supply 1522 can be logically coupled to the processor 1520 via a power management system that can provide management of charging, discharging, and power consumption.

In an embodiment of the present application, a readable storage medium is provided, on which a program or instructions are stored, which when executed by a processor implement the steps of the function switching method provided in any of the above embodiments.

In this embodiment, the readable storage medium can implement each process of the function switching method provided in the embodiments of the present application, and can achieve the same technical effect, and is not described herein again to avoid repetition.

The processor is the processor in the electronic device described in the above embodiment. The readable storage medium includes a computer readable storage medium, such as a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and so on.

The embodiment of the present application further provides a chip, where the chip includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is configured to run a program or an instruction to implement each process of the above embodiment of the function switching method, and can achieve the same technical effect, and the details are not repeated here to avoid repetition.

It should be understood that the chips mentioned in the embodiments of the present application may also be referred to as system-on-chip, system-on-chip or system-on-chip, etc.

In the description of the present invention, the terms "plurality" or "a plurality" refer to two or more, and unless otherwise specifically defined, the terms "upper", "lower", and the like indicate orientations or positional relationships based on the orientations or positional relationships illustrated in the drawings, and are only for convenience in describing the present invention and simplifying the description, but do not indicate or imply that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and thus, should not be construed as limiting the present invention; the terms "connected," "mounted," "secured," and the like are to be construed broadly and include, for example, fixed connections, removable connections, or integral connections; may be directly connected or indirectly connected through an intermediate. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.

In the description of the present invention, the description of the terms "one embodiment," "some embodiments," "specific embodiments," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In the present invention, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for training a voice detection model, comprising:

acquiring a training sound signal, carrying out feature extraction on the training sound signal, and establishing a two-dimensional feature map training set;

and importing a neural network model into the two-dimensional feature map training set, and training the neural network model through a loss function based on a hidden Markov model to obtain the sound detection model.

2. The method for training the voice detection model according to claim 1, wherein the performing feature extraction on the voice signal specifically includes:

framing the training sound signal to obtain a sample frame;

and windowing the sample frame through a first window function, and extracting the characteristics of the windowed sample frame to obtain the two-dimensional characteristic map training set.

3. The method for training a voice detection model according to claim 2, wherein the training of the neural network model by a hidden markov model-based loss function specifically comprises:

inputting the posterior probability of the sample frame output by the neural network model as an event frame into the loss function;

and obtaining a current loss value output by the loss function, and continuously training the neural network model by taking a target loss value range as a target until the current loss value falls into the target loss value range.

4. The method for training the voice detection model according to any one of claims 1 to 3, wherein before the training the preset neural network model by the objective loss function, the method for training the voice detection model further comprises:

acquiring a preset initial loss function;

and weighting the posterior probability of the sound event according to the continuous frame number of the event frame in any sound event output by the neural network model through the hidden state parameters of the hidden Markov model so as to obtain the loss function based on the hidden Markov model.

5. The method for training the voice detection model according to any one of claims 1 to 3, wherein the performing feature extraction on the voice signal specifically includes:

performing feature extraction on the sound signal by using a feature extraction method, wherein the preset feature extraction method comprises the following steps:

a mel energy spectrum feature extraction method, a short-time Fourier transform extraction method, a mel cepstrum coefficient extraction method, a bark domain energy spectrum extraction method, an equivalent rectangular bandwidth domain energy spectrum extraction method or a gamma-pass cepstrum coefficient extraction method; and

the neural network model specifically comprises:

a convolutional-recurrent neural network, a convolutional neural network, a recurrent neural network, a hidden markov model, a gaussian mixture model, or a support vector machine.

6. An apparatus for training a voice test model, comprising:

the signal processing module is used for acquiring a training sound signal, extracting features of the sound signal and establishing a two-dimensional feature map training set;

and the training module is used for leading a preset neural network model into the two-dimensional characteristic map training set, training the preset neural network model through a target loss function, and obtaining the sound detection model.

7. A method for detecting a sound event, comprising;

determining at least one event frame in a sound signal to be detected by the sound detection model obtained by training with the training method of the sound detection model according to any one of claims 1 to 5;

and determining a corresponding detection result according to the event frame, and outputting the detection result.

8. The method according to claim 7, wherein the determining at least one event frame in the sound signal to be detected comprises:

carrying out feature extraction on the sound signal to be detected, and establishing a two-dimensional feature map data set;

and importing the sound detection model into the two-dimensional feature map data set to obtain at least one event frame output by the sound detection model.

9. The method for detecting a sound event according to claim 8, wherein the acquiring at least one event frame output by the sound detection model specifically includes:

determining, by the acoustic detection model, a plurality of time frames corresponding to the two-dimensional feature map dataset;

respectively calculating the posterior probability of each time frame as the event frame through a solution space algorithm, and determining the time frame with the posterior probability higher than a probability threshold value as the event frame.

10. The method according to claim 9, wherein the calculating the posterior probability of each time frame as the event frame by the solution space algorithm comprises:

establishing a corresponding time frame sequence through the time frame, and constructing a solution space of the event sequence corresponding to the time frame sequence;

in the solution space, solving the event sequence to obtain the posterior probability of each time frame being the event frame; and

the sound event detection method further comprises:

and determining an optimal solution sequence of the event sequence through a dynamic programming algorithm.

11. The method according to any one of claims 8 to 10, wherein the performing feature extraction on the sound signal to be detected to establish a two-dimensional feature map data set specifically includes:

framing the sound signal to be detected to obtain a signal frame;

and windowing the signal frame through a second window function, and extracting the characteristics of the signal frame subjected to windowing through a characteristic extraction method to obtain the two-dimensional characteristic map data set.

12. The method of detecting a sound event according to claim 11, wherein the feature extraction method comprises:

a mel-energy spectrum feature extraction method, a short-time Fourier transform extraction method, a mel cepstrum coefficient extraction method, a bark domain energy spectrum extraction method, an equivalent rectangular bandwidth domain energy spectrum extraction method or a gamma-pass cepstrum coefficient extraction method.

13. An apparatus for detecting a sound event, comprising:

a detection module for determining at least one event frame in a sound signal to be detected by means of the sound detection model according to any one of claims 1 to 5;

and the output module is used for determining a corresponding detection result according to the event frame and outputting the detection result.

14. A computer device, comprising:

a memory having a computer program stored thereon;

a processor configured to implement a method of training a sound detection model according to any of claims 1 to 5, and/or a method of detecting a sound event according to any of claims 7 to 12 when executing the computer program.

15. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a method of training a sound detection model according to any one of claims 1 to 5 and a method of detecting sound events according to any one of claims 7 to 12.

16. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being configured to execute the method of training a sound detection model according to any of claims 1 to 5 and the method of detecting sound events according to any of claims 7 to 12 by means of the computer program.