CN114252906A

CN114252906A - Sound event detection method and device, computer equipment and storage medium

Info

Publication number: CN114252906A
Application number: CN202011008499.2A
Authority: CN
Inventors: 曹海涛; 陈招基
Original assignee: Midea Group Co Ltd
Current assignee: Midea Group Co Ltd
Priority date: 2020-09-23
Filing date: 2020-09-23
Publication date: 2022-03-29

Abstract

The invention provides a method and a device for detecting a sound event, a computer device and a storage medium, wherein the method for detecting the sound event comprises the following steps: respectively determining an endpoint detection event frame of at least one event in a signal to be detected; determining at least one model predicted event frame for each event in a signal to be detected; and comparing at least one model predicted event frame with the endpoint detection event frame aiming at each event, and determining the detection result of each event. According to the embodiment of the invention, the comparison result of the event frame is detected based on the endpoint and the model prediction event frame, and the detection result of each event is determined based on the comparison result, so that the integration of a plurality of model prediction event frames is realized, the problem of event fragmentation can be solved, and the accuracy and the integrity of sound event detection and identification are effectively improved.

Description

Sound event detection method and device, computer equipment and storage medium

Technical Field

The present invention relates to the field of sound detection technologies, and in particular, to a sound event detection method, a sound event detection apparatus, a computer device, a computer-readable storage medium, and an electronic device.

Background

In the related art, sound quality inspection of products belongs to an important link in factory production. Whether quality problems such as screw not tightened, heating pipe not block in support appear can be judged through artifical monitoring sound. However, the manual monitoring method is inefficient and easily damages the human hearing.

Disclosure of Invention

The present invention is directed to solving at least one of the problems of the prior art or the related art.

To this end, a first aspect of the invention proposes a method of detecting a sound event.

A second aspect of the invention proposes a device for detecting a sound event.

A third aspect of the invention provides a computer apparatus.

A fourth aspect of the present invention is directed to a computer-readable storage medium.

A fifth aspect of the present invention provides an electronic device.

In view of the above, a first aspect of the present invention provides a method for detecting a sound event, including: respectively determining an endpoint detection event frame of at least one event in a signal to be detected; respectively determining at least one model predicted event frame of each event in a signal to be detected; and comparing at least one model predicted event frame with the endpoint detection event frame aiming at each event, and determining the detection result of each event.

In the technical scheme, an endpoint detection event frame corresponding to at least one 'event' can be identified in a signal to be detected through an endpoint detection method, and one or more model prediction event frames under each 'event' are predicted in the signal to be detected through an artificial intelligence algorithm and the like.

After determining the endpoint detection event frame and the model prediction event frame, comparing the endpoint detection event frame and the model prediction event frame, integrating a plurality of fragmented model prediction event frames to obtain a detection result corresponding to each event, and outputting the detection result.

The "event" may be an accident sound event caused by a production accident, such as an accident sound of falling of an article, glass breakage, or a specific quality problem of a produced product, such as a quality problem of a screw not tightened (when the screw is not tightened, a specific noise may occur during a test operation of the product), a quality problem of a heating pipe not clamped in a bracket (when the heating pipe is loosened, a specific noise may occur due to movement of the heating pipe), or other sounds occurring in a normal production activity of a factory, such as a talk between persons, a noise during normal operation of a machine, a down bell or other warning sounds, and the specific type of the "event" in a sound signal is not limited in the embodiment of the present invention.

According to the embodiment of the invention, the comparison result of the event frame is detected based on the endpoint and the model prediction event frame, and the detection result of each event is determined based on the comparison result, so that the integration of a plurality of model prediction event frames is realized, the problem of event fragmentation can be solved, and the accuracy and the integrity of sound event detection and identification are effectively improved.

In addition, the method for detecting a sound event in the above technical solution provided by the present invention may further have the following additional technical features:

in the above technical solution, determining an endpoint detection event frame of at least one event in a signal to be detected respectively specifically includes: respectively determining an endpoint detection event frame of at least one event in a signal to be detected, and determining an endpoint detection initial frame and an endpoint detection end frame of the endpoint detection event frame; and determining at least one model predicted event frame for each event in the signal to be detected, specifically comprising: determining at least one model predicted event frame of each event in the signal to be detected through a neural network model, and determining a model predicted starting frame and a model predicted ending frame of each model predicted event frame of each event.

In the technical scheme, the acquired signal to be detected, such as sound pickup equipment such as a microphone, is picked up by the pre-trained sound detection model in the process of producing products in a factory, so that analog signals or digital signals of the sound in the factory production are obtained. Through the sound detection model, sound detection analysis can be carried out on the signal to be detected, and at least one event contained in the sound signal is further determined.

After the events are determined in the signal to be detected, at least one endpoint detection event frame and at least one model prediction event frame are included under each determined event. The endpoint detection event frame and the model prediction event frame may be one of a plurality of signal frames obtained by framing in a section of audio signal by a framing method, and when the sound detection model determines that an event occurs at a time corresponding to the signal frame or based on that the energy-entropy ratio of the signal frame is greater than a threshold, the signal frame is determined as an event frame.

After the event frame is determined, a start frame and an end frame of each signal frame, i.e., an end-point detection start frame, an end-point detection end frame, a model prediction start frame, and a model prediction end frame, are detected. The "start frame" and the "end frame" represent the "start point" and the "end point" of an event frame, respectively, i.e., the initial time when the event occurs and the final time when the event ends. And (3) the event frames are normalized according to the starting frame and the ending frame, so that fragmented event segments can be effectively integrated, and the complete and accurate detection result of the 'event' to which each model predicts the event frame is finally determined. Such as: "16 detected: 00 to 16: 01 a screw loosening event occurred.

The technical proposal provided by the invention is applied, the artificial intelligence-based neural network model, namely the sound detection model is utilized to carry out sound detection on the sound in factory production, thereby being capable of automatically and 'unmanned' monitoring various conditions possibly occurring in the production process of the factory with high efficiency and accuracy, meanwhile, because no personnel are needed to intervene in the method, the voice detection is not dependent on experienced workers any more, on one hand, the detection efficiency is improved, the detection threshold is reduced, on the other hand, the hearing of people is not damaged, and simultaneously, through end point detection, identifying a start frame and an end frame of an event frame, determining, by the start frame and the end frame of each event frame, a plurality of fragmented event frames are regulated effectively, the problem of event fragmentation is solved, and the accuracy and integrity of sound event detection and identification are improved effectively.

In any of the above technical solutions, respectively determining an endpoint detection event frame of at least one event in a signal to be detected, and determining an endpoint detection start frame and an endpoint detection end frame of the endpoint detection event frame, specifically includes: framing a signal to be detected to obtain a plurality of signal frames; determining the energy-entropy ratio of any signal frame, and determining the signal frame with the energy-entropy ratio larger than the energy-entropy ratio threshold as an endpoint detection event frame; and determining the frame with the earliest time in the endpoint detection event frames as an endpoint detection starting frame, and taking the frame with the latest time in the endpoint detection event frames as an endpoint detection ending frame.

In the technical scheme, firstly, a signal to be detected is framed to obtain signal frames with a certain length, wherein each signal frame is a sound sample, and partial overlap can exist between two adjacent signal frames. The frame length (the duration of the sound sample) and the overlap length of the signal frame may be adjusted according to the application scenario, for example, the signal frame may be a frame length of 40ms, and two adjacent signal frames overlap for 20 ms. It can be understood that the frame length and the overlapping frame length are only used as examples for illustration, and the length and the overlapping length of the signal frame are not limited in the embodiments of the present invention.

After a plurality of signal frames are obtained by framing, the energy-entropy ratio of each signal frame is further determined. If the energy-entropy ratio of a signal frame is greater than the energy-entropy ratio threshold, the current signal frame may be considered as an endpoint detection event frame, and some "event" occurs in the endpoint detection event frame. Then, an "endpoint" of the endpoint detection event frame is determined, specifically, an earliest frame (a time frame with the earliest frame time) in one endpoint detection event frame is an endpoint detection start frame of the endpoint detection event frame, and correspondingly, a latest frame (an event frame with the latest frame time) in one endpoint detection event frame is an endpoint detection end frame of the endpoint detection event frame.

The method for dividing the frame and calculating the energy-entropy ratio of each segment is used for determining whether an event occurs in a sound segment or not, and further carrying out end point detection on the frame in which the event occurs, so that an end point detection event frame is obtained, the starting and stopping time of the event can be effectively and accurately judged, and the accuracy of sound event detection is improved.

In any of the above technical solutions, determining at least one model predicted event frame of each event in a signal to be detected specifically includes: extracting the characteristics of the windowed signal frame by a characteristic extraction method to obtain a corresponding two-dimensional characteristic diagram; and inputting the two-dimensional characteristic diagram into a neural network model, and acquiring a model prediction event frame, a model prediction starting frame and a model prediction ending frame which are output by the neural network model.

In the technical scheme, after framing a signal to be detected, windowing processing is performed on a signal frame obtained after framing through a preset window function. The window function may be a rectangular window or a non-rectangular window, such as a hamming window, a hanning window, and the like, and the specific type of the window function is not limited in the embodiment of the present invention.

After windowing, further performing feature extraction on the signal frames subjected to windowing by using a feature extraction method to obtain a two-dimensional feature map, inputting the two-dimensional feature map serving as an original signal into a trained neural network model, wherein the neural network model can predict whether an event occurs in the input original signal (a processed signal to be detected) and the starting and stopping time of the event by artificial intelligence recognition, and finally outputting a model prediction event frame, and a model prediction starting frame and a model prediction ending frame corresponding to the model prediction event frame.

According to the embodiment of the invention, the sound detection is carried out on the sound in the factory production by utilizing the neural network model based on artificial intelligence, namely the sound detection model, so that various conditions possibly occurring in the factory production process can be monitored automatically and accurately in an unmanned manner, meanwhile, because no personnel are needed to intervene in the method, the sound detection is not dependent on experienced workers, on one hand, the detection efficiency is improved, the detection threshold is reduced, and on the other hand, the hearing of people is not damaged.

In any of the above technical solutions, comparing at least one model predicted event frame with an endpoint detection event frame based on the existence of at least two model predicted event frames, and determining a detection result of each event specifically includes: for each event, comparing the corresponding time of the start frame and the end frame of each model predicted event frame with the frame time of the start frame and the end frame of the endpoint detection event frame corresponding to the event, and determining the model predicted event frame which is overlapped with the endpoint detection event frame in the frame time as a result event frame; and determining a detection result according to the result event frame.

In the technical scheme, each event frame in a plurality of model prediction event frames is traversed, whether a frame time which is overlapped with an endpoint detection event frame exists between any one of the model prediction event frames and the endpoint detection event frame is determined, if two or more than two model prediction event frames exist and the frame time which is overlapped with the same endpoint detection event frame exists, the two or more than two model prediction event frames are considered to belong to the same sound event, and then the model prediction event frames are integrated, so that the effective integration of the fragmentation event is realized.

Specifically, for example, the endpoint detection event frame includes 11 event frames from time 10 to time 20, the first model predicted event frame includes 16 time frames from time 0 to time 15, and the second model predicted event frame includes 16 time frames from time 14 to time 29, and since the first model predicted event frame and the endpoint detection event frame both include the time frame from time 10 to time 15 and the second model predicted event frame and the endpoint detection event frame both include the time frame from time 14 to time 20, the first model predicted event frame and the second model predicted event frame are integrated to obtain a complete third event frame from time 0 to time 29.

By effectively integrating fragmented model prediction event frames based on the judgment of whether two event frames are overlapped with the same endpoint detection event frame, the finally output detection result can be more complete, the situation of the fragmented event is avoided, and the accuracy and the integrity during sound detection are effectively improved.

In any of the above technical solutions, the overlapping specifically includes: for each event, based on a first time corresponding to the model prediction starting frame, the first time is earlier than or equal to a second time corresponding to the endpoint detection starting frame, and a third time corresponding to the model prediction ending frame is later than or equal to the second time, the frame time corresponding to the model prediction event frame and the frame time corresponding to the endpoint detection event frame are determined to be overlapped; or based on the first time being later than or equal to the second time and the third time being earlier than or equal to the fourth time corresponding to the endpoint detection end frame, overlapping the frame time corresponding to the model prediction event frame with the frame time corresponding to the endpoint detection event frame; or based on the first time being earlier than or equal to the fourth time and the third time being later than or equal to the fourth time, overlapping the frame time corresponding to the determined model prediction event frame with the frame time corresponding to the endpoint detection event frame; or determining that the frame time corresponding to the model prediction event frame overlaps with the frame time corresponding to the endpoint detection event frame based on the condition that the first time is earlier than or equal to the second time and the third time is later than or equal to the fourth time.

In the technical scheme, when judging whether the two event frames are at least partially overlapped, whether the first event frame and the second event frame are partially overlapped can be judged by setting a simplified judgment condition based on the 'endpoint' time of different event frames, namely the time corresponding to the starting frame and the ending frame.

Specifically, the time corresponding to the model prediction start frame is recorded as a first time; recording the time corresponding to the endpoint detection initial frame as a second time; recording the moment corresponding to the model prediction end frame as a third moment; and recording the time corresponding to the end point detection ending frame as a fourth time.

And if the first time is earlier than or equal to the second time and the third time is later than or equal to the second time, the frame time of the model prediction event frame and the frame time of the endpoint detection event frame are considered to be at least partially overlapped with each other.

If the first time is later than the second time, but the third time is earlier or lower than the fourth time, the model may also consider that the frame times of the model predicted event frame and the endpoint detection event frame at least partially overlap each other.

And if the first time is earlier than or equal to the fourth time and the third time is later than the fourth time, the model predicts that the frame time of the event frame and the frame time of the endpoint detection event frame are at least partially overlapped mutually.

It may also be determined that the frame times of the model predicted event frame and the endpoint detection event frame at least partially overlap one another if the first time is earlier than or equal to the second time and the third time is later than the fourth time.

Whether partial overlapping exists between two different event frames is judged according to the end points of each event frame, namely the frame time of the initial frame and the frame time of the end frame, so that the calculation process can be effectively reduced, the requirement on the system operation performance is lowered, the effective regulation of the event frames is ensured, and the accuracy and the integrity of sound detection are improved.

In any of the above technical solutions, determining the event frame as a detection result according to the result specifically includes: determining the initial frame of the prediction event with the earliest time in all the result event frames as the initial frame of the detection result; and determining the predicted event end frame with the latest time among all the result event frames as the end frame of the detection result.

In the technical scheme, after a plurality of model prediction event frames overlapped with the endpoint detection event frame at least in part of the frame time are unified, an integrated complete result event frame is obtained. And determining a final detection result according to the result event frame.

Specifically, a start frame and an end frame of the result event frame are first acquired. Among all the event frames integrated and formed into the result event frame, the model prediction starting frame with the earliest frame time is taken as the result event starting frame, and the model prediction ending frame with the latest frame time is taken as the result event ending frame in all the model prediction event frames formed into the result event frame.

After determining the start frame and the end frame of the result event frame, i.e. the start and stop time of the "sound event", the finally output event type, i.e. the event corresponding to the result event frame, is determined. Specifically, all the model predicted event frames formed as the result event frame are traversed, the total frame length of each of the model predicted event frames is respectively determined, the event type (the event corresponding thereto) of one of the model predicted event frames in which the frame length is the longest is determined as the event corresponding to the result event frame, and the events are simultaneously marked as the event type of the finally output "event".

And finally, determining a detection result according to a result event starting frame (namely the starting time of the event), a result event ending frame (namely the ending time of the event) and a result event (namely the finally identified event type) corresponding to the result event frame, and outputting the detection result, so that various conditions possibly occurring in the production process of the factory are monitored automatically and unmanned efficiently and accurately, and managers can master various events occurring in the factory in time.

In any of the above technical solutions, determining the energy-entropy ratio of any signal frame specifically includes: acquiring energy and entropy corresponding to any signal frame, calculating the absolute value of the ratio of the energy and the entropy, and determining the energy-entropy ratio according to the absolute value.

In the technical scheme, the energy-entropy ratio can be calculated by the following formula:

wherein Efi is the energy entropy ratio, E_iIs the energy of the ith frame, and E_i＝Σ_j|Y_j|²，H_iIs the entropy of the ith frame, and H_i＝-∑_jp_jlogp_j，p_j＝Y_j/∑_j|Y_j|，Y_jRepresenting the jth spectral line in the signal to be detected.

The energy-entropy ratio of each signal frame is calculated through a formula, so that whether one signal frame is an event frame or not can be judged more accurately, and the detection accuracy of the sound event is improved.

In any of the above technical solutions, after determining the energy-entropy ratio according to the absolute value, the method for detecting a sound event further includes: and carrying out normalization processing on the energy-entropy ratio, and carrying out median filtering processing on the energy-entropy ratio after the normalization processing.

In the technical scheme, after the energy entropy ratio corresponding to each signal frame is obtained through calculation, the energy entropy ratios of all the signal frames are normalized and subjected to multiple median filtering, so that the influence of signal noise can be effectively removed, the obtained energy entropy ratio is more accurate, whether one signal frame is an event frame or not can be more accurately judged, and the detection accuracy of a sound event is further improved.

In any of the above technical solutions, the neural network model includes: a convolutional-recurrent neural network, a convolutional neural network, a recurrent neural network, a hidden markov model, a gaussian mixture model, or a support vector machine.

In the technical scheme, in order to adapt to various application environments, when the voice event is identified, any one of a convolution-recurrent neural network, a convolutional neural network, a recurrent neural network, a hidden markov model, a gaussian mixture model or a support vector machine can be used as an applied neural network model, or multiple neural network models can be selected to form a 'multi-stage' neural network model. The embodiment of the present invention does not limit the specific form of the neural network model.

In any of the above technical solutions, the feature extraction method includes: a mel-frequency energy spectrum feature extraction method, a short-time fourier transform extraction method, a mel cepstral coefficient extraction method, a Bark (Bark) domain energy spectrum extraction method, an equivalent rectangular bandwidth domain (Erb) energy spectrum extraction method or a gamma-pass cepstral coefficient extraction method.

In the technical solution, when performing feature extraction on a signal frame after a subframe, according to different sound environments or particularly for different sound event types, any one of a mel-energy spectrum feature extraction method, a short-time fourier transform extraction method, a mel-cepstrum coefficient extraction method, a Bark (Bark) domain energy spectrum extraction method, an Erb domain energy spectrum extraction method, or a Gammatone cepstrum coefficient extraction method may be selected to convert a one-dimensional sound signal into a two-dimensional feature map. The embodiment of the invention does not limit the specific type of the feature extraction method.

A second aspect of the present invention provides a device for detecting a sound event, comprising: the determining module is used for respectively determining an endpoint detection event frame of at least one event in the signal to be detected; a detection module for determining at least one model predicted event frame for each event in a signal to be detected; the determining module is further configured to compare, for each event, the at least one model predicted event frame with the endpoint detection event frame, and determine a detection result for each event.

After determining the endpoint detection event frame and the model prediction event frame, comparing the endpoint detection event frame and the model prediction event frame further, integrating a plurality of fragmented model prediction event frames to obtain a detection result corresponding to each event, and outputting the detection result.

A third aspect of the present invention provides a computer apparatus comprising: a memory having a computer program stored thereon; the processor is configured to implement the steps of the method for detecting a sound event provided in any one of the above technical solutions when running the computer program, and therefore, the computer apparatus simultaneously includes all the beneficial effects of the method for detecting a sound event provided in any one of the above technical solutions, which are not described herein again.

A fourth aspect of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when being executed by a processor, can implement the steps of the method for detecting a sound event provided in any of the above technical solutions, and therefore, the computer-readable storage medium simultaneously includes all the beneficial effects of the method for detecting a sound event provided in any of the above technical solutions, which are not described herein again.

A fifth aspect of the present invention provides an electronic device, including a memory and a processor, where the memory stores a computer program, and the processor is configured to execute the steps of the method for detecting a sound event provided in any one of the above technical solutions through the computer program, so that the electronic device simultaneously includes all the beneficial effects of the method for detecting a sound event provided in any one of the above technical solutions, which are not described herein again.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 shows one of the flow diagrams of a method of detection of a sound event according to an embodiment of the invention;

FIG. 2 illustrates a second flow chart of a method for detecting a sound event according to an embodiment of the present invention;

FIG. 3 shows a third flowchart of a method of detecting a sound event according to an embodiment of the invention;

FIG. 4 shows a fourth flowchart of a method of detection of a sound event according to an embodiment of the invention;

FIG. 5 shows a fifth flowchart of a method of detection of a sound event according to an embodiment of the invention;

FIG. 6 shows a sixth flowchart of a method of detection of sound events according to an embodiment of the invention;

FIG. 7 shows a seventh flowchart of a method of detection of a sound event according to an embodiment of the invention;

FIG. 8 is a logic diagram illustrating the generation of a sound detection model according to an embodiment of the present invention;

FIG. 9 is a diagram illustrating an example of a relationship between an un-integrated model predicted event frame and an endpoint detection event frame in an embodiment in accordance with the invention;

FIG. 10 is a diagram illustrating an example of the relationship between an integrated model predicted event frame and an endpoint detection event frame according to an embodiment of the invention;

FIG. 11 illustrates a correspondence of sound signals to sound event detection in accordance with an embodiment of the present invention;

fig. 12 is a block diagram showing a configuration of a sound event detection apparatus according to an embodiment of the present invention;

FIG. 13 shows a block diagram of a computer device according to an embodiment of the invention;

fig. 14 shows a block diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described herein, and therefore the scope of the present invention is not limited by the specific embodiments disclosed below.

A detection method of a sound event, a detection apparatus of a sound event, a computer device, a computer-readable storage medium, and an electronic device according to some embodiments of the present invention are described below with reference to fig. 1 to 14.

For convenience of description, the following embodiment takes detection of one event as an example, but the technical solution of the present invention is not limited to the number of events, and when there are multiple times, the processing method is similar.

Example one

Fig. 1 shows one of the flowcharts of a method for detecting a sound event according to an embodiment of the present invention, and specifically, the method for detecting a sound event specifically includes the following steps:

102, acquiring a signal to be detected, and determining an endpoint detection event frame corresponding to one or more events in the signal to be detected;

104, determining one or more model prediction event frames under each event in a signal to be detected;

and 106, comparing the model predicted event frame with the endpoint detection event frame aiming at each event to determine the detection result under each event.

In the embodiment of the invention, an endpoint detection event frame corresponding to at least one 'event' can be identified in a signal to be detected through an endpoint detection method, and one or more model prediction event frames under each 'event' can be predicted in the signal to be detected through an artificial intelligence algorithm and the like.

Example two

Fig. 2 shows a second flowchart of a method for detecting a sound event according to an embodiment of the present invention, and specifically, the method for detecting a sound event includes the following steps:

step 202, after windowing the signal to be detected in frames, calculating the energy entropy ratio of each frame, and acquiring the endpoint detection event frames of one or more events, and corresponding endpoint detection start frames and endpoint detection end frames;

and step 204, predicting one or more model prediction event frames under each event through the neural network model, and corresponding model prediction starting frames and model prediction ending frames.

In the embodiment of the invention, the acquired signal to be detected, such as sound generated in the process of producing products in a factory through sound pickup equipment such as a microphone, can be picked up through a pre-trained sound detection model, so that an analog signal or a digital signal of the sound in the process of producing the products in the factory is obtained. Through the sound detection model, sound detection analysis can be carried out on the signal to be detected, and at least one event contained in the sound signal is further determined.

After determining the events in the signal to be detected by the acoustic detection model, at least one endpoint detection event frame and at least one model prediction event frame are included under continued detection of each determined event. The endpoint detection event frame and the model prediction event frame may be one of a plurality of signal frames obtained by framing in a section of audio signal by a framing method, and when the sound detection model determines that an event occurs at a time corresponding to the signal frame or based on that the energy-entropy ratio of the signal frame is greater than a threshold, the signal frame is determined as an event frame.

EXAMPLE III

Fig. 3 shows a third flowchart of a method for detecting a sound event according to an embodiment of the present invention, and specifically, the method for detecting a sound event includes the following steps:

step 302, framing a signal to be detected so as to obtain a plurality of initial signal frames after framing, and performing windowing processing so as to obtain a signal frame after windowing processing;

step 304, determining the energy-entropy ratio of each signal frame, and determining all signal frames with the energy-entropy ratios larger than the energy-entropy ratio threshold as the end point detection event frames;

step 306, determining the earliest frame in the end point detection event frames as the end point detection start frame, and determining the latest frame in the end point detection event frames as the end point detection end frame.

In the embodiment of the invention, firstly, a signal to be detected is framed to obtain signal frames with a certain length, wherein each signal frame is a sound sample, and partial overlap can exist between two adjacent signal frames. The frame length (the duration of the sound sample) and the overlap length of the signal frame may be adjusted according to the application scenario, for example, the signal frame may be a frame length of 40ms, and two adjacent signal frames overlap for 20 ms. It can be understood that the frame length and the overlapping frame length are only used as examples for illustration, and the length and the overlapping length of the signal frame are not limited in the embodiments of the present invention.

After a plurality of signal frames are obtained through framing, windowing is carried out on the framed signal frames, and the energy-entropy ratio of each signal frame is further determined based on the windowed signal frames. If the energy-entropy ratio of a signal frame is greater than the energy-entropy ratio threshold, the current signal frame may be considered as an endpoint detection event frame, and some "event" occurs in the endpoint detection event frame. Then, an "endpoint" of the endpoint detection event frame is determined, specifically, an earliest frame (a time frame with the earliest frame time) in one endpoint detection event frame is an endpoint detection start frame of the endpoint detection event frame, and correspondingly, a latest frame (an event frame with the latest frame time) in one endpoint detection event frame is an endpoint detection end frame of the endpoint detection event frame.

Example four

Fig. 4 shows a fourth flowchart of a method for detecting a sound event according to an embodiment of the present invention, and specifically, the method for detecting a sound event specifically includes the following steps:

step 402, extracting the features of the signal frame after the windowing processing to obtain a corresponding two-dimensional feature map;

step 404, inputting the two-dimensional feature map into the neural network model to obtain a model prediction event frame, a model prediction start frame and a model prediction end frame.

In the embodiment of the invention, feature extraction is carried out on a signal frame subjected to windowing so as to obtain a two-dimensional feature map, the two-dimensional feature map is used as an original signal and is input into a trained neural network model, the neural network model can predict whether an event occurs in the input original signal (a processed signal to be detected) and the starting and stopping time of the event through artificial intelligence recognition, and finally a model prediction event frame, and a model prediction starting frame and a model prediction ending frame corresponding to the model prediction event frame are output.

EXAMPLE five

Fig. 5 shows a fifth flowchart of a method for detecting a sound event according to an embodiment of the present invention, and specifically, the method for detecting a sound event includes the following steps:

step 502, for each event, comparing the corresponding time of the start frame and the end frame of each model predicted event frame with the frame time of the start frame and the end frame of the endpoint detected event frame corresponding to the event, and determining the model predicted event frame overlapping with the endpoint detected event frame in the frame time as a result event frame;

step 504, determining the detection result according to the result event frame.

In the embodiment of the invention, each event frame in a plurality of model prediction event frames is traversed, whether a frame time which is overlapped with an endpoint detection event frame exists between any one of the model prediction event frames and the endpoint detection event frame is determined, if two or more than two model prediction event frames exist and the frame time which is overlapped with the endpoint detection event frame exists, the two or more than two model prediction event frames are considered to belong to the same sound event, and the model prediction event frames are integrated, so that the effective integration of the fragmentation event is realized.

EXAMPLE six

In the embodiment of the present invention, for each event, when at least one of the following conditions is satisfied, it may be determined that there is an overlap between a frame time corresponding to the model prediction event frame and a frame time corresponding to the endpoint detection event frame. The specific conditions are as follows:

the first condition is as follows: the first time is earlier than or equal to the second time, and the third time is later than or equal to the second time;

and a second condition: the first time is later than or equal to the second time, and the third time is earlier than or equal to the fourth time;

and (3) carrying out a third condition: the first time is earlier than or equal to the fourth time, and the third time is later than or equal to the fourth time;

and a fourth condition: the first time is earlier than or equal to the second time, and the third time is later than or equal to the fourth time.

In the embodiment of the present invention, when determining whether two event frames at least partially overlap, it is possible to determine whether a first event frame and a second event frame partially overlap by setting a simplified determination condition based on "end point" time of different event frames, that is, time corresponding to a start frame and an end frame.

If the first time is earlier than or equal to the second time and the third time is later than the fourth time, it can be determined that the model predicts that the frame times of the event frame and the endpoint detection event frame at least partially overlap each other.

EXAMPLE seven

Fig. 6 shows a sixth flowchart of a detection method of a sound event according to an embodiment of the present invention, specifically, the detection method of a sound event specifically includes the following steps:

step 602, determining the initial frame of the predicted event with the earliest time among all the result event frames as the initial frame of the detection result;

in step 604, the predicted event end frame with the latest time among all the result event frames is determined as the end frame of the detection result.

In the embodiment of the invention, after a plurality of model prediction event frames overlapped with the endpoint detection event frame at least in part of the frame time are unified, an integrated complete result event frame is obtained. And determining a final detection result according to the result event frame.

Example eight

Fig. 7 shows a seventh flowchart of a method for detecting a sound event according to an embodiment of the present invention, and specifically, the method for detecting a sound event includes the following steps:

step 702, acquiring energy and entropy corresponding to any signal frame, and calculating an energy-entropy ratio through a formula;

step 704, normalizing the calculated energy entropy ratio, and performing multiple median filtering on the normalized energy entropy ratio.

In the embodiment of the present invention, specifically, the energy-entropy ratio may be calculated by the following formula:

After the energy entropy ratio corresponding to each signal frame is obtained through calculation, the energy entropy ratios of all the signal frames are normalized, and multiple median filtering is performed on the energy entropy ratios, so that the influence of signal noise can be effectively removed, the obtained energy entropy ratios are more accurate, whether one signal frame is an event frame or not can be judged more accurately, and the detection accuracy of a sound event is further improved.

In some embodiments of the invention, the neural network model comprises: a convolutional-recurrent neural network, a convolutional neural network, a recurrent neural network, a hidden markov model, a gaussian mixture model, or a support vector machine.

In order to adapt to various application environments, when the sound event is identified, any one of a convolution-recurrent neural network, a convolution neural network, a recurrent neural network, a hidden markov model, a gaussian mixture model or a support vector machine can be used as an applied neural network model, or multiple ones of the neural network models can be selected to form a 'multi-stage' neural network model. The embodiment of the present invention does not limit the specific form of the neural network model.

In some embodiments of the invention, a feature extraction method comprises: a mel-frequency energy spectrum feature extraction method, a short-time fourier transform extraction method, a mel cepstral coefficient extraction method, a Bark (Bark) domain energy spectrum extraction method, an equivalent rectangular bandwidth domain (Erb) energy spectrum extraction method or a gamma-pass cepstral coefficient extraction method.

When the feature extraction is performed on the signal frame after the frame division, according to different sound environments or particularly for different sound event types, any one of a mel-energy spectrum feature extraction method, a short-time fourier transform extraction method, a mel-cepstrum coefficient extraction method, a Bark (Bark) domain energy spectrum extraction method, an Erb domain energy spectrum extraction method or a Gammatone cepstrum coefficient extraction method can be selected to convert the one-dimensional sound signal into a two-dimensional feature map. The embodiment of the invention does not limit the specific type of the feature extraction method.

Example nine

In some embodiments of the present invention, the embodiments of the present invention are fully described in terms of practical applications.

In particular, sound event detection refers to identifying the category of various events in a sound signal and the starting and stopping time of the event occurrence so as to make corresponding decisions. In the actual production process of a factory, sound quality inspection of products is an important link, for example, when workers carry out power-on test on a washing machine, whether the products have quality problems is judged by monitoring the sound of the whole machine, for example, a heating pipe is not clamped into a support or a motor screw is not tightened, and the like. The traditional manual sound quality inspection method has low efficiency and easily causes damage to human hearing.

The embodiment of the invention provides a method for automatically identifying a sound signal by applying a sound event detection algorithm so as to determine whether a specific 'event' occurs at present. The sound event detection algorithm mainly comprises two processing steps:

1. converting one-dimensional sound signals into a two-dimensional characteristic diagram, such as Mel energy spectrum characteristics, by utilizing the technology in the field of signal processing;

2. and identifying event information contained in the two-dimensional sound characteristic diagram by using a deep learning algorithm, such as a convolution-recurrent neural network model, and outputting an event category and a time label.

The training process of the neural network mainly comprises two steps of operations: the method comprises the steps of firstly, utilizing a convolutional neural network to learn effective local information of a feature map, wherein the effective local information is mainly used for event classification, and secondly, utilizing a recursive neural network to learn context information between time frames, and the effective local information is mainly used for determining time labels.

The embodiment of the invention further utilizes an endpoint detection technology to carry out event warping on the output result of the convolution-recurrent neural network so as to solve the problem of event fragmentation.

The detection of acoustic events requires the application of a corresponding neural network model, which is referred to as acoustic detection model hereinafter. To obtain the sound detection model, two steps of training and testing are generally required to further determine whether the obtained model meets the requirements.

Fig. 8 is a schematic diagram illustrating a logic diagram of generating a sound detection model according to an embodiment of the present invention, wherein the training process mainly includes the following steps:

1. collecting sound signals as a training set, and performing framing with the frame length of 40ms, overlapping the framing of 20ms and adding a Hamming window on each sound sample;

2. performing feature extraction on the sound signal based on the Mel energy spectrum, and establishing a two-dimensional feature map;

3. constructing a convolution-recurrent neural network model, and importing a two-dimensional characteristic diagram training set for training;

4. and after the training is finished, saving the parameters of the convolution-recurrent neural network model.

The test process includes the following steps:

1. collecting sound signals as a test set, performing framing with the frame length of 40ms and overlapping with the framing of 20ms on each sound sample, and adding a Hamming window;

2. carrying out endpoint detection on the sound signal based on the energy-entropy ratio to obtain endpoint information of the sound event, and specifically comprising the following steps of:

1) carrying out short-time Fourier transform on the framed signal to obtain a time-frequency diagram of the sound signal;

2) calculating the energy and entropy of each frame of signal to obtain the energy-entropy ratio Efi of each frame, wherein the calculation formula is as follows:

wherein Efi is the energy entropy ratio, E_iIs the energy of the ith frame, and E_i＝∑_j|Y_j|²，H_iIs the entropy of the ith frame, and H_i＝-∑_jp_jlogp_j，p_j＝Y_j/∑_j|Y_j|，Y_jRepresenting the jth spectral line in the signal to be detected.

3) Normalizing the energy entropy ratios of all frames, and performing multiple median filtering to eliminate noise influence;

4) setting a threshold value a (default is 0.2), regarding a frame with the energy entropy ratio larger than the threshold value as an event frame, and otherwise, regarding the frame as an event-free frame, and recording a starting frame and an ending frame of each event;

3. performing feature extraction on the sound signal based on the Mel energy spectrum, and establishing a two-dimensional feature map;

4. reading parameters of the convolution-recurrent neural network model, and importing a two-dimensional characteristic diagram test set to obtain an event result predicted by the model;

5. and (3) performing event warping on the event result predicted by the model by combining the endpoint information obtained in the step (2) to obtain and output a final prediction result, wherein the method specifically comprises the following steps:

1) based on the event frames obtained by the endpoint detection, all model predicted event frames meeting 4 types of conditions are found:

m is less than or equal to a and n is greater than or equal to a, as in the model prediction event frame 1 in FIG. 9;

m is greater than or equal to a and n is less than or equal to b, as in the model prediction event frame 2 in FIG. 9;

m is less than or equal to b and n is greater than or equal to b, as in the model prediction event frame 3 in FIG. 9;

m is less than or equal to a and n is greater than or equal to b, as in the model prediction event frame 4 in FIG. 9;

where m and n are the start frame and the end frame of the model predicted event frame, respectively, and a and b are the start frame and the end frame of the endpoint detected event frame, respectively.

2) Fig. 10 is a diagram illustrating an example of a relationship between an integrated model predicted event frame and an end-point detected event frame according to an embodiment of the present invention, where the total frame length of all model predicted event frames in the same category is counted, and the event category with the longest total frame length is taken as an identification result category, such as category 1 in fig. 10;

all the model prediction event frames classified as the recognition result type are taken, the initial frame is the initial frame of the result, the final frame is the end frame of the result, and the frame result is converted into a time result to be output as shown in fig. 10.

Fig. 11 shows a mapping between a sound signal and sound event detection according to an embodiment of the present invention, in which event fragmentation occurring in a sound event detection algorithm is solved to a certain extent by integrating the model prediction event frames of the fragments, so that the event identification accuracy can be improved, and the later-stage actual landing application is facilitated.

Example ten

Fig. 12 is a block diagram illustrating a structure of a sound event detection apparatus according to an embodiment of the present invention, wherein the sound event detection apparatus 1200 includes: a determining module 1202, configured to determine an endpoint detection event frame of at least one event in a signal to be detected, respectively; a detection module 1204 for determining at least one model predicted event frame for each event in the signal to be detected; the determining module 1202 is further configured to compare, for each event, at least one model predicted event frame with the endpoint detection event frame, and determine a detection result for each event.

In some embodiments of the present invention, the determining module 1202 is further configured to determine an endpoint detection event frame of at least one event in the signal to be detected, and determine an endpoint detection start frame and an endpoint detection end frame of the endpoint detection event frame; determining at least one model prediction event frame of each event in the signal to be detected through a neural network model, and predicting a model prediction start frame and a model prediction end frame of each model prediction event frame of each event.

The obtained signal to be detected, such as sound pickup equipment such as a microphone, is picked up by the pre-trained sound detection model in the process of producing products in a factory, so that analog signals or digital signals of the sound in the factory production are obtained. Through the sound detection model, sound detection analysis can be carried out on the signal to be detected, and at least one event contained in the sound signal is further determined.

In some embodiments of the present invention, the apparatus 1200 for detecting a sound event further comprises: a framing module 1206, configured to frame a signal to be detected to obtain multiple signal frames; the determining module 1202 is further configured to determine an energy-entropy ratio of any signal frame, and determine a signal frame with the energy-entropy ratio greater than an energy-entropy ratio threshold as an endpoint detection event frame; and determining the frame with the earliest time in the endpoint detection event frames as an endpoint detection starting frame, and taking the frame with the latest time in the endpoint detection event frames as an endpoint detection ending frame.

In some embodiments of the present invention, the apparatus 1200 for detecting a sound event further comprises: an extraction module 1208, configured to perform windowing on the signal frame through a window function, and perform feature extraction on the windowed signal frame through a feature extraction method to obtain a corresponding two-dimensional feature map; the processing module 1210 is configured to input the two-dimensional feature map into the neural network model, obtain a model predicted event frame output by the neural network model, and obtain a model predicted start frame and a model predicted end frame.

In the embodiment of the invention, before the signal to be detected is input into the neural network model, firstly, the windowing processing is carried out on the signal frame obtained after framing through the preset window function. The window function may be a rectangular window or a non-rectangular window, such as a hamming window, a hanning window, and the like, and the specific type of the window function is not limited in the embodiment of the present invention.

In some embodiments of the invention, the determining module 1202 is further configured to: for each event, integrating at least two model predicted event frames into a result event frame under the condition that at least part of frame time corresponding to at least two model predicted event frames is overlapped with corresponding frame time in an endpoint detection event frame until all the frame time corresponding to all the model predicted event frames in at least one model predicted event frame is not overlapped with the frame time of the endpoint detection event frame; and determining a detection result according to the result event frame.

In some embodiments of the present invention, comparing at least one model predicted event frame with an endpoint detection event frame specifically comprises: for each event, predicting a first time corresponding to the initial frame based on the model, wherein the first time is earlier than or equal to a second time corresponding to the endpoint detection initial frame, and a third time corresponding to the model prediction end frame is later than or equal to the second time; or based on the first time being later than or equal to the second time and the third time being earlier than or equal to a fourth time corresponding to the endpoint detection end frame; or based on the first time being earlier than or equal to the fourth time and the third time being later than or equal to the fourth time; or determining that the frame time corresponding to the model prediction event frame overlaps with the frame time corresponding to the endpoint detection event frame based on the condition that the first time is earlier than or equal to the second time and the third time is later than or equal to the fourth time.

In some embodiments of the invention, the determining module 1202 is further configured to: determining a model prediction initial frame with the earliest time in all model prediction event frames formed into result event frames as a result event initial frame of the result event frames; determining a model prediction end frame with the latest time in all model prediction event frames formed into result event frames as a result event end frame of the result event frames; determining an event corresponding to one model predicted event frame with the longest frame length in all model predicted event frames which become result event frames as a result event corresponding to the result event frame; and determining a result event starting frame, a result event ending frame and a result event frame as a detection result.

In some embodiments of the present invention, the energy-to-entropy ratio may be calculated by the following formula:

In some embodiments of the invention, the processing module 1210 is further configured to: and carrying out normalization processing on the energy-entropy ratio, and carrying out median filtering processing on the energy-entropy ratio after the normalization processing.

In the embodiment of the invention, after the energy entropy ratio corresponding to each signal frame is obtained through calculation, the energy entropy ratios of all the signal frames are normalized and subjected to multiple median filtering, so that the influence of signal noise can be effectively removed, the obtained energy entropy ratio is more accurate, and the method is favorable for more accurately judging whether one signal frame is an event frame, thereby further improving the detection accuracy of the sound event.

In some embodiments of the invention, the neural network model comprises: a convolutional-recurrent neural network, a convolutional neural network, a recurrent neural network, a hidden markov model, a gaussian mixture model, or a support vector machine. The feature extraction method comprises the following steps: mel-frequency energy spectrum feature extraction, short-time fourier transform extraction, mel-frequency cepstral coefficient extraction, Bark (Bark) domain energy spectrum extraction, Erb domain energy spectrum extraction, or gamma-tone cepstral coefficient extraction.

EXAMPLE eleven

Fig. 13 shows a block diagram of a computer apparatus according to an embodiment of the present invention, the computer apparatus 1300 including: a memory 1302 having a computer program stored thereon; the processor 1304 is configured to implement the steps of the sound event detection method provided in any of the above embodiments when executing the computer program, so that the computer device 1300 simultaneously includes all the advantages of the sound event detection method provided in any of the above embodiments, which will not be described herein again.

Example twelve

In some embodiments of the present invention, a computer-readable storage medium is provided, on which a computer program is stored, and the computer program, when being executed by a processor, can implement the steps of the method for detecting a sound event provided in any of the above embodiments, so that the computer-readable storage medium simultaneously includes all the advantages of the method for detecting a sound event provided in any of the above embodiments, and the details are not repeated herein.

EXAMPLE thirteen

FIG. 14 shows a block diagram of an electronic device according to an embodiment of the invention, the electronic device 1500 including but not limited to: a radio frequency unit 1502, a network module 1504, an audio output unit 1506, an input unit 1508, a sensor 1510, a display unit 1512, a user input unit 1514, an interface unit 1516, a memory 1518, a processor 1520, and a power supply 1522. Those skilled in the art will appreciate that the electronic device configuration shown in fig. 14 does not constitute a limitation of the electronic device, and that the electronic device may include more or fewer components than shown, or some components may be combined, or a different arrangement of components. In the embodiment of the present application, the electronic device includes, but is not limited to, a mobile terminal, a tablet computer, a notebook computer, a palm computer, a vehicle-mounted electronic device, a wearable device, a pedometer, and the like.

Meanwhile, the processor 1520 can run the computer program on the memory 1518 to further implement the steps of the method in any of the above embodiments, so that the electronic device further includes all the advantages of any of the above embodiments, which is not described herein again.

It should be understood that, in the embodiment of the present application, the radio frequency unit 1502 may be configured to send and receive information or send and receive signals during a call, and in particular, receive downlink data of a base station or send uplink data to the base station. Radio frequency unit 1502 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier, a duplexer, and the like.

The network module 1504 provides wireless broadband internet access to the user, such as assisting the user in emailing, browsing web pages, and accessing streaming media.

The audio output unit 1506 may convert audio data received by the radio frequency unit 1502 or the network module 1504 or stored in the memory 1518 into an audio signal and output as sound. Also, the audio output unit 1506 may also provide audio output related to a specific function performed by the electronic device 1500 (e.g., a call signal reception sound, a message reception sound, etc.). The audio output unit 1506 includes a speaker, a buzzer, a receiver, and the like.

The input unit 1508 is for receiving an audio or video signal. The input Unit 1508 may include a Graphics Processing Unit (GPU) 5082 and a microphone 5084, the Graphics processor 5082 Processing image data of still pictures or video obtained by an image capture device (e.g., a camera) in a video capture mode or an image capture mode. The processed image frames may be displayed on the display unit 1512 or stored in the memory 1518 (or other storage medium) or transmitted via the radio frequency unit 1502 or the network module 1504. The microphone 5084 may receive sound and may be capable of processing the sound into audio data, and the processed audio data may be converted into a format output transmittable to a mobile communication base station via the radio frequency unit 1502 in case of a phone call mode.

The electronic device 1500 also includes at least one sensor 1510, such as a fingerprint sensor, a pressure sensor, an iris sensor, a molecular sensor, a gyroscope, a barometer, a hygrometer, a thermometer, an infrared sensor, a light sensor, a motion sensor, and others.

The display unit 1512 is used to display information input by the user or information provided to the user. The display unit 1512 may include a display panel 5122, and the display panel 5122 may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like.

The user input unit 1514 may be used to receive input numeric or character information, and generate key signal inputs related to user settings and function control of the electronic device. Specifically, the user input unit 1514 includes a touch panel 5142 and other input devices 5144. Touch panel 5142, also referred to as a touch screen, can collect touch operations by a user on or near it. The touch panel 5142 may include two parts of a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 1520, and receives and executes commands from the processor 1520. Other input devices 5144 may include, but are not limited to, a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, and a joystick, which are not described in detail herein.

Further, the touch panel 5142 can be overlaid on the display panel 5122, and when the touch panel 5142 detects a touch operation thereon or nearby, the touch panel is transmitted to the processor 1520 to determine the type of the touch event, and then the processor 1520 provides a corresponding visual output on the display panel 5122 according to the type of the touch event. The touch panel 5142 and the display panel 5122 can be provided as two separate components or can be integrated into one component.

The interface unit 1516 is an interface for connecting an external device to the electronic apparatus 1500. For example, the external device may include a wired or wireless headset port, an external power supply (or battery charger) port, a wired or wireless data port, a memory card port, a port for connecting a device having an identification module, an audio input/output (I/0) port, a video I/0 port, an earphone port, and the like. The interface unit 1516 may be used to receive input (e.g., data information, power, etc.) from an external device and transmit the received input to one or more elements within the electronic apparatus 1500 or may be used to transmit data between the electronic apparatus 1500 and the external device.

The memory 1518 may be used to store software programs as well as various data. The memory 1518 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the mobile terminal, and the like. Further, the memory 1518 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The processor 1520 performs various functions of the electronic device 1500 and processes data by running or executing software programs and/or modules stored in the memory 1518 and calling data stored in the memory 1518, thereby performing overall monitoring of the electronic device 1500. Processor 1520 may include one or more processing units; preferably, the processor 1520 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications.

The electronic device 1500 can also include a power supply 1522 that provides power to the various components, and preferably, the power supply 1522 can be logically coupled to the processor 1520 via a power management system that can provide management of charging, discharging, and power consumption.

In an embodiment of the present application, a readable storage medium is provided, on which a program or instructions are stored, which when executed by a processor implement the steps of the function switching method provided in any of the above embodiments.

In this embodiment, the readable storage medium can implement each process of the function switching method provided in the embodiments of the present application, and can achieve the same technical effect, and is not described herein again to avoid repetition.

The processor is the processor in the electronic device described in the above embodiment. The readable storage medium includes a computer readable storage medium, such as a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and so on.

The embodiment of the present application further provides a chip, where the chip includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is configured to run a program or an instruction to implement each process of the above embodiment of the function switching method, and can achieve the same technical effect, and the details are not repeated here to avoid repetition.

It should be understood that the chips mentioned in the embodiments of the present application may also be referred to as system-on-chip, system-on-chip or system-on-chip, etc.

In the description of the present invention, the terms "plurality" or "a plurality" refer to two or more, and unless otherwise specifically defined, the terms "upper", "lower", and the like indicate orientations or positional relationships based on the orientations or positional relationships illustrated in the drawings, and are only for convenience in describing the present invention and simplifying the description, but do not indicate or imply that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and thus, should not be construed as limiting the present invention; the terms "connected," "mounted," "secured," and the like are to be construed broadly and include, for example, fixed connections, removable connections, or integral connections; may be directly connected or indirectly connected through an intermediate. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.

In the description of the present invention, the description of the terms "one embodiment," "some embodiments," "specific embodiments," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In the present invention, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for detecting a sound event, comprising:

respectively determining an endpoint detection event frame of at least one event in a signal to be detected;

determining at least one model predicted event frame for each of the events in the signal to be detected;

for each of the events, comparing the at least one model predicted event frame to the endpoint detection event frame to determine a detection result for each of the events.

2. The method according to claim 1, wherein the determining the endpoint detection event frame of at least one event in the signal to be detected respectively comprises:

respectively determining the endpoint detection event frame of at least one event in the signals to be detected, and determining an endpoint detection starting frame and an endpoint detection ending frame of the endpoint detection event frame; and

the determining at least one model predicted event frame for each event in the signal to be detected specifically includes:

and determining at least one model prediction event frame of each event in the signal to be detected through a neural network model, and determining a model prediction starting frame and a model prediction ending frame of each model prediction event frame of each event.

3. The method according to claim 2, wherein the determining the endpoint detection event frame of at least one of the events in the signal to be detected and determining the endpoint detection start frame and the endpoint detection end frame of the endpoint detection event frame respectively comprises:

framing the signal to be detected to obtain a plurality of signal frames, and windowing;

calculating the energy-entropy ratio of any signal frame, and determining the signal frame with the energy-entropy ratio larger than an energy-entropy ratio threshold as the endpoint detection event frame; and

and taking the frame with the earliest time in the endpoint detection event frames as the endpoint detection starting frame, and taking the frame with the latest time in the endpoint detection event frames as the endpoint detection ending frame.

4. The method according to claim 3, wherein the determining at least one model predicted event frame for each event in the signal to be detected comprises:

carrying out feature extraction on the signal frame subjected to windowing processing by a feature extraction method to obtain a corresponding two-dimensional feature map;

and inputting the two-dimensional feature map into the neural network model, and acquiring the model prediction event frame, the model prediction starting frame and the model prediction ending frame output by the neural network model.

5. The method according to any one of claims 2 to 4, wherein the comparing the at least one model predicted event frame with the endpoint detected event frame based on the existence of at least two model predicted event frames to determine the detection result of each event comprises:

for each event, comparing the corresponding time of the start frame and the end frame of each model predicted event frame with the frame time of the start frame and the end frame of the endpoint detection event frame corresponding to the event, and determining the model predicted event frame which overlaps with the endpoint detection event frame in frame time as a result event frame;

and determining the detection result according to the result event frame.

6. The method for detecting a sound event according to claim 5, wherein the overlapping specifically comprises:

for each event, determining that the frame time corresponding to the model prediction event frame and the frame time corresponding to the endpoint detection event frame have the overlap, based on a first time corresponding to the model prediction start frame being earlier than or equal to a second time corresponding to the endpoint detection start frame and a third time corresponding to the model prediction end frame being later than or equal to the second time; or

Determining that the overlap exists between the frame time corresponding to the model prediction event frame and the frame time corresponding to the endpoint detection event frame based on the first time being later than or equal to the second time and the third time being earlier than or equal to a fourth time corresponding to the endpoint detection end frame; or

Determining that the overlap exists between the frame time corresponding to the model prediction event frame and the frame time corresponding to the endpoint detection event frame based on the first time being earlier than or equal to the fourth time and the third time being later than or equal to the fourth time; or

And determining that the overlap exists between the frame time corresponding to the model prediction event frame and the frame time corresponding to the endpoint detection event frame based on the case that the first time is earlier than or equal to the second time and the third time is later than or equal to the fourth time.

7. The method for detecting a sound event according to claim 5, wherein the determining as the detection result according to the result event frame specifically includes:

determining a starting frame of a predicted event with the earliest moment in all the result event frames as a starting frame of the detection result; and

and determining the predicted event end frame with the latest time in all the result event frames as the end frame of the detection result.

8. The method according to claim 3, wherein the determining the energy-to-entropy ratio of any signal frame specifically comprises:

and acquiring energy and entropy corresponding to any signal frame, calculating the absolute value of the ratio of the energy to the entropy, and determining the energy-entropy ratio according to the absolute value.

9. The method of detecting a sound event according to claim 8, wherein after the determining the entropy-capable ratio according to the absolute value, the method further comprises:

and carrying out normalization processing on the energy entropy ratio value, and carrying out median filtering processing on the energy entropy ratio value after the normalization processing.

10. The method of detecting a sound event according to any one of claims 2 to 4, wherein the neural network model comprises:

a convolutional-recurrent neural network, a convolutional neural network, a recurrent neural network, a hidden markov model, a gaussian mixture model, or a support vector machine.

11. The method of detecting a sound event according to claim 4, wherein the feature extraction method comprises:

a mel-energy spectrum feature extraction method, a short-time Fourier transform extraction method, a mel cepstrum coefficient extraction method, a bark domain energy spectrum extraction method, an equivalent rectangular bandwidth domain energy spectrum extraction method or a gamma-pass cepstrum coefficient extraction method.

12. An apparatus for detecting a sound event, comprising:

the determining module is used for respectively determining an endpoint detection event frame of at least one event in the signal to be detected;

a detection module for determining at least one model predicted event frame for each of said events in said signal to be detected;

the determining module is further configured to compare, for each of the events, the at least one model predicted event frame with the endpoint detection event frame, and determine a detection result for each of the events.

13. A computer device, comprising:

a memory having a computer program stored thereon;

a processor configured to implement the method of detecting a sound event of any one of claims 1 to 11 when running the computer program.

14. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a method of detecting a sound event according to any one of claims 1 to 11.

15. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is configured to execute the method of detecting a sound event of any one of claims 1 to 11 by the computer program.