CN110600059B

CN110600059B - Acoustic event detection method and device, electronic equipment and storage medium

Info

Publication number: CN110600059B
Application number: CN201910838074.5A
Authority: CN
Inventors: 刘文龙
Original assignee: Shanghai Jinsheng Communication Technology Co ltd; Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Shanghai Jinsheng Communication Technology Co ltd; Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2019-09-05
Filing date: 2019-09-05
Publication date: 2022-03-15
Anticipated expiration: 2039-09-05
Also published as: CN110600059A

Abstract

The application provides an acoustic event detection method, an acoustic event detection device, electronic equipment and a storage medium, wherein original sound data of at least one acoustic event collected in a first time period of a first area are obtained; generating at least one acoustic event feature data of the at least one acoustic event from the frequency domain features in the raw sound data, the at least one acoustic event feature data being used to characterize the acoustic event in the raw sound data; determining a category for each of the at least one acoustic event based on the at least one acoustic event characteristic data. A plurality of acoustic events occurring in the same time period can be detected, and the categories of the acoustic events can be determined, so that the efficiency of acoustic event detection is greatly improved.

Description

Acoustic event detection method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of acoustic recognition, and in particular, to a method and an apparatus for detecting an acoustic event, an electronic device, and a storage medium.

Background

With the development of the technology, more and more scenes need to use an acoustic event detection technology, and the acoustic event detection refers to a process of detecting and calibrating segments with definite semantics in a continuous audio signal stream. The method is an important basis for recognizing and semantically understanding environmental sound scenes by a machine, and plays an important role in semantically understanding the sound environment of the humanoid robot in the future, and in sound perception of the surrounding environment of the unmanned vehicle.

In reality, different acoustic events often occur in the same time period, for example, a child rings when crying, the sound data of the period includes two acoustic events, namely crying and door ringing, the existing acoustic event detection method mainly performs acoustic event classification detection through template matching, a traditional machine learning classification algorithm and a deep neural network algorithm, only one type of event can be detected at a time, various acoustic events occurring in the same time period cannot be detected, and the detection is very inconvenient.

Disclosure of Invention

Based on the above problems, the present application provides an acoustic event detection method, apparatus, electronic device, and computer storage medium, which can detect a plurality of acoustic events occurring within the same time period, thereby greatly improving the detection efficiency.

A first aspect of an embodiment of the present application provides an acoustic event detection method, where the method includes:

acquiring original sound data of at least one acoustic event acquired within a first time period of a first region;

generating at least one acoustic event feature data of the at least one acoustic event from the frequency domain features in the raw sound data, the at least one acoustic event feature data being used to characterize the acoustic event in the raw sound data;

determining a category for each of the at least one acoustic event based on the at least one acoustic event characteristic data.

A second aspect of embodiments of the present application provides an acoustic event detection apparatus, comprising a processing unit and a communication unit, wherein,

the processing unit is used for acquiring original sound data of at least one acoustic event collected in a first period of time in a first area through the communication unit; and generating at least one acoustic event feature data for the at least one acoustic event from the frequency domain features in the raw sound data, the at least one acoustic event feature data being used to characterize the acoustic event in the raw sound data; and determining a category for each of the at least one acoustic event from the at least one acoustic event characteristic data.

An electronic device, comprising an application processor, an input device, an output device, and a memory, wherein the application processor, the input device, the output device, and the memory are interconnected, and wherein the memory is configured to store a computer program comprising program instructions, and wherein the application processor is configured to invoke the program instructions to perform the method according to the first aspect of the embodiments of the present application.

A fourth aspect of embodiments of the present application provides a computer-readable storage medium storing a computer program comprising program instructions that, when executed by a processor, cause the processor to perform some or all of the method steps of any one of the methods of the first aspect of embodiments of the present application.

A fifth aspect of embodiments of the present application provides a computer program product, wherein the computer program product comprises a non-transitory computer-readable storage medium storing a computer program, and the computer program is operable to cause a computer to perform some or all of the steps as described in any one of the methods of the first aspect of embodiments of the present application. The computer program product may be a software installation package.

By implementing the embodiment of the application, the following beneficial effects can be obtained:

according to the acoustic event detection method, the acoustic event detection device, the electronic equipment and the storage medium, original sound data of at least one acoustic event collected in a first time period of a first area are obtained; generating at least one acoustic event feature data of the at least one acoustic event from the frequency domain features in the raw sound data, the at least one acoustic event feature data being used to characterize the acoustic event in the raw sound data; determining a category for each of the at least one acoustic event based on the at least one acoustic event characteristic data. A plurality of acoustic events occurring in the same time period can be detected, and the categories of the acoustic events can be determined, so that the efficiency of acoustic event detection is greatly improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a system architecture diagram of an acoustic event detection method according to an embodiment of the present application;

fig. 2 is a schematic flow chart of an acoustic event detection method according to an embodiment of the present application;

FIG. 3 is a flowchart illustrating a sound data processing method according to FIG. 2 in an embodiment of the present application;

FIG. 4 is a schematic flow chart of another acoustic event detection method in an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a sound detection model according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an electronic device in an embodiment of the present application;

fig. 7 is a block diagram illustrating functional units of an acoustic event detection apparatus according to an embodiment of the present disclosure.

Detailed Description

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The term "first" in the description and claims of this application and the above-described drawings is used to denote any one object not specifically, but rather, to describe a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

The electronic device according to the embodiments of the present application may be an electronic device with communication capability, and the electronic device may include various handheld devices with wireless communication function, vehicle-mounted devices, wearable devices, computing devices or other processing devices connected to a wireless modem, and various forms of User Equipment (UE), Mobile Stations (MS), terminal devices (terminal device), and so on.

At present, the existing acoustic event detection is generally template matching, the traditional machine learning classification algorithm and the deep neural network algorithm are used for acoustic event classification detection, only one acoustic event can be detected at a time, and if a plurality of acoustic events occur in the same time period, each acoustic event cannot be detected simultaneously by the existing method, so that the efficiency of acoustic event detection is very low.

Based on the above problem, the present application provides an acoustic event detection method, and embodiments of the present application are described in detail below.

As shown in fig. 1, fig. 1 is a system architecture diagram of an acoustic event detection method in an embodiment of the present application, and includes a sound collection module 110 and a processor 120, where the sound collection module 110 may collect original sound data by using a microphone array, the original sound data includes sound in a first area within a first time period, the first area and the first time period may be any one preset area and any one preset time period, and the processor 120 is connected to the sound collection module 110 and obtains and processes the original sound data collected by the sound collection module 110. Specifically, the processor 120 may process the raw sound data through an acoustic event model, where the acoustic event model may include a convolutional neural network module and a cyclic neural network module, and may simultaneously identify features of multiple acoustic events according to different frequencies between the acoustic events through a Multi-Head Attention mechanism, so as to complete detection of the multiple acoustic events in the same time period.

Through the system framework, a plurality of acoustic events occurring in the same time period can be detected, the types of the acoustic events are determined, and the efficiency of acoustic event detection is greatly improved.

Fig. 2 is a schematic flow chart of an acoustic event detection method in the embodiment of the present application, and specifically includes the following steps:

in step 201, the electronic device acquires raw sound data of at least one acoustic event collected during a first time period in a first region.

The original voice data is an unprocessed original audio signal, and the time length of the original voice data can be changed according to the change of the first time period, and if the first time period is set to be 10 seconds, the time length of each original voice data is 10 seconds. The acoustic event is a segment with clear semantics in a continuous audio signal stream, such as "crying" and "explosion" belonging to the acoustic event, and original sound data can be collected by a sound collection module, the sound collection module may exist independently or may be integrated on the electronic device, and when the sound collection module exists independently, the original sound data needs to be transmitted by wired or wireless connection with the electronic device.

Optionally, the electronic device may perform a screening on the collected original sound data, filter out original sound data without an acoustic event, and only obtain original sound data including at least one acoustic event.

Specifically, before the electronic device acquires original sound data of at least one acoustic event acquired in a first time period of a first region, the acquired original sound data may be filtered and screened, the electronic device may have associated parameters of the acoustic event built therein, when the original sound data is acquired, whether the acoustic event exists in the original sound data is determined according to the associated parameters, and if the acoustic event does not exist, the original sound data is filtered; and if at least one acoustic event exists, acquiring the original voice data containing the at least one acoustic event.

The electronic equipment acquires the original sound data of at least one acoustic event collected in the first time period of the first area, so that the detection of the original sound data without the acoustic event can be avoided, and the efficiency of the acoustic event detection method is improved.

Step 202, the electronic device generates at least one acoustic event feature data of the at least one acoustic event according to the frequency domain features in the original sound data.

The frequency domain characteristics, namely sound frequencies, of different acoustic events are different, and the acoustic event characteristic data is used for representing the acoustic events in the original sound data.

The electronic device first processes the original sound data to determine sound feature data, where the sound feature data includes filter bank Fbank features, a processing process is shown in fig. 3, and fig. 3 is a schematic flow diagram of a sound data processing method based on fig. 2 in an embodiment of the present application, and specifically includes:

pre-emphasis is carried out by taking a frame as a unit, because a high-frequency end is attenuated by 6dB/oct (octave) above 800Hz, and corresponding components are smaller when the frequency is higher, therefore, the high-frequency part of the original sound data is required to be improved before being analyzed, the numerical problem of subsequent fast Fourier transform is avoided, and the high-frequency signal-to-noise ratio can be improved.

And (4) framing, namely, in order to avoid omission of the signal by a window boundary, a frame overlap is required between every two frames (a part of the frame overlap is required between the frames). A common choice is a frame length of 25ms, with a frame shift of 10ms, i.e. an overlap of 10ms between each two frames. Framing is because speech signals are rapidly changing, while Fourier transforms are suitable for analyzing stationary signals. If there is no overlap between frames, this part of the information may be lost because the signal at the frame-to-frame junction is weakened by windowing.

Windowing, fourier transforms require that the input signal is stationary, but the audio signal as a whole is not. Each frame of signal is usually multiplied by a smooth window function, and the two ends of the frame are smoothly attenuated to zero, so that the intensity of side lobes after Fourier transform can be reduced, a higher-quality frequency spectrum can be obtained, and a Hamming window function can be added to each frame to smooth the edge of the frame signal.

And fast Fourier transform, namely performing fast Fourier transform on each frame after the windowing function to obtain an energy spectrum, converting the energy spectrum into a frequency domain through Fourier transform, dividing the complex sound waves into sound waves with various frequencies, and facilitating learning of a neural network model.

And (3) filtering by using a Mel filter bank, filtering the energy spectrum in the previous step by using the Mel filter, and then taking a logarithm to obtain a Log Fbank, wherein the Log Fbank is the Fbank characteristic.

After the Fbank features are obtained, inputting the Fbank features into a pre-trained acoustic event model to generate at least one acoustic event feature data of the at least one acoustic event. The acoustic event model may include a convolutional neural network model, the Fbank feature is processed by the convolutional neural network module to obtain processed sound feature data, and the processed sound feature data is classified according to frequency domain features by a Multi-Head Attention multiple-Head authorization mechanism to generate at least one acoustic event feature data of the at least one acoustic event, where the acoustic event feature data includes at least one of: voice frequency, waveform, pitch, subband energy, and short-term energy, etc. By carrying out Multi-Head attachment on the frequency dimension of the acoustic event, different subspaces can pay Attention to the acoustic event with different frequencies according to the mapping relation of the frequencies to obtain the subspace characteristics corresponding to each acoustic event, and then the subspace characteristics are cascaded through a Dot Product Attention mechanism Scaled Dot-Product attachment and the acoustic event characteristic data are output. The calculated amount can be reduced through Scaled Dot-Product attribute, and the speed of detecting the acoustic event is improved.

It should be noted that the acoustic event characteristic data may include at least one of the following: the specific acoustic event recognition mode is the existing technology of voice recognition and the like, and details are not repeated herein.

Therefore, the electronic device generates at least one acoustic event feature data of the at least one acoustic event according to the frequency domain features in the original sound data, and Multi-Head orientation can be performed according to the frequency domain features to determine the acoustic event feature data corresponding to different acoustic events, so that the detection efficiency is greatly improved.

Step 203, the electronic device determines a category of each acoustic event in the at least one acoustic event according to the at least one acoustic event feature data.

Firstly, splitting the at least one acoustic event feature data according to a preset frame number to obtain split acoustic event feature data, wherein the duration of the preset frame number and the original sound data can be determined according to a sampling rate, for example, every 40 frames corresponds to 1s, and no specific limitation is made herein, and the calculation amount can be reduced by splitting the acoustic event feature data according to the preset frame number;

then, a cyclic neural network (RNN) module is used for carrying out cyclic calculation on the split acoustic event characteristic data to obtain a calculation result, and the calculation results corresponding to the split acoustic event characteristic data are combined to obtain average acoustic event characteristic data, wherein the cyclic calculation is to calculate the acoustic event characteristic data corresponding to each frame once, and since each frame of acoustic event characteristic data can correspond to a plurality of parameters, the cyclic calculation can improve the detection accuracy and lay a cushion for the subsequent output result;

finally, the category of each acoustic event in the at least one acoustic event is determined by inputting the average acoustic event feature data into the fully-connected FC layer, the category of the acoustic event occurring within the above time period is automatically determined by the acoustic event model, and the specific category of the acoustic event may be determined according to the acoustic event recognizable by the acoustic event model, which is not specifically limited herein.

The electronic equipment determines the category of each acoustic event in the at least one acoustic event according to the at least one acoustic event characteristic data, so that a plurality of acoustic events occurring in the same time period can be detected, and the categories of the acoustic events can be determined, and the efficiency of acoustic event detection is greatly improved.

Next, another acoustic event detection method in this embodiment is described in detail with reference to fig. 4, where fig. 4 is a schematic flow chart of another acoustic event detection method in this embodiment, and specifically includes the following steps:

step 401, the electronic device obtains a trained acoustic event model through training.

The method comprises the steps of inputting training data into an acoustic event model, wherein the training data can be sound data with an acoustic event label, determining a loss function according to the difference between a predicted acoustic event output by the acoustic event model and the acoustic event label, wherein the loss function can be a binary cross entropy function, and training by combining a gradient descent method until training is completed. The training method is not described herein.

The trained acoustic event model is obtained through training, and the accuracy of acoustic event detection can be improved.

In step 402, the electronic device obtains raw sound data for at least one acoustic event collected during a first time period in a first region.

Step 403, the electronic device generates at least one acoustic event feature data of the at least one acoustic event according to the frequency domain features in the original sound data.

Step 404, the electronic device determines a category of each acoustic event of the at least one acoustic event according to the at least one acoustic event feature data.

In step 405, the electronic device determines an occurrence probability corresponding to each acoustic event through an activation function.

The activation function may be a Sigmoid function, and the formula is as follows:

the Sigmoid function is also called Logistic function, is used for hidden layer neuron output, has a value range of (0,1), can map a real number to an interval of (0,1), and can be used for binary classification. The effect is better when the characteristic phase difference is more complex or the phase difference is not particularly large. The probability of occurrence of each acoustic event can be derived.

The occurrence probability corresponding to each acoustic event is determined through the activation function, so that the acoustic events with lower occurrence probability can be eliminated, and errors in the detection process are reduced.

In step 406, the electronic device determines whether the occurrence probability corresponding to each acoustic event is greater than a preset threshold.

The preset threshold may be manually adjusted, and the switching is flexible according to different scene environments, and when the occurrence probability corresponding to the acoustic event is greater than the preset threshold, step 408 is executed; when the occurrence probability corresponding to the acoustic event is less than or equal to the preset threshold, step 407 is executed. It should be noted that the occurrence probability corresponding to each acoustic event is independent, and the determination process may be performed simultaneously.

Step 407, the electronic device outputs a prompt message.

The above prompt information may be used to indicate that the acoustic event cannot be identified, and the prompt information may be output in the form of audio broadcasting, video playing, picture displaying, display lamp flashing, and the like, which is not specifically limited herein.

At step 408, the electronic device determines that the acoustic event occurred.

Through the steps, a plurality of acoustic events occurring in the same time period can be detected, the types of the acoustic events are determined, and the efficiency of acoustic event detection is greatly improved.

The steps not described in detail above refer to the method described in fig. 2, and are not described in detail here.

The following describes, by way of example, an acoustic event detection method in the embodiment of the present application with reference to fig. 5, where fig. 5 is a schematic structural diagram of an acoustic event model in the embodiment of the present application, and the description starts with inputting sound feature data into a convolutional neural network module of the acoustic event model:

assuming that sound feature data is 640 frames and each frame corresponds to an Fbank feature (640,128) of 128 parameters, firstly, performing batch normalization BatchNorm processing on the Fbank feature, wherein the BatchNorm can keep the same distribution of inputs of each layer of neural network in the deep neural network training process, then, passing through a convolution layer of 3 × 3, the number of output channels of the convolution layer is 16, and then, performing average pooling through a pooling layer to obtain a feature after dimension reduction, wherein the dimension reduction can be understood as that the number of frames is compressed, and 1 frame after dimension reduction is equivalent to the original 4 frames, so that the detection speed can be obviously improved. Then 160 frames of acoustic event features corresponding to 32 parameters are obtained through Multi-Head Attention.

And then inputting the characteristics of the acoustic events into a Recurrent Neural Network (RNN) module for recurrent calculation, wherein the RNN comprises a Gating Recurrent Unit (GRU), then entering a full connection layer (FC) after random inactivation Dropout processing to obtain the category of each acoustic event, and finally obtaining the occurrence probability of each acoustic event through a Sigmoid function.

Referring to fig. 6, consistent with the embodiments shown in fig. 2 and fig. 4, fig. 6 is a schematic structural diagram of an electronic device 600 in an embodiment of the present application, and includes an application processor 610, an input device 620, an output device 630, and a memory 640, where the application processor 610, the input device 620, the output device 630, and the memory 640 are connected to each other, where the memory 640 is used for storing a computer program, the computer program includes program instructions, and the application processor 610 is configured to call the program instructions to execute all or part of the steps described in fig. 2 and fig. 4.

The above description has introduced the solution of the embodiment of the present application mainly from the perspective of the method-side implementation process. It is understood that the electronic device comprises corresponding hardware structures and/or software modules for performing the respective functions in order to realize the above-mentioned functions. Those of skill in the art will readily appreciate that the present application is capable of hardware or a combination of hardware and computer software implementing the various illustrative elements and algorithm steps described in connection with the embodiments provided herein. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiment of the present application, the electronic device may be divided into the functional units according to the method example, for example, each functional unit may be divided corresponding to each function, or two or more functions may be integrated into one processing unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit. It should be noted that the division of the unit in the embodiment of the present application is schematic, and is only a logic function division, and there may be another division manner in actual implementation.

An acoustic event detection apparatus 700 according to an embodiment of the present application is described in detail below with reference to fig. 7. Fig. 7 is a schematic structural diagram of an acoustic event detection apparatus in an embodiment of the present application, including a processing unit 710 and a communication unit 720, wherein,

the processing unit 710 is configured to acquire, through the communication unit 720, raw sound data of at least one acoustic event collected in a first time period of a first area; and generating at least one acoustic event feature data for the at least one acoustic event from the frequency domain features in the raw sound data, the at least one acoustic event feature data being used to characterize the acoustic event in the raw sound data; and determining a category for each of the at least one acoustic event from the at least one acoustic event characteristic data.

In a possible embodiment, the processing unit 710 is specifically configured to generate at least one acoustic event feature data of the at least one acoustic event according to the raw sound data, and:

processing the original sound data to determine sound characteristic data, wherein the sound characteristic data comprises filter bank Fbank characteristics;

generating at least one acoustic event feature data of the at least one acoustic event by inputting the Fbank feature in a pre-trained acoustic event model.

In one possible embodiment, the acoustic event model includes a convolutional neural network module; the at least one acoustic event feature data of the at least one acoustic event is generated by inputting the Fbank feature in an acoustic event model, and the processing unit 710 is specifically configured to:

processing the Fbank characteristics through the convolutional neural network module to obtain processed sound characteristic data;

classifying the processed sound feature data according to the frequency domain features through a multi-head attention mechanism to generate at least one acoustic event feature data of the at least one acoustic event, wherein the acoustic event feature data comprises at least one of the following data: sound frequency, waveform, pitch, subband energy, and short-term energy.

In a possible embodiment, the acoustic event model includes a recurrent neural network module and a full connection layer, the determining a category of each acoustic event in the at least one acoustic event according to the at least one acoustic event feature data includes:

splitting the at least one acoustic event characteristic data according to a preset frame number to obtain split acoustic event characteristic data;

performing cyclic calculation on the split acoustic event characteristic data through the cyclic neural network module to obtain a calculation result, and combining the calculation results corresponding to the split acoustic event characteristic data to obtain average acoustic event characteristic data;

determining a category for each of the at least one acoustic event by inputting the average acoustic event characteristic data in the fully-connected layer.

In a possible embodiment, after determining the category of each acoustic event of the at least one acoustic event by inputting the average acoustic event characteristic data in the fully-connected layer, the processing unit 710 is further configured to:

determining the occurrence probability corresponding to each acoustic event through an activation function;

judging whether the occurrence probability corresponding to each acoustic event is greater than a preset threshold value or not;

and if the occurrence probability corresponding to the acoustic event is greater than a preset threshold value, determining that the acoustic event occurs.

In a possible embodiment, after determining whether the occurrence probability corresponding to each acoustic event is greater than a preset threshold, the processing unit 710 is further configured to:

and if the occurrence probability corresponding to the acoustic event is smaller than or equal to a preset threshold value, outputting prompt information, wherein the prompt information is used for indicating that the acoustic event cannot be identified.

In a possible embodiment, the processing unit 710 is specifically configured to process the raw sound data to determine sound feature data, and:

performing pre-emphasis, framing, windowing, fast Fourier transform, Mel Filter Bank Filtering on the sound data to determine the sound feature data, the sound feature data comprising Filter Bank Fbanks features.

The acoustic event detection apparatus 700 may further include a storage unit 730 for storing program codes and data of the electronic device. The processing unit 710 may be a processor, the communication unit 720 may be a touch display screen or a transceiver, and the storage unit 730 may be a memory.

Embodiments of the present application also provide a computer storage medium, where the computer storage medium stores a computer program for electronic data exchange, the computer program enabling a computer to execute part or all of the steps of any one of the methods described in the above method embodiments, and the computer includes an electronic device.

Embodiments of the present application also provide a computer program product comprising a non-transitory computer readable storage medium storing a computer program operable to cause a computer to perform some or all of the steps of any of the methods as described in the above method embodiments. The computer program product may be a software installation package, the computer comprising an electronic device.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the above-described division of the units is only one type of division of logical functions, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit may be stored in a computer readable memory if it is implemented in the form of a software functional unit and sold or used as a stand-alone product. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a memory, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the above-mentioned method of the embodiments of the present application. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable memory, which may include: flash Memory disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the above description of the embodiments is only provided to help understand the method and the core concept of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method of acoustic event detection, the method comprising:

processing the Fbank characteristics through a convolutional neural network module to obtain processed sound characteristic data;

classifying the processed sound feature data according to frequency domain features through a multi-head attention mechanism to generate at least one acoustic event feature data of the at least one acoustic event, wherein the acoustic event feature data comprises at least one of the following data: a sound frequency, a waveform map, a fundamental tone, a sub-band energy, and a short-time energy, the at least one acoustic event feature data being used to characterize acoustic events in the original sound data;

2. The method of claim 1, wherein determining the category of each of the at least one acoustic event from the at least one acoustic event characteristic data comprises:

performing cyclic calculation on the split acoustic event characteristic data through a cyclic neural network module to obtain a calculation result, and combining the calculation results corresponding to the split acoustic event characteristic data to obtain average acoustic event characteristic data;

determining a category for each of the at least one acoustic event by inputting the average acoustic event characteristic data in a fully connected layer.

3. The method of claim 2, wherein after determining the category for each of the at least one acoustic event by inputting the average acoustic event signature data in the fully-connected layer, the method further comprises:

4. The method according to claim 3, wherein after determining whether the occurrence probability corresponding to each acoustic event is greater than a preset threshold, the method further comprises:

5. The method according to any one of claims 1 to 4, wherein the processing the raw sound data to determine sound feature data comprises:

performing pre-emphasis, framing, windowing, fast Fourier transform, Mel Filter Bank Filtering on the sound data to determine the sound feature data, the sound feature data comprising the Filter Bank Fbanks features.

6. An acoustic event detection apparatus, characterized in that the apparatus comprises a processing unit and a communication unit, wherein,

the processing unit is used for acquiring original sound data of at least one acoustic event collected in a first period of time in a first area through the communication unit; processing the original sound data to determine sound characteristic data, wherein the sound characteristic data comprises filter bank Fbank characteristics; processing the Fbank characteristics through a convolutional neural network module to obtain processed sound characteristic data; classifying the processed sound feature data according to frequency domain features through a multi-head attention mechanism to generate at least one acoustic event feature data of the at least one acoustic event, wherein the acoustic event feature data comprises at least one of the following data: a sound frequency, a waveform map, a fundamental tone, a sub-band energy, and a short-time energy, the at least one acoustic event feature data being used to characterize acoustic events in the original sound data; determining a category for each of the at least one acoustic event based on the at least one acoustic event characteristic data.

7. An electronic device comprising an application processor, an input device, an output device and a memory, the application processor, the input device, the output device and the memory being interconnected, wherein the memory is configured to store a computer program comprising program instructions, the application processor being configured to invoke the program instructions to perform the method of any of claims 1 to 5.

8. A computer-readable storage medium, characterized in that the computer storage medium stores a computer program comprising program instructions that, when executed by a processor, cause the processor to carry out the method according to any one of claims 1 to 5.