CN112447187A

CN112447187A - Device and method for recognizing sound event

Info

Publication number: CN112447187A
Application number: CN201910822623.XA
Authority: CN
Inventors: 石自强; 刘柳; 林慧镔; 刘汝杰
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2019-09-02
Filing date: 2019-09-02
Publication date: 2021-03-05
Also published as: JP2021039332A

Abstract

Disclosed is a sound event recognition apparatus including: an encoder configured to convert a sound signal having a plurality of sound events contained therein into features in a low-dimensional space; and a detector configured to map the features to a posterior probability for each sound event, wherein the detector performs a plurality of hole convolution operations on the features. The identification apparatus according to the present disclosure more efficiently performs automatic sound event detection in an end-to-end manner.

Description

Device and method for recognizing sound event

Technical Field

The present disclosure relates to the field of sound processing, and in particular, to a device and a method for recognizing a sound event.

Background

This section provides background information related to the present disclosure, which is not necessarily prior art.

Sound carries a great deal of information about the daily environment and the physical events that occur therein. A person may perceive the sound scene (busy street, office, etc.) in which he is located and may identify individual sound events (car passing, footsteps, etc.). Automatic detection of these sound events has many applications in real life. For example, it is very useful for smart devices, robots, etc. in environmental awareness, and furthermore, automatic detection of sound events can help build a complete monitoring system when radar or video systems may not work in some situations.

Disclosure of Invention

This section provides a general summary of the disclosure, and is not a comprehensive disclosure of its full scope or all of its features.

An object of the present disclosure is to provide a sound event recognition apparatus and method that more efficiently performs automatic sound event detection through an end-to-end device. Unlike traditional models based on recurrent neural networks, the device according to the present disclosure is based entirely on a pure one-dimensional convolutional neural network model, which is easier to parallelize and performs better in certain environments. Also, the device according to the present disclosure is a complete end-to-end system, without the use of human intervention. The input is the original sound signal and the output is the posterior probability of the sound event.

According to an aspect of the present disclosure, there is provided a sound event recognition apparatus including: an encoder configured to convert a sound signal having a plurality of sound events contained therein into features in a low-dimensional space; and a detector configured to map the features to a posterior probability for each sound event, wherein the detector performs a plurality of hole convolution operations on the features.

According to another aspect of the present disclosure, there is provided a method for recognizing a sound event, including: converting an acoustic signal having a plurality of acoustic events therein to features in a low dimensional space; and mapping the features to a posterior probability for each sound event, wherein a plurality of hole convolution operations are performed on the features.

According to another aspect of the present disclosure, there is provided a program product comprising machine-readable instruction code stored therein, wherein the instruction code, when read and executed by a computer, is capable of causing the computer to perform a method of recognition of sound events according to the present disclosure.

According to another aspect of the present disclosure, a machine-readable storage medium is provided, having embodied thereon a program product according to the present disclosure.

Further areas of applicability will become apparent from the description provided herein. The description and specific examples in this summary are intended for purposes of illustration only and are not intended to limit the scope of the present disclosure.

Drawings

The drawings described herein are for illustrative purposes only of selected embodiments and not all possible implementations, and are not intended to limit the scope of the present disclosure. In the drawings:

FIG. 1 shows a block diagram of a recognition device of a sound event according to one embodiment of the present disclosure;

FIG. 2 illustrates an overall framework of a recognition network of sound events according to one embodiment of the present disclosure;

FIG. 3 illustrates a flow diagram of a method of recognition of a sound event according to one embodiment of the present disclosure;

and

fig. 4 is a block diagram of an exemplary structure of a general-purpose personal computer in which a recognition apparatus of a sound event and a recognition method of a sound event according to an embodiment of the present disclosure can be implemented.

While the disclosure is susceptible to various modifications and alternative forms, specific embodiments thereof have been shown by way of example in the drawings and are herein described in detail. It should be understood, however, that the description herein of specific embodiments is not intended to limit the disclosure to the particular forms disclosed, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the disclosure. It is noted that throughout the several views, corresponding reference numerals indicate corresponding parts.

Detailed Description

Examples of the present disclosure will now be described more fully with reference to the accompanying drawings. The following description is merely exemplary in nature and is not intended to limit the present disclosure, application, or uses.

Example embodiments are provided so that this disclosure will be thorough, and will fully convey the scope to those skilled in the art. Numerous specific details are set forth such as examples of specific components, devices, and methods to provide a thorough understanding of embodiments of the present disclosure. It will be apparent to those skilled in the art that specific details need not be employed, that example embodiments may be embodied in many different forms, and that neither should be construed to limit the scope of the disclosure. In certain example embodiments, well-known processes, well-known structures, and well-known technologies are not described in detail.

According to an embodiment of the present disclosure, there is provided a sound event recognition apparatus including: an encoder configured to convert a sound signal having a plurality of sound events contained therein into features in a low-dimensional space; and a detector configured to map the features to a posterior probability for each sound event, wherein the detector performs a plurality of hole convolution operations on the features.

As illustrated in fig. 1, the apparatus 100 for recognizing a sound event according to the present disclosure may include an encoder 101 and a detector 102.

The encoder 101 may convert a sound signal containing a plurality of sound events into features in a low-dimensional space. Such features may be used to more efficiently extract the task of identifying sound events. Here, it should be apparent to those skilled in the art that the plurality of sound events may be sound events including two or more different types (e.g., sounds of pedestrian steps and car horns on the street, etc.). The encoder 101 may convert the signals containing these sound events into feature vectors in a low-dimensional space.

Next, the detector 102 may map the feature vectors in the low dimensional space to a posterior probability of each sound event, e.g., a posterior probability of a pedestrian step or a car horn sound on the street for each frame. According to one embodiment of the present disclosure, these a posteriori probabilities may represent the type, start and end times, etc. of the sound event. Here, it should be apparent to those skilled in the art that the above events are merely exemplary, and the present disclosure is not limited thereto.

According to one embodiment of the present disclosure, the detector 102 may perform a plurality of hole convolution operations on the feature vector to obtain a posterior probability for each sound event. Hole convolution, also known as dilation convolution or dilation convolution, introduces a new parameter called the "dilation rate" to the convolution layer, which defines the spacing of values when the convolution kernel processes data. According to one embodiment of the present disclosure, the detector 102 may perform a cubic hole convolution operation on the feature vector to provide a larger receptive field (receptive field). In the convolutional neural network CNN, determining the size (mapping) of an area of an input layer corresponding to one element in an output result of a certain layer is called a receptive field. In other words, a larger receptive field means a larger amount of information. Here, it should be apparent to those skilled in the art that the present disclosure performs the cubic hole convolution operation only by way of example, and the present disclosure is not limited thereto. Those skilled in the art can certainly perform more or less hole convolution operations according to the requirements of actual operation amount and the like.

According to an embodiment of the present disclosure, the encoder 101 may perform a one-dimensional convolution operation, a ReLU operation with parameters, a normalization operation, and a 1 × 1 convolution operation on the sound signal to obtain the feature vector. The normalization operation is to perform normalization processing on the feature vector so as to improve the training speed. A 1 x 1 convolution operation may be used to modify the size of the last dimension of the feature vector. That is, the feature vectors processed through the 1 × 1 convolution operation can maintain a uniform size. Here, it should be apparent to those skilled in the art that the above-described operations are merely exemplary, and the present disclosure is not limited thereto. Those skilled in the art can add, delete or replace the operations therein according to actual needs.

According to an embodiment of the present disclosure, the detector 102 may further perform a 1 × 1 convolution operation, a full join operation, and a Softmax operation to obtain the a posteriori probability after performing the hole convolution operation on the feature a plurality of times. Here, it should be apparent to those skilled in the art that the above-described operations are merely exemplary, and the present disclosure is not limited thereto. Those skilled in the art can add, delete or replace the operations therein according to actual needs.

According to an embodiment of the present disclosure, the detector 102 may further perform a 1 × 1 convolution operation, a ReLU operation with parameters, a normalization operation, and a depth convolution operation in the process of performing each hole convolution operation. Here, it should be apparent to those skilled in the art that the above-described operations are merely exemplary, and the present disclosure is not limited thereto. Those skilled in the art can add, delete or replace the operations therein according to actual needs.

For example, as shown in fig. 2, an input audio signal containing a plurality of audio events may be subjected to a one-dimensional convolution operation, a parametric ReLU operation, a normalization operation, and a 1 × 1 convolution operation to obtain a feature vector. Next, the obtained feature vector is subjected to a hole convolution operation three times, and then subjected to a convolution operation of 1 × 1, a full join operation, and a Softmax operation, so as to obtain a posterior probability.

Referring to fig. 2 again, a process of the hole convolution operation is specifically illustrated. Wherein each circle represents a time point, i.e., a time sequence, from left to right, and each convolutional layer has an expansion rate. The expansion ratio rises exponentially to ensure that the convolutional layer can obtain information for a sufficient length of time. For example, fig. 2 schematically shows four convolutional layers, where the expansion rate d of the first layer is 1, the expansion rate d of the second layer is 2, the expansion rate d of the third layer is 4, and the expansion rate d of the fourth layer is 8. The dilation rate represents the amount of information of the feature vector on a time scale. Here, it should be apparent to those skilled in the art that the convolutional layer shown in fig. 2 of the present disclosure is merely an example, and the present disclosure is not limited thereto.

Then, according to an embodiment of the present disclosure, the 1 × 1 convolution operation, the parametric ReLU operation, the normalization operation, and the depth convolution operation may be further performed in the process of the hole convolution operation.

With the recognition device for the sound event according to the present disclosure, automatic sound event detection can be performed more effectively due to the end-to-end framework thereof, and the multiple hole convolution operations adopted therein can increase more information amount within a large time scale, thereby achieving better detection results.

According to the recognition apparatus of voice events of one embodiment of the present disclosure, in the training phase, the encoder 101 and the detector 102 may be trained using voice data with event tags. In the evaluation phase, the trained encoder 101 and detector 102 may be used to detect each event in the input mixed sound and evaluate the performance of the trained encoder 101 and detector 102.

A recognition method for a sound event according to an embodiment of the present disclosure will be described below with reference to fig. 3. As shown in fig. 3, the recognition method for a sound event according to an embodiment of the present disclosure starts at step S310.

In step S510, a sound signal containing therein a plurality of sound events is converted into features in a low-dimensional space.

Next, in step S320, the features are mapped to a posterior probability of each sound event.

In step S320, a plurality of hole convolution operations are performed on the features.

The recognition method for sound events according to one embodiment of the present disclosure further includes the step of performing a one-dimensional convolution operation, a ReLU operation with parameters, a normalization operation, and a 1 × 1 convolution operation on the sound signal to obtain the feature.

The recognition method for a sound event according to one embodiment of the present disclosure further includes the step of performing a 1 × 1 convolution operation, a full join operation, and a Softmax operation after performing a plurality of hole convolution operations on the feature to obtain the a posteriori probability.

The recognition method for sound events according to one embodiment of the present disclosure further includes the step of performing a hole convolution operation 3 times on the features.

The recognition method for a sound event according to one embodiment of the present disclosure further includes the step of performing a 1 × 1 convolution operation, a ReLU operation with parameters, a normalization operation, and a depth convolution operation in the process of performing each hole convolution operation.

A recognition method for sound events according to one embodiment of the present disclosure further includes the step of training the encoder and the detector using sound data having an event tag.

The method for identifying a sound event according to one embodiment of the present disclosure, wherein the feature is a feature based on each frame of the sound signal.

With the adoption of the identification method for the sound event, the automatic sound event detection can be more effectively carried out due to the end-to-end framework, and the multiple-time hole convolution operation adopted in the method can increase more information quantity in a large-scale time scale, so that a better detection result is realized.

Various embodiments of the above-described steps of the recognition method for a sound event according to an embodiment of the present disclosure have been described in detail above, and a description thereof will not be repeated.

It is apparent that the respective operational procedures of the recognition method for a sound event according to the present disclosure can be implemented in the form of computer-executable programs stored in various machine-readable storage media.

Moreover, the object of the present disclosure can also be achieved by: a storage medium storing the above executable program code is directly or indirectly supplied to a system or an apparatus, and a computer or a Central Processing Unit (CPU) in the system or the apparatus reads out and executes the program code. At this time, as long as the system or the apparatus has a function of executing a program, the embodiments of the present disclosure are not limited to the program, and the program may also be in any form, for example, an object program, a program executed by an interpreter, a script program provided to an operating system, or the like.

Such machine-readable storage media include, but are not limited to: various memories and storage units, semiconductor devices, magnetic disk units such as optical, magnetic, and magneto-optical disks, and other media suitable for storing information, etc.

In addition, the computer can also implement the technical solution of the present disclosure by connecting to a corresponding website on the internet, downloading and installing the computer program code according to the present disclosure into the computer and then executing the program.

Fig. 4 is a block diagram of an exemplary structure of a general-purpose personal computer 1300 in which a recognition apparatus and a recognition method for a sound event according to an embodiment of the present disclosure may be implemented.

As shown in fig. 4, the CPU 1301 executes various processes in accordance with a program stored in a Read Only Memory (ROM)1302 or a program loaded from a storage section 1308 to a Random Access Memory (RAM) 1303. In the RAM 1303, data necessary when the CPU 1301 executes various processes and the like is also stored as necessary. The CPU 1301, the ROM 1302, and the RAM 1303 are connected to each other via a bus 1304. An input/output interface 1305 is also connected to bus 1304.

The following components are connected to the input/output interface 1305: an input portion 1306 (including a keyboard, a mouse, and the like), an output portion 1307 (including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker, and the like), a storage portion 1308 (including a hard disk, and the like), a communication portion 1309 (including a network interface card such as a LAN card, a modem, and the like). The communication section 1309 performs communication processing via a network such as the internet. A driver 1310 may also be connected to the input/output interface 1305, as desired. A removable medium 1311 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1310 as needed, so that a computer program read out therefrom is installed in the storage portion 1308 as needed.

In the case where the above-described series of processes is realized by software, a program constituting the software is installed from a network such as the internet or a storage medium such as the removable medium 1311.

It should be understood by those skilled in the art that such a storage medium is not limited to the removable medium 1311 shown in fig. 4, in which the program is stored, distributed separately from the apparatus to provide the program to the user. Examples of the removable medium 1311 include a magnetic disk (including a floppy disk (registered trademark)), an optical disk (including a compact disc read only memory (CD-ROM) and a Digital Versatile Disc (DVD)), a magneto-optical disk (including a Mini Disk (MD) (registered trademark)), and a semiconductor memory. Alternatively, the storage medium may be the ROM 1302, a hard disk contained in the storage section 1308, or the like, in which programs are stored and which are distributed to users together with the apparatus containing them.

In the systems and methods of the present disclosure, it is apparent that individual components or steps may be broken down and/or recombined. These decompositions and/or recombinations are to be considered equivalents of the present disclosure. Also, the steps of executing the series of processes described above may naturally be executed chronologically in the order described, but need not necessarily be executed chronologically. Some steps may be performed in parallel or independently of each other.

Although the embodiments of the present disclosure have been described in detail with reference to the accompanying drawings, it should be understood that the above-described embodiments are merely illustrative of the present disclosure and do not constitute a limitation of the present disclosure. It will be apparent to those skilled in the art that various modifications and variations can be made in the above-described embodiments without departing from the spirit and scope of the disclosure. Accordingly, the scope of the disclosure is to be defined only by the claims appended hereto, and by their equivalents.

With respect to the embodiments including the above embodiments, the following remarks are also disclosed:

supplementary note 1. a sound event recognition apparatus comprising:

an encoder configured to convert a sound signal having a plurality of sound events contained therein into features in a low-dimensional space; and

a detector configured to map the features to a posterior probability for each sound event,

wherein the detector performs a plurality of hole convolution operations on the feature.

Supplementary note 2. the apparatus according to supplementary note 1, wherein the encoder performs a one-dimensional convolution operation, a ReLU operation with parameter, a normalization operation, and a 1 × 1 convolution operation on the sound signal to obtain the feature.

Supplementary 3. the apparatus according to supplementary 2, wherein the detector further performs a 1 × 1 convolution operation, a full join operation, and a Softmax operation after performing a plurality of hole convolution operations on the feature to obtain the a posteriori probability.

Supplementary note 4. the apparatus according to any one of supplementary notes 1 to 3, wherein the detector performs a hole convolution operation 3 times on the feature.

Supplementary note 5. the apparatus according to supplementary note 4, wherein the detector further performs a 1 × 1 convolution operation, a ReLU operation with parameter, a normalization operation, and a depth convolution operation in the process of performing each hole convolution operation.

Supplementary note 6. the apparatus according to supplementary note 1, wherein the encoder and the detector are trained using sound data with an event tag.

Supplementary note 7 the apparatus according to supplementary note 1, wherein the feature is a feature based on each frame of the sound signal.

Note 8. a method of recognizing a sound event, comprising:

converting an acoustic signal having a plurality of acoustic events therein to features in a low dimensional space; and

mapping the features to a posterior probability for each sound event,

wherein a plurality of hole convolution operations are performed on the features.

Supplementary note 9. the method according to supplementary note 8, further comprising performing a one-dimensional convolution operation, a ReLU operation with parameters, a normalization operation, and a 1 × 1 convolution operation on the sound signal to obtain the feature.

Supplementary notes 10 the method of supplementary notes 9, further comprising, after performing a plurality of hole convolution operations on the features, further performing a 1 × 1 convolution operation, a full join operation, and a Softmax operation to obtain the a posteriori probability.

Reference 11. the method according to any one of the references 8 to 10, wherein the feature is subjected to a hole convolution operation 3 times.

Reference 12. the method of reference 11, further comprising performing a 1 × 1 convolution operation, a parameterized ReLU operation, a normalization operation, and a depth convolution operation during each hole convolution operation.

Supplementary note 13 the method according to supplementary note 8, wherein a sound signal having a plurality of sound events therein is converted into features in a low dimensional space by an encoder, the features are mapped to a posterior probability of each sound event by a detector,

the method further includes training the encoder and the detector using sound data having an event tag.

Supplementary note 14. the method according to supplementary note 8, wherein the feature is a feature based on each frame of the sound signal.

Appendix 15. a program product comprising machine-readable instruction code stored therein, wherein the instruction code, when read and executed by a computer, is capable of causing the computer to perform a method according to any of appendixes 8-14.

Claims

1. An apparatus for recognition of a sound event, comprising:

2. The apparatus of claim 1, wherein said encoder performs a one-dimensional convolution operation, a parametric ReLU operation, a normalization operation, and a 1 x 1 convolution operation on the sound signal to obtain the feature.

3. The apparatus of claim 2, wherein the detector further performs a 1 x 1 convolution operation, a full join operation, and a Softmax operation to obtain the a posteriori probability after performing a plurality of hole convolution operations on the feature.

4. The apparatus of any of claims 1 to 3, wherein the detector performs 3 hole convolution operations on the feature.

5. The apparatus of claim 4, wherein the detector further performs a 1 x 1 convolution operation, a parameterized ReLU operation, a normalization operation, and a depth convolution operation in the course of performing each hole convolution operation.

6. The apparatus of claim 1, wherein the encoder and the detector are trained using sound data with event tags.

7. The apparatus of claim 1, wherein the feature is a feature based on each frame of a sound signal.

8. A method of recognition of a sound event, comprising:

mapping the features to a posterior probability for each sound event,

9. A machine-readable storage medium having a program product embodied thereon, the program product comprising machine-readable instruction code stored therein, wherein the instruction code, when read and executed by a computer, is capable of causing the computer to perform the method of claim 8.