CN111241336A

CN111241336A - Audio scene recognition method and device, electronic equipment and medium

Info

Publication number: CN111241336A
Application number: CN202010015772.8A
Authority: CN
Inventors: 陈剑超; 肖龙源; 李稀敏; 蔡振华; 刘晓葳
Original assignee: Xiamen Kuaishangtong Technology Co Ltd
Current assignee: Xiamen Kuaishangtong Technology Co Ltd
Priority date: 2020-01-07
Filing date: 2020-01-07
Publication date: 2020-06-05

Abstract

The application provides an audio scene recognition method and device, electronic equipment and a computer readable medium. Wherein the method comprises the following steps: receiving audio data, and performing audio segmentation on the audio data to form a plurality of audio segments; performing audio event detection based on the plurality of audio clips to obtain an audio event detection result; and carrying out scene recognition and labeling on the audio data according to the audio event detection result and a preset recognition model. This approach avoids feature extraction of the underlying audio, since a background sound is first identified from the audio data, and then the likely boundaries of the audio data are determined by the background sound. The audio data are divided by the sampling value of the original audio data, so that the manual marking amount is reduced, the running efficiency of related algorithms is improved, and a unified standard is provided for the audio marking specification.

Description

Audio scene recognition method and device, electronic equipment and medium

Technical Field

The present application relates to the field of audio recognition technologies, and in particular, to an audio scene recognition method and apparatus, an electronic device, and a computer-readable medium.

Background

Audio scene recognition refers to recognizing, for one audio data, an environment in which the audio data occurs, or in other words, audio scene recognition refers to perceiving a surrounding environment through audio information. The audio scene recognition technology has very wide application value, and the audio scene recognition technology can enable the equipment to well sense the surrounding environment when being used for the mobile terminal equipment, so that the equipment state can be automatically adjusted.

The audio retrieval technology based on text is to store each piece of audio as an object in a database, generally label the audio by audio name (keyword) and text information, and the audio retrieval is to perform precise search or fuzzy search according to the keyword description of the audio. Therefore, the text-based audio retrieval technology is carried out through the text retrieval technology, and audio information does not play any role in the whole retrieval process. Most audio retrieval systems are text-based audio retrieval, for example, all music search engines currently use the text-based audio retrieval method to retrieve audio.

In real life, the sounds we have come into contact with are extremely wide, from the sounds of nature such as wind and rain, the sounds of animals, and the sounds of running water, to the sounds of life such as machine roaring, the sounds of automobile engines, and the sounds of various audios, voices, and synthesized sounds coming into contact with computers.

In the traditional audio retrieval technology based on the text, the audio in the audio library needs to be summarized and annotated in advance manually, and the audio retrieval result also completely depends on the manually annotated information such as audio name, number, annotation and the like. However, because each section of audio needs to be labeled with characters, if the audio database is large, the labeling needs a lot of manual labor, and the character labeling has strong human subjectivity, and different people may have different opinions for the same section of audio labeling, so that the labeling information is inconsistent, and a small number of characters are difficult to fully express the connotation of one audio.

Disclosure of Invention

The application aims to provide an audio scene identification method and device, electronic equipment and a computer readable medium.

A first aspect of the present application provides an audio scene recognition method, including:

receiving audio data, and performing audio segmentation on the audio data to form a plurality of audio segments;

performing audio event detection based on the plurality of audio clips to obtain an audio event detection result;

and carrying out scene recognition and labeling on the audio data according to the audio event detection result and a preset recognition model.

In some possible implementations, the audio segmenting the audio data to form a plurality of audio segments includes:

inputting the audio data into a preset background sound recognition model to obtain background sounds in the audio data;

extracting waveform values of the audio data to form a matrix, projecting the matrix and the background sound into a feature space, and obtaining a feature vector of the audio data and a feature vector of the background sound;

calculating a normalized distance between the feature vector of the audio data and the feature vector of the background sound;

and determining the position of a division point of the audio data according to the normalized distance, and performing audio division on the audio data according to the position of the division point to form a plurality of audio segments.

In some possible implementations, the performing audio event detection based on the plurality of audio segments to obtain an audio event detection result includes:

according to a preset spectral clustering algorithm, carrying out audio event detection on the plurality of audio segments to obtain an audio event detection result of each audio segment;

the audio event detection result comprises: frequency of occurrence of audio events, total length of time, importance, and label.

In some possible implementation manners, the performing scene recognition and labeling on the audio data according to the audio event detection result and a preset recognition model includes:

according to the audio event detection result, sequencing the audio clips according to importance, and expanding the audio clips on an event axis according to a descending sequence to obtain audio scene data;

and carrying out scene recognition and labeling on the audio scene data corresponding to the audio data through a preset recognition model.

A second aspect of the present application provides an audio scene recognition apparatus, including:

the segmentation module is used for receiving audio data and performing audio segmentation on the audio data to form a plurality of audio segments;

the detection module is used for detecting audio events based on the plurality of audio clips to obtain audio event detection results;

and the recognition module is used for carrying out scene recognition and labeling on the audio data according to the audio event detection result and a preset recognition model.

In some possible implementations, the segmentation module is specifically configured to:

In some possible implementations, the detection module is specifically configured to:

In some possible implementations, the identification module is specifically configured to:

A third aspect of the present application provides an electronic device comprising: memory, a processor and a computer program stored on the memory and executable on the processor, the processor executing the computer program when executing the computer program to perform the method of the first aspect of the application.

A fourth aspect of the present application provides a computer readable medium having computer readable instructions stored thereon which are executable by a processor to implement the method of the first aspect of the present application.

Compared with the prior art, the audio scene identification method, the audio scene identification device, the electronic equipment and the medium receive audio data, and perform audio segmentation on the audio data to form a plurality of audio segments; performing audio event detection based on the plurality of audio clips to obtain an audio event detection result; and carrying out scene recognition and labeling on the audio data according to the audio event detection result and a preset recognition model. This approach avoids feature extraction of the underlying audio, since a background sound is first identified from the audio data, and then the likely boundaries of the audio data are determined by the background sound. The audio data are divided by the sampling value of the original audio data, so that the manual marking amount is reduced, the running efficiency of related algorithms is improved, and a unified standard is provided for the audio marking specification.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 illustrates a flow chart of an audio scene recognition method provided by some embodiments of the present application;

FIG. 2 illustrates a flow chart of an audio scene recognition method provided by some embodiments of the present application;

FIG. 3 illustrates a flow diagram for spectral clustering of segmented audio segments provided by some embodiments of the present application;

fig. 4 shows a schematic diagram of an audio scene recognition apparatus provided by some embodiments of the present application;

fig. 5 illustrates a schematic diagram of an electronic device provided by some embodiments of the present application.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

It is to be noted that, unless otherwise specified, technical or scientific terms used herein shall have the ordinary meaning as understood by those skilled in the art to which this application belongs.

In addition, the terms "first" and "second", etc. are used to distinguish different objects, rather than to describe a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

Audio retrieval can be attributed to a pattern matching problem. An audio retrieval system typically includes two stages, a training stage (database generation) and a pattern matching (database query).

The first step in audio retrieval is to build a database: and extracting the characteristics of the audio data, loading the audio data into an original audio library part of the database, loading the characteristics into a characteristic library part, clustering the audio data through the characteristics, and loading the clustering information into a clustering parameter library part. The database is built and the audio information can be retrieved.

The audio retrieval technology mainly adopts a characteristic sample query mode, a user determines a sample through a query interface and sets an attribute value, then the query is submitted, a system extracts characteristics from the sample, determines a query characteristic vector by combining the attribute value, then a retrieval engine matches the characteristic vector with a clustering parameter set, matches a certain amount of corresponding data in a characteristic library and an original audio library according to the sequence of the correlation from large to small, and returns the data to the user through the query interface. The original audio library stores audio data, the feature library stores features of the audio data and stores the features according to records, and the clustering parameter library is a parameter set obtained by clustering the audio features.

The embodiment of the application provides an audio scene recognition method and device, an electronic device and a computer readable medium, which are described below with reference to the accompanying drawings.

Referring to fig. 1 and fig. 2, fig. 1 shows a flowchart of an audio scene recognition method provided in some embodiments of the present application, and fig. 2 shows a flowchart of an audio scene recognition method provided in some embodiments of the present application, and as shown in fig. 1, the audio scene recognition method may include the following steps:

step S101: receiving audio data, and performing audio segmentation on the audio data to form a plurality of audio segments;

specifically, the scene sound is generally composed of a structured foreground sound and an unstructured background sound, and the segmentation based on the scene change is performed based on the unstructured background sound. A background sound is first identified from the test audio and then the likely boundaries of the test audio are determined from the background sound. This algorithm avoids feature extraction of the underlying audio. The segmentation of the audio data is achieved by sampling values of the original audio data.

In this embodiment, step S101 may be specifically implemented as:

Specifically, in the audio segmentation process, a section of background sound in the test audio is found out through a simple background sound identification algorithm; then extracting waveform values of the test audio to form a matrix, and projecting the matrix and the background sound feature vector into a feature space; and finally, calculating the normalized distance between the test audio characteristic vector and the background sound characteristic vector to determine the position of the segmentation point of the test audio segment, thereby realizing audio segmentation.

Step S102: performing audio event detection based on the plurality of audio clips to obtain an audio event detection result;

in this embodiment, step S102 may be specifically implemented as:

Specifically, in the audio event detection process, similar segmented segments in the test audio are clustered together through a spectral clustering algorithm, and the clustered segments are called audio events. There are 4 attributes of an audio event, namely frequency of occurrence, total length of time, importance and label. The frequency of occurrence and the total length of time are relatively easy to calculate, and the importance of an audio event is calculated by adding these two attributes to the average length of each audio piece in the audio event. The labels of the audio events are subjected to Singular Value Decomposition (SVD) to calculate principal feature vectors (DFVs) thereof, and then the similarity between the audio events and the training audio events is calculated through the principal feature vectors, and the label corresponding to the training audio event with the largest similarity value is the label of the test audio event, as shown in fig. 2.

And (3) spectral clustering algorithm: the method comprises the steps of using two times of spectral clustering, finding out principal characteristic vectors of the segmented audio segments by a first spectral clustering algorithm, then using the mean value of the principal characteristic vectors as the characteristic vectors to represent the segmented audio segments, and clustering the principal characteristic vectors by a second spectral clustering algorithm, thereby greatly improving the clustering result. Fig. 3 is a flow chart of spectral clustering of segmented audio segments. In FIG. 3, the segment (S) is divided for audio₁To S_N) Extracting the features to obtain a feature vector (x)_ij) Performing a first spectral clustering to obtain C_i1The feature vectors are equal, and then the main feature vector is calculated to obtain (V)₁To V_N) Then, performing a second spectral clustering to obtain a feature vector C₁To C_ks。

Principal feature vector representation: each audio event usually comprises a plurality of audio segments, each audio segment can be represented by a feature vector through a primary spectral clustering algorithm, the audio event is represented by a feature matrix formed by the feature vectors corresponding to the audio segments, and the audio event is represented by Dominant Feature Vectors (DFVs) in order to represent the prominent features of the audio event. The dominant eigenvector is calculated by SVD on the eigenspace of the audio event.

Step S103: and carrying out scene recognition and labeling on the audio data according to the audio event detection result and a preset recognition model.

The preset recognition model may be an HMM model (hidden markov model).

In this embodiment, step S103 may be specifically implemented as:

Specifically, in the audio scene recognition, firstly, an audio segment labeled with an error in an audio event is corrected through a context model. The method mainly considers the occurrence order and the occurrence frequency among audio events to construct an audio event occurrence probability matrix, then calculates the probability of the simultaneous occurrence of two audio events, can calculate the occurrence probability relation of the audio events in a certain specific scene for a given audio scene training set, usually calculates a likelihood function, and finally maximizes the likelihood function, thereby finding the optimal audio event occurrence probability matrix in the training set. The labeling of the wrong audio piece in the audio event is then corrected by a greedy search algorithm. Then, the audio events are sequenced according to the importance, the audio events are expanded on an event axis according to a descending sequence to obtain audio scene data, and finally the recognition and classification of the audio scenes are achieved through an HMM model algorithm.

The audio scene recognition method can be applied to a client, and in the embodiment of the application, the client may include hardware or software. When the client includes hardware, it may be various electronic devices having a display screen and supporting information interaction, for example, and may include, but not be limited to, a smart phone, a tablet computer, a laptop portable computer, a desktop computer, and the like. When the client includes software, it may be installed in the electronic device, and it may be implemented as a plurality of software or software modules, or as a single software or software module. And is not particularly limited herein.

Compared with the prior art, the audio scene identification method provided by the embodiment of the application forms a plurality of audio segments by receiving audio data and performing audio segmentation on the audio data; performing audio event detection based on the plurality of audio clips to obtain an audio event detection result; and carrying out scene recognition and labeling on the audio data according to the audio event detection result and a preset recognition model. This approach avoids feature extraction of the underlying audio, since a background sound is first identified from the audio data, and then the likely boundaries of the audio data are determined by the background sound. The audio data are divided by the sampling value of the original audio data, so that the manual marking amount is reduced, the running efficiency of related algorithms is improved, and a unified standard is provided for the audio marking specification.

In the foregoing embodiment, an audio scene recognition method is provided, and correspondingly, an audio scene recognition apparatus is also provided in the present application. The audio scene recognition device provided by the embodiment of the application can implement the audio scene recognition method, and the audio scene recognition device can be implemented through software, hardware or a combination of software and hardware. For example, the audio scene recognition means may comprise integrated or separate functional modules or units to perform the corresponding steps of the above-described methods. Please refer to fig. 4, which illustrates a schematic diagram of an audio scene recognition apparatus according to some embodiments of the present application. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

As shown in fig. 4, the audio scene recognition apparatus 10 may include:

the segmentation module 101 is configured to receive audio data, and perform audio segmentation on the audio data to form a plurality of audio segments;

the detection module 102 is configured to perform audio event detection based on the plurality of audio segments to obtain an audio event detection result;

and the identification module 103 is configured to perform scene identification and labeling on the audio data according to the audio event detection result and a preset identification model.

In some possible implementations, the segmentation module 101 is specifically configured to:

In some possible implementations, the detection module 102 is specifically configured to:

In some possible implementations, the identification module 103 is specifically configured to:

The audio scene recognition device 10 provided in the embodiment of the present application and the audio scene recognition method provided in the foregoing embodiment of the present application have the same beneficial effects and the same inventive concepts.

The embodiment of the present application further provides an electronic device corresponding to the audio scene recognition method provided by the foregoing embodiment, where the electronic device may be an electronic device for a client, such as a mobile phone, a notebook computer, a tablet computer, a desktop computer, and the like, so as to execute the audio scene recognition method.

Please refer to fig. 5, which illustrates a schematic diagram of an electronic device according to some embodiments of the present application. As shown in fig. 5, the electronic device 20 includes: the system comprises a processor 200, a memory 201, a bus 202 and a communication interface 203, wherein the processor 200, the communication interface 203 and the memory 201 are connected through the bus 202; the memory 201 stores a computer program that can be executed on the processor 200, and the processor 200 executes the audio scene recognition method provided by any one of the foregoing embodiments when executing the computer program.

The Memory 201 may include a high-speed Random Access Memory (RAM) and may further include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 203 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, and the like can be used.

Bus 202 can be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The memory 201 is used for storing a program, and the processor 200 executes the program after receiving an execution instruction, and the audio scene recognition method disclosed in any embodiment of the present application may be applied to the processor 200, or implemented by the processor 200.

The processor 200 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 200.

The electronic device provided by the embodiment of the application and the audio scene identification method provided by the embodiment of the application have the same inventive concept and have the same beneficial effects as the method adopted, operated or realized by the electronic device.

The present application further provides a computer-readable medium corresponding to the audio scene recognition method provided in the foregoing embodiments, and a computer program (i.e., a program product) is stored thereon, and when being executed by a processor, the computer program executes the audio scene recognition method provided in any of the foregoing embodiments.

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the present disclosure, and the present disclosure should be construed as being covered by the claims and the specification.

Claims

1. An audio scene recognition method, comprising:

2. The method of claim 1, wherein the audio segmenting the audio data into a plurality of audio segments comprises:

3. The method of claim 2, wherein the performing audio event detection based on the plurality of audio segments to obtain an audio event detection result comprises:

4. The method of claim 3, wherein the performing scene recognition and labeling on the audio data according to the audio event detection result and a preset recognition model comprises:

5. An audio scene recognition apparatus, comprising:

6. The apparatus of claim 5, wherein the segmentation module is specifically configured to:

7. The apparatus according to claim 6, wherein the detection module is specifically configured to:

8. The apparatus according to claim 7, wherein the identification module is specifically configured to:

9. An electronic device, comprising: memory, processor and computer program stored on the memory and executable on the processor, characterized in that the processor executes the computer program to implement the method according to any of claims 1 to 4.

10. A computer readable medium having computer readable instructions stored thereon which are executable by a processor to implement the method of any one of claims 1 to 4.