CN110853672A

CN110853672A - Data expansion method and device for audio scene classification

Info

Publication number: CN110853672A
Application number: CN201911088131.9A
Authority: CN
Inventors: 冷严; 赵玮玮; 林婵; 齐广慧; 王荣燕; 李登旺; 华庆
Original assignee: Shandong Normal University
Current assignee: Shandong Normal University
Priority date: 2019-11-08
Filing date: 2019-11-08
Publication date: 2020-02-28
Anticipated expiration: 2039-11-08
Also published as: CN110853672B

Abstract

The data expansion method is based on the steps that firstly, key audio events in an audio document are mined on the basis of LDA; secondly, for each type of audio scene, respectively counting the probability distribution of the occurrence times of the key audio events in the type, the probability distribution of the occurrence positions of the key audio events under each occurrence time and the probability distribution of the duration of the key audio events under each occurrence time; and finally, generating a new audio document according to the probability distribution obtained by statistics. The data expansion method provided by the disclosure generates new data by key audio events and non-key audio events according to the probability distribution of the key audio events in a training set, and the new data generated by the method can well simulate the audio scene in a real environment, so that the classification performance of the model can be effectively improved, and the model has good stability under different network structures and different data sets.

Description

Data expansion method and device for audio scene classification

Technical Field

The present disclosure relates to the field of audio recognition technology, and in particular, to a data expansion method and apparatus for audio scene classification.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

Audio scene classification refers to the perception of the surrounding environment through audio, which has gained widespread attention in the last decade. The audio scene classification has wide application prospect, and can be widely applied to human-computer interaction, intelligent wearing, context perception calculation and the like. In recent years, with the rapid development of deep learning technology, the traditional classification model is gradually replaced by a deep neural network, and the signal features extracted by the deep neural network have stronger expression capability, so that better classification performance can be obtained. The intra-class samples of audio scenes typically differ significantly, for example, for park scenes, the samples taken on weekends and weekdays may differ significantly. In order to enable the collected audio scene samples to cover the audio scene characteristics in various situations, the audio samples are collected at different positions and different time periods, which means that a great deal of time and energy are consumed for collecting data, and thus the collected audio scene training samples are often insufficient relative to the huge parameters of the deep neural network.

In order to solve the problem of insufficient training samples of the deep neural network in the audio field, a number of data expansion methods have been proposed, such as a pitch shift method, a time stretching method, a method of adding left channel information, a method of adding right channel information, and the like. The inventor finds that the data expansion methods commonly used at present have the following problems:

1) the pitch-shifting method and the time stretching method have instability on different data sets, and the newly generated data on some data sets can well help to improve the classification performance of the network, but on other data sets, the newly generated data has little improvement on the classification performance, and even some data sets can cause the reduction of the classification performance. Experiments show that when a VGG network is used as a classification model, on an evaluation data set of an audio scene recognition database of the international DCASE2016 challenge race, a pitch shift method and a time stretching method can well help to improve the classification performance of the network, and on a level data set, the pitch shift method and the time stretching method can cause the classification performance to be reduced.

2) The method for adding the left channel information and the method for adding the right channel information have instability on different network structures, the newly generated data can well help to improve the classification performance of the network for some network structures, but for other network structures, the improvement on the classification performance is not great, and even some network structures can cause the reduction of the classification performance; experiments show that the method of adding left channel information and the method of adding right channel information can help to improve the classification performance of the network well on the VGG network structure, but on the LSTM network structure, the method of adding left channel information can cause the reduction of the classification performance, and the method of adding right channel information improves the classification performance of the network but improves the classification performance of the network slightly.

3) The method of adding left channel information and the method of adding right channel information are only applicable to a multi-channel data set, and for a single-channel data set, the method of adding left channel information and the method of adding right channel information cannot be applied to the single-channel data set because audio data in the data set does not have left channel information and right channel information.

Therefore, the existing data expansion method has instability on different network structures or different data sets, does not have universality, and cannot well improve the classification performance.

Disclosure of Invention

The data expansion method is based on a Latent Dirichlet Allocation model (LDA) to mine key audio events in an audio document, and calculates a series of probability distributions of the key audio events in each audio scene class, and finally generates new data according to the probability distributions obtained through calculation.

In order to achieve the purpose, the following technical scheme is adopted in the disclosure:

one or more embodiments provide a data augmentation method for audio scene classification, comprising the steps of:

acquiring a training set; the training set comprises a plurality of audio documents and audio scenes to which the audio documents belong;

extracting key audio events and non-key audio events in a training set by adopting an LDA-based method;

respectively counting the probability distribution of key audio events aiming at each type of audio scene;

and generating a new audio document according to the probability distribution of the key audio event, namely the new data.

One or more embodiments provide a data augmentation apparatus for audio scene classification, comprising:

an acquisition module: for obtaining a training set; the training set comprises a plurality of audio documents and audio scenes to which the audio documents belong;

a first extraction module: extracting key audio events and non-key audio events in a training set by adopting an LDA-based method;

a probability distribution statistics module: the method is used for respectively counting the probability distribution of the key audio events aiming at each type of audio scene;

a new data generation module: the method is used for generating a new audio document according to the probability distribution of the key audio events, namely the new data.

The present disclosure also provides an electronic device comprising a memory and a processor, and computer instructions stored on the memory and executed on the processor, wherein the computer instructions, when executed by the processor, perform the steps of the above method.

The present disclosure also provides a computer-readable storage medium for storing computer instructions, which when executed by a processor, perform the steps of the above-described method.

Compared with the prior art, the beneficial effect of this disclosure is:

(1) the data expansion method designed by the present disclosure generates new data from the key audio events and non-key audio events in the training set according to the probability distribution of the key audio events in the training set, so as to expand the training set. The new data generated by the data expansion method designed by the disclosure can well simulate the audio scene in the real environment, so that the generated new data is very effective, and the classification performance of the model can be effectively improved after the new data is expanded to the training set.

(2) The data expansion method designed by the disclosure has good stability on different network structures, and the newly generated data can well help to improve the classification performance of the network for different network structures.

(3) The data expansion method designed by the disclosure has good stability on different data sets, and new data generated on different data sets can well help to improve the classification performance of the network.

(4) Compared with the data expansion methods such as a left channel information adding method and a right channel information adding method which are commonly used in the audio field at present, the data expansion method designed by the disclosure can generate new data with the data volume more than twice of the original training data, so that a user can generate new data with the required data volume according to the requirement.

(5) The method for extracting the key audio event in the audio stream based on the LDA has wide applicability, can be used for generating new data, and can also be used in a plurality of audio analysis fields such as audio theme analysis, audio summarization and the like.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and not to limit the disclosure.

Fig. 1 is a flowchart of a data augmentation method for audio scene classification according to embodiment 1 of the present disclosure;

fig. 2 is a process diagram of extracting key audio events and non-key audio events according to embodiment 1 of the present disclosure.

The specific implementation mode is as follows:

the present disclosure is further described with reference to the following drawings and examples.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise. It should be noted that, in the case of no conflict, the embodiments and features in the embodiments in the present disclosure may be combined with each other. The embodiments will be described in detail below with reference to the accompanying drawings.

Example 1

In one or more embodiments, as shown in fig. 1, a data expansion method for audio scene classification includes the following steps:

s1, acquiring a training set; the training set comprises a plurality of audio documents and audio scenes to which the audio documents belong; data collected by the audio data collector may be used.

S2, extracting key audio events and non-key audio events in the training set by adopting an LDA-based method;

s3, respectively counting the probability distribution of the key audio events aiming at each type of audio scene;

and S4, generating a new audio document according to the probability distribution of the key audio events, namely new data.

The above steps are specifically described below.

1. Extracting key and non-key audio events

A process diagram for extracting key audio events and non-key audio events is shown in fig. 2, and mainly includes the following steps: (1) generating audio words; (2) extracting the theme distribution of the audio words and the theme distribution of the audio document based on the LDA; (3) key audio events and non-key audio events are extracted.

(1) Generating audio words

In some embodiments, optionally, the step of generating audio words comprises:

101) performing framing processing on each audio document in the training set; the audio document may be framed according to a set frame length and a set frame shift, for example, the frame length is set to 40 ms and the frame shift is set to 20 ms.

102) Extracting the frequency spectrum characteristic data of each frame; 40-dimensional log-mel frequency spectrum features can be extracted from each frame;

103) clustering the spectral feature data of all frames in the training set to obtain M clusters, taking the cluster centers as audio words, obtaining M audio words by the M clusters, forming an audio dictionary by the M audio words, and representing the frames in each cluster as the audio words corresponding to the cluster where the frame is located.

(2) Extracting the theme distribution of the audio words and the theme distribution of the audio document based on the LDA;

201) according to the obtained audio words, counting the occurrence times of each audio word in each audio document, and establishing an audio document-audio word co-occurrence matrix;

alternatively, a histogram method for creating audio words may be specifically adopted, and assuming that there are N audio documents in the training set, after the audio words are generated, each audio document in the training set is composed of a series of audio words. For each audio document, counting a histogram of the audio words; stacking the audio word histograms of all audio documents in the training set by column to form an audio document-audio word co-occurrence matrix D:

where d denotes the d-th audio document in the training set (d ═ 1, …, N), w_iIndicating the ith audio word (i ═ 1, …, M),

representing audio words w in an audio document d_iM represents the number of audio words.

202) Setting the number of the topics as K, and obtaining the probability value of each audio document on each topic through LDA learning according to the co-occurrence matrix D, thereby obtaining the topic distribution of the audio documents; according to the co-occurrence matrix D, obtaining a probability value of each theme on each audio word through LDA learning, and obtaining theme distribution of the audio words according to the probability value of each audio document on each theme and the probability value of each theme on each audio word;

setting the number of the subjects as K, learning the probability value of each audio document on each subject and the probability value of each subject on each audio word through LDA, and respectively forming an 'audio document-subject' matrix theta and a 'subject-audio word' matrix

In which is shown

Representing the probability value of the audio document d on the K (K-1, …, K) -th topic, the "audio documentEach column theta of the topic matrix theta^(d)(d-1, …, N) is the distribution of topics representing each audio document.

Indicating that the k topic is in the ith audio word w_iThe probability value of (c).

Audio word w_iProbability value on K (K-1, …, K) th topicCan be obtained by the following formula:

audio word w_iThe topic distribution of (i ═ 1, …, M) isWherein the content of the first and second substances,

representing the ith audio word w_iThe distribution of the subject matter of (a),

representing the ith audio word w_iProbability value on K-th topic, K-1, …, K, z_iRepresenting the ith audio word w_iI-1, …, M indicating the number of audio words,representing the probability value of the audio document d on the kth subject, d is 1, …, N, N represents the number of audio documents in the training set,

(3) Extracting key and non-key audio events

The method specifically comprises the following steps:

301) setting a similarity threshold, and calculating the similarity of the theme distribution of the audio words and the theme distribution of the audio document;

302) regarding the audio document d, (d ═ 1, …, N), the audio words with similarity greater than the set threshold value with the audio document are taken as the key audio words of the audio document, and the rest audio words are the non-key audio words of the audio document;

303) for audio document d, (d ═ 1, …, N), consecutive key audio words within it constitute key audio events, and consecutive non-key audio words within it constitute non-key audio events.

In this embodiment, optionally, for each audio document d (d ═ 1, …, N), the theme distribution θ of the audio document d is calculated by the following equation^(d)And audio word w_iTopic distribution of (i ═ 1, …, M)

Similarity of (2)

Setting a similarity threshold T will satisfy

Audio word w of_iAs a key audio word of the audio document d, will not be satisfied

Homonymous frequency word w_iAs non-key audio words of the audio document d. For audio document d, consecutive key audio words therein constitute key audio events, and consecutive non-key audio words therein constitute non-key audio events.

2. For each type of audio scene, counting the probability distribution of key audio events

In step S3, optionally, a normalized histogram method may be used to count the probability distribution of the key audio event, where the probability distribution of the key audio event at least includes: for each type of audio scene, the probability distribution of the occurrence times of the key audio events, the probability distribution of the occurrence positions of the key audio events at each occurrence time and the probability distribution of the duration of the key audio events at each occurrence time.

In some embodiments, the occurrence position of the key audio event may be the distance between the starting position of the occurrence of the key audio event and the starting position of the located audio document.

In this embodiment, for each type of audio scene in the training set, a key audio event segment and a non-key audio event segment in each audio document included in the same audio scene are collected, an occurrence position and a duration of each key audio event in the audio document are recorded, and occurrence times of the key audio event in each audio document in the same audio scene are recorded.

And counting the probability distribution P (num) of the occurrence times of the key audio events in the same audio scene by a normalized histogram method, wherein num represents the occurrence times of the key audio events.

Collecting audio documents with the same occurrence frequency of key audio events in the same audio scene, and counting the probability distribution P (loc | num) of the occurrence positions of the key audio events in the same audio scene under each occurrence frequency by a normalized histogram method, wherein loc represents the occurrence positions of the key audio events.

Collecting audio documents with the same occurrence times of key audio events in the same audio scene, and counting the probability distribution P (dur | num) of the duration time of the key audio events in each occurrence time in the same audio scene by a normalized histogram method, wherein dur represents the duration time of the key audio events.

3. For each type of audio scene, new data is generated according to probability distribution

For each type of audio scene, generating a new audio document according to the probability distribution, specifically:

31. obtaining the occurrence times of key audio events in a new audio document to be generated, the occurrence position and the duration of each key audio event by a roulette method according to the probability distribution;

32. according to the obtained occurrence times, occurrence positions and duration, selecting a key audio event with the duration closest to the obtained duration in the audio scene according to the occurrence position and the duration of each key audio event, and placing the key audio event at the occurrence position of the corresponding key audio event in the new audio document;

33. according to the interval between two key audio events in a new audio document, selecting a segment with the duration closest to the interval duration from non-key audio events in the audio scene, and placing the segment at an interval position;

34. and if the appearance position value of the first key audio event in the new audio document is not equal to 0, selecting a non-key audio event segment with the duration closest to the appearance position value of the first key audio event from non-key audio event segments in the audio scene, and placing the non-key audio event segment at the position in front of the appearance position of the first key audio event.

In this embodiment, for each type of audio scene, according to the probability distribution P (num) of the occurrence times of the key audio event in the type, the probability distribution P (loc | num) of the occurrence position of the key audio event at each occurrence time, and the probability distribution P (dur | num) of the duration time of the key audio event at each occurrence time, a new audio document is generated according to the following steps:

(1) generating the number of occurrences num of the key audio event by the roulette method according to the probability distribution P (num)_i，num_iRepresenting the occurrence number of key audio events in the generated ith new audio document;

(2) according to the probability distribution P (loc | num)_i) Generating the occurrence location loc of a key audio event by the roulette method_j(j＝1，…，num_i)，loc_jRepresenting the occurrence position of the jth key audio event in the newly generated audio document;

(3) according to probability scoreCloth P (dur | num)_i) Generating duration dur of key audio event by roulette method_j(j＝1，…，num_i)，dur_jRepresenting the duration of the jth key audio event in the newly generated audio document;

(4) for each key audio event j (j ═ 1, …, num_i) Selecting the key audio event segment with the duration closest to dur from the key audio event segments in the class_jPlacing it in loc_jAt a location;

(5) for the interval between two adjacent key audio events, selecting a segment with the duration closest to the interval duration from the non-key audio event segments in the class, and placing the segment at an interval position;

(6) for the first key audio event, if it occurs at location loc₁Not equal to 0, then the non-critical audio event segment within the class is selected with the duration closest to loc₁A segment of values that precedes the first key audio event;

(7) and (5) repeating the steps (1) to (6) to generate more new audio documents.

Example 2

The embodiment provides a data expansion device for audio scene classification, which comprises:

a new data generation module: and generating a new audio document according to the probability distribution of the key audio event, wherein the generated new audio document is new data.

Example 3

The present embodiment provides an electronic device comprising a memory and a processor, and computer instructions stored on the memory and executed on the processor, wherein the computer instructions, when executed by the processor, perform the steps of the method of embodiment 1.

Example 4

The present embodiment provides a computer readable storage medium for storing computer instructions which, when executed by a processor, perform the steps of the method of embodiment 1.

The electronic device provided by the present disclosure may be a mobile terminal and a non-mobile terminal, where the non-mobile terminal includes a desktop computer, and the mobile terminal includes a Smart Phone (such as an Android Phone and an IOS Phone), Smart glasses, a Smart watch, a Smart bracelet, a tablet computer, a notebook computer, a personal digital assistant, and other mobile internet devices capable of performing wireless communication.

It should be understood that in the present disclosure, the processor may be a central processing unit CPU, but may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate arrays FPGA or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may include both read-only memory and random access memory, and may provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.

In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The steps of a method disclosed in connection with the present disclosure may be embodied directly in a hardware processor, or in a combination of the hardware and software modules within the processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, among other storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor. To avoid repetition, it is not described in detail here. Those of ordinary skill in the art will appreciate that the various illustrative elements, i.e., algorithm steps, described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present disclosure, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is merely a division of one logic function, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, mechanical or other form.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Although the present disclosure has been described with reference to specific embodiments, it should be understood that the scope of the present disclosure is not limited thereto, and those skilled in the art will appreciate that various modifications and changes can be made without departing from the spirit and scope of the present disclosure.

Claims

1. A data expansion method for audio scene classification is characterized by comprising the following steps:

2. A data augmentation method for audio scene classification as claimed in claim 1, characterized by: a method for extracting key audio events and non-key audio events in a training set by adopting an LDA-based method comprises the following steps:

generating audio words according to the audio data in the training set;

according to the obtained audio words, counting the occurrence times of each audio word in each audio document, and establishing an audio document-audio word co-occurrence matrix;

setting the number of themes, and obtaining the probability value of each audio document on each theme through learning by an LDA method according to the co-occurrence matrix so as to obtain the theme distribution of the audio documents; according to the co-occurrence matrix, a probability value of each theme on each audio word is obtained through learning by an LDA method; obtaining the theme distribution of the audio words according to the probability value of each audio document on each theme and the probability value of each theme on each audio word;

and obtaining key audio events and non-key audio events in the training set by calculating the similarity of the theme distribution of the audio words and the theme distribution of the audio documents.

3. A data augmentation method for audio scene classification as claimed in claim 2, characterized by: generating audio words according to the audio data in the training set, specifically:

performing framing processing on each audio document in the training set;

extracting the frequency spectrum characteristic data of each frame;

clustering the spectral feature data of all frames in the training set to obtain M clusters, taking the cluster centers as audio words, obtaining M audio words by the M clusters, forming an audio dictionary by the M audio words, and representing the frames in each cluster as the audio words corresponding to the cluster where the frame is located;

or

Obtaining the theme distribution of the audio words according to the probability value of each audio document on each theme and the probability value of each theme on each audio word, wherein the method specifically comprises the following steps:

wherein the content of the first and second substances,

4. A data augmentation method for audio scene classification as claimed in claim 2, characterized by: the method for obtaining the key audio events and the non-key audio events in the training set by calculating the similarity of the theme distribution of the audio words and the theme distribution of the audio documents comprises the following steps:

setting a similarity threshold, and calculating the similarity of the theme distribution of the audio words and the theme distribution of the audio document;

regarding the audio document d, taking the audio words with the similarity greater than a set threshold as the key audio words of the audio document, and taking the rest audio words as the non-key audio words of the audio document;

for audio document d, consecutive key audio words in audio document d constitute a key audio event, and consecutive non-key audio words in audio document d constitute a non-key audio event.

5. A data augmentation method for audio scene classification as claimed in claim 1, characterized by: for each type of audio scene, respectively counting the probability distribution of the key audio events: and (4) counting the probability distribution of the key audio event by adopting a normalized histogram method.

6. A data augmentation method for audio scene classification as claimed in claim 1, characterized by: the probability distribution of the key audio event comprises at least: for each type of audio scene, the probability distribution of the occurrence times of the key audio events, the probability distribution of the occurrence positions of the key audio events at each occurrence time and the probability distribution of the duration of the key audio events at each occurrence time.

7. A data augmentation method for audio scene classification as claimed in claim 1, characterized by: for each type of audio scene, generating a new audio document according to the probability distribution of the key audio events, specifically:

obtaining the occurrence times of key audio events in a new audio document to be generated, the occurrence position and the duration of each key audio event by a roulette method according to the probability distribution of the key audio events;

according to the obtained occurrence times, occurrence positions and duration, selecting a key audio event with the duration closest to the obtained duration in the audio scene according to the occurrence position and the duration of each key audio event, and placing the key audio event at the occurrence position of the corresponding key audio event in the new audio document;

according to the interval between two key audio events in a new audio document, selecting a segment with the duration closest to the interval duration from non-key audio events in the audio scene, and placing the segment at an interval position;

and if the appearance position value of the first key audio event in the new audio document is not equal to 0, selecting a non-key audio event segment with the duration closest to the appearance position value of the first key audio event from non-key audio event segments in the audio scene, and placing the non-key audio event segment at the position in front of the appearance position of the first key audio event.

8. A data expansion apparatus for audio scene classification, comprising:

9. An electronic device comprising a memory and a processor and computer instructions stored on the memory and executable on the processor, the computer instructions when executed by the processor performing the steps of the method of any of claims 1 to 7.

10. A computer-readable storage medium storing computer instructions which, when executed by a processor, perform the steps of the method of any one of claims 1 to 7.