CN110853672A - Data expansion method and device for audio scene classification - Google Patents

Data expansion method and device for audio scene classification Download PDF

Info

Publication number
CN110853672A
CN110853672A CN201911088131.9A CN201911088131A CN110853672A CN 110853672 A CN110853672 A CN 110853672A CN 201911088131 A CN201911088131 A CN 201911088131A CN 110853672 A CN110853672 A CN 110853672A
Authority
CN
China
Prior art keywords
audio
key
events
document
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911088131.9A
Other languages
Chinese (zh)
Other versions
CN110853672B (en
Inventor
冷严
赵玮玮
林婵
齐广慧
王荣燕
李登旺
华庆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Normal University
Original Assignee
Shandong Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Normal University filed Critical Shandong Normal University
Priority to CN201911088131.9A priority Critical patent/CN110853672B/en
Publication of CN110853672A publication Critical patent/CN110853672A/en
Application granted granted Critical
Publication of CN110853672B publication Critical patent/CN110853672B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/65Clustering; Classification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Abstract

The data expansion method is based on the steps that firstly, key audio events in an audio document are mined on the basis of LDA; secondly, for each type of audio scene, respectively counting the probability distribution of the occurrence times of the key audio events in the type, the probability distribution of the occurrence positions of the key audio events under each occurrence time and the probability distribution of the duration of the key audio events under each occurrence time; and finally, generating a new audio document according to the probability distribution obtained by statistics. The data expansion method provided by the disclosure generates new data by key audio events and non-key audio events according to the probability distribution of the key audio events in a training set, and the new data generated by the method can well simulate the audio scene in a real environment, so that the classification performance of the model can be effectively improved, and the model has good stability under different network structures and different data sets.

Description

Data expansion method and device for audio scene classification
Technical Field
The present disclosure relates to the field of audio recognition technology, and in particular, to a data expansion method and apparatus for audio scene classification.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
Audio scene classification refers to the perception of the surrounding environment through audio, which has gained widespread attention in the last decade. The audio scene classification has wide application prospect, and can be widely applied to human-computer interaction, intelligent wearing, context perception calculation and the like. In recent years, with the rapid development of deep learning technology, the traditional classification model is gradually replaced by a deep neural network, and the signal features extracted by the deep neural network have stronger expression capability, so that better classification performance can be obtained. The intra-class samples of audio scenes typically differ significantly, for example, for park scenes, the samples taken on weekends and weekdays may differ significantly. In order to enable the collected audio scene samples to cover the audio scene characteristics in various situations, the audio samples are collected at different positions and different time periods, which means that a great deal of time and energy are consumed for collecting data, and thus the collected audio scene training samples are often insufficient relative to the huge parameters of the deep neural network.
In order to solve the problem of insufficient training samples of the deep neural network in the audio field, a number of data expansion methods have been proposed, such as a pitch shift method, a time stretching method, a method of adding left channel information, a method of adding right channel information, and the like. The inventor finds that the data expansion methods commonly used at present have the following problems:
1) the pitch-shifting method and the time stretching method have instability on different data sets, and the newly generated data on some data sets can well help to improve the classification performance of the network, but on other data sets, the newly generated data has little improvement on the classification performance, and even some data sets can cause the reduction of the classification performance. Experiments show that when a VGG network is used as a classification model, on an evaluation data set of an audio scene recognition database of the international DCASE2016 challenge race, a pitch shift method and a time stretching method can well help to improve the classification performance of the network, and on a level data set, the pitch shift method and the time stretching method can cause the classification performance to be reduced.
2) The method for adding the left channel information and the method for adding the right channel information have instability on different network structures, the newly generated data can well help to improve the classification performance of the network for some network structures, but for other network structures, the improvement on the classification performance is not great, and even some network structures can cause the reduction of the classification performance; experiments show that the method of adding left channel information and the method of adding right channel information can help to improve the classification performance of the network well on the VGG network structure, but on the LSTM network structure, the method of adding left channel information can cause the reduction of the classification performance, and the method of adding right channel information improves the classification performance of the network but improves the classification performance of the network slightly.
3) The method of adding left channel information and the method of adding right channel information are only applicable to a multi-channel data set, and for a single-channel data set, the method of adding left channel information and the method of adding right channel information cannot be applied to the single-channel data set because audio data in the data set does not have left channel information and right channel information.
Therefore, the existing data expansion method has instability on different network structures or different data sets, does not have universality, and cannot well improve the classification performance.
Disclosure of Invention
The data expansion method is based on a Latent Dirichlet Allocation model (LDA) to mine key audio events in an audio document, and calculates a series of probability distributions of the key audio events in each audio scene class, and finally generates new data according to the probability distributions obtained through calculation.
In order to achieve the purpose, the following technical scheme is adopted in the disclosure:
one or more embodiments provide a data augmentation method for audio scene classification, comprising the steps of:
acquiring a training set; the training set comprises a plurality of audio documents and audio scenes to which the audio documents belong;
extracting key audio events and non-key audio events in a training set by adopting an LDA-based method;
respectively counting the probability distribution of key audio events aiming at each type of audio scene;
and generating a new audio document according to the probability distribution of the key audio event, namely the new data.
One or more embodiments provide a data augmentation apparatus for audio scene classification, comprising:
an acquisition module: for obtaining a training set; the training set comprises a plurality of audio documents and audio scenes to which the audio documents belong;
a first extraction module: extracting key audio events and non-key audio events in a training set by adopting an LDA-based method;
a probability distribution statistics module: the method is used for respectively counting the probability distribution of the key audio events aiming at each type of audio scene;
a new data generation module: the method is used for generating a new audio document according to the probability distribution of the key audio events, namely the new data.
The present disclosure also provides an electronic device comprising a memory and a processor, and computer instructions stored on the memory and executed on the processor, wherein the computer instructions, when executed by the processor, perform the steps of the above method.
The present disclosure also provides a computer-readable storage medium for storing computer instructions, which when executed by a processor, perform the steps of the above-described method.
Compared with the prior art, the beneficial effect of this disclosure is:
(1) the data expansion method designed by the present disclosure generates new data from the key audio events and non-key audio events in the training set according to the probability distribution of the key audio events in the training set, so as to expand the training set. The new data generated by the data expansion method designed by the disclosure can well simulate the audio scene in the real environment, so that the generated new data is very effective, and the classification performance of the model can be effectively improved after the new data is expanded to the training set.
(2) The data expansion method designed by the disclosure has good stability on different network structures, and the newly generated data can well help to improve the classification performance of the network for different network structures.
(3) The data expansion method designed by the disclosure has good stability on different data sets, and new data generated on different data sets can well help to improve the classification performance of the network.
(4) Compared with the data expansion methods such as a left channel information adding method and a right channel information adding method which are commonly used in the audio field at present, the data expansion method designed by the disclosure can generate new data with the data volume more than twice of the original training data, so that a user can generate new data with the required data volume according to the requirement.
(5) The method for extracting the key audio event in the audio stream based on the LDA has wide applicability, can be used for generating new data, and can also be used in a plurality of audio analysis fields such as audio theme analysis, audio summarization and the like.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and not to limit the disclosure.
Fig. 1 is a flowchart of a data augmentation method for audio scene classification according to embodiment 1 of the present disclosure;
fig. 2 is a process diagram of extracting key audio events and non-key audio events according to embodiment 1 of the present disclosure.
The specific implementation mode is as follows:
the present disclosure is further described with reference to the following drawings and examples.
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise. It should be noted that, in the case of no conflict, the embodiments and features in the embodiments in the present disclosure may be combined with each other. The embodiments will be described in detail below with reference to the accompanying drawings.
Example 1
In one or more embodiments, as shown in fig. 1, a data expansion method for audio scene classification includes the following steps:
s1, acquiring a training set; the training set comprises a plurality of audio documents and audio scenes to which the audio documents belong; data collected by the audio data collector may be used.
S2, extracting key audio events and non-key audio events in the training set by adopting an LDA-based method;
s3, respectively counting the probability distribution of the key audio events aiming at each type of audio scene;
and S4, generating a new audio document according to the probability distribution of the key audio events, namely new data.
The above steps are specifically described below.
1. Extracting key and non-key audio events
A process diagram for extracting key audio events and non-key audio events is shown in fig. 2, and mainly includes the following steps: (1) generating audio words; (2) extracting the theme distribution of the audio words and the theme distribution of the audio document based on the LDA; (3) key audio events and non-key audio events are extracted.
(1) Generating audio words
In some embodiments, optionally, the step of generating audio words comprises:
101) performing framing processing on each audio document in the training set; the audio document may be framed according to a set frame length and a set frame shift, for example, the frame length is set to 40 ms and the frame shift is set to 20 ms.
102) Extracting the frequency spectrum characteristic data of each frame; 40-dimensional log-mel frequency spectrum features can be extracted from each frame;
103) clustering the spectral feature data of all frames in the training set to obtain M clusters, taking the cluster centers as audio words, obtaining M audio words by the M clusters, forming an audio dictionary by the M audio words, and representing the frames in each cluster as the audio words corresponding to the cluster where the frame is located.
(2) Extracting the theme distribution of the audio words and the theme distribution of the audio document based on the LDA;
201) according to the obtained audio words, counting the occurrence times of each audio word in each audio document, and establishing an audio document-audio word co-occurrence matrix;
alternatively, a histogram method for creating audio words may be specifically adopted, and assuming that there are N audio documents in the training set, after the audio words are generated, each audio document in the training set is composed of a series of audio words. For each audio document, counting a histogram of the audio words; stacking the audio word histograms of all audio documents in the training set by column to form an audio document-audio word co-occurrence matrix D:
Figure BDA0002266047080000071
where d denotes the d-th audio document in the training set (d ═ 1, …, N), wiIndicating the ith audio word (i ═ 1, …, M),
Figure BDA0002266047080000072
representing audio words w in an audio document diM represents the number of audio words.
202) Setting the number of the topics as K, and obtaining the probability value of each audio document on each topic through LDA learning according to the co-occurrence matrix D, thereby obtaining the topic distribution of the audio documents; according to the co-occurrence matrix D, obtaining a probability value of each theme on each audio word through LDA learning, and obtaining theme distribution of the audio words according to the probability value of each audio document on each theme and the probability value of each theme on each audio word;
setting the number of the subjects as K, learning the probability value of each audio document on each subject and the probability value of each subject on each audio word through LDA, and respectively forming an 'audio document-subject' matrix theta and a 'subject-audio word' matrix
Figure BDA0002266047080000073
Figure BDA0002266047080000074
In which is shown
Figure BDA0002266047080000076
Representing the probability value of the audio document d on the K (K-1, …, K) -th topic, the "audio documentEach column theta of the topic matrix theta(d)(d-1, …, N) is the distribution of topics representing each audio document.
Figure BDA0002266047080000077
Indicating that the k topic is in the ith audio word wiThe probability value of (c).
Audio word wiProbability value on K (K-1, …, K) th topicCan be obtained by the following formula:
Figure BDA0002266047080000081
audio word wiThe topic distribution of (i ═ 1, …, M) isWherein the content of the first and second substances,
Figure BDA0002266047080000083
representing the ith audio word wiThe distribution of the subject matter of (a),
Figure BDA0002266047080000084
representing the ith audio word wiProbability value on K-th topic, K-1, …, K, ziRepresenting the ith audio word wiI-1, …, M indicating the number of audio words,representing the probability value of the audio document d on the kth subject, d is 1, …, N, N represents the number of audio documents in the training set,
Figure BDA0002266047080000086
indicating that the k topic is in the ith audio word wiThe probability value of (c).
(3) Extracting key and non-key audio events
The method specifically comprises the following steps:
301) setting a similarity threshold, and calculating the similarity of the theme distribution of the audio words and the theme distribution of the audio document;
302) regarding the audio document d, (d ═ 1, …, N), the audio words with similarity greater than the set threshold value with the audio document are taken as the key audio words of the audio document, and the rest audio words are the non-key audio words of the audio document;
303) for audio document d, (d ═ 1, …, N), consecutive key audio words within it constitute key audio events, and consecutive non-key audio words within it constitute non-key audio events.
In this embodiment, optionally, for each audio document d (d ═ 1, …, N), the theme distribution θ of the audio document d is calculated by the following equation(d)And audio word wiTopic distribution of (i ═ 1, …, M)
Figure BDA0002266047080000087
Similarity of (2)
Figure BDA0002266047080000088
Figure BDA0002266047080000089
Setting a similarity threshold T will satisfy
Figure BDA00022660470800000810
Audio word w ofiAs a key audio word of the audio document d, will not be satisfied
Figure BDA0002266047080000091
Homonymous frequency word wiAs non-key audio words of the audio document d. For audio document d, consecutive key audio words therein constitute key audio events, and consecutive non-key audio words therein constitute non-key audio events.
2. For each type of audio scene, counting the probability distribution of key audio events
In step S3, optionally, a normalized histogram method may be used to count the probability distribution of the key audio event, where the probability distribution of the key audio event at least includes: for each type of audio scene, the probability distribution of the occurrence times of the key audio events, the probability distribution of the occurrence positions of the key audio events at each occurrence time and the probability distribution of the duration of the key audio events at each occurrence time.
In some embodiments, the occurrence position of the key audio event may be the distance between the starting position of the occurrence of the key audio event and the starting position of the located audio document.
In this embodiment, for each type of audio scene in the training set, a key audio event segment and a non-key audio event segment in each audio document included in the same audio scene are collected, an occurrence position and a duration of each key audio event in the audio document are recorded, and occurrence times of the key audio event in each audio document in the same audio scene are recorded.
And counting the probability distribution P (num) of the occurrence times of the key audio events in the same audio scene by a normalized histogram method, wherein num represents the occurrence times of the key audio events.
Collecting audio documents with the same occurrence frequency of key audio events in the same audio scene, and counting the probability distribution P (loc | num) of the occurrence positions of the key audio events in the same audio scene under each occurrence frequency by a normalized histogram method, wherein loc represents the occurrence positions of the key audio events.
Collecting audio documents with the same occurrence times of key audio events in the same audio scene, and counting the probability distribution P (dur | num) of the duration time of the key audio events in each occurrence time in the same audio scene by a normalized histogram method, wherein dur represents the duration time of the key audio events.
3. For each type of audio scene, new data is generated according to probability distribution
For each type of audio scene, generating a new audio document according to the probability distribution, specifically:
31. obtaining the occurrence times of key audio events in a new audio document to be generated, the occurrence position and the duration of each key audio event by a roulette method according to the probability distribution;
32. according to the obtained occurrence times, occurrence positions and duration, selecting a key audio event with the duration closest to the obtained duration in the audio scene according to the occurrence position and the duration of each key audio event, and placing the key audio event at the occurrence position of the corresponding key audio event in the new audio document;
33. according to the interval between two key audio events in a new audio document, selecting a segment with the duration closest to the interval duration from non-key audio events in the audio scene, and placing the segment at an interval position;
34. and if the appearance position value of the first key audio event in the new audio document is not equal to 0, selecting a non-key audio event segment with the duration closest to the appearance position value of the first key audio event from non-key audio event segments in the audio scene, and placing the non-key audio event segment at the position in front of the appearance position of the first key audio event.
In this embodiment, for each type of audio scene, according to the probability distribution P (num) of the occurrence times of the key audio event in the type, the probability distribution P (loc | num) of the occurrence position of the key audio event at each occurrence time, and the probability distribution P (dur | num) of the duration time of the key audio event at each occurrence time, a new audio document is generated according to the following steps:
(1) generating the number of occurrences num of the key audio event by the roulette method according to the probability distribution P (num)i,numiRepresenting the occurrence number of key audio events in the generated ith new audio document;
(2) according to the probability distribution P (loc | num)i) Generating the occurrence location loc of a key audio event by the roulette methodj(j=1,…,numi),locjRepresenting the occurrence position of the jth key audio event in the newly generated audio document;
(3) according to probability scoreCloth P (dur | num)i) Generating duration dur of key audio event by roulette methodj(j=1,…,numi),durjRepresenting the duration of the jth key audio event in the newly generated audio document;
(4) for each key audio event j (j ═ 1, …, numi) Selecting the key audio event segment with the duration closest to dur from the key audio event segments in the classjPlacing it in locjAt a location;
(5) for the interval between two adjacent key audio events, selecting a segment with the duration closest to the interval duration from the non-key audio event segments in the class, and placing the segment at an interval position;
(6) for the first key audio event, if it occurs at location loc1Not equal to 0, then the non-critical audio event segment within the class is selected with the duration closest to loc1A segment of values that precedes the first key audio event;
(7) and (5) repeating the steps (1) to (6) to generate more new audio documents.
Example 2
The embodiment provides a data expansion device for audio scene classification, which comprises:
an acquisition module: for obtaining a training set; the training set comprises a plurality of audio documents and audio scenes to which the audio documents belong;
a first extraction module: extracting key audio events and non-key audio events in a training set by adopting an LDA-based method;
a probability distribution statistics module: the method is used for respectively counting the probability distribution of the key audio events aiming at each type of audio scene;
a new data generation module: and generating a new audio document according to the probability distribution of the key audio event, wherein the generated new audio document is new data.
Example 3
The present embodiment provides an electronic device comprising a memory and a processor, and computer instructions stored on the memory and executed on the processor, wherein the computer instructions, when executed by the processor, perform the steps of the method of embodiment 1.
Example 4
The present embodiment provides a computer readable storage medium for storing computer instructions which, when executed by a processor, perform the steps of the method of embodiment 1.
The electronic device provided by the present disclosure may be a mobile terminal and a non-mobile terminal, where the non-mobile terminal includes a desktop computer, and the mobile terminal includes a Smart Phone (such as an Android Phone and an IOS Phone), Smart glasses, a Smart watch, a Smart bracelet, a tablet computer, a notebook computer, a personal digital assistant, and other mobile internet devices capable of performing wireless communication.
It should be understood that in the present disclosure, the processor may be a central processing unit CPU, but may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate arrays FPGA or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory may include both read-only memory and random access memory, and may provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.
In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The steps of a method disclosed in connection with the present disclosure may be embodied directly in a hardware processor, or in a combination of the hardware and software modules within the processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, among other storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor. To avoid repetition, it is not described in detail here. Those of ordinary skill in the art will appreciate that the various illustrative elements, i.e., algorithm steps, described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present disclosure, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is merely a division of one logic function, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, mechanical or other form.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.
Although the present disclosure has been described with reference to specific embodiments, it should be understood that the scope of the present disclosure is not limited thereto, and those skilled in the art will appreciate that various modifications and changes can be made without departing from the spirit and scope of the present disclosure.

Claims (10)

1. A data expansion method for audio scene classification is characterized by comprising the following steps:
acquiring a training set; the training set comprises a plurality of audio documents and audio scenes to which the audio documents belong;
extracting key audio events and non-key audio events in a training set by adopting an LDA-based method;
respectively counting the probability distribution of key audio events aiming at each type of audio scene;
and generating a new audio document according to the probability distribution of the key audio event, namely the new data.
2. A data augmentation method for audio scene classification as claimed in claim 1, characterized by: a method for extracting key audio events and non-key audio events in a training set by adopting an LDA-based method comprises the following steps:
generating audio words according to the audio data in the training set;
according to the obtained audio words, counting the occurrence times of each audio word in each audio document, and establishing an audio document-audio word co-occurrence matrix;
setting the number of themes, and obtaining the probability value of each audio document on each theme through learning by an LDA method according to the co-occurrence matrix so as to obtain the theme distribution of the audio documents; according to the co-occurrence matrix, a probability value of each theme on each audio word is obtained through learning by an LDA method; obtaining the theme distribution of the audio words according to the probability value of each audio document on each theme and the probability value of each theme on each audio word;
and obtaining key audio events and non-key audio events in the training set by calculating the similarity of the theme distribution of the audio words and the theme distribution of the audio documents.
3. A data augmentation method for audio scene classification as claimed in claim 2, characterized by: generating audio words according to the audio data in the training set, specifically:
performing framing processing on each audio document in the training set;
extracting the frequency spectrum characteristic data of each frame;
clustering the spectral feature data of all frames in the training set to obtain M clusters, taking the cluster centers as audio words, obtaining M audio words by the M clusters, forming an audio dictionary by the M audio words, and representing the frames in each cluster as the audio words corresponding to the cluster where the frame is located;
or
Obtaining the theme distribution of the audio words according to the probability value of each audio document on each theme and the probability value of each theme on each audio word, wherein the method specifically comprises the following steps:
Figure FDA0002266047070000022
wherein the content of the first and second substances,
Figure FDA0002266047070000023
representing the ith audio word wiThe distribution of the subject matter of (a),
Figure FDA0002266047070000024
representing the ith audio word wiProbability value on K-th topic, K-1, …, K, ziRepresenting the ith audio word wiI-1, …, M indicating the number of audio words,representing the probability value of the audio document d on the kth subject, d is 1, …, N, N represents the number of audio documents in the training set,
Figure FDA0002266047070000026
indicating that the k topic is in the ith audio word wiThe probability value of (c).
4. A data augmentation method for audio scene classification as claimed in claim 2, characterized by: the method for obtaining the key audio events and the non-key audio events in the training set by calculating the similarity of the theme distribution of the audio words and the theme distribution of the audio documents comprises the following steps:
setting a similarity threshold, and calculating the similarity of the theme distribution of the audio words and the theme distribution of the audio document;
regarding the audio document d, taking the audio words with the similarity greater than a set threshold as the key audio words of the audio document, and taking the rest audio words as the non-key audio words of the audio document;
for audio document d, consecutive key audio words in audio document d constitute a key audio event, and consecutive non-key audio words in audio document d constitute a non-key audio event.
5. A data augmentation method for audio scene classification as claimed in claim 1, characterized by: for each type of audio scene, respectively counting the probability distribution of the key audio events: and (4) counting the probability distribution of the key audio event by adopting a normalized histogram method.
6. A data augmentation method for audio scene classification as claimed in claim 1, characterized by: the probability distribution of the key audio event comprises at least: for each type of audio scene, the probability distribution of the occurrence times of the key audio events, the probability distribution of the occurrence positions of the key audio events at each occurrence time and the probability distribution of the duration of the key audio events at each occurrence time.
7. A data augmentation method for audio scene classification as claimed in claim 1, characterized by: for each type of audio scene, generating a new audio document according to the probability distribution of the key audio events, specifically:
obtaining the occurrence times of key audio events in a new audio document to be generated, the occurrence position and the duration of each key audio event by a roulette method according to the probability distribution of the key audio events;
according to the obtained occurrence times, occurrence positions and duration, selecting a key audio event with the duration closest to the obtained duration in the audio scene according to the occurrence position and the duration of each key audio event, and placing the key audio event at the occurrence position of the corresponding key audio event in the new audio document;
according to the interval between two key audio events in a new audio document, selecting a segment with the duration closest to the interval duration from non-key audio events in the audio scene, and placing the segment at an interval position;
and if the appearance position value of the first key audio event in the new audio document is not equal to 0, selecting a non-key audio event segment with the duration closest to the appearance position value of the first key audio event from non-key audio event segments in the audio scene, and placing the non-key audio event segment at the position in front of the appearance position of the first key audio event.
8. A data expansion apparatus for audio scene classification, comprising:
an acquisition module: for obtaining a training set; the training set comprises a plurality of audio documents and audio scenes to which the audio documents belong;
a first extraction module: extracting key audio events and non-key audio events in a training set by adopting an LDA-based method;
a probability distribution statistics module: the method is used for respectively counting the probability distribution of the key audio events aiming at each type of audio scene;
a new data generation module: the method is used for generating a new audio document according to the probability distribution of the key audio events, namely the new data.
9. An electronic device comprising a memory and a processor and computer instructions stored on the memory and executable on the processor, the computer instructions when executed by the processor performing the steps of the method of any of claims 1 to 7.
10. A computer-readable storage medium storing computer instructions which, when executed by a processor, perform the steps of the method of any one of claims 1 to 7.
CN201911088131.9A 2019-11-08 2019-11-08 Data expansion method and device for audio scene classification Active CN110853672B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911088131.9A CN110853672B (en) 2019-11-08 2019-11-08 Data expansion method and device for audio scene classification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911088131.9A CN110853672B (en) 2019-11-08 2019-11-08 Data expansion method and device for audio scene classification

Publications (2)

Publication Number Publication Date
CN110853672A true CN110853672A (en) 2020-02-28
CN110853672B CN110853672B (en) 2022-04-01

Family

ID=69600163

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911088131.9A Active CN110853672B (en) 2019-11-08 2019-11-08 Data expansion method and device for audio scene classification

Country Status (1)

Country Link
CN (1) CN110853672B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112530399A (en) * 2020-11-30 2021-03-19 上海明略人工智能(集团)有限公司 Method and system for expanding voice data, electronic equipment and storage medium
CN112750462A (en) * 2020-08-07 2021-05-04 腾讯科技(深圳)有限公司 Audio processing method, device and equipment
CN113220932A (en) * 2021-04-02 2021-08-06 山东师范大学 Audio scene recognition method and device based on multitask learning
CN117373488A (en) * 2023-12-08 2024-01-09 富迪科技(南京)有限公司 Audio real-time scene recognition system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090094233A1 (en) * 2007-10-05 2009-04-09 Fujitsu Limited Modeling Topics Using Statistical Distributions
CN104269169A (en) * 2014-09-09 2015-01-07 山东师范大学 Classifying method for aliasing audio events
CN104850650A (en) * 2015-05-29 2015-08-19 清华大学 Short-text expanding method based on similar-label relation
CN105677769A (en) * 2015-12-29 2016-06-15 广州神马移动信息科技有限公司 Keyword recommending method and system based on latent Dirichlet allocation (LDA) model
CN106205609A (en) * 2016-07-05 2016-12-07 山东师范大学 A kind of based on audio event and the audio scene recognition method of topic model and device
CN106844424A (en) * 2016-12-09 2017-06-13 宁波大学 A kind of file classification method based on LDA

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090094233A1 (en) * 2007-10-05 2009-04-09 Fujitsu Limited Modeling Topics Using Statistical Distributions
CN104269169A (en) * 2014-09-09 2015-01-07 山东师范大学 Classifying method for aliasing audio events
CN104850650A (en) * 2015-05-29 2015-08-19 清华大学 Short-text expanding method based on similar-label relation
CN105677769A (en) * 2015-12-29 2016-06-15 广州神马移动信息科技有限公司 Keyword recommending method and system based on latent Dirichlet allocation (LDA) model
CN106205609A (en) * 2016-07-05 2016-12-07 山东师范大学 A kind of based on audio event and the audio scene recognition method of topic model and device
CN106844424A (en) * 2016-12-09 2017-06-13 宁波大学 A kind of file classification method based on LDA

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112750462A (en) * 2020-08-07 2021-05-04 腾讯科技(深圳)有限公司 Audio processing method, device and equipment
CN112530399A (en) * 2020-11-30 2021-03-19 上海明略人工智能(集团)有限公司 Method and system for expanding voice data, electronic equipment and storage medium
CN113220932A (en) * 2021-04-02 2021-08-06 山东师范大学 Audio scene recognition method and device based on multitask learning
CN113220932B (en) * 2021-04-02 2022-06-10 山东师范大学 Audio scene recognition method and device based on multitask learning
CN117373488A (en) * 2023-12-08 2024-01-09 富迪科技(南京)有限公司 Audio real-time scene recognition system
CN117373488B (en) * 2023-12-08 2024-02-13 富迪科技(南京)有限公司 Audio real-time scene recognition system

Also Published As

Publication number Publication date
CN110853672B (en) 2022-04-01

Similar Documents

Publication Publication Date Title
CN110853672B (en) Data expansion method and device for audio scene classification
US10515627B2 (en) Method and apparatus of building acoustic feature extracting model, and acoustic feature extracting method and apparatus
US10943582B2 (en) Method and apparatus of training acoustic feature extracting model, device and computer storage medium
CN101894548B (en) Modeling method and modeling device for language identification
CN111508524B (en) Method and system for identifying voice source equipment
CN108241856B (en) Information generation method and equipment
KR100792016B1 (en) Apparatus and method for character based video summarization by audio and video contents analysis
CN106681716A (en) Intelligent terminal and automatic classification method of application programs thereof
CN111061837A (en) Topic identification method, device, equipment and medium
CN112883734A (en) Block chain security event public opinion monitoring method and system
CN110148428B (en) Acoustic event identification method based on subspace representation learning
CN108875050B (en) Text-oriented digital evidence-obtaining analysis method and device and computer readable medium
CN111506726A (en) Short text clustering method and device based on part-of-speech coding and computer equipment
CN111062221A (en) Data processing method, data processing device, electronic equipment and storage medium
CN113409827B (en) Voice endpoint detection method and system based on local convolution block attention network
CN108564967B (en) Mel energy voiceprint feature extraction method for crying detection system
CN111898559B (en) Method and device for improving face recognition speed and electronic equipment
CN112926436A (en) Behavior recognition method and apparatus, electronic device, and storage medium
CN110889010A (en) Audio matching method, device, medium and electronic equipment
CN106910494B (en) Audio identification method and device
CN107742256A (en) A kind of network puppet public opinion identification method based on decision making of combinatorial optimization tree
CN110516513B (en) Face recognition method and device
CN116541876A (en) Key attack method for acquiring sensitive information and related equipment
CN111159996B (en) Short text set similarity comparison method and system based on text fingerprint algorithm
CN113851113A (en) Model training method and device and voice awakening method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant