CN114882909A

CN114882909A - Environmental sound classification analysis method, device and medium

Info

Publication number: CN114882909A
Application number: CN202210403964.5A
Authority: CN
Inventors: 刘立峰; 宋卫华; 冯志峰; 母健康; 王文重; 张建军
Original assignee: Zhuhai Comleader Information Technology Co Ltd
Current assignee: Zhuhai Comleader Information Technology Co Ltd
Priority date: 2022-04-18
Filing date: 2022-04-18
Publication date: 2022-08-09

Abstract

The application discloses an environmental sound classification analysis method, an environmental sound classification analysis device and an environmental sound classification analysis medium, wherein the method comprises the following steps: performing data enhancement on voice training data; carrying out data preprocessing and carrying out feature extraction on environmental sound to obtain a feature vector; and performing model training on the feature vector by adopting a deep CNN network to obtain and output a multi-classification model of the environmental sound. The method for enhancing the data is adopted, and experiments show that compared with the training effect without data enhancement, the training effect is greatly improved, and the practical value of the system is improved.

Description

Environmental sound classification analysis method, device and medium

Technical Field

The present application relates to the field of sound classification, and in particular, to a method, an apparatus, and a medium for classifying and analyzing environmental sounds.

Background

The task of using the environmental sound classification can be applied to classify different musical instruments, robotic navigation, medical or medical problems, customer or buyer reminders, crime warning systems, voice activity recognition, audio-based disaster recognition, environmental monitoring, etc. Sound classification can be involved in many applications, which indicates its importance. The classification of sounds is to identify the sound category of a small audio clip or recording. A detailed analysis of the received information derived from the audio signal is performed. It is important to recognize the environment of the ambient sound and take immediate action to reduce the risk.

The environmental sound classification technology adopted by the related technology is mostly realized by extracting the mfcc of the environmental sound and a machine learning classification method, the method is simpler, and the accuracy of classification analysis of the environmental sound is lower because the noise of the environmental sound is more.

Therefore, the above technical problems of the related art need to be solved.

Disclosure of Invention

The present application is directed to solving one of the technical problems in the related art. Therefore, the embodiment of the application provides a method, a device and a medium for classifying and analyzing environmental sounds, which can classify and analyze the environmental sounds more accurately.

According to an aspect of the embodiments of the present application, there is provided an ambient sound classification analysis method, including:

performing data enhancement on voice training data;

carrying out data preprocessing and carrying out feature extraction on environmental sound to obtain a feature vector;

and performing model training on the feature vector by adopting a deep CNN network to obtain and output a multi-classification model of the environmental sound.

In one embodiment, the data enhancement of the speech training data at least includes:

shifting positive pitch: increasing the pitch of each audio signal in the data set of ambient sound by a positive factor;

shifting negative pitch: increasing the pitch of each audio signal in the data set of ambient sound by a negative factor;

mute pruning: pruning silent portions of the audio clip, leaving only the portions containing sound;

and (3) quick stretching time: stretching the time of each sound clip of the data set by a factor of 2;

slow stretching time: stretch each sound clip of the data set to 0.7 times the original time;

white noise addition: white noise is added to the data set of the ambient sound.

In one embodiment, the data preprocessing includes adding endpoint detection and de-muting functionality.

In one embodiment, the feature extraction of the environmental sound includes:

and (4) extracting the features of the environmental sound by adopting a Log-MEL feature extraction method.

In one embodiment, after the feature extraction is performed on the environmental sound by using the Log-MEL feature extraction method, the method further includes:

obtaining a Mel frequency spectrum graph and a Mel cepstrum coefficient;

and performing feature fusion on the Mel frequency spectrum diagram and the Mel cepstrum coefficient.

In one embodiment, after obtaining the mel-frequency spectrogram and the mel-frequency cepstral coefficients, the method further comprises:

and inputting the Mel frequency spectrogram and the Mel cepstrum coefficient into a deep CNN network for model training.

In one embodiment, before the data enhancement of the speech training data, the method further comprises:

acquiring environmental sound;

and filtering the environmental sound to filter environmental noise.

According to an aspect of the embodiments of the present application, there is provided an ambient sound classification analysis apparatus, the apparatus including:

the data enhancement module is used for enhancing the data of the voice training data;

the feature extraction module is used for preprocessing data and extracting features of environmental sound to obtain feature vectors;

and the training classification module is used for performing model training on the feature vectors by adopting a deep CNN network to obtain and output a multi-classification model of the environmental sound.

at least one processor;

at least one memory for storing at least one program;

at least one of the programs, when executed by at least one of the processors, implements an ambient sound classification analysis method as described in the previous embodiments.

According to an aspect of the embodiments of the present application, there is provided a medium storing a program executable by a processor, wherein the program executable by the processor implements an environmental sound classification analysis method according to the foregoing embodiments.

The method, the device and the medium for classifying and analyzing the environmental sound have the advantages that: the application provides an environmental sound classification analysis method, which comprises the following steps: performing data enhancement on voice training data; carrying out data preprocessing and carrying out feature extraction on environmental sound to obtain a feature vector; and performing model training on the feature vector by adopting a deep CNN network to obtain and output a multi-classification model of the environmental sound. The method for enhancing the data is adopted, and experiments show that compared with the training effect without data enhancement, the training effect is greatly improved, and the practical value of the system is improved.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart of an environmental sound classification analysis method according to an embodiment of the present disclosure;

fig. 2 is a schematic diagram of a working process of an environmental sound classification analysis method according to an embodiment of the present application;

fig. 3 is a schematic view of an environmental sound classification analysis apparatus according to an embodiment of the present disclosure;

fig. 4 is another schematic diagram of an environmental sound classification analysis apparatus according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," "third," and "fourth," etc. in the description and claims of this application and the accompanying drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

In order to solve the above problem, the present application provides an environmental sound classification analysis method.

For the sake of understanding, the following is an explanation of terms and words that may appear in the present specification, as follows:

deep learning: deep learning is one of machine learning, and machine learning is a must-pass path for implementing artificial intelligence. The concept of deep learning is derived from the research of artificial neural networks, and a multi-layer perceptron comprising a plurality of hidden layers is a deep learning structure. Deep learning forms a more abstract class or feature of high-level representation properties by combining low-level features to discover a distributed feature representation of the data. The motivation for studying deep learning is to build neural networks that simulate the human brain for analytical learning, which mimics the mechanism of the human brain to interpret data such as images, sounds, text, and the like.

And (3) voice processing: the speech processing is an important research direction in the fields of computer science and artificial intelligence, and the research is carried out by using a computer to process the speech so as to achieve the purpose of effectively communicating between people and the computer. The speech processing is mainly applied to the aspects of speech classification, noise detection, speech recognition and the like.

Mel-frequency cepstrum (MFCC): refers to the mel-frequency cepstrum, in signal processing, the mel-frequency cepstrum is a frequency spectrum which can be used to represent short-term audio and the principle is based on the logarithmic frequency spectrum expressed by nonlinear mel scale and the linear cosine transform. The mel-frequency cepstral coefficients are a set of key coefficients used to create the mel-frequency cepstrum. From segments of the music signal, a set of cepstra sufficient to represent the music signal is obtained, and the mel-frequency cepstra coefficients are the cepstra derived from the cepstra.

CNN: convolutional neural networks (CNN or ConvNet) are a class of deep neural networks most commonly used for analyzing visual images. CNNs use a variant design of the multilayer perceptron, requiring minimal preprocessing. They are also known as shift-invariant or space-invariant artificial neural networks (SIANN), based on their shared weight architecture and shift invariance characteristics.

Fig. 1 is a flowchart of an environmental sound classification analysis method according to an embodiment of the present application, and as shown in fig. 1, the environmental sound classification analysis method provided by the present application includes:

s101, performing data enhancement on voice training data.

In step S101, the data enhancement of the speech training data at least includes: shifting positive pitch: increasing the pitch of each audio signal in the data set of ambient sound by a positive factor; shifting negative pitch: increasing the pitch of each audio signal in the data set of ambient sound by a negative factor; mute pruning: pruning silent portions of the audio clip, leaving only the portions containing sound; and (3) quick stretching time: stretching the time of each sound clip of the data set by a factor of 2; slow stretching time: stretch each sound clip of the data set to 0.7 times the original time; white noise addition: white noise is added to the data set of the ambient sound.

And S102, preprocessing data and extracting the features of the environmental sound to obtain a feature vector.

The data preprocessing in step S102 includes adding endpoint detection and de-muting functions. The endpoint detection function is used for detecting the integrity and correctness of the endpoint of the audio data, preventing the audio data or the segment from being incomplete, and timely notifying technicians to repair the endpoint when the endpoint is detected to be abnormal.

Wherein, the feature extraction of the environmental sound is specifically as follows: and (4) extracting the features of the environmental sound by adopting a Log-MEL feature extraction method. The Log-MEL feature extraction method has the advantages that the effect is very poor if the obtained audio data are directly identified automatically, because much noise exists in the audio, effective data needed in the audio are not highlighted, effective information in the audio data can be extracted and useless information can be filtered by using MEL feature extraction, the Log-MEL feature extraction method is based on the principle of simulating human ear structure, the audio is filtered, and the automatic identification effect of the processed data can be obviously improved.

Further, after the feature extraction is performed on the environmental sound by using the Log-MEL feature extraction method, the method further includes: obtaining a Mel frequency spectrum graph and a Mel cepstrum coefficient; and performing feature fusion on the Mel frequency spectrum diagram and the Mel cepstrum coefficient. After the mel-frequency spectrum diagram and the mel-frequency cepstrum coefficient are obtained, the mel-frequency spectrum diagram and the mel-frequency cepstrum coefficient can be input into a deep CNN network for model training.

S103, performing model training on the feature vectors by adopting a deep CNN network to obtain and output a multi-classification model of the environmental sound.

It should be noted that the environmental sound often contains many natural noises, and if the naturally occurring noises are not primarily filtered, the accuracy of the final environmental sound classification is affected, and misjudgment is easily caused. Therefore, before the data enhancement of the voice training data of the embodiment, the method further includes: acquiring environmental sound; and filtering the environmental sound to filter environmental noise.

The method for enhancing the data is adopted, and experiments show that compared with the training effect without data enhancement, the training effect is greatly improved, and the practical value of the system is improved.

Fig. 2 is a schematic diagram of a working process of an environmental sound classification analysis method according to an embodiment of the present application, and as shown in fig. 2, the working process of the environmental sound classification analysis method according to the embodiment of the present application mainly includes: receiving audio data through external equipment, performing data enhancement processing on the received audio data, wherein the data enhancement processing comprises but is not limited to pitch increasing processing, pitch weakening processing, mute clipping processing, fast time stretching processing, slow time stretching processing and white noise increasing processing, performing enhanced data hash processing on the processed audio data, and finally inputting the processed audio data into an audio feature extraction module; the method mainly comprises the steps of extracting Mel spectrogram features and MFCC of audio data during audio feature extraction, then performing feature fusion on the Mel spectrogram features and MFCC of the audio data, inputting the Mel spectrogram features and MFCC into a deep CNN model for training, and finally obtaining a trained CNN model which can be applied to classification and analysis of environmental sounds of various different scenes.

In this embodiment, the specific steps of calculating the MFCC include: framing the signal into short frames; for each frame, calculating a periodogram estimate of the power spectrum; applying a mel filter bank to the power spectrum, adding the energy in each filter; taking the logarithm of the energy of all filter banks; obtaining DCT of log filter bank energy; the DCT coefficients 2-13 are kept and the rest is discarded.

The embodiment considers the implementation of Mel, MFCC and LogMel audio feature extraction technologies, uses three audio feature extraction technologies, including Mel spectrogram (Mel), logarithm Mel and Mel frequency coefficient (MFCC), and also proposes to use offline audio-based enhancement and L2 regularization of a deformed data set to reduce the risk of overfitting caused by insufficient data. The same models with the same audio feature extraction techniques are used on these enhanced data sets. By not involving any max pool function (Model-2) with Log-Mel characteristics for deep CNN, and using the enhanced data set for training, excellent results can be achieved in the ambient sound classification task.

Fig. 3 is a schematic view of an ambient sound classification analysis apparatus according to an embodiment of the present application, and the present application further provides an ambient sound classification analysis apparatus, as shown in fig. 3, the ambient sound classification analysis apparatus provided by the present application includes:

a data enhancement module 301, configured to perform data enhancement on the speech training data.

The feature extraction module 302 is configured to perform data preprocessing and feature extraction on the environmental sound to obtain a feature vector.

And the training classification module 303 is configured to perform model training on the feature vector by using a deep CNN network, obtain a multi-classification model of the environmental sound, and output the multi-classification model.

Fig. 4 is another schematic diagram of an ambient sound classification analysis apparatus according to an embodiment of the present application, and as shown in fig. 4, the present application further provides an ambient sound classification analysis apparatus, including:

at least one processor 401;

at least one memory 402, the memory 402 for storing at least one program;

when executed by at least one of the processors 401, implements an ambient sound classification analysis method as described in the previous embodiments.

The contents in the above method embodiments are all applicable to the present apparatus embodiment, the functions specifically implemented by the present apparatus embodiment are the same as those in the above method embodiments, and the advantageous effects achieved by the present apparatus embodiment are also the same as those achieved by the above method embodiments.

In addition, the present application also provides a medium storing a program executable by a processor, and the program executable by the processor realizes an environmental sound classification analysis method according to the foregoing embodiment when being executed by the processor.

Similarly, the contents in the foregoing method embodiments are all applicable to this storage medium embodiment, the functions specifically implemented by this storage medium embodiment are the same as those in the foregoing method embodiments, and the advantageous effects achieved by this storage medium embodiment are also the same as those achieved by the foregoing method embodiments.

In alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flowcharts of the present application are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed and in which sub-operations described as part of larger operations are performed independently.

Furthermore, although the present application is described in the context of functional modules, it should be understood that, unless otherwise stated to the contrary, one or more of the functions and/or features may be integrated in a single physical device and/or software module, or one or more functions and/or features may be implemented in separate physical devices or software modules. It will also be appreciated that a detailed discussion regarding the actual implementation of each module is not necessary for an understanding of the present application. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be understood within the ordinary skill of an engineer, given the nature, function, and internal relationship of the modules. Accordingly, those skilled in the art can, using ordinary skill, practice the present application as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative of and not intended to limit the scope of the application, which is defined by the appended claims and their full scope of equivalents.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

In the foregoing description of the specification, reference to the description of "one embodiment/example," "another embodiment/example," or "certain embodiments/examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the present application have been shown and described, it will be understood by those of ordinary skill in the art that: numerous changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the application, the scope of which is defined by the claims and their equivalents.

The above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. An environmental sound classification analysis method, characterized in that the method comprises:

performing data enhancement on voice training data;

2. The method of claim 1, wherein the enhancing the speech training data at least comprises:

3. The method of claim 1, wherein the data preprocessing comprises adding endpoint detection and de-muting functionality.

4. The method for classifying and analyzing environmental sound according to claim 1, wherein the extracting the features of the environmental sound comprises:

5. The method as claimed in claim 4, wherein after the environmental sound is extracted by the Log-MEL feature extraction method, the method further comprises:

obtaining a Mel frequency spectrum graph and a Mel cepstrum coefficient;

6. The method of claim 5, wherein after obtaining the Mel frequency spectrogram and the Mel cepstrum coefficients, the method further comprises:

7. The method of claim 1, wherein before the data enhancement of the voice training data, the method further comprises:

acquiring environmental sound;

and filtering the environmental sound to filter environmental noise.

8. An ambient sound classification analysis apparatus, characterized in that the apparatus comprises:

9. An ambient sound classification analysis apparatus, characterized in that the apparatus comprises:

at least one processor;

at least one memory for storing at least one program;

an ambient sound classification analysis method according to any one of claims 1-7 when at least one of said programs is executed by at least one of said processors.

10. A medium storing a program executable by a processor, the program being executed by the processor to implement an ambient sound classification analysis method according to any one of claims 1 to 7.