CN116092517A

CN116092517A - Audio detection method, audio detection device and computer storage medium

Info

Publication number: CN116092517A
Application number: CN202211734814.9A
Authority: CN
Inventors: 方瑞东; 杜海云; 吴人杰; 史巍; 林聚财; 殷俊
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2022-12-30
Filing date: 2022-12-30
Publication date: 2023-05-09

Abstract

The application discloses an audio detection method, an audio detection device and a computer storage medium, wherein the audio detection method comprises the following steps: acquiring audio to be detected; acquiring acoustic characteristics belonging to a positive sample in the audio to be detected by using a characteristic analysis model; and carrying out classification detection on the acoustic features by using an audio detection model, and outputting the detection type of the audio to be detected based on the classification detection result. Through the mode, the method and the device can carry out primary screening on the extracted acoustic features of the audio to be detected based on the feature analysis model, and then send the acoustic features into the audio detection model for detection, so that the calculated amount of the audio detection model is reduced, the training complexity of the whole audio detection network structure is reduced, and the overall time consumption of an audio detection algorithm is reduced.

Description

Audio detection method, audio detection device and computer storage medium

Technical Field

The present disclosure relates to the field of audio processing, and in particular, to an audio detection method, an audio detection apparatus, and a computer storage medium.

Background

Along with the continuous development of intelligent acoustic technology, the requirements of the related technology for detecting the audio event are increasingly higher, the audio event detection is mainly to judge whether the event occurs according to the detected audio signal intensity, if the detected signal intensity is higher than a set threshold value, the algorithm judges that the event occurs, and then prompts the user that the event occurs. The detection of audio events makes the life of people more convenient and efficient.

In an application scenario, it is necessary to detect infant crying, and when infant crying is detected, warning information is sent to a user so that the user can find out an abnormal state of the infant in time. However, in an actual scene, environmental factors are mostly complex and changeable, and the surrounding environment usually contains various noise interferences besides the sound event to be detected, so that the overall audio detection is too long in time consumption and too high in cost.

Disclosure of Invention

The application mainly solves the technical problem of how to reduce the time consumption of audio detection, and in this regard, the application provides an audio detection method, an audio detection device and a computer storage medium.

In order to solve the technical problems, one technical scheme adopted by the application is as follows: there is provided an audio detection method, the method comprising: acquiring audio to be detected; acquiring acoustic characteristics belonging to a positive sample in the audio to be detected by using a characteristic analysis model; and carrying out classification detection on the acoustic features by using an audio detection model, and outputting the detection type of the audio to be detected based on the classification detection result.

Wherein the classification detection result includes a probability value detected as a positive sample; if the probability value is higher than a first preset threshold value, outputting that the detection type of the audio to be detected is a positive sample.

The classification detection result also comprises detection times of positive samples; if the detection times and the probability value are higher than a second preset threshold value and a first preset threshold value respectively, outputting the detection type of the audio to be detected as a positive sample.

Before the acoustic features belonging to the positive sample in the audio to be detected are acquired by using the feature analysis model, the method further comprises the following steps: and extracting the audio characteristics of the audio to be detected by using the characteristic extraction model.

The method for extracting the audio features of the audio to be detected by using the feature extraction model comprises the following steps: and performing frequency domain and/or cepstrum domain conversion on the audio signal of the audio to be detected by using the feature extraction model so as to obtain audio features.

The audio features may be one or more of logarithmic mel-spectrum, mel-cepstral coefficients, filter bank features, perceptual linear prediction, channel energy normalization features.

The method for acquiring the acoustic characteristics of the positive sample in the audio to be detected by using the characteristic analysis model comprises the following steps: the feature analysis model obtains parameters of a probability density function by using a maximized likelihood estimation method based on a preset positive sample; obtaining a probability density function based on parameter training; and obtaining the acoustic characteristics belonging to the positive sample in the audio to be detected by using the probability density function.

Wherein the network structure of the audio detection model comprises a packet convolution layer and a residual layer.

In order to solve the technical problems, another technical scheme adopted by the application is as follows: there is provided an audio detection device comprising a processor and a memory coupled to the processor, the memory storing program data, the processor being configured to execute the program data to implement an audio detection method as described above.

In order to solve the technical problems, another technical scheme adopted by the application is as follows: there is provided a computer readable storage medium storing program data which, when executed, is adapted to carry out the above-described audio detection method.

The beneficial effects of this application are: different from the condition of the prior art, the audio detection method provided by the invention is applied to an audio detection device, and the audio detection device acquires audio to be detected; acquiring acoustic characteristics belonging to a positive sample in the audio to be detected by using a characteristic analysis model; and carrying out classification detection on the acoustic features by using an audio detection model, and outputting the detection type of the audio to be detected based on the classification detection result. By means of the method, compared with a conventional audio detection method, the method and the device for detecting the audio by using the characteristic analysis model in the audio detection device are used for carrying out primary screening on the acoustic characteristics of the audio to be detected, data which are most likely to be negative samples are removed, the data are not sent into the audio detection model for classification, only the data which belong to positive samples are sent into the audio detection model for further prediction, so that calculation time of the audio detection model is saved, calculation efficiency is improved, overall audio event prediction efficiency is improved, and time cost waste in the detection process is reduced.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Wherein:

fig. 1 is a schematic flow chart of a first embodiment of an audio detection method provided in the present application;

fig. 2 is a schematic flow chart of an audio detection method applied to an audio detection device;

fig. 3 is a schematic structural diagram of an audio detection device provided in the present application;

fig. 4 is a schematic flow chart of a second embodiment of an audio detection method provided in the present application;

fig. 5 is a schematic structural diagram of a first embodiment of an audio detection device provided in the present application;

fig. 6 is a schematic structural diagram of a second embodiment of an audio detection device provided in the present application;

fig. 7 is a schematic structural diagram of an embodiment of a computer readable storage medium provided in the present application.

Detailed Description

The following description of the technical solutions in the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

The term "and/or" is herein merely an association relationship describing an associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship. Further, "a plurality" herein means two or more than two. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

The audio detection method is mainly applied to an audio detection device, wherein the audio detection device can be a server or a system formed by mutually matching a server and terminal equipment. Accordingly, each part, such as each unit, sub-unit, module, and sub-module, included in the audio detection apparatus may be all disposed in the server, or may be disposed in the server and the terminal device, respectively.

Further, the server may be hardware or software. When the server is hardware, the server may be implemented as a distributed server cluster formed by a plurality of servers, or may be implemented as a single server. When the server is software, it may be implemented as a plurality of software or software modules, for example, software or software modules for providing a distributed server, or may be implemented as a single software or software module, which is not specifically limited herein. In some possible implementations, the audio detection method of the embodiments of the present application may be implemented by a processor invoking computer readable instructions stored in a memory.

The audio detection method is mainly applied to the detection of audio events, and is mainly used for judging whether the event occurs according to the detected audio signal intensity, if the detected signal intensity is higher than a set threshold value, the event is judged to occur, and then the user is prompted to occur. In a practical scenario, environmental factors are mostly complex and changeable, and the surrounding environment usually contains various noise interferences besides the sound event to be detected, so that the suitable audio signal features need to be extracted, thereby reducing the cost of audio event detection.

Referring to fig. 1 to 3, fig. 1 is a schematic flow chart of a first embodiment of an audio detection method provided in the present application, fig. 2 is a schematic flow chart of an audio detection method applied to an audio detection device provided in the present application, and fig. 3 is a schematic structural diagram of an audio detection device provided in the present application.

Step 11: and acquiring the audio to be detected.

Specifically, the audio to be detected may be recorded in advance by the recording device, or may be recorded in real time by the recording device. The recording device may be any device with recording and/or recording functions, such as a mobile phone, a microphone, a tablet, etc., which is not limited herein. Any application and/or recording equipment with the recording function can acquire audio data to record so as to obtain audio to be detected, such as an audio recorder, a video recorder, instant messaging software (such as WeChat, QQ and the like) and the like.

Specifically, after the audio detection device obtains the audio to be detected, the audio detection device further performs labeling, framing and other processes on the audio to be detected, so as to segment the audio data with an indefinite length into small segments with a fixed length and label each small segment, the framing is required because the audio is a long-time unsteady sequence, and in order to make the unsteady audio in a steady state in a short time, so that the subsequent audio detection device can obtain relatively stable characteristic parameters.

Optionally, the audio detection device may further add noise, reverberation, and the like to the acquired audio to be detected, so as to increase the robustness of the audio to be detected.

Optionally, the audio detection device may further perform speed adjustment, noise adding, pitch adjustment, volume adjustment, and the like on the audio data to generate training data sets similar to but different from the audio data of the audio to be detected, and then use the amplified audio data as the training data sets, so that by labeling and amplifying the original audio data, the scale of the training data sets is enlarged, and the generalization capability of the audio detection network during subsequent training of the audio detection network is improved.

Specifically, before the audio to be detected is sent to the feature analysis model, the audio detection device also uses the feature extraction model to extract the audio features of the audio to be detected. The audio features may be one or more of logarithmic mel-spectrum, mel-cepstral coefficients, filter bank features, perceptual linear prediction, channel energy normalization features.

The mel cepstrum coefficient is based on the auditory characteristic of the human ear, the mel cepstrum frequency band division is equally divided on the mel scale, and the logarithmic distribution relation of the scale value of the mel frequency and the actual frequency is more consistent with the auditory characteristic of the human ear, so that the voice signal can be better represented. The logarithmic mel spectrum is obtained by carrying out logarithmic conversion on the mel spectrum.

The filter bank features are discrete cosine transforms that correspond to the last step of mel-frequency cepstrum coefficient removal, and retain more of the original speech data than the mel-frequency cepstrum coefficients.

The perceptual linear prediction is a characteristic parameter based on an auditory model, the parameter is a characteristic equivalent to a linear prediction coefficient, and is also a group of coefficients of an all-pole model prediction polynomial, the perceptual linear prediction is applied to spectrum analysis through calculation, an input voice signal is processed by an in-ear auditory model to replace a used time domain signal, and the perceptual linear prediction has the advantage of being beneficial to the extraction of anti-noise voice characteristics.

Specifically, before the audio detection device acquires the acoustic features belonging to the positive sample in the audio to be detected by using the feature analysis model, the audio signal of the audio to be detected is subjected to frequency domain and/or cepstrum domain conversion by using the feature extraction model so as to obtain the audio features.

The frequency domain is converted into a structural form represented by frequency components by converting one-dimensional sound signals of the audio to be detected into a structural form represented by frequency components through Fourier transformation, wavelet transformation and the like. The cepstrum domain is converted into the short-time amplitude spectrum of the signal to carry out logarithmic Fourier inverse transformation, and can be used for analyzing a periodic structure on a complex spectrogram and separating and extracting periodic components in a dense frequency modulation signal.

Step 12: and acquiring acoustic characteristics belonging to the positive sample in the audio to be detected by using the characteristic analysis model.

In particular, the feature analysis model may utilize a GMM model (Gaussian Mixture Model ) to cluster the obtained acoustic features and screen the desired data. Fig. 4 is a schematic flow chart of a second embodiment of the audio detection method provided in the present application.

Step 41: and obtaining parameters of the probability density function by using a maximum likelihood estimation method based on a preset positive sample by using a feature analysis model.

In particular, maximizing likelihood estimation is to find the parameter values most likely (i.e., most probable) to lead to a known sample distribution, using such a distribution.

Specifically, the probability density function of the GMM model is

Wherein K is the total number of Gaussian probability models, and the kth Gaussian probability model is +.>

μ _k And->

Respectively representing the mean and the variance; alpha _k Is->

Prior probability, alpha _k ≥0，/>

Step 42: and obtaining a probability density function based on parameter training.

GMM model to maximize likelihood estimation

Training a preset positive sample to obtain parameters of a probability density function, and modeling to obtain the probability density function of the positive sample.

Step 43: and obtaining the acoustic characteristics belonging to the positive sample in the audio to be detected by using the probability density function.

Specifically, the feature analysis model in the audio detection device substitutes the acoustic features of the audio to be detected into the probability density function to perform clustering screening on the acoustic features, so that acoustic features belonging to positive samples are obtained. Wherein, the positive sample indicates that the event to be judged has occurred.

Specifically, the audio detection device filters the acoustic features of the audio to be detected in advance through a trained acoustic feature model in the feature analysis model, divides sample data to be detected through a probability threshold set by people, considers negative sample data if the probability value of predicting that the sample is a positive sample is lower than the threshold set by people, and otherwise, judges that the sample is a positive sample, and sends the positive sample into the audio detection model of the next stage for retraining analysis.

Step 13: and carrying out classification detection on the acoustic features by using an audio detection model, and outputting the detection type of the audio to be detected based on the classification detection result.

Specifically, the network structure of the audio detection model includes a packet convolution layer and a residual layer.

The grouping convolution is to group different feature graphs of an input layer, and then to convolve each group by adopting different convolution kernels, so that the calculated amount of convolution can be reduced. The residual layer is helpful to solve the problems of gradient disappearance and gradient explosion when the network depth is too high, so that the accuracy of the model is improved while the number of convolution layers is increased continuously.

Specifically, the classification detection result includes a probability value detected as a positive sample; if the probability value is higher than a first preset threshold value, outputting that the detection type of the audio to be detected is a positive sample.

Specifically, the classification detection result further includes the number of times of detection detected as a positive sample; if the detection times and the probability value are higher than a second preset threshold value and a first preset threshold value respectively, outputting the detection type of the audio to be detected as a positive sample.

The audio detection device performs primary screening on the acoustic features of the audio to be detected by the feature extraction module, eliminates the acoustic features belonging to the negative sample, and does not send the acoustic features into the audio detection model. Only the acoustic features belonging to the positive sample are sent into the audio detection model for reclassifying, so that the overall time consumption of the whole audio event detection can be reduced, and the efficiency of the audio detection is improved.

Optionally, after the audio detection device obtains the classification to which the acoustic feature of the audio to be detected belongs, the audio detection device may further send the classification to which the positive sample belongs, that is, the corresponding audio data or the detection result of the occurrence of the detection event, to the client (such as the mobile phone APP) of the user, so as to inform the user of the occurrence of the event, so that the user can make subsequent operations on the event in time.

In an embodiment of the present application, the audio data of the infant crying may be extracted, the audio detection device divides the audio data into audio segments with equal lengths, and performs frequency domain transformation on the audio signal of each audio segment, and converts the audio signal into a frequency domain signal from a one-dimensional audio signal, so as to improve the representation capability of the signal. The audio detection device extracts the frequency domain signal into a proper audio feature, and sends the audio feature to the feature analysis module to judge whether the baby crys for the first time. And sending the audio characteristics which are judged to occur in the infant crying event into a neural network in the audio detection model for audio detection again, so that a final detection result is obtained.

Different from the condition of the prior art, the audio detection method provided by the invention is applied to an audio detection device, and the audio detection device acquires audio to be detected; acquiring acoustic characteristics belonging to a positive sample in the audio to be detected by using a characteristic analysis model; and carrying out classification detection on the acoustic features by using an audio detection model, and outputting the detection type of the audio to be detected based on the classification detection result. By means of the method, compared with a conventional audio detection method, the method and the device for detecting the audio by using the characteristic analysis model in the audio detection device are used for carrying out primary screening on the acoustic characteristics of the audio to be detected, data which are most likely to be negative samples are removed, the data are not sent into the audio detection model for classification, only the data which belong to positive samples are sent into the audio detection model for further prediction, so that calculation time of the audio detection model is saved, calculation efficiency is improved, overall audio event prediction efficiency is improved, and time cost waste in the detection process is reduced.

The method of the above embodiment may be implemented by an audio detection device, and is described below with reference to fig. 5, where fig. 5 is a schematic structural diagram of a first embodiment of an audio detection device provided in the present application.

As shown in fig. 5, the audio detection apparatus 50 in the embodiment of the present application includes an acquisition module 51, a feature analysis module 52, and an audio detection module 53.

The acquiring module 51 is configured to acquire audio to be detected.

The feature analysis module 52 is configured to obtain acoustic features belonging to a positive sample in the audio to be detected by using the feature analysis model.

The audio detection module 53 is configured to perform classification detection on the acoustic features by using an audio detection model, and output a detection type of the audio to be detected based on a classification detection result.

The method of the above embodiment may be implemented by an audio detection device, and referring to fig. 6, fig. 6 is a schematic structural diagram of a second embodiment of the audio detection device provided in the present application, where the audio detection device 60 includes a memory 61 and a processor 62, the memory 61 is used for storing program data, and the processor 62 is used for executing the program data to implement the following method:

acquiring audio to be detected; acquiring acoustic characteristics belonging to a positive sample in the audio to be detected by using a characteristic analysis model; and carrying out classification detection on the acoustic features by using an audio detection model, and outputting the detection type of the audio to be detected based on the classification detection result.

Referring to fig. 7, fig. 7 is a schematic structural diagram of an embodiment of a computer readable storage medium 70 provided in the present application, where the computer readable storage medium 70 stores program data 71, and the program data 71, when executed by a processor, is configured to implement the following method:

Embodiments of the present application are implemented in the form of software functional units and sold or used as a stand-alone product, which may be stored on a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution, in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing description is only of embodiments of the present application, and is not intended to limit the scope of the patent application, and all equivalent structures or equivalent processes using the descriptions and the contents of the present application or other related technical fields are included in the scope of the patent application.

Claims

1. An audio detection method, characterized in that the audio detection method comprises:

acquiring audio to be detected;

acquiring acoustic characteristics belonging to a positive sample in the audio to be detected by using a characteristic analysis model;

and carrying out classification detection on the acoustic features by using an audio detection model, and outputting the detection type of the audio to be detected based on a classification detection result.

2. The audio detection method of claim 1, wherein,

the classification detection result comprises a probability value detected as a positive sample;

and if the probability value is higher than a first preset threshold value, outputting the detection type of the audio to be detected as a positive sample.

3. The audio detection method of claim 2, wherein,

the classification detection result also comprises detection times of positive samples;

and if the detection times and the probability value are higher than a second preset threshold value and the first preset threshold value respectively, outputting that the detection type of the audio to be detected is a positive sample.

4. The audio detection method according to claim 1, wherein

Before the acoustic features belonging to the positive sample in the audio to be detected are acquired by using the feature analysis model, the method further comprises:

and extracting the audio characteristics of the audio to be detected by using a characteristic extraction model.

5. The audio detection method of claim 4, wherein,

the extracting the audio features of the audio to be detected by using the feature extraction model comprises:

and performing frequency domain and/or cepstrum domain conversion on the audio signal of the audio to be detected by using a feature extraction model to obtain audio features.

6. The audio detection method of claim 4, wherein,

7. The audio detection method of claim 1, wherein,

the obtaining the acoustic features belonging to the positive sample in the audio to be detected by using the feature analysis model comprises the following steps:

obtaining parameters of a probability density function by using a maximized likelihood estimation method based on a preset positive sample by utilizing the characteristic analysis model;

training to obtain the probability density function based on the parameters;

and obtaining the acoustic characteristics belonging to the positive sample in the audio to be detected by using the probability density function.

8. The audio detection method of claim 1, wherein,

the network structure of the audio detection model comprises a grouping convolution layer and a residual layer.

9. An audio detection device, comprising a memory and a processor coupled to the memory;

wherein the memory is for storing program data and the processor is for executing the program data to implement the audio detection method according to any one of claims 1 to 8.

10. A computer storage medium for storing program data which, when executed by a computer, is adapted to carry out the audio detection method according to any one of claims 1 to 8.