CN117789755A - Audio data detection method and device and electronic equipment - Google Patents

Audio data detection method and device and electronic equipment Download PDF

Info

Publication number
CN117789755A
CN117789755A CN202311781734.3A CN202311781734A CN117789755A CN 117789755 A CN117789755 A CN 117789755A CN 202311781734 A CN202311781734 A CN 202311781734A CN 117789755 A CN117789755 A CN 117789755A
Authority
CN
China
Prior art keywords
audio
preset
training
target
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311781734.3A
Other languages
Chinese (zh)
Inventor
杜海云
吴人杰
方瑞东
黄昀
史巍
林聚财
殷俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Dahua Technology Co Ltd
Original Assignee
Zhejiang Dahua Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Dahua Technology Co Ltd filed Critical Zhejiang Dahua Technology Co Ltd
Priority to CN202311781734.3A priority Critical patent/CN117789755A/en
Publication of CN117789755A publication Critical patent/CN117789755A/en
Pending legal-status Critical Current

Links

Abstract

A method, a device and an electronic device for detecting audio data, wherein the method comprises the following steps: the method comprises the steps of obtaining audio data to be detected, inputting the audio data to be detected into a target audio detection model for training, obtaining the corresponding relation between at least one audio event and an audio event probability value corresponding to each audio event, screening target audio event probability values from all audio event probability values according to a preset rule, and taking the audio event corresponding to the target audio event probability value as a target audio event corresponding to the audio data to be detected. By the method, the trained target audio detection model is determined, the accuracy of the determined target audio detection model is ensured, the target audio event in the audio data to be detected is detected by the target audio detection model, and the accuracy of the determined target audio event is ensured.

Description

Audio data detection method and device and electronic equipment
Technical Field
The present disclosure relates to the field of audio event detection technologies, and in particular, to a method and an apparatus for detecting audio data, and an electronic device.
Background
In the process of detecting audio data, at least one audio event needs to be extracted from the audio data, and the audio event can be noise, crying sound, music sound, alarm sound and the like, and generally, the audio data is derived from audio signals collected by different devices under different scenes.
In order to extract audio events from audio data, a spectrogram corresponding to the audio data is generally required to be obtained, the spectrogram can reflect audio features corresponding to the audio events, so that distribution map areas of all the audio features are obtained on the spectrogram, the audio feature distribution maps are ordered in a sequence from large to small, audio features with the audio feature distribution map areas exceeding the preset audio feature distribution map areas are used as universality features, and the audio events are detected according to the universality features.
In the above method for detecting an audio event, the universality feature is related to a spectrogram area, a distribution area of an audio feature, and a preset audio feature distribution area, and in an actual detection process, the spectrogram area is limited by the audio signal data length and the audio event, namely: the areas of the spectrograms corresponding to different audio events are different from the distribution diagram of the audio features, so that the audio events are detected through the universality features, only the audio events with universality can be accurately detected, and when the audio events do not have universality, the accuracy rate of extracting the audio events from the audio data is low.
Disclosure of Invention
The application provides a method and a device for detecting audio data and electronic equipment, which are used for improving the accuracy of extracting a target audio event from audio data to be detected.
In a first aspect, the present application provides a method for detecting audio data, the method including:
obtaining audio data to be detected;
inputting the audio data to be detected into a target audio detection model for training to obtain at least one audio event and an audio event probability value corresponding to each audio event, wherein the audio event probability value of each audio event is larger than a preset audio event probability value of a corresponding preset audio event;
and screening target audio event probability values from all audio event probability values according to preset rules, and taking the audio event corresponding to the target audio event probability value as the target audio event corresponding to the audio data to be detected.
By the method, the corresponding relation between the audio event and the audio event probability value determined by the target audio detection model is screened, and the accuracy of the determined audio event is ensured.
In one possible design, the screening the target audio event probability value from all audio event probability values according to the preset rule includes:
Arranging all audio event probability values according to a preset sequence, screening out a maximum audio event probability value, and taking the maximum audio event probability value as a target audio event probability value; or alternatively
Determining preset audio events consistent with each audio event, calculating probability difference values between each audio event probability value and the corresponding preset audio event probability value, determining the maximum probability difference value from all the probability difference values, and taking the audio event probability value corresponding to the maximum probability difference value as a target audio event probability value.
Through the method, the maximum audio event probability value is screened from all the audio event probability values according to the preset rule, and the maximum audio event probability value is used as the target audio event probability value, so that the accuracy of the target audio event is improved.
In one possible design, before the training of the input of the audio data to be detected into the target audio detection model, the method further includes:
obtaining audio training data, wherein the audio training data comprises event tags;
determining a first target feature set in the audio training data;
inputting the first target feature set and the event tag into a preset audio training model for training according to a preset feature input mode to obtain at least one audio training model;
And screening a target audio detection model from all the audio training models according to a preset model screening rule.
Through the method, the audio training data are trained, so that a plurality of audio training models are obtained, and the method is favorable for screening out the target audio detection model with highest accuracy.
In one possible design, the determining the first set of target features in the audio training data includes:
enhancing the audio training data according to a preset signal enhancement mode to obtain audio signal data corresponding to the audio training data;
inputting the audio signal data into a preset feature extraction model to obtain a first feature vector set corresponding to the audio signal data;
inputting the first feature vector set into a preset feature enhancement model to obtain a second feature vector set corresponding to the first feature vector set;
and inputting the second feature vector set into a preset feature recombination model to obtain a first target feature set corresponding to the audio training data.
Through the method, the audio training data is processed for multiple times to obtain the first target feature set, so that generalization corresponding to the first target feature set is improved.
In one possible design, the inputting the second feature vector set into a preset feature recombination model to obtain a first target feature set corresponding to the audio training data includes:
according to a first preset statistical formula, a first statistics corresponding to the second feature vector set is obtained, wherein the first statistics comprises a first mean value and a first standard deviation corresponding to the second feature vector set in a preset dimension;
performing preset normalization processing on the second feature vector set based on the first statistics to obtain a third feature vector set corresponding to the second feature vector set;
determining all sequence number arrangements corresponding to the second feature vector set, carrying out preset disorder processing on all sequence number arrangements, and calculating second statistics corresponding to the second feature vector set after the preset disorder processing;
the first statistic and the second statistic are brought into a second preset statistic formula, and combined statistic is calculated;
and carrying out preset inverse normalization processing on the combined statistic and the third feature vector set to obtain a first target feature set.
By the method, the second feature vector is input into the preset feature recombination model, and the first target feature set with more feature types and better generalization can be obtained on the basis of the second feature vector set.
In one possible design, the training the first target feature set and the event tag according to a preset feature input mode by inputting a preset audio training model to obtain at least one audio training model includes:
n preset network layers are determined from the preset audio training model, wherein N is a positive integer;
inputting the first target feature set and the event labels into the N preset network layers for training to obtain at least one audio training model;
when the audio training model accords with a preset model rule, outputting a target audio detection model corresponding to the preset audio training model.
Through the method, the first target feature set is trained, a plurality of audio training models are determined, and the target audio detection model with highest accuracy is screened from the plurality of audio training models.
In one possible design, the deriving at least one audio training model includes:
after training the first target feature set on an M-th preset network layer, obtaining a first training feature set, wherein M is a positive integer smaller than N;
inputting a plurality of first training features in the first training feature set into the preset feature recombination model for recombination training, and taking the plurality of first training features after recombination training as a second target feature set;
And inputting the second target feature set into an M+1th preset network layer for training according to a preset insertion rule to obtain at least one audio training model.
By the method, the second target feature set is determined, and the first target feature set or the second target feature set is trained, so that the accuracy of the determined audio training model is improved.
In one possible design, when the preset audio training model meets a preset model rule, outputting a target audio detection model corresponding to the preset audio training model includes:
determining the actual iteration times corresponding to the preset audio training model and the loss value corresponding to each iteration training;
when the actual iteration number reaches the preset iteration number, determining that the preset audio training model accords with a preset model rule, and taking the preset audio training model of the last iteration as a target audio detection model; or alternatively
When at least one loss value is in a preset loss value range, determining that the preset audio training model accords with a preset model rule, screening a minimum loss value in the preset loss value range from the at least one loss value, and taking the preset audio training model corresponding to the minimum loss value as a target audio detection model.
Through the method, the target audio detection model is determined in various modes, so that the target audio detection model can be suitable for scenes with more ranges, and the accuracy of the target audio detection model is improved.
In a second aspect, the present application provides an apparatus for detecting audio data, the apparatus comprising:
the acquisition module is used for acquiring audio data to be detected;
the input module is used for inputting the audio data to be detected into a target audio detection model for training to obtain at least one audio event and an audio event probability value corresponding to each audio event;
and the determining module is used for screening out target audio event probability values from all audio event probability values according to preset rules, and taking the audio event corresponding to the target audio event probability value as the target audio event corresponding to the audio data to be detected.
In one possible design, the determining module is specifically configured to rank all audio event probability values according to a preset sequence, screen out a maximum audio event probability value, and use the maximum audio event probability value as a target audio event probability value, or determine a preset audio event consistent with each audio event, calculate a probability difference value between each audio event probability value and a corresponding preset audio event probability value, determine a maximum probability difference value from all the probability difference values, and use an audio event probability value corresponding to the maximum probability difference value as the target audio event probability value.
In one possible design, the obtaining module is specifically configured to obtain audio training data, where the audio training data includes an event tag, determine a first target feature set in the audio training data, input the first target feature set and the event tag into a preset audio training model according to a preset feature input mode, train the event tag to obtain at least one audio training model, and screen out target audio detection models from all the audio training models according to a preset model screening rule.
In one possible design, the obtaining module is further configured to enhance the audio training data according to a preset signal enhancement mode, obtain audio signal data corresponding to the audio training data, input the audio signal data into a preset feature extraction model, obtain a first feature vector set corresponding to the audio signal data, input the first feature vector set into a preset feature enhancement model, obtain a second feature vector set corresponding to the first feature vector set, and input the second feature vector set into a preset feature recombination model, so as to obtain a first target feature set corresponding to the audio training data.
In one possible design, the obtaining module is further configured to obtain a first statistics corresponding to the second feature vector set according to a first preset statistics formula, perform preset normalization processing on the second feature vector set based on the first statistics to obtain a third feature vector set corresponding to the second feature vector set, determine all sequence number arrangements corresponding to the second feature vector set, perform preset disorder processing on all sequence number arrangements, calculate a second statistic corresponding to the second feature vector set after the preset disorder processing, bring the first statistics and the second statistic into a second preset statistics formula, calculate a combined statistic, and perform preset inverse normalization processing on the combined statistic and the third feature vector set to obtain a first target feature set.
In one possible design, the obtaining module is further configured to determine N preset network layers from the preset audio training models, input the first target feature set and the event tag into the N preset network layers for training, obtain at least one audio training model, and output a target audio detection model corresponding to the preset audio training model when the audio training model accords with a preset model rule.
In one possible design, the obtaining module is further configured to obtain a first training feature set after the first target feature set is trained on an mth preset network layer, input a plurality of first training features in the first training feature set into the preset feature recombination model for recombination training, take the plurality of first training features after the recombination training as a second target feature set, and input the second target feature set into an (m+1) th preset network layer for training according to a preset insertion rule, so as to obtain at least one audio training model.
In one possible design, the obtaining module is further configured to determine an actual iteration number corresponding to the preset audio training model and a loss value corresponding to each iteration training, determine that the preset audio training model meets a preset model rule when the actual iteration number reaches the preset iteration number, use a preset audio training model of a last iteration as a target audio detection model, or determine that the preset audio training model meets a preset model rule when at least one loss value is in a preset loss value range, screen out a minimum loss value in the preset loss value range from the at least one loss value, and use a preset audio training model corresponding to the minimum loss value as the target audio detection model.
In a third aspect, the present application provides an electronic device, including:
a memory for storing a computer program;
and the processor is used for realizing the steps of the method for detecting the audio data when executing the computer program stored in the memory.
In a fourth aspect, a computer readable storage medium has stored therein a computer program which, when executed by a processor, implements the above-described method steps of detecting audio data.
The technical effects of each of the first to fourth aspects and the technical effects that may be achieved by each aspect are referred to above for the technical effects that may be achieved by the first aspect or the various possible aspects of the first aspect, and are not repeated here.
Drawings
Fig. 1 is a flowchart of steps of a method for detecting audio data provided in the present application;
FIG. 2 is a schematic flow chart of determining a target audio detection model provided in the present application;
FIG. 3 is a schematic flow chart of determining a first target feature set provided in the present application;
fig. 4 is a schematic diagram of a training process of the second feature vector set in the preset feature recombination model provided in the present application;
Fig. 5 is a schematic diagram of a training flow of audio training data provided in the present application;
fig. 6 is a schematic structural diagram of an audio data detection device provided in the present application;
fig. 7 is a schematic structural diagram of an electronic device provided in the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the present application will be described in further detail with reference to the accompanying drawings. The specific method of operation in the method embodiment may also be applied to the device embodiment or the system embodiment. It should be noted that "a plurality of" is understood as "at least two" in the description of the present application. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. A is connected with B, and can be represented as follows: both cases of direct connection of A and B and connection of A and B through C. In addition, in the description of the present application, the words "first," "second," and the like are used merely for distinguishing between the descriptions and not be construed as indicating or implying a relative importance or order.
In the prior art, in the process of extracting an audio event in audio data, as the universality feature is related to the spectrogram area, the distribution area of the audio feature and the preset audio feature distribution area, in the actual detection process, the spectrogram area is limited by the audio signal data length and the audio event, namely: the areas of the spectrograms corresponding to different audio events are different from the distribution diagram of the audio features, so that the audio events are detected through the universality features, only the audio events with universality can be accurately detected, and when the audio events do not have universality, the accuracy rate of extracting the audio events from the audio data is low.
In order to solve the above-described problems, an embodiment of the present application provides a method for detecting audio data, which is used for accurately extracting an audio event in audio data to be detected. The method and the device according to the embodiments of the present application are based on the same technical concept, and because the principles of the problems solved by the method and the device are similar, the embodiments of the device and the method can be referred to each other, and the repetition is not repeated.
Embodiments of the present application are described in detail below with reference to the accompanying drawings.
Referring to fig. 1, the present application provides a method for detecting audio data, which can improve accuracy of determining an audio event corresponding to audio data to be detected, and the implementation flow of the method is as follows:
step S1: and obtaining the audio data to be detected.
Step S2: and inputting the audio data to be detected into a target audio detection model for training to obtain at least one audio event and an audio event probability value corresponding to each audio event.
In order to obtain an audio event corresponding to the audio data to be detected, the audio event to be detected needs to be input into a trained target audio detection model for detection, and the audio data to be detected is preprocessed in the target audio detection model, wherein the preprocessing process is specifically as follows:
Firstly, in order to obtain the data diversity of the audio data to be detected, the audio data to be detected needs to be subjected to signal enhancement, wherein the signal enhancement mode can be waveform movement, volume change, reverberation addition, noise addition and the like, and then the audio data to be detected after the signal enhancement is subjected to feature extraction, and the feature extraction mode can be as follows: and finally, carrying out characteristic enhancement on the audio data to be detected after adjustment and extraction, wherein the characteristic enhancement mode can be distortion enhancement, addition enhancement and the like, and carrying out subsequent training on the audio data to be detected after the characteristic enhancement.
Further, because the audio data to be detected corresponds to at least one audio event, in order to ensure the accuracy of the audio event output by the target audio detection model, before the audio detection model outputs at least one audio event and the audio event probability value corresponding to each audio event, the preset audio event probability value associated with the preset audio event is determined, and the audio event probability value corresponding to each audio event determined by the target audio detection model is determined.
Because each determined audio event can determine a consistent preset audio event in the target audio detection model, whether the audio event probability value is larger than the preset audio event probability value or not needs to be compared, when the audio event probability value is larger than the corresponding preset audio event probability value, the probability that the target audio event corresponding to the audio data to be detected is the determined audio event is higher, and the target audio detection model outputs the audio event and the audio probability value corresponding to the audio event; when the audio event probability value is smaller than the corresponding preset audio event probability value, the target audio event corresponding to the audio data to be detected is represented as the determined audio event probability is lower, and the target audio detection model does not output the audio event probability value corresponding to the audio event.
Since the process of outputting the audio event probability value for each audio event by the target audio detection model is consistent with that of each audio event, only the audio event probability value for one audio event corresponding to the audio event will be described as an example.
By the method, the audio event probability value corresponding to each audio event is determined, and the output corresponding relation is screened before the audio event probability value corresponding to the audio event is output, so that the accuracy of the output audio event is ensured.
Step S3: and screening target audio event probability values from all audio event probability values according to preset rules, and taking the audio event corresponding to the target audio event probability value as the target audio event corresponding to the audio data to be detected.
After the corresponding relation between at least one audio event and the audio event probability value is determined by the target audio detection model, since the audio data to be detected corresponds to at least one audio event, in order to ensure the accuracy of the determined target audio event corresponding to the audio data to be detected, the target audio event probability value needs to be screened out from all the audio event probability values according to a preset rule, and a plurality of modes exist for screening out the target audio event probability value, which are specifically as follows:
mode one: and arranging all the audio event probability values in the order from large to small, screening out the maximum audio probability value, and taking the maximum audio probability value as a target audio event probability value.
Mode two: because each determined audio event is in the target audio detection model, the target audio detection model has an association relationship between the preset audio event and the preset audio event probability value corresponding to the preset audio event, and the association relationship is shown in the following table 1:
Presetting audio events Presetting audio probability values
A 0.2
B 0.3
C 0.1
…… ……
TABLE 1
In table 1, the preset audio event probability values corresponding to the three preset audio events are listed, and the preset audio events and the preset audio event probability values can be adjusted according to the actual situation, and the table 1 is only used as an illustrative example.
Because the association relation between the preset audio event and the preset audio event probability value corresponding to the preset audio event exists, the preset audio event consistent with the audio event can be determined, then the probability difference value between the audio event probability value corresponding to each audio event and the corresponding preset audio event probability value is calculated, and a plurality of probability difference values are determined.
Such as: the process of determining the plurality of probability differences is as follows:
audio events Audio event probability value Presetting audio probability values Probability difference
Event 1 0.4 0.2 0.2
Event 2 0.4 0.3 0.1
Event 3 0.5 0.1 0.4
TABLE 2
In table 2, the probability difference corresponding to the three audio events is listed, the probability value of the audio event corresponding to the event 1 is 0.4, the preset audio probability value consistent with the event 1 is 0.2, the difference between 0.4 and 0.2 is 0.2, the probability difference of the event 1 is 0.2, and the process of calculating the probability difference between the event 2 and the event 3 in table 2 refers to the process of calculating the probability difference of the event 1, which is not described herein.
After the probability difference value corresponding to each audio event is calculated according to the method, all the probability difference values are arranged according to the sequence from big to small, the maximum probability difference value is screened out, and the audio event corresponding to the maximum probability difference value is used as the target audio event.
It should be noted that, in the first and second modes, when the number of the determined maximum event probability values is greater than 1, detecting the detection duration corresponding to each maximum audio event probability value, screening the minimum detection duration from all the detection durations, and taking the maximum audio event probability value corresponding to the minimum detection duration as the target audio event probability value.
After the target audio event probability value is determined, the audio event corresponding to the target audio event probability value is used as the target audio event corresponding to the audio data to be detected, so that the target audio event is detected from the audio data to be detected.
The embodiment of the present application may set the number of target audio events corresponding to the audio event to be detected according to the actual detection requirement, and since the determination method of multiple target audio events is consistent with the above process, a repeated description is not given here.
By the method, the audio data to be detected is detected by adopting the trained target audio detection model, and the corresponding relation between the audio event output by the target audio detection model and the audio event probability value is screened, so that the accuracy of determining the target audio event is improved.
In the embodiment of the present application, the target audio detection model is used to detect the audio data to be detected, so that a flow chart for determining the target audio detection model is shown in fig. 2, and in fig. 2, a specific determination process is as follows:
step S21: in order to determine the training sample, various types of time audios need to be obtained, audio training data are generated, and in order to accurately obtain an accurate target audio detection model, the audio training data need to be marked so that event labels are contained in the audio training data.
Step S22: a first set of target features in the audio training data is determined.
In order to increase data in the audio training data, the audio training data needs to be processed to obtain a first target feature set, and a flow chart for determining the first target feature set is shown in fig. 3, which specifically includes the following steps:
step S31: and enhancing the audio training data according to a preset signal enhancement mode to obtain audio signal data corresponding to the audio training data.
In order to increase diversity of audio training data, it is necessary to enhance the audio training data according to a preset signal enhancement mode, and use the audio training data after enhancement as audio signal data, where the preset signal enhancement mode may be: at least one of the listed preset signal enhancement modes may be adopted, and each preset signal enhancement mode has a corresponding enhancement intensity range, and the enhancement intensity range may also be set according to actual conditions, which is not described herein too much.
Such as: changing the volume of the audio training data can be changed according to the amplitude or energy level of the audio signal of the audio training data, and the intensity of enhancement of the changed volume can be set to 1.5, which is not described too much.
When the diversity of the audio training data is increased, the off-line enhancement may be performed on a part of the audio training data, the off-line enhanced audio signal data may be stored, and the data size of the audio signal data may be increased to be consistent with the data sizes of other types of audio signal data.
Step S32: inputting the audio signal data into a preset feature extraction model to obtain a first feature vector set corresponding to the audio signal data.
After determining all the audio signal data, all the audio signal data need to be input into a preset feature extraction model, and the preset feature extraction model can extract a first feature vector set by adopting modes such as logarithmic Mel (English full name: log-Mel spectrum, abbreviated as: log-Mel), mel cepstrum coefficient (English full name: mel-Frequency Cepstral Coefficients, abbreviated as: mol), filter Bank feature (English full name: filter Bank, abbreviated as: FBANK), perceptual linear prediction (English full name: perceptual Linear Predictive, abbreviated as: PLP), channel energy normalization feature (English full name: per-Channel Energy Normalization, abbreviated as: PCEN), and the like, wherein the mode of extracting the first feature vector set by the preset feature extraction model is only used as an example and is not described herein.
The first feature vector output by the preset feature extraction model can be a single feature or a spliced combination feature, and the spliced feature can be a spliced feature obtained by splicing different feature vectors in different dimensions, wherein the dimensions can be a time dimension and a frequency dimension or a spliced feature obtained by calculating the feature vector at a specified position.
By the method, the audio signal data is converted from the time domain signal into the first feature vector which is convenient for the neural network to process, so that the training efficiency of the audio training data is improved.
Step S33: inputting the first feature vector set into a preset feature enhancement model to obtain a second feature vector set corresponding to the first feature vector set;
after the first feature vector set is determined, the first feature vector set is input into a preset feature enhancement model for enhancement, wherein the feature enhancement modes in the preset feature enhancement model can be torsion enhancement, addition enhancement, random drop enhancement and the like, and the preset feature enhancement model can be enhanced by adopting at least one feature enhancement mode to obtain a second feature vector set corresponding to the first feature training set by way of illustration only.
By the method, the feature vector diversity of the first feature vector set is increased, and the detection accuracy and the robustness of the target audio detection model are improved.
Step S34: and inputting the second feature vector set into a preset feature recombination model to obtain a first target feature set corresponding to the audio training data.
After the second feature vector set is determined, the specific process of determining the first target feature set is as follows:
The training flow diagram of the second feature vector set in the preset feature recombination model is shown in fig. 4, a first statistics corresponding to the second feature vector set in a preset dimension is required to be calculated according to a first preset statistical formula, where the preset dimension may be a time dimension and a frequency dimension, and the first statistics is
The amounts include: the first mean and the first standard deviation, and the first preset statistical formula is as follows:
in the first preset statistical formula described above,the dimension information representing the second feature vector, wherein N represents the number of the second feature vector, C represents the number of channels for training the second feature vector, H represents the number of matrix rows of the second feature vector, W represents the number of matrix columns of the second feature vector, and B wm Representing a first average value of the second feature vector in a preset dimension W, B ws Representing a first standard deviation of the second feature vector in a predetermined dimension W.
After the first statistics are calculated according to the first preset statistics formula, the second feature vector set is subjected to preset normalization processing according to the first statistics, so that environmental noise in the second feature vector is reduced, and a third feature vector set corresponding to the second feature vector set is obtained.
The formula for obtaining the third feature vector is as follows:
In the above formula, B wr Represents a third feature vector, B represents a second feature vector, B wm Represents a first standard deviation corresponding to a preset dimension, B ws Representing the corresponding first mean value over the predetermined dimension.
Determining all sequence number arrangements corresponding to the second feature vector set, carrying out preset disorder treatment on all sequence number arrangements, and calculating second statistic corresponding to the second feature vector set after the preset disorder treatment, wherein the calculation process of the second statistic is as follows:
in the above-mentioned formula(s),represents dimension information of the second feature vector after disorder, wherein N represents the number of the second feature vector, C represents the number of channels for training the second feature vector, H represents the number of matrix rows of the second feature vector, W represents the number of matrix columns of the second feature vector, and B' wm Representing a second average of the second feature vectors after disorder, B' ws Representing the second out-of-orderAnd a second standard deviation of the feature vector.
After the first statistic and the second statistic are determined, the first statistic and the second statistic are brought into a second preset statistical formula, and the combined statistic is calculated, wherein the second preset statistical formula is as follows:
B mm =wB wm +(1-w)B wm '
B ms =wB ws +(1-w)B ws '
in the second preset statistical formula, w ranges from [0,1 ] ]Random numbers in the same position, the first statistics before disorder is B wm And B is connected with ws The second statistic after disorder is B wm ' and B ws '。
Carrying out preset inverse normalization processing on the combined statistic and the third feature vector set, and obtaining a formula of a first target feature through preset inverse normalization, wherein the third feature vector set subjected to preset inverse normalization processing is called as a first target feature set, and the formula is as follows:
B”=B ms *B wr +B mm
in the above formula, B' represents a first target feature, B ms Represents the second standard deviation, B mm Represents the second mean value, B wr And dimension information representing the third feature vector.
Step S23: and inputting the first target feature set and the event label into a preset audio training model for training according to a preset feature input mode to obtain at least one audio training model.
After the first target feature set is determined, in order to improve the accuracy of the determined target audio detection model, N preset network layers are required to be determined from the preset audio training model, N is a positive integer, the preset network layers can be neural networks arranged in the preset audio training model, then the first target feature set and event labels are input into the N preset network layers for training, when the first target feature set is trained on the Mth preset network layer, a first training feature set is obtained, M is a positive integer smaller than N, and then a plurality of first training features in the first training feature set are input into the preset feature recombination model for training.
After the first training features are trained in the preset feature recombination model, taking a plurality of first training features for recombination training as a second target feature set, and inputting the second target feature set into an M+1th preset network layer for training to obtain at least one audio training model.
When the first training feature is present, the first training feature is input into the next preset network layer for training, and when the first training feature is not present, the first target feature set is input into the next preset network layer for training.
Step S24: and screening a target audio detection model from all the audio training models according to a preset model screening rule.
After a plurality of audio training models are determined, determining the actual iteration times of a preset audio training model and a loss value corresponding to each iteration training, and when the actual iteration times reach the preset iteration times, determining that the preset audio training model accords with a preset model rule, and taking the preset audio training model of the last iteration as a target audio training model.
When at least one loss value is in the preset loss value range, determining that the preset audio training model accords with a preset model rule, sorting all the loss values in the preset loss value range according to the sequence from big to small, screening out the minimum loss value in the preset loss value range, and taking the preset audio training model corresponding to the minimum loss value as a target audio detection model.
The generalization of the preset audio training model on different data can be enhanced through K-fold cross validation, for example: the audio training data can be split into 5 audio training sub-data, no intersection exists between the audio training data and the audio training data, the 4 audio training sub-data are used as a training set for training, the rest of one audio training sub-data is tested, the process is repeated, so that each audio training sub-data can be tested in the audio training model after training to obtain 5 audio training models, and the process of screening target audio detection models from all the audio training models is described, so that repeated description is omitted.
In fig. 5, audio training data is first marked, then the marked audio training data is subjected to preset signal enhancement processing, preset feature extraction processing, preset feature enhancement processing and preset feature recombination model processing to obtain a first target feature set, then the first target feature set is subjected to model training, a preset number of preset network layers from a preset network layer 1 to a preset network layer n in the model training can be selected to process the first target feature set, the first target feature set after being processed by the preset network layer is input into a preset feature recombination model to be processed to obtain a second target feature set, and when the first target feature set exists, the first target feature set is input into the next preset network layer to train; when the first target feature set does not exist, the second target feature set is input into the next preset network layer for training, the process is repeated, at least one audio training model is obtained, and then a target audio detection model is determined from the audio training model.
By the method, signal enhancement, feature recombination and the like are performed on the audio training data, so that the feature vector of the audio training data with stronger generalization is obtained, and the accuracy of the target audio detection model is ensured.
Based on the same inventive concept, the embodiment of the present application further provides an audio data detection apparatus, where the audio data detection apparatus is configured to implement a function of an audio data detection method, and referring to fig. 6, the apparatus includes:
an obtaining module 601, configured to obtain audio data to be detected;
the input module 602 is configured to input the audio data to be detected into a target audio detection model for training, so as to obtain at least one audio event and an audio event probability value corresponding to each audio event;
the determining module 603 is configured to screen target audio event probability values from all audio event probability values according to a preset rule, and take an audio event corresponding to the target audio event probability value as a target audio event corresponding to the audio data to be detected.
In one possible design, the determining module 603 is specifically configured to rank all audio event probability values according to a preset sequence, screen out a maximum audio event probability value, and use the maximum audio event probability value as a target audio event probability value, or determine a preset audio event consistent with each audio event, calculate a probability difference between each audio event probability value and a corresponding preset audio event probability value, determine a maximum probability difference from all the probability differences, and use an audio event probability value corresponding to the maximum probability difference as the target audio event probability value.
In one possible design, the obtaining module 601 is specifically configured to obtain audio training data, where the audio training data includes an event tag, determine a first target feature set in the audio training data, input the first target feature set and the event tag into a preset audio training model according to a preset feature input mode to train, obtain at least one audio training model, and screen out target audio detection models from all the audio training models according to a preset model screening rule.
In one possible design, the obtaining module 601 is further configured to enhance the audio training data according to a preset signal enhancement mode, obtain audio signal data corresponding to the audio training data, input the audio signal data into a preset feature extraction model, obtain a first feature vector set corresponding to the audio signal data, input the first feature vector set into a preset feature enhancement model, obtain a second feature vector set corresponding to the first feature vector set, and input the second feature vector set into a preset feature recombination model, so as to obtain a first target feature set corresponding to the audio training data.
In one possible design, the obtaining module 601 is further configured to obtain a first statistics corresponding to the second feature vector set according to a first preset statistics formula, perform preset normalization processing on the second feature vector set based on the first statistics to obtain a third feature vector set corresponding to the second feature vector set, determine all sequence number arrangements corresponding to the second feature vector set, perform preset disorder processing on all sequence number arrangements, calculate a second statistic corresponding to the second feature vector set after the preset disorder processing, bring the first statistics and the second statistic into a second preset statistics formula, calculate a combined statistic, and perform preset inverse normalization processing on the combined statistic and the third feature vector set to obtain a first target feature set.
In one possible design, the obtaining module 601 is further configured to determine N preset network layers from the preset audio training models, input the first target feature set and the event tag into the N preset network layers for training, obtain at least one audio training model, and output a target audio detection model corresponding to the preset audio training model when the audio training model accords with a preset model rule.
In one possible design, the obtaining module 601 is further configured to obtain a first training feature set after the first target feature set is trained on an mth preset network layer, input a plurality of first training features in the first training feature set into the preset feature reorganization model for reorganization training, use the plurality of first training features after reorganization training as a second target feature set, and input the second target feature set into an (m+1) th preset network layer for training according to a preset insertion rule, so as to obtain at least one audio training model.
In one possible design, the obtaining module 601 is further configured to determine an actual iteration number corresponding to the preset audio training model and a loss value corresponding to each iteration training, determine that the preset audio training model meets a preset model rule when the actual iteration number reaches the preset iteration number, use a preset audio training model of a last iteration as a target audio detection model, or determine that the preset audio training model meets a preset model rule when at least one loss value is within a preset loss value range, screen out a minimum loss value within the preset loss value range from the at least one loss value, and use a preset audio training model corresponding to the minimum loss value as the target audio detection model.
Based on the same inventive concept, the embodiment of the present application further provides an electronic device, where the electronic device may implement the function of the foregoing audio data detection apparatus, and referring to fig. 7, the electronic device includes:
at least one processor 701, and a memory 702 connected to the at least one processor 701, in this embodiment of the present application, a specific connection medium between the processor 701 and the memory 702 is not limited, and in fig. 7, the processor 701 and the memory 702 are connected by a bus 700 as an example. Bus 700 is shown in bold lines in fig. 7, and the manner in which the other components are connected is illustrated schematically and not by way of limitation. The bus 700 may be divided into an address bus, a data bus, a control bus, etc., and is represented by only one thick line in fig. 7 for convenience of representation, but does not represent only one bus or one type of bus. Alternatively, the processor 701 may be referred to as a controller, and the names are not limited.
In the embodiment of the present application, the memory 702 stores instructions executable by the at least one processor 701, and the at least one processor 701 can execute a method for detecting audio data as described above by executing the instructions stored in the memory 702. The processor 701 may implement the functions of the various modules in the apparatus shown in fig. 6.
The processor 701 is a control center of the apparatus, and may connect various parts of the entire control device using various interfaces and lines, and by executing or executing instructions stored in the memory 702 and invoking data stored in the memory 702, various functions of the apparatus and processing data, thereby performing overall monitoring of the apparatus.
In one possible design, processor 701 may include one or more processing units, and processor 701 may integrate an application processor and a modem processor, wherein the application processor primarily processes operating systems, user interfaces, application programs, and the like, and the modem processor primarily processes wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 701. In some embodiments, processor 701 and memory 702 may be implemented on the same chip, or they may be implemented separately on separate chips in some embodiments.
The processor 701 may be a general purpose processor such as a Central Processing Unit (CPU), digital signal processor, application specific integrated circuit, field programmable gate array or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, which may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present application. The general purpose processor may be a microprocessor or any conventional processor or the like. The steps of a method for detecting audio data disclosed in connection with the embodiments of the present application may be directly embodied and executed by a hardware processor, or may be executed by a combination of hardware and software modules in the processor.
The memory 702 is a non-volatile computer-readable storage medium that can be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The Memory 702 may include at least one type of storage medium, and may include, for example, flash Memory, hard disk, multimedia card, card Memory, random access Memory (Random Access Memory, RAM), static random access Memory (Static Random Access Memory, SRAM), programmable Read-Only Memory (Programmable Read Only Memory, PROM), read-Only Memory (ROM), charged erasable programmable Read-Only Memory (Electrically Erasable Programmable Read-Only Memory), magnetic Memory, magnetic disk, optical disk, and the like. Memory 702 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 702 in the embodiments of the present application may also be circuitry or any other device capable of implementing a memory function for storing program instructions and/or data.
By programming the processor 701, the code corresponding to the method for detecting audio data described in the foregoing embodiment may be cured into the chip, so that the chip can execute the step for detecting audio data in the embodiment shown in fig. 1 during operation. How to design and program the processor 701 is a technology well known to those skilled in the art, and will not be described in detail herein.
Based on the same inventive concept, the embodiments of the present application also provide a storage medium storing computer instructions that, when executed on a computer, cause the computer to perform a method of detecting audio data as previously discussed.
In some possible embodiments, aspects of a method of detecting audio data may also be implemented in the form of a program product comprising program code for causing a control apparatus to carry out the steps of a method of detecting audio data according to the various exemplary embodiments of the application as described herein above when the program product is run on a device.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various modifications and variations can be made in the present application without departing from the spirit or scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to cover such modifications and variations.

Claims (11)

1. A method for detecting audio data, comprising:
obtaining audio data to be detected;
inputting the audio data to be detected into a target audio detection model for training to obtain at least one audio event and an audio event probability value corresponding to each audio event, wherein the audio event probability value of each audio event is larger than a preset audio event probability value of a corresponding preset audio event;
And screening target audio event probability values from all audio event probability values according to preset rules, and taking the audio event corresponding to the target audio event probability value as the target audio event corresponding to the audio data to be detected.
2. The method of claim 1, wherein the screening the target audio event probability values from all audio event probability values according to the preset rule comprises:
arranging all audio event probability values according to a preset sequence, screening out a maximum audio event probability value, and taking the maximum audio event probability value as a target audio event probability value; or alternatively
Determining preset audio events consistent with each audio event, calculating probability difference values between each audio event probability value and the corresponding preset audio event probability value, determining the maximum probability difference value from all the probability difference values, and taking the audio event probability value corresponding to the maximum probability difference value as a target audio event probability value.
3. The method of claim 1, further comprising, prior to said inputting the audio data to be detected into a target audio detection model for training:
Obtaining audio training data, wherein the audio training data comprises event tags;
determining a first target feature set in the audio training data;
inputting the first target feature set and the event tag into a preset audio training model for training according to a preset feature input mode to obtain at least one audio training model;
and screening a target audio detection model from all the audio training models according to a preset model screening rule.
4. The method of claim 3, wherein the determining the first set of target features in the audio training data comprises:
enhancing the audio training data according to a preset signal enhancement mode to obtain audio signal data corresponding to the audio training data;
inputting the audio signal data into a preset feature extraction model to obtain a first feature vector set corresponding to the audio signal data;
inputting the first feature vector set into a preset feature enhancement model to obtain a second feature vector set corresponding to the first feature vector set;
and inputting the second feature vector set into a preset feature recombination model to obtain a first target feature set corresponding to the audio training data.
5. The method of claim 4, wherein inputting the second feature vector set into a preset feature recombination model to obtain a first target feature set corresponding to the audio training data comprises:
according to a first preset statistical formula, a first statistics corresponding to the second feature vector set is obtained, wherein the first statistics comprises a first mean value and a first standard deviation corresponding to the second feature vector set in a preset dimension;
performing preset normalization processing on the second feature vector set based on the first statistics to obtain a third feature vector set corresponding to the second feature vector set;
determining all sequence number arrangements corresponding to the second feature vector set, carrying out preset disorder processing on all sequence number arrangements, and calculating second statistics corresponding to the second feature vector set after the preset disorder processing;
the first statistic and the second statistic are brought into a second preset statistic formula, and combined statistic is calculated;
and carrying out preset inverse normalization processing on the combined statistic and the third feature vector set to obtain a first target feature set.
6. The method of claim 3, wherein the training the first target feature set and the event tag into a preset audio training model according to a preset feature input mode to obtain at least one audio training model comprises:
N preset network layers are determined from the preset audio training model, wherein N is a positive integer;
inputting the first target feature set and the event labels into the N preset network layers for training to obtain at least one audio training model;
when the audio training model accords with a preset model rule, outputting a target audio detection model corresponding to the preset audio training model.
7. The method of claim 6, wherein the deriving at least one audio training model comprises:
after training the first target feature set on an M-th preset network layer, obtaining a first training feature set, wherein M is a positive integer smaller than N;
inputting a plurality of first training features in the first training feature set into the preset feature recombination model for recombination training, and taking the plurality of first training features after recombination training as a second target feature set;
and inputting the second target feature set into an M+1th preset network layer for training according to a preset insertion rule to obtain at least one audio training model.
8. The method of claim 6, wherein outputting the target audio detection model corresponding to the preset audio training model when the preset audio training model meets a preset model rule comprises:
Determining the actual iteration times corresponding to the preset audio training model and the loss value corresponding to each iteration training;
when the actual iteration number reaches the preset iteration number, determining that the preset audio training model accords with a preset model rule, and taking the preset audio training model of the last iteration as a target audio detection model; or alternatively
When at least one loss value is in a preset loss value range, determining that the preset audio training model accords with a preset model rule, screening a minimum loss value in the preset loss value range from the at least one loss value, and taking the preset audio training model corresponding to the minimum loss value as a target audio detection model.
9. An audio data detection apparatus, comprising:
the acquisition module is used for acquiring audio data to be detected;
the input module is used for inputting the audio data to be detected into a target audio detection model for training to obtain at least one audio event and an audio event probability value corresponding to each audio event;
and the determining module is used for screening out target audio event probability values from all audio event probability values according to preset rules, and taking the audio event corresponding to the target audio event probability value as the target audio event corresponding to the audio data to be detected.
10. An electronic device, comprising:
a memory for storing a computer program;
a processor for carrying out the method steps of any one of claims 1-8 when executing a computer program stored on said memory.
11. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored therein a computer program which, when executed by a processor, implements the method steps of any of claims 1-8.
CN202311781734.3A 2023-12-21 2023-12-21 Audio data detection method and device and electronic equipment Pending CN117789755A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311781734.3A CN117789755A (en) 2023-12-21 2023-12-21 Audio data detection method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311781734.3A CN117789755A (en) 2023-12-21 2023-12-21 Audio data detection method and device and electronic equipment

Publications (1)

Publication Number Publication Date
CN117789755A true CN117789755A (en) 2024-03-29

Family

ID=90386402

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311781734.3A Pending CN117789755A (en) 2023-12-21 2023-12-21 Audio data detection method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN117789755A (en)

Similar Documents

Publication Publication Date Title
CN109473123B (en) Voice activity detection method and device
CN109599093B (en) Intelligent quality inspection keyword detection method, device and equipment and readable storage medium
CN102394062B (en) Method and system for automatically identifying voice recording equipment source
CN110718228B (en) Voice separation method and device, electronic equipment and computer readable storage medium
CN104903954A (en) Speaker verification and identification using artificial neural network-based sub-phonetic unit discrimination
CN111785288B (en) Voice enhancement method, device, equipment and storage medium
CN110675862A (en) Corpus acquisition method, electronic device and storage medium
CN110880329A (en) Audio identification method and equipment and storage medium
CN113223536B (en) Voiceprint recognition method and device and terminal equipment
CN108899033B (en) Method and device for determining speaker characteristics
CN108877783A (en) The method and apparatus for determining the audio types of audio data
CN107293308A (en) A kind of audio-frequency processing method and device
CN111462761A (en) Voiceprint data generation method and device, computer device and storage medium
CN111696580A (en) Voice detection method and device, electronic equipment and storage medium
CN108764114B (en) Signal identification method and device, storage medium and terminal thereof
CN110570870A (en) Text-independent voiceprint recognition method, device and equipment
CN111968670A (en) Audio recognition method and device
CN109545226A (en) A kind of audio recognition method, equipment and computer readable storage medium
CN113450822A (en) Voice enhancement method, device, equipment and storage medium
CN113112992B (en) Voice recognition method and device, storage medium and server
CN116777569A (en) Block chain-based commodity big data voice introduction and intelligent checkout method and system
CN115083422B (en) Voice traceability evidence obtaining method and device, equipment and storage medium
CN114863943B (en) Self-adaptive positioning method and device for environmental noise source based on beam forming
CN117789755A (en) Audio data detection method and device and electronic equipment
CN114420100B (en) Voice detection method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination