CN112289340B

CN112289340B - Audio detection method and device

Info

Publication number: CN112289340B
Application number: CN202011211792.9A
Authority: CN
Inventors: 陈佳路; 王晓红; 刘鲁鹏; 元海明; 高强; 夏龙; 郭常圳
Original assignee: Beijing Ape Power Future Technology Co Ltd
Current assignee: Beijing Ape Power Future Technology Co Ltd
Priority date: 2020-11-03
Filing date: 2020-11-03
Publication date: 2024-05-07
Anticipated expiration: 2040-11-03
Also published as: CN112289340A

Abstract

The present specification provides an audio detection method and apparatus, wherein the audio detection method includes: acquiring an audio file to be detected; inputting the audio file to be detected into a wheat spraying detection model for processing to obtain a wheat spraying audio interval and a wheat spraying probability of the wheat spraying audio interval in the audio file to be detected; determining a wheat spraying audio fragment in the audio file to be detected according to the wheat spraying audio interval and the wheat spraying probability; the detection of the wheat spraying audio clips is carried out by the audio detection method, so that the detection efficiency can be ensured, the detection accuracy can be ensured, and the audio processing requirement of the actual application scene is further met.

Description

Audio detection method and device

Technical Field

The present disclosure relates to the field of audio processing technologies, and in particular, to an audio detection method and apparatus.

Background

With the development of internet technology, the noise processing of audio is applied to more and more scenes, such as live broadcasting scenes, communication scenes, voice broadcasting scenes and the like, and the noise in the audio needs to be eliminated in real time or offline, so that a user can hear the audio with better quality. However, in the prior art, when noise cancellation is implemented, most implementations distinguish between a primary sound source and a secondary sound source in audio, and eliminate unwanted noise by retaining the primary sound source and deleting the secondary sound source, however, because the microphone spraying is the noise generated by the primary sound source in the sounding process, the manner cannot eliminate the influence of the microphone spraying noise, so that the optimized audio still has other sound contents which are partially unfavorable to listening, and an effective scheme is needed to solve the problem.

Disclosure of Invention

In view of this, the present embodiments provide an audio detection method. The present disclosure also relates to an audio detection apparatus, a computing device, and a computer-readable storage medium, which solve the technical drawbacks of the prior art.

According to a first aspect of embodiments of the present disclosure, there is provided an audio detection method, including:

acquiring an audio file to be detected;

Inputting the audio file to be detected into a wheat spraying detection model for processing to obtain a wheat spraying audio interval and a wheat spraying probability of the wheat spraying audio interval in the audio file to be detected;

and determining the wheat spraying audio fragment in the audio file to be detected according to the wheat spraying audio interval and the wheat spraying probability.

Optionally, the wheat spraying detection model is trained by the following method:

Obtaining a sample audio file, and formatting the sample audio file to obtain a standard audio file;

Dividing the standard audio file into a plurality of audio intervals, and adding a wheat spraying label or a normal voice label to the plurality of audio intervals according to attribute information of the plurality of audio intervals;

And forming a sample audio interval pair based on the plurality of audio intervals and the plurality of audio intervals with the labels, and training an initial wheat spraying detection model by utilizing the sample audio intervals to obtain the wheat spraying detection model.

Optionally, the inputting the audio file to be detected to a wheat spraying detection model for processing to obtain a wheat spraying audio interval and a wheat spraying probability of the wheat spraying audio interval in the audio file to be detected includes:

Transforming the audio file to be detected based on a processing window with a preset frame length and frame shift with a preset time to obtain an amplitude spectrum corresponding to the audio file to be detected;

And extracting low-frequency characteristics from the audio file to be detected according to the amplitude spectrum, inputting the low-frequency characteristics into the wheat spraying detection model for processing, and obtaining the wheat spraying audio interval and the wheat spraying probability.

Optionally, the determining the wheat spraying audio clip in the audio file to be detected according to the wheat spraying audio interval and the wheat spraying probability includes:

Determining a plurality of wheat spraying audio subintervals in the wheat spraying audio interval, and determining the sub-wheat spraying probability corresponding to the wheat spraying audio subintervals in the plurality of wheat spraying audio subintervals according to the wheat spraying probability;

Comparing the sub-wheat spraying probability corresponding to the wheat spraying audio subinterval in the plurality of wheat spraying audio subintervals with a preset probability threshold;

and selecting the wheat spraying audio subintervals which are larger than or equal to the preset probability threshold according to the comparison result to form the wheat spraying audio fragments in the audio file to be detected.

Comparing the sub-wheat spraying probability corresponding to the wheat spraying audio subinterval in the plurality of wheat spraying audio subintervals with a preset probability threshold, and determining a plurality of target wheat spraying audio subintervals according to the comparison result;

extracting the plurality of target wheat spraying audio subintervals from the audio file to be detected to form a continuous wheat spraying audio subinterval;

And determining the wheat spraying audio fragment in the audio file to be detected based on the continuous wheat spraying audio subinterval.

Optionally, extracting the plurality of target spraying audio subintervals from the audio file to be detected to form a continuous spraying audio subinterval includes:

Determining audio positions of the target wheat spraying audio subintervals in the audio file to be detected, and determining audio time lengths corresponding to the target wheat spraying audio subintervals;

judging whether the plurality of target wheat spraying audio subintervals are continuous in the time dimension according to the audio positions and the audio duration;

if yes, integrating the target wheat spraying audio subintervals to obtain the continuous wheat spraying audio subintervals.

Optionally, the determining the spraying audio clip in the audio file to be detected based on the continuous spraying audio subinterval includes:

Carrying out framing treatment on the continuous wheat spraying audio subinterval to obtain a plurality of target audio frames;

Respectively carrying out wheat spraying detection on the target audio frames, comparing the detection result with a wheat spraying detection threshold value, and forming a wheat spraying audio sequence according to the comparison result;

And under the condition that the duration time of the wheat spraying audio sequence is longer than a duration threshold value, determining the wheat spraying audio fragment in the audio file to be detected according to the audio frames forming the wheat spraying audio sequence.

Optionally, the determining, when the duration of the sequence of the spraying microphone is greater than the duration threshold, the spraying microphone audio segment in the audio file to be detected according to the audio frames that form the sequence of the spraying microphone includes:

determining the duration of the wheat spraying audio sequence;

determining the duration threshold according to preset conditions;

judging whether the duration of the wheat spraying audio sequence is greater than or equal to the duration threshold;

if yes, executing the step of determining the spray wheat audio clips in the audio file to be detected according to the audio frames composing the spray wheat audio sequence.

Optionally, the determining the spray audio clip in the audio file to be detected according to the audio frames composing the spray audio sequence includes:

Carrying out framing treatment on the wheat spraying audio sequence to obtain a plurality of audio frames ordered according to a time sequence;

Selecting an audio frame with the first output probability larger than a first threshold value from a plurality of audio frames sequenced according to time sequence as an initial wheat spraying audio frame, and determining an audio frame with the first output probability smaller than a second threshold value after the initial wheat spraying audio frame from the plurality of audio frames sequenced according to time sequence as a terminal wheat spraying audio frame;

And determining an intermediate wheat spraying audio frame between the initial wheat spraying audio frame and the tail wheat spraying audio frame, and forming the wheat spraying audio fragment based on the initial wheat spraying audio frame, the tail wheat spraying audio frame and the intermediate wheat spraying audio frame.

Optionally, after the step of determining the malt spraying audio segment in the audio file to be detected according to the malt spraying audio interval and the malt spraying probability is performed, the method further includes:

Carrying out standardized processing on the audio file to be detected according to the wheat spraying audio fragment to obtain a target audio file, and playing the target audio file;

wherein the normalization process includes deleting the boom audio clip or reducing energy of the boom audio clip.

Optionally, the audio file to be detected includes at least one of the following:

the method comprises the steps of uploading an audio file, an audio file to be played and an audio file to be processed by a client.

According to a second aspect of embodiments of the present specification, there is provided an audio detection apparatus comprising:

the acquisition module is configured to acquire an audio file to be detected;

the processing module is configured to input the audio file to be detected into a wheat spraying detection model for processing, and obtain a wheat spraying audio interval and a wheat spraying probability of the wheat spraying audio interval in the audio file to be detected;

And the determining module is configured to determine the wheat spraying audio fragment in the audio file to be detected according to the wheat spraying audio interval and the wheat spraying probability.

According to a third aspect of embodiments of the present specification, there is provided a computing device comprising:

A memory and a processor;

the memory is for storing computer-executable instructions, and the processor is for executing the computer-executable instructions:

acquiring an audio file to be detected;

According to a fourth aspect of embodiments of the present description, there is provided a computer-readable storage medium storing computer-executable instructions which, when executed by a processor, implement the steps of the audio detection method.

According to the audio detection method provided by the specification, after the audio file to be detected is obtained, the audio file to be detected is input into the wheat spraying detection model for processing, so that a wheat spraying audio interval in the audio file to be detected and the wheat spraying probability of the wheat spraying audio interval are obtained, at the moment, in order to accurately position the wheat spraying audio fragments, the audio file to be detected is processed according to the wheat spraying audio interval and the corresponding wheat spraying probability thereof, so that the wheat spraying audio fragments in the audio file to be detected are obtained, the accurate positioning of the wheat spraying audio fragments is realized, the positioning processing efficiency of the wheat spraying audio fragments is improved, the wider application of the wheat spraying processing scene is realized, and the experience effect of a user for listening to the audio file to be detected is further improved.

Drawings

Fig. 1 is a flowchart of an audio detection method according to an embodiment of the present disclosure;

Fig. 2 is a schematic structural diagram of a first wheat spraying detection model according to an embodiment of the present disclosure;

FIG. 3 is a schematic structural diagram of a second exemplary wheat spraying detection model according to an embodiment of the present disclosure;

fig. 4 is a process flow diagram of an audio detection method applied to an audio playing scene according to an embodiment of the present disclosure;

Fig. 5 is a process flow diagram of another audio detection method applied to an audio playing scene according to an embodiment of the present disclosure;

Fig. 6 is a schematic structural diagram of an audio detection device according to an embodiment of the present disclosure;

FIG. 7 is a block diagram of a computing device according to one embodiment of the present disclosure.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present description. This description may be embodied in many other forms than described herein and similarly generalized by those skilled in the art to whom this disclosure pertains without departing from the spirit of the disclosure and, therefore, this disclosure is not limited by the specific implementations disclosed below.

The terminology used in the one or more embodiments of the specification is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the specification. As used in this specification, one or more embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that, although the terms first, second, etc. may be used in one or more embodiments of this specification to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first may also be referred to as a second, and similarly, a second may also be referred to as a first, without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "in response to a determination" depending on the context.

First, terms related to one or more embodiments of the present specification will be explained.

Spraying wheat: the phenomenon that the air flow generated by the nasal cavity and the oral cavity directly acts on the microphone during breathing or sounding and is recorded into burst larger air flow sound is referred to.

In the present specification, an audio detection method is provided, and the present specification relates to an audio detection apparatus, a computing device, and a computer-readable storage medium, which are described in detail in the following embodiments one by one.

In practical applications, the problem of spraying wheat is a common trouble when a user uses a microphone to record or speak. Aiming at the wheat spraying phenomenon, the common elimination treatment mode mainly comprises manual control or later examination; the manual control means that when a user uses the microphone, the distance and the angle between the mouth and the microphone are controlled, and additional devices such as sponge are placed on the microphone, so that the direct influence of air flow on the microphone is reduced as much as possible, and the phenomenon of wheat spraying is reduced; the later auditing is realized by finding out the spraying wheat audio frequency fragment from the recorded audio frequency by a technician and eliminating the influence of spraying wheat by processing the audio frequency frame. However, the two modes have the problems of poor timeliness, low efficiency and low precision, and the elimination of the wheat spraying phenomenon in the audio file is particularly important.

Fig. 1 shows a flowchart of an audio detection method according to an embodiment of the present disclosure, which specifically includes the following steps:

Step S102, an audio file to be detected is obtained.

In order to improve the positioning accuracy of the wheat spraying audio frequency segment and improve the positioning processing efficiency, after the audio frequency file to be detected is obtained, the audio frequency file to be detected is input into a wheat spraying detection model for processing, so that the wheat spraying audio frequency segment in the audio frequency file to be detected and the wheat spraying probability of the wheat spraying audio frequency segment are obtained, at the moment, in order to accurately position the wheat spraying audio frequency segment, the audio frequency file to be detected is processed according to the wheat spraying audio frequency segment and the corresponding wheat spraying probability thereof, so that the wheat spraying audio frequency segment in the audio frequency file to be detected is obtained, the positioning processing efficiency of the wheat spraying audio frequency segment is improved, the application of a wider wheat spraying processing scene is realized, and the experience effect of a user for listening to the audio frequency file to be detected is improved.

In a specific implementation, the audio file to be detected is an audio file to be detected by spraying the wheat-spraying audio clip, and the audio file to be detected can be an audio file uploaded by a client, for example, in a recording and broadcasting scene, a teacher records teaching contents in a recording and broadcasting mode for watching by students to achieve the purpose of teaching, after the teacher finishes the course, the teacher needs to send the course to each student for watching or listening, at this time, in order to avoid the influence of the wheat-spraying phenomenon in the recording and broadcasting contents on the students listening and lessons, the audio file uploaded by the teacher through the client needs to be checked and processed to achieve the effects of eliminating the wheat-spraying phenomenon and improving the listening effect of the students.

Or the audio file to be detected may be an audio file to be played, for example, when the user listens to a voice signal (audio file to be played) in an instant messaging scene, in order to avoid the problem of unclear played voice caused by the phenomenon of wheat spraying, the voice signal may be subjected to wheat spraying detection processing, so as to eliminate a wheat spraying audio segment in the voice information, so that the user can listen to clear voice information, and the experience effect of the user is improved.

Or the audio file to be detected may be an audio file to be processed, for example, in a model training scene, a large number of audio files need to be used for training a translation model or a semantic recognition model, and if a wheat spraying audio fragment exists in the audio file, the trained model effect is poor, in order to improve the model prediction effect, the wheat spraying detection can be performed on the audio file (the audio file to be processed) in a preprocessing stage, and the detected wheat spraying audio fragment is removed, so that the training effect of the model is improved.

In this embodiment, the audio detection method will be described by taking an audio file to be detected as an audio file recorded by a client, and other processes of positioning the microphone spraying audio clip in the audio file can refer to corresponding descriptions of this embodiment, which will not be described herein in detail.

Step S104, inputting the audio file to be detected into a wheat spraying detection model for processing, and obtaining a wheat spraying audio interval in the audio file to be detected and a wheat spraying probability of the wheat spraying audio interval.

Specifically, on the basis of obtaining the audio file to be detected, further, the audio file to be detected needs to be processed, so that a wheat spraying audio interval in the audio file to be detected and a wheat spraying probability of the wheat spraying audio interval are obtained. The wheat spraying audio interval specifically refers to an audio fragment in the audio file to be detected, and the wheat spraying audio interval comprises at least one wheat spraying audio subinterval; the wheat spraying probability specifically refers to the probability that the wheat spraying possibility exists or the probability that the wheat spraying possibility does not exist in the audio frequency fragment, so that the follow-up accurate positioning of the wheat spraying audio frequency fragment in the audio frequency file to be detected can be realized.

Based on the above, the wheat spraying audio interval comprises at least one audio frame; correspondingly, the wheat spraying detection model specifically refers to a classifier for wheat spraying detection, after the input audio is processed, the probability of the presence or absence of wheat spraying of the audio segment contained in the audio is output, so that the wheat spraying audio segment in the audio file to be detected can be conveniently determined, and the wheat spraying detection model includes, but is not limited to, a VGGish model or a ResNet (depth residual network, deep residual network) model, wherein the structure of the VGGish model is shown in fig. 2, and the structure of the ResNet model is shown in fig. 3.

It should be noted that, because VGGish models have larger volume, slower wheat spraying detection speed and high accuracy, VGGish models can be used in scenes with lower real-time requirements, such as wheat spraying detection processing scenes of recorded video, wheat spraying detection processing scenes of model training through audio files, and the like, when features are input to the models, the input features can be framed in units of set duration, the formed audio frames can be overlapped to different degrees according to the needs, each frame of audio comprises a set number of stft frequency points, namely, a matrix with the format of [ batch_size, s, p ] dimension is used as the model input feature, wherein batch_size is the number of frames to be detected contained in the audio files to be detected, s represents the duration of the audio frames, and p represents the amplitude spectrum value of the selected feature frequency band. And outputting a matrix with the dimension of [ batch_size,2], wherein each audio subinterval in the audio interval of the corresponding wheat spraying is predicted to contain the wheat spraying and is predicted to not contain the wheat spraying.

The ResNet model has smaller volume, higher wheat spraying detection speed and relatively short precision compared with the VGGish model, so the ResNet model can be used in scenes with high real-time requirements, such as a live-broadcast wheat spraying detection processing scene and a voice chat wheat spraying detection processing scene, when features are input into the model, the input features can be framed in units of set duration, each audio frame is ensured to be free from overlapping, each audio frame contains stft frequency points of set number, and when the model is output, a matrix with the format of [ batch_size, L/s,1] dimension can be output, wherein the batch_size corresponds to the number of files, (L/s) corresponds to the number of frames containing s duration, and 1 corresponds to the probability that the current audio subinterval contains wheat spraying.

In addition, since the trained VGGish model and the ResNet model can independently achieve the purpose of wheat spraying detection, other models which are not mentioned can also achieve the function of wheat spraying detection through proper configuration and training, so that the model can be selected according to the actual application scene, and the embodiment is not limited in any way.

Further, in order to improve the prediction capability of the wheat spraying detection model, the audio frame used by the training model needs to be preprocessed, so that interference of other factors on the model is avoided.

Specifically, the sample audio file specifically refers to an audio file used for training a model, and the audio file comprises a wheat spraying audio frame and a normal audio frame which does not contain wheat spraying; the standard audio file specifically refers to an audio file which can be applied to a training model after the sample audio file is formatted, wherein the formatting specifically refers to storing the sample audio file in a 16kHz sampling 16bit quantized single-channel PCM audio data format, so that the audio file used by the training model is met.

Based on the above, the wheat spraying detection model predicts whether the probability of the wheat spraying phenomenon exists in the audio file, so that in order to improve the accuracy of the model in predicting the audio file, one audio file can be divided into a plurality of audio intervals, the model predicts the plurality of audio intervals respectively, and then the interval level prediction is achieved, so that the prediction capability of the model on the wheat spraying phenomenon is improved.

Further, after the target audio file is segmented into a plurality of audio intervals, adding labels to each audio interval according to attribute information of each audio interval, wherein the attribute information specifically refers to whether each audio interval contains a wheat spraying audio frame or not, and if the audio interval contains the wheat spraying audio frame, adding the wheat spraying labels to the audio interval; if the voice frequency interval does not exist, adding a normal voice tag to the voice frequency interval, taking the voice frequency interval added with the tag as the output of a model, taking the voice frequency interval before adding the tag as the input of the model, forming a sample voice frequency interval pair, and training the initial wheat spraying detection model so as to obtain the wheat spraying detection model capable of meeting the requirements.

It should be noted that, because the model training requires continuous iteration to obtain a model with better prediction capability, in the process of training the initial wheat spraying detection model, an iteration stop condition can be set according to actual requirements, and the condition can be set based on a loss function or determined based on prediction accuracy of the model; after training of one stage is completed each time, detecting a current model by using test sample audio, stopping training the model if iteration stopping conditions are met, and taking the model obtained at the moment as the wheat spraying detection model; if the iteration stop condition is not met, training the model is continued until a model meeting the iteration stop condition can be obtained.

Furthermore, since the length of the audio file to be detected may be too large and does not meet the input standard of the wheat spraying detection model, the audio file to be detected may be preprocessed before the input of the wheat spraying detection model, so as to obtain the applicable input of the wheat spraying detection model, and in this embodiment, the specific implementation manner is as follows:

Specifically, due to the structural requirement of the wheat spraying detection model, a low-frequency part of short-time fourier transform of an audio file to be detected can be used as a characteristic to be input into the wheat spraying detection model for processing, based on the characteristic, the audio file to be detected is firstly transformed according to a processing window with a preset frame length and frame shift with a preset time to obtain an amplitude spectrum corresponding to the audio file to be detected, then the corresponding low-frequency characteristic is extracted from the audio file to be detected according to the amplitude spectrum to serve as the input of the wheat spraying detection model, and finally a wheat spraying audio interval output by the wheat spraying detection model and a wheat spraying probability corresponding to the wheat spraying audio interval are obtained to realize the subsequent positioning of the wheat spraying audio segment based on the wheat spraying audio interval, wherein the low-frequency characteristic is expressed in a matrix form.

It should be noted that, because the audio file to be detected may be longer, when the short-time fourier transform is performed on the audio file to be detected, the audio file to be detected may be divided into a plurality of audio intervals, then a processing window with a preset frame length and a frame shift with a preset time are adopted to transform the plurality of audio intervals, so as to obtain amplitude spectrums corresponding to each audio interval, and then a processing process of extracting low-frequency features corresponding to each amplitude spectrum to perform a model is more convenient.

For example, the audio file to be detected is a section of audio file recorded by a teacher, and the audio file needs to be subjected to wheat spraying elimination treatment at the moment, so that students can conveniently listen to the audio file; based on the method, a frame shift of 10ms is used for carrying out 512-order short-time Fourier transform on the audio file to obtain 257-dimensional amplitude spectrum, then a low-frequency section such as [2,66 ] in the 257-dimensional amplitude spectrum is taken as an input characteristic of a model, the low-frequency characteristic of each audio frame is input into a wheat spraying detection model to predict a wheat spraying audio section, the wheat spraying audio section with the format of [ batch_size,96,64] input by the model and the wheat spraying probability of the wheat spraying audio section are obtained, and the wheat spraying audio section in the audio file is predicted by the wheat spraying detection model for subsequent determination of the wheat spraying audio fragment in the audio file.

In summary, in order to improve the prediction capability of the model and improve the accuracy of detection of the wheat spraying audio frequency interval, the audio file input into the model can be preprocessed in a short-time Fourier transform mode, so that the model can predict the wheat spraying audio frequency interval through frequency spectrum characteristics, and the accuracy of positioning the wheat spraying audio frequency segment in the follow-up process is further improved.

And S106, determining the wheat spraying audio clips in the audio file to be detected according to the wheat spraying audio interval and the wheat spraying probability.

Specifically, after the above-mentioned wheat spraying audio interval and the corresponding wheat spraying probability output by the wheat spraying detection model are obtained, at this time, a wheat spraying audio segment in the audio file to be detected needs to be analyzed based on the wheat spraying audio interval and the wheat spraying probability, so that the audio file to be detected can be processed according to the wheat spraying audio segment to obtain an audio file meeting the playing requirement, where the wheat spraying audio segment specifically refers to an audio segment corresponding to a wheat spraying audio frame in the audio file to be detected.

In addition, different wheat spraying detection models can be selected in different application scenes in a targeted manner to achieve the purpose of wheat spraying detection, so that when the wheat spraying audio fragments in the audio file to be detected are positioned later, different modes are adopted for determining the wheat spraying audio fragments for different wheat spraying detection models in order to be matched with the wheat spraying detection models.

Based on this, on the one hand, under the condition that the wheat spraying detection model is ResNet models, it is described that the current wheat spraying detection scene is a scene with higher real-time requirement, so that the current wheat spraying detection scene needs to be completed as soon as possible when the wheat spraying audio clip is positioned, so as to adapt to the current wheat spraying detection scene, that is, the mode of exchanging detection precision for detection efficiency is used to meet the wheat spraying detection requirement, and in this embodiment, the specific implementation mode is as follows:

Specifically, in the processing of the audio file to be detected by the ResNet model, the audio file to be detected may be divided into a plurality of audio subintervals according to the model processing requirement, and then prediction of the wheat spraying probability is performed on each audio subinterval, so that after the wheat spraying audio interval output by the model is obtained, a plurality of wheat spraying audio subintervals can be determined through the wheat spraying audio interval, and then the wheat spraying audio segment is determined through detecting each wheat spraying audio subinterval.

Based on the above, analyzing the wheat spraying audio frequency interval to obtain a plurality of wheat spraying audio frequency subintervals, determining the sub-wheat spraying probability of each wheat spraying audio frequency subinterval according to the wheat spraying probability (the set of the wheat spraying probabilities corresponding to each wheat spraying audio frequency subinterval) of the wheat spraying audio frequency interval output by the wheat spraying detection model (ResNet model), comparing the sub-wheat spraying probability of each wheat spraying audio frequency subinterval with a preset probability threshold, and finally analyzing the wheat spraying audio frequency fragments in the audio file to be detected according to the comparison result, namely selecting the wheat spraying audio frequency subinterval which is larger than or equal to the preset probability threshold as the audio frequency interval with higher wheat spraying probability according to the comparison result, and forming the wheat spraying audio frequency fragments based on the part of the wheat spraying audio frequency subintervals.

For example, after a section of audio file is obtained in chat software, in order to provide better quality audio for users, at this time, a plurality of audio intervals of 0.96s are obtained by splitting the audio file, then short-time fourier transform is performed on each audio interval of 0.96s to obtain 257-dimensional matrix capable of being input into a model, at this time, the matrix corresponding to each audio interval is input into a ResNet model for processing, so as to obtain wheat spraying probabilities corresponding to each audio interval output by the model, wherein the wheat spraying probabilities are respectively 55% for the first audio interval and 80% for the second audio interval, and 15% for the nth audio interval. At this time, comparing the wheat spraying probability of each audio interval with a preset probability threshold value of 78%, determining that the probability of the second audio interval having wheat spraying is greater than the preset probability threshold value, and determining that the probability of the second audio interval having wheat spraying audio frames is greater, and determining the second audio interval as the wheat spraying audio fragment in the audio file, so as to realize subsequent quality improvement processing of the audio file according to the second audio interval.

In summary, in a scene with higher real-time requirements, in order to meet the requirements of rapidly determining the wheat spraying audio clips in the audio file to be detected, the wheat spraying audio clips are positioned in a mode of comparing the wheat spraying probability, so that the positioning of the wheat spraying audio clips is completed in a shorter time, and the audio detection efficiency is effectively improved.

On the other hand, when the wheat spraying detection model is VGGish, it is described that the current wheat spraying detection scene is a scene with low real-time requirement (i.e. a scene with high accuracy of wheat spraying detection), so that the accurate positioning of the wheat spraying audio segment is required, that is, the mode of exchanging the detection efficiency for the detection accuracy is used to meet the requirement of wheat spraying detection, and in this embodiment, the specific implementation manner is as follows:

(1) And determining a plurality of wheat spraying audio subintervals in the wheat spraying audio interval, and determining the sub-wheat spraying probability corresponding to the wheat spraying audio subintervals in the plurality of wheat spraying audio subintervals according to the wheat spraying probability.

(2) And comparing the sub-wheat spraying probability corresponding to the wheat spraying audio subinterval in the plurality of wheat spraying audio subintervals with a preset probability threshold, and determining a plurality of target wheat spraying audio subintervals according to the comparison result.

Specifically, in the processing of the audio file to be detected by the VGGish model, the audio file to be detected may be divided into a plurality of audio subintervals according to the model processing requirement, and then prediction of the wheat spraying probability is performed on each audio subinterval, so that after the wheat spraying audio interval output by the model is obtained, a plurality of wheat spraying audio subintervals can be determined through the wheat spraying audio interval, and then the wheat spraying audio segment is determined through detecting each wheat spraying audio subinterval.

Based on the above, firstly analyzing the wheat spraying audio frequency interval to obtain a plurality of wheat spraying audio frequency subintervals, and simultaneously determining the sub-wheat spraying probability of each wheat spraying audio frequency subinterval according to the wheat spraying probability (the set of the wheat spraying probabilities corresponding to each wheat spraying audio frequency subinterval) of the wheat spraying audio frequency interval output by the wheat spraying detection model (VGGish model), secondly comparing the sub-wheat spraying probability of each wheat spraying audio frequency subinterval with a preset probability threshold, and finally determining a plurality of target wheat spraying audio frequency subintervals from the plurality of wheat spraying audio frequency subintervals according to the comparison result, namely selecting the wheat spraying audio frequency subintervals with the sub-wheat spraying probability larger than or equal to the preset probability threshold as the target wheat spraying audio frequency subintervals, wherein the target wheat spraying audio frequency subintervals specifically refer to the audio frequency intervals with the wheat spraying audio frequency frames.

For example, after a section of audio file is obtained in audio-visual software, in order to provide audio with better quality for users, at this time, a plurality of audio intervals of 0.96s are obtained by splitting the audio file, then short-time fourier transform is performed on each audio interval of 0.96s to obtain 257-dimensional matrix capable of being input into a model, at this time, the matrix corresponding to each audio interval is input into a VGGish model for processing, so as to obtain wheat spraying probabilities corresponding to each audio interval output by the model, wherein the wheat spraying probabilities are respectively 85% for a first audio interval and 80% for a second audio interval; at this time, the wheat spraying probability of each audio interval is compared with a preset probability threshold value of 78%, the probability that wheat spraying exists in the first audio interval and the second audio interval is determined to be larger than the preset probability threshold value, the probability that wheat spraying audio frames exist in the first audio interval and the second audio interval can be determined to be larger, and then the first audio interval and the second audio interval are determined to be target wheat spraying audio subintervals so as to be used for accurately positioning the wheat spraying audio fragments in the audio file.

(3) And extracting the plurality of target wheat spraying audio subintervals from the audio file to be detected to form a continuous wheat spraying audio subinterval.

Specifically, on the basis of the obtaining of the target wheat spraying audio subinterval, further, in order to accurately position the wheat spraying audio clips in the audio file to be detected, a frame-by-frame analysis mode is adopted for positioning, so that the accuracy of subsequently determining the wheat spraying audio clips is improved, and the processing requirement of the audio file to be detected is met; based on the above, after obtaining the target wheat spraying audio subinterval, forming the continuous wheat spraying audio subinterval according to the attribute of the target wheat spraying audio subinterval so as to assist in positioning the wheat spraying audio fragment; the continuous wheat spraying audio subinterval specifically refers to an audio interval formed by a plurality of target wheat spraying audio subintervals which are continuous in time in the audio file to be detected.

Further, in the process of extracting the continuous spraying wheat audio subinterval from the audio file to be detected, in order to improve the accuracy of the subsequent positioning spraying wheat audio segment, adjacent target spraying wheat audio subintervals may be selected to form the continuous spraying wheat audio subinterval, and in this embodiment, the specific implementation manner is as follows:

Specifically, firstly determining the audio position of the target wheat spraying audio subinterval in the audio file to be detected and the audio duration corresponding to the target wheat spraying audio subinterval, and secondly determining whether the target wheat spraying audio subintervals have a continuous relationship in the time dimension; and finally, merging the target wheat spraying audio subintervals with continuous time to form the continuous wheat spraying audio subinterval.

Along the above example, in the case of determining the first audio interval and the second audio interval as the target microphone spraying audio subinterval, since these are two adjacent audio intervals (0.00 s-0.96s and 0.96s-1.92 s) in time, the two audio intervals are combined to form the continuous microphone spraying audio subinterval (0.00 s-1.92 s) for subsequent positioning of microphone spraying audio clips in the audio file.

(4) And determining the wheat spraying audio fragment in the audio file to be detected based on the continuous wheat spraying audio subinterval.

Specifically, on the basis of the continuous wheat spraying audio subinterval corresponding to the target wheat spraying audio subinterval, further, at this time, the wheat spraying audio fragment in the audio file to be detected can be determined by checking the continuous wheat spraying audio subinterval again.

Further, the specific implementation manner of determining the spray audio segment based on the continuous spray audio subinterval is as follows: carrying out framing treatment on the continuous wheat spraying audio subinterval to obtain a plurality of target audio frames;

Determining the duration of the wheat spraying audio sequence; determining the duration threshold according to preset conditions; judging whether the duration of the wheat spraying audio is greater than or equal to the duration threshold;

If yes, carrying out framing treatment on the wheat spraying audio sequence to obtain a plurality of audio frames ordered according to a time sequence; selecting an audio frame with the first output probability larger than a first threshold value from a plurality of audio frames sequenced according to time sequence as an initial wheat spraying audio frame, and determining an audio frame with the first output probability smaller than a second threshold value after the initial wheat spraying audio frame from the plurality of audio frames sequenced according to time sequence as a terminal wheat spraying audio frame; determining an intermediate wheat spraying audio frame between the initial wheat spraying audio frame and the end wheat spraying audio frame, and forming the wheat spraying audio segment based on the initial wheat spraying audio frame, the end wheat spraying audio frame and the intermediate wheat spraying audio frame, namely the wheat spraying audio sequence formed by the initial wheat spraying audio frame, the intermediate wheat spraying audio frame and the end wheat spraying audio frame is the wheat spraying audio segment;

if not, the sequence is not the spray audio clip.

Specifically, firstly, re-framing the continuous wheat spraying audio subinterval according to a fixed frame length and a smaller frame shift to obtain a plurality of audio frames; respectively inputting the characteristics of the audio frames into VGGish wheat spraying detection models for processing to obtain output probabilities corresponding to the audio frames output by the models; selecting an audio frame with the first output probability being greater than a first threshold value from a plurality of audio frames ordered according to time sequence to be determined as an initial wheat spraying audio frame, and determining an audio frame with the first output probability being less than a second threshold value from the subsequent audio frames to be determined as a terminal wheat spraying audio frame; and determining an intermediate wheat spraying audio frame between the initial wheat spraying audio frame and the tail wheat spraying audio frame, and forming a wheat spraying audio sequence based on the initial wheat spraying audio frame, the tail wheat spraying audio frame and the intermediate wheat spraying audio frame.

Based on the above, after the duration threshold is determined according to the preset condition, comparing the duration of the wheat spraying audio sequence with the duration threshold, if the duration of the wheat spraying audio sequence is greater than or equal to the duration threshold, it is indicated that more or all audio frames of the wheat spraying audio frames contained in the current wheat spraying audio sequence are the wheat spraying audio frames, and at this time, the wheat spraying audio sequence can be determined as a wheat spraying audio segment.

If the duration of the wheat spraying audio sequence is smaller than the duration threshold, the fact that the current wheat spraying audio sequence contains fewer wheat spraying audio frames is indicated, misjudgment is possible, and the wheat spraying audio sequence is not judged to be the wheat spraying audio fragment.

Along the above example, after obtaining the continuous audio subinterval (0.00 s-1.92 s), the continuous audio subinterval is divided into 193 data frames (data front-end complement 0) containing overlapped parts again by 0.96s frame length, the 193 data frames are respectively input into a model for detection, the output probability is greater than 0.9 from the 80 th frame and less than 0.85 from the 181 th frame, the 80 th frame is called an initial wheat spraying audio frame, the 181 th frame is called an end wheat spraying audio frame, the 81 st to 180 th frames in the middle are called middle wheat spraying audio frames, and the 80 th to 181 th frames form a wheat spraying audio sequence, and the duration is 1.02s. Because the duration threshold value determined according to the preset condition is 0.9s, the duration of the above-mentioned wheat spraying audio sequence is longer than the preset duration threshold value, and then the above-mentioned wheat spraying audio sequence is determined as a wheat spraying audio fragment for subsequent processing.

In summary, in order to accurately locate the wheat spraying audio clips, the time threshold is determined according to preset conditions; in addition, after the continuous wheat spraying audio interval is subjected to framing, the wheat spraying audio frames are analyzed frame by frame, so that the wheat spraying audio fragments are obtained, the accuracy of positioning the wheat spraying audio fragments is further improved, and the elimination treatment of the wheat spraying phenomenon on the audio file to be detected is more convenient.

In addition, after the determination of the malting audio clip in the audio file to be detected is completed, the audio file to be detected may be processed according to the malting audio clip, so as to generate a target audio file without the malting audio clip for playing, where in this embodiment, the specific implementation manner is as follows:

Carrying out standardized processing on the audio file to be detected according to the wheat spraying audio fragment to obtain a target audio file, and playing the target audio file; wherein the normalization process includes deleting the boom audio clip or reducing energy of the boom audio clip.

For example, a teacher performs teaching of Chinese phonetic lessons in a live lesson, and because the phonetic letters of "t", "p", "k" and the like all generate the phenomenon of larger air flow, so that the situation of spraying wheat is caused, in order to avoid bad influence of the phenomenon of spraying wheat on students listening to lessons, the audio file recorded by the teacher is subjected to wheat spraying detection at this time, and after the wheat spraying audio fragment is detected, the wheat spraying audio fragment is subjected to elimination processing, so that a target audio file without the wheat spraying audio fragment is generated and played, and the students can hear the audio with better playing effect.

In practical applications, the processing of the malting audio segments in the audio file to be detected includes, but is not limited to, eliminating the malting audio segments, reducing energy of the malting audio segments, etc., which are not limited herein.

The following describes an example of the application of the audio detection method provided in the present specification in an audio playing scene with reference to fig. 4. Fig. 4 shows a process flow chart of an audio detection method applied to an audio playing scene according to an embodiment of the present disclosure, which specifically includes the following steps:

Step S402, an audio file to be detected is acquired.

In this embodiment, the audio detection method is described by taking the audio file to be detected as recorded broadcast audio as an example; because the audio frames of the wheat spraying exist in the audio file to be detected, the problem of being unfavorable for the user to listen is caused, and in order to avoid the problem to influence the listening experience of the user, the audio frames of the wheat spraying in the audio file to be detected are detected and eliminated, so that the audio file which is convenient for the user to listen is played.

Step S404, the audio file to be detected is segmented to obtain a plurality of audio intervals.

In step S406, 512-order short-time fourier transform is performed on each audio interval based on the Hann window with the frame length of 25ms and the frame shift of 10ms, so as to obtain 257-dimensional amplitude spectrums corresponding to each audio interval.

Step S408, the [2,66 ] th dimension low frequency band in the 257-dimension amplitude spectrum corresponding to each audio interval is used as an input characteristic and is input to the VGGish model.

Specifically, in the recording and broadcasting scene, the timeliness requirement on the wheat spraying detection processing is not high, but the accuracy of the wheat spraying detection is required to a certain extent, so that the wheat spraying probability of each audio interval is predicted by adopting a VGGish model when the wheat spraying detection is carried out, and the accuracy of the wheat spraying detection processing is improved.

Step S410, obtaining wheat spraying probability corresponding to each audio interval output by VGGish models.

Step S412, selecting the audio interval with the wheat spraying probability greater than the preset wheat spraying probability threshold as the wheat spraying audio interval.

Step S414, a continuous wheat spraying audio interval is determined in the audio file to be detected based on the wheat spraying audio interval.

Step S416, framing the continuous wheat spraying audio interval to obtain a plurality of continuous audio frames.

In step S418, the features of a plurality of continuous audio frames are input into the model for detection, and the audio frames whose model output probabilities satisfy the probability threshold are formed into a microphone-spraying audio sequence.

Step S420, when the duration of the wheat spraying audio sequence is longer than the duration threshold, the wheat spraying audio sequence is determined to be a wheat spraying audio segment.

Step S422, the audio file to be detected is processed according to the wheat spraying audio fragment, and the processing result is played.

The audio detection method provided in this embodiment can be referred to the corresponding descriptions of the above embodiments, and will not be repeated here.

The following describes an example of the application of the audio detection method provided in the present specification in an audio playing scene with reference to fig. 5. Fig. 5 shows another processing flow chart of an audio detection method applied to an audio playing scene according to an embodiment of the present disclosure, which specifically includes the following steps:

Step S502, an audio file to be detected is obtained.

In this embodiment, the audio detection method is described by taking the audio file to be detected as live audio as an example; because the audio frames of the wheat spraying exist in the audio file to be detected, the problem of being unfavorable for the user to listen is caused, and in order to avoid the problem to influence the listening experience of the user, the audio frames of the wheat spraying in the audio file to be detected are detected and eliminated, so that the audio file which is convenient for the user to listen is played.

Step S504, the audio file to be detected is segmented to obtain a plurality of audio intervals.

Step S506, 512-order short-time Fourier transform is carried out on each audio interval based on a Hann window with a frame length of 32ms and a frame shift of 16ms, so as to obtain 257-dimensional amplitude spectrums corresponding to each audio interval.

And step S508, taking the [2,66 ] th-dimensional low frequency band in the 257-dimensional amplitude spectrum corresponding to each audio interval as an input characteristic, and inputting the input characteristic into the ResNet model.

Specifically, in a live broadcast scene, the timeliness requirement on the wheat spraying detection processing is high, so that the wheat spraying probability of each audio interval is predicted by adopting a ResNet model in order to meet the purpose of timely processing and timely feedback, and the efficiency of the wheat spraying detection processing is improved.

And step S510, obtaining wheat spraying probability corresponding to each audio interval output by the ResNet model.

Step S512, selecting the audio interval with the wheat spraying probability larger than the preset wheat spraying probability threshold as the wheat spraying audio interval.

Step S514, noise reduction processing is carried out on the wheat spraying audio interval, and the target audio file is obtained and played.

Corresponding to the above method embodiments, the present disclosure further provides an embodiment of an audio detection device, and fig. 6 shows a schematic structural diagram of an audio detection device according to an embodiment of the present disclosure. As shown in fig. 6, the apparatus includes:

An acquisition module 602 configured to acquire an audio file to be detected;

The processing module 604 is configured to input the audio file to be detected into a wheat spraying detection model for processing, so as to obtain a wheat spraying audio interval and a wheat spraying probability of the wheat spraying audio interval in the audio file to be detected;

A determining module 606 is configured to determine a wheat spraying audio fragment in the audio file to be detected according to the wheat spraying audio interval and the wheat spraying probability.

In an alternative embodiment, the wheat spray detection model is trained by:

In an alternative embodiment, the processing module 604 includes:

the transformation unit is configured to transform the audio file to be detected based on a processing window with a preset frame length and a frame shift with a preset time to obtain an amplitude spectrum corresponding to the audio file to be detected;

The processing unit is configured to extract low-frequency characteristics from the audio file to be detected according to the amplitude spectrum, input the low-frequency characteristics into the wheat spraying detection model for processing, and obtain the wheat spraying audio interval and the wheat spraying probability.

In an alternative embodiment, the determining module 606 includes:

the first determining unit is configured to determine a plurality of wheat spraying audio subintervals in the wheat spraying audio interval and determine the sub-wheat spraying probability corresponding to the wheat spraying audio subintervals in the plurality of wheat spraying audio subintervals according to the wheat spraying probability;

The first comparison unit is configured to compare the sub-wheat spraying probability corresponding to the wheat spraying audio subinterval in the plurality of wheat spraying audio subintervals with a preset probability threshold;

The first composition unit is configured to select a wheat spraying audio subinterval which is larger than or equal to the preset probability threshold according to the comparison result to compose a wheat spraying audio fragment in the audio file to be detected.

In an alternative embodiment, the determining module 606 includes:

A second determining unit configured to determine a plurality of wheat spraying audio subintervals in the wheat spraying audio interval, and determine a sub-wheat spraying probability corresponding to the wheat spraying audio subinterval in the plurality of wheat spraying audio subintervals according to the wheat spraying probability;

The second comparison unit is configured to compare the sub-wheat spraying probability corresponding to the wheat spraying audio subinterval in the plurality of wheat spraying audio subintervals with a preset probability threshold value, and determine a plurality of target wheat spraying audio subintervals according to the comparison result;

the extraction unit is configured to extract the plurality of target wheat spraying audio subintervals from the audio file to be detected to form continuous wheat spraying audio subintervals;

and the second composition unit is configured to determine the wheat spraying audio fragment in the audio file to be detected based on the continuous wheat spraying audio subinterval.

In an alternative embodiment, the extracting unit includes:

The audio position determining subunit is configured to determine audio positions of the target wheat spraying audio subintervals in the audio file to be detected and determine audio time lengths corresponding to the target wheat spraying audio subintervals;

A judging subunit configured to judge whether the plurality of target wheat spraying audio subintervals are continuous in a time dimension according to the audio position and the audio duration;

If yes, operating an integration subunit, wherein the integration subunit is configured to integrate the multiple target wheat spraying audio subintervals to obtain the continuous wheat spraying audio subintervals.

In an alternative embodiment, the second constituent unit includes:

The framing processing subunit is configured to perform framing processing on the continuous wheat spraying audio subinterval to obtain a plurality of target audio frames;

The wheat spraying detection subunit is configured to respectively perform wheat spraying detection on the plurality of target audio frames, compare the detection result with a wheat spraying detection threshold value and form a wheat spraying audio sequence according to the comparison result;

and the composition subunit is configured to determine the wheat spraying audio fragment in the audio file to be detected according to the audio frames composing the wheat spraying audio sequence under the condition that the duration time of the wheat spraying audio sequence is longer than a duration threshold value.

In an alternative embodiment, the constituent subunits comprise:

a first determination submodule configured to determine a duration of the wheat-spraying audio sequence;

A second determining submodule configured to determine the duration threshold according to a preset condition;

The judging submodule is configured to judge whether the duration of the wheat spraying audio sequence is greater than or equal to the duration threshold;

if yes, determining the spray wheat audio fragment in the audio file to be detected according to the audio frames composing the spray wheat audio sequence.

In an alternative embodiment, the determining the malting audio clip in the audio file to be detected according to the audio frames comprising the malting audio sequence includes:

Carrying out framing treatment on the wheat spraying audio sequence to obtain a plurality of audio frames ordered according to a time sequence; selecting an audio frame with the first output probability larger than a first threshold value from a plurality of audio frames sequenced according to time sequence as an initial wheat spraying audio frame, and determining an audio frame with the first output probability smaller than a second threshold value after the initial wheat spraying audio frame from the plurality of audio frames sequenced according to time sequence as a terminal wheat spraying audio frame; and determining an intermediate wheat spraying audio frame between the initial wheat spraying audio frame and the tail wheat spraying audio frame, and forming the wheat spraying audio fragment based on the initial wheat spraying audio frame, the tail wheat spraying audio frame and the intermediate wheat spraying audio frame.

In an alternative embodiment, the audio detection device further includes:

The playing module is configured to perform standardized processing on the audio file to be detected according to the wheat spraying audio clip, obtain a target audio file and play the target audio file;

In an alternative embodiment, the audio file to be detected includes at least one of:

According to the audio detection device provided by the embodiment, after the audio file to be detected is obtained, the audio file to be detected is input into the wheat spraying detection model for processing, the wheat spraying audio interval in the audio file to be detected and the wheat spraying probability of the wheat spraying audio interval are obtained, at the moment, in order to accurately position the wheat spraying audio fragments, the audio file to be detected is processed according to the wheat spraying audio interval and the corresponding wheat spraying probability thereof, so that the wheat spraying audio fragments in the audio file to be detected are obtained, the accurate positioning of the wheat spraying audio fragments is realized, the positioning processing efficiency of the wheat spraying audio fragments is improved, the wider wheat spraying processing scene is realized, and the experience effect of a user for listening to the audio file to be detected is further improved.

The foregoing is a schematic solution of an audio detection apparatus of the present embodiment. It should be noted that, the technical solution of the audio detection device and the technical solution of the audio detection method belong to the same conception, and details of the technical solution of the audio detection device, which are not described in detail, can be referred to the description of the technical solution of the audio detection method.

Fig. 7 illustrates a block diagram of a computing device 700 provided in accordance with an embodiment of the present specification. The components of computing device 700 include, but are not limited to, memory 710 and processor 720. Processor 720 is coupled to memory 710 via bus 730, and database 750 is used to store data.

Computing device 700 also includes access device 740, access device 740 enabling computing device 700 to communicate via one or more networks 760. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. The access device 740 may include one or more of any type of network interface, wired or wireless (e.g., a Network Interface Card (NIC)), such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.

In one embodiment of the present description, the above-described components of computing device 700, as well as other components not shown in FIG. 7, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device illustrated in FIG. 7 is for exemplary purposes only and is not intended to limit the scope of the present description. Those skilled in the art may add or replace other components as desired.

Computing device 700 may be any type of stationary or mobile computing device including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smart phone), wearable computing device (e.g., smart watch, smart glasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 700 may also be a mobile or stationary server.

Wherein the processor 720 is configured to execute the following computer-executable instructions:

acquiring an audio file to be detected;

The foregoing is a schematic illustration of a computing device of this embodiment. It should be noted that, the technical solution of the computing device and the technical solution of the audio detection method belong to the same concept, and details of the technical solution of the computing device, which are not described in detail, can be referred to the description of the technical solution of the audio detection method.

An embodiment of the present disclosure also provides a computer-readable storage medium storing computer instructions that, when executed by a processor, are configured to:

acquiring an audio file to be detected;

The above is an exemplary version of a computer-readable storage medium of the present embodiment. It should be noted that, the technical solution of the storage medium and the technical solution of the audio detection method belong to the same concept, and details of the technical solution of the storage medium, which are not described in detail, can be referred to the description of the technical solution of the audio detection method.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

The computer instructions include computer program code that may be in source code form, object code form, executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable medium contains content that can be appropriately scaled according to the requirements of jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is subject to legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunication signals.

It should be noted that, for the sake of simplicity of description, the foregoing method embodiments are all expressed as a series of combinations of actions, but it should be understood by those skilled in the art that the present description is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present description. Further, those skilled in the art will appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily all necessary in the specification.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.

The preferred embodiments of the present specification disclosed above are merely used to help clarify the present specification. Alternative embodiments are not intended to be exhaustive or to limit the invention to the precise form disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the disclosure and the practical application, to thereby enable others skilled in the art to best understand and utilize the disclosure. This specification is to be limited only by the claims and the full scope and equivalents thereof.

Claims

1. An audio detection method, comprising:

acquiring an audio file to be detected;

Transforming the audio file to be detected based on a processing window with a preset frame length and frame shift with a preset time to obtain an amplitude spectrum corresponding to the audio file to be detected, extracting low-frequency features from the audio file to be detected according to the amplitude spectrum, and inputting the low-frequency features into a wheat spraying detection model for processing to obtain a wheat spraying audio interval in the audio file to be detected and a wheat spraying probability of the wheat spraying audio interval;

Determining a wheat spraying audio fragment in the audio file to be detected according to the wheat spraying audio interval and the wheat spraying probability;

Wherein, determining the wheat spraying audio fragment in the audio file to be detected according to the wheat spraying audio interval and the wheat spraying probability comprises: determining a plurality of wheat spraying audio subintervals in the wheat spraying audio interval, and determining the sub-wheat spraying probability corresponding to the wheat spraying audio subintervals in the plurality of wheat spraying audio subintervals according to the wheat spraying probability; comparing the sub-wheat spraying probability corresponding to the wheat spraying audio subinterval in the plurality of wheat spraying audio subintervals with a preset probability threshold, and determining a plurality of target wheat spraying audio subintervals according to the comparison result; determining audio positions of the target wheat spraying audio subintervals in the audio file to be detected, and determining audio time lengths corresponding to the target wheat spraying audio subintervals; judging whether the plurality of target wheat spraying audio subintervals are continuous in the time dimension according to the audio positions and the audio duration; if yes, integrating the plurality of target wheat spraying audio subintervals to obtain a continuous wheat spraying audio subinterval; and determining the wheat spraying audio fragment in the audio file to be detected based on the continuous wheat spraying audio subinterval.

2. The audio detection method according to claim 1, wherein the wheat spray detection model is trained by:

3. The audio detection method according to claim 1, characterized in that the method further comprises:

4. The audio detection method according to claim 1, wherein the determining the microphone-spraying audio clip in the audio file to be detected based on the continuous microphone-spraying audio subinterval comprises:

5. The audio detection method according to claim 4, wherein the determining the malting audio clip in the audio file to be detected from the audio frames constituting the malting audio sequence in the case where the duration of the malting audio sequence is longer than a duration threshold value includes:

determining the duration of the wheat spraying audio sequence;

determining the duration threshold according to preset conditions;

6. The audio detection method according to claim 5, wherein the determining the malting audio clip in the audio file to be detected from the audio frames constituting the malting audio sequence comprises:

7. The audio detection method according to claim 1, wherein after the step of determining the microphone-spraying audio clip in the audio file to be detected according to the microphone-spraying audio interval and the microphone-spraying probability is performed, the method further comprises:

8. The audio detection method according to claim 1, wherein the audio file to be detected comprises at least one of:

9. An audio detection apparatus, comprising:

the acquisition module is configured to acquire an audio file to be detected;

The processing module is configured to transform the audio file to be detected based on a processing window with a preset frame length and frame shift with a preset time, obtain an amplitude spectrum corresponding to the audio file to be detected, extract low-frequency characteristics from the audio file to be detected according to the amplitude spectrum, input the low-frequency characteristics into a wheat spraying detection model for processing, and obtain a wheat spraying audio interval in the audio file to be detected and a wheat spraying probability of the wheat spraying audio interval;

The determining module is configured to determine a wheat spraying audio fragment in the audio file to be detected according to the wheat spraying audio interval and the wheat spraying probability, wherein determining the wheat spraying audio fragment in the audio file to be detected according to the wheat spraying audio interval and the wheat spraying probability comprises: determining a plurality of wheat spraying audio subintervals in the wheat spraying audio interval, and determining the sub-wheat spraying probability corresponding to the wheat spraying audio subintervals in the plurality of wheat spraying audio subintervals according to the wheat spraying probability; comparing the sub-wheat spraying probability corresponding to the wheat spraying audio subinterval in the plurality of wheat spraying audio subintervals with a preset probability threshold, and determining a plurality of target wheat spraying audio subintervals according to the comparison result; determining audio positions of the target wheat spraying audio subintervals in the audio file to be detected, and determining audio time lengths corresponding to the target wheat spraying audio subintervals; judging whether the plurality of target wheat spraying audio subintervals are continuous in the time dimension according to the audio positions and the audio duration; if yes, integrating the plurality of target wheat spraying audio subintervals to obtain a continuous wheat spraying audio subinterval; and determining the wheat spraying audio fragment in the audio file to be detected based on the continuous wheat spraying audio subinterval.

10. A computing device, comprising:

A memory and a processor;

The memory is configured to store computer-executable instructions and the processor is configured to execute the computer-executable instructions to implement the method of:

acquiring an audio file to be detected;

11. A computer readable storage medium storing computer instructions which, when executed by a processor, implement the steps of the audio detection method of any one of claims 1 to 8.