CN112289340A - Audio detection method and device - Google Patents

Audio detection method and device Download PDF

Info

Publication number
CN112289340A
CN112289340A CN202011211792.9A CN202011211792A CN112289340A CN 112289340 A CN112289340 A CN 112289340A CN 202011211792 A CN202011211792 A CN 202011211792A CN 112289340 A CN112289340 A CN 112289340A
Authority
CN
China
Prior art keywords
audio
wheat
wheat spraying
spraying
detected
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011211792.9A
Other languages
Chinese (zh)
Inventor
陈佳路
王晓红
刘鲁鹏
元海明
高强
夏龙
郭常圳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Ape Power Future Technology Co Ltd
Original Assignee
Beijing Ape Power Future Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Ape Power Future Technology Co Ltd filed Critical Beijing Ape Power Future Technology Co Ltd
Priority to CN202011211792.9A priority Critical patent/CN112289340A/en
Publication of CN112289340A publication Critical patent/CN112289340A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals

Abstract

The present specification provides an audio detection method and an apparatus, wherein the audio detection method includes: acquiring an audio file to be detected; inputting the audio file to be detected into a wheat spraying detection model for processing, and obtaining a wheat spraying audio interval in the audio file to be detected and the wheat spraying probability of the wheat spraying audio interval; determining a wheat spraying audio clip in the audio file to be detected according to the wheat spraying audio interval and the wheat spraying probability; the detection of the wheat spraying audio clip is carried out by the audio detection method, so that not only can the detection efficiency be ensured, but also the detection accuracy can be ensured, and the audio processing requirement of an actual application scene is further met.

Description

Audio detection method and device
Technical Field
The present disclosure relates to the field of audio processing technologies, and in particular, to an audio detection method and apparatus.
Background
With the development of internet technology, noise processing of audio is applied to more and more scenes, such as a live scene, a call scene, a voice broadcast scene and the like, and the noise in the audio needs to be eliminated in real time or offline, so that a user can listen to the audio with high quality. However, in the prior art, when noise cancellation is implemented, most of implementation schemes distinguish a primary sound source and a secondary sound source in audio, and cancel unwanted noise by retaining the primary sound source and deleting the secondary sound source, however, since the microphone is noise generated by the primary sound source during sound production, this method cannot cancel the influence of microphone noise, so that there are some other sound contents that are not good for listening in the optimized audio, and an effective scheme is needed to solve the problem.
Disclosure of Invention
In view of this, the embodiments of the present specification provide an audio detection method. The present specification also relates to an audio detection apparatus, a computing device, and a computer-readable storage medium to solve the technical problems of the prior art.
According to a first aspect of embodiments herein, there is provided an audio detection method, comprising:
acquiring an audio file to be detected;
inputting the audio file to be detected into a wheat spraying detection model for processing, and obtaining a wheat spraying audio interval in the audio file to be detected and the wheat spraying probability of the wheat spraying audio interval;
and determining the wheat spraying audio frequency segment in the audio file to be detected according to the wheat spraying audio frequency interval and the wheat spraying probability.
Optionally, the wheat spraying detection model is trained in the following manner:
acquiring a sample audio file, and formatting the sample audio file to obtain a standard audio file;
dividing the standard audio file into a plurality of audio intervals, and adding wheat spraying labels or normal voice labels to the plurality of audio intervals according to the attribute information of the plurality of audio intervals;
forming a sample audio interval pair based on the plurality of audio intervals and the plurality of audio intervals added with the labels, and training an initial wheat spraying detection model by using the sample audio interval to obtain the wheat spraying detection model.
Optionally, the inputting the audio file to be detected into a wheat spraying detection model for processing to obtain a wheat spraying audio interval in the audio file to be detected and a wheat spraying probability of the wheat spraying audio interval includes:
transforming the audio file to be detected based on a processing window with a preset frame length and frame shift of preset time to obtain a magnitude spectrum corresponding to the audio file to be detected;
extracting low-frequency features from the audio file to be detected according to the amplitude spectrum, inputting the low-frequency features into the wheat spraying detection model for processing, and obtaining the wheat spraying audio interval and the wheat spraying probability.
Optionally, determining the wheat spraying audio clip in the audio file to be detected according to the wheat spraying audio interval and the wheat spraying probability includes:
determining a plurality of wheat spraying audio subintervals in the wheat spraying audio interval, and determining the sub-wheat spraying probability corresponding to the wheat spraying audio subintervals in the plurality of wheat spraying audio subintervals according to the wheat spraying probability;
comparing the sub-wheat spraying probability corresponding to the wheat spraying audio subintervals in the plurality of wheat spraying audio subintervals with a preset probability threshold;
and selecting the wheat spraying audio subintervals which are greater than or equal to the preset probability threshold value according to the comparison result to form the wheat spraying audio clip in the audio file to be detected.
Optionally, determining the wheat spraying audio clip in the audio file to be detected according to the wheat spraying audio interval and the wheat spraying probability includes:
determining a plurality of wheat spraying audio subintervals in the wheat spraying audio interval, and determining the sub-wheat spraying probability corresponding to the wheat spraying audio subintervals in the plurality of wheat spraying audio subintervals according to the wheat spraying probability;
comparing the sub-wheat spraying probability corresponding to the wheat spraying audio subintervals in the plurality of wheat spraying audio subintervals with a preset probability threshold, and determining a plurality of target wheat spraying audio subintervals according to the comparison result;
extracting the target wheat spraying audio subintervals from the audio file to be detected to form a continuous wheat spraying audio subinterval;
and determining the wheat spraying audio clip in the audio file to be detected based on the continuous wheat spraying audio subinterval.
Optionally, the extracting the multiple target wheat spraying audio subintervals from the audio file to be detected to form a continuous wheat spraying audio subinterval includes:
determining audio positions of the target wheat spraying audio subintervals in the audio file to be detected, and determining audio durations corresponding to the target wheat spraying audio subintervals;
judging whether the target wheat spraying audio subintervals are continuous in the time dimension according to the audio positions and the audio duration;
and if so, integrating the multiple target wheat spraying audio subintervals to obtain the continuous wheat spraying audio subintervals.
Optionally, the determining, based on the continuous wheat spraying audio subinterval, a wheat spraying audio clip in the audio file to be detected includes:
performing framing processing on the continuous wheat spraying audio subinterval to obtain a plurality of target audio frames;
respectively carrying out wheat spraying detection on the target audio frames, comparing the detection result with a wheat spraying detection threshold value, and forming a wheat spraying audio sequence according to the comparison result;
and under the condition that the duration of the wheat spraying audio sequence is greater than a duration threshold, determining the wheat spraying audio segments in the audio file to be detected according to the audio frames forming the wheat spraying audio sequence.
Optionally, determining the puff audio clip in the audio file to be detected according to the audio frames forming the puff audio sequence under the condition that the duration of the puff audio sequence is greater than a duration threshold, including:
determining the duration of the wheat spraying audio sequence;
determining the time length threshold value according to a preset condition;
judging whether the duration of the wheat spraying audio sequence is greater than or equal to the duration threshold;
and if so, executing the step of determining the wheat spraying audio clip in the audio file to be detected according to the audio frames forming the wheat spraying audio sequence.
Optionally, the determining, according to the audio frames forming the wheat spraying audio sequence, the wheat spraying audio clip in the audio file to be detected includes:
performing framing processing on the wheat spraying audio sequence to obtain a plurality of audio frames which are sequenced according to a time sequence;
selecting audio frames with the first output probability larger than a first threshold value from a plurality of audio frames which are sequenced according to a time sequence to determine as initial wheat-spraying audio frames, and determining audio frames with the first output probability smaller than a second threshold value from the plurality of audio frames which are sequenced according to the time sequence to determine as terminal wheat-spraying audio frames;
determining an intermediate jet audio frame between the initial jet audio frame and the terminal jet audio frame, and forming the jet audio segment based on the initial jet audio frame, the terminal jet audio frame, and the intermediate jet audio frame.
Optionally, after the step of determining the wheat spraying audio clip in the audio file to be detected according to the wheat spraying audio interval and the wheat spraying probability is executed, the method further includes:
standardizing the audio file to be detected according to the wheat spraying audio clip to obtain a target audio file, and playing the target audio file;
wherein the normalization process comprises deleting the puff audio piece or reducing the energy of the puff audio piece.
Optionally, the audio file to be detected includes at least one of the following items:
the audio file uploaded by the client, the audio file to be played and the audio file to be processed.
According to a second aspect of embodiments herein, there is provided an audio detection apparatus comprising:
the acquisition module is configured to acquire an audio file to be detected;
the processing module is configured to input the audio file to be detected into a wheat spraying detection model for processing, and obtain a wheat spraying audio interval in the audio file to be detected and a wheat spraying probability of the wheat spraying audio interval;
and the determining module is configured to determine the wheat spraying audio clip in the audio file to be detected according to the wheat spraying audio interval and the wheat spraying probability.
According to a third aspect of embodiments herein, there is provided a computing device comprising:
a memory and a processor;
the memory is to store computer-executable instructions, and the processor is to execute the computer-executable instructions to:
acquiring an audio file to be detected;
inputting the audio file to be detected into a wheat spraying detection model for processing, and obtaining a wheat spraying audio interval in the audio file to be detected and the wheat spraying probability of the wheat spraying audio interval;
and determining the wheat spraying audio frequency segment in the audio file to be detected according to the wheat spraying audio frequency interval and the wheat spraying probability.
According to a fourth aspect of embodiments herein, there is provided a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the audio detection method.
According to the audio detection method provided by the specification, after the audio file to be detected is obtained, the audio file to be detected is input to a wheat spraying detection model to be processed, a wheat spraying audio interval in the audio file to be detected and the wheat spraying probability of the wheat spraying audio interval are obtained, at the moment, in order to accurately position a wheat spraying audio clip, the audio file to be detected is processed according to the wheat spraying audio interval and the corresponding wheat spraying probability, so that the wheat spraying audio clip in the audio file to be detected is obtained, the wheat spraying audio clip can be accurately positioned, the positioning processing efficiency of the wheat spraying audio clip is improved, a more widely applied wheat spraying processing scene is realized, and the experience effect of a user listening to the audio file to be detected is improved.
Drawings
Fig. 1 is a flowchart of an audio detection method provided in an embodiment of the present specification;
fig. 2 is a schematic structural diagram of a first wheat spray detection model provided in an embodiment of the present disclosure;
fig. 3 is a schematic structural diagram of a second wheat spray detection model provided in an embodiment of the present disclosure;
fig. 4 is a flowchart illustrating an audio detection method applied in an audio playing scene according to an embodiment of the present disclosure;
fig. 5 is a flowchart illustrating another audio detection method applied in an audio playing scene according to an embodiment of the present disclosure;
fig. 6 is a schematic structural diagram of an audio detection apparatus according to an embodiment of the present disclosure;
fig. 7 is a block diagram of a computing device according to an embodiment of the present disclosure.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present description. This description may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein, as those skilled in the art will be able to make and use the present disclosure without departing from the spirit and scope of the present disclosure.
The terminology used in the description of the one or more embodiments is for the purpose of describing the particular embodiments only and is not intended to be limiting of the description of the one or more embodiments. As used in one or more embodiments of the present specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It will be understood that, although the terms first, second, etc. may be used herein in one or more embodiments to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first can also be referred to as a second and, similarly, a second can also be referred to as a first without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
First, the noun terms to which one or more embodiments of the present specification relate are explained.
Wheat spraying: the phenomenon that airflow generated by the nasal cavity and the oral cavity directly acts on the microphone during breathing or sounding and is recorded to burst larger airflow sound is meant.
In the present specification, an audio detection method is provided, and the present specification relates to an audio detection apparatus, a computing device, and a computer-readable storage medium, which are described in detail in the following embodiments one by one.
In practical application, when a user uses a microphone to record or speak, the problem of microphone spraying is a common trouble. Aiming at the wheat spraying phenomenon, a common elimination processing mode mainly comprises manual control or later-stage examination realization; the manual control means that when a user uses the microphone, the distance and the angle between the mouth and the microphone are controlled, and additional devices such as sponge and the like are placed on the microphone, so that the direct influence of airflow on the microphone is reduced as much as possible, and the phenomenon of microphone ejection is reduced; the later auditing is realized by finding out the audio fragments of the wheat spraying from the recorded audio by technicians and eliminating the influence of the wheat spraying by processing the audio frames. However, the two methods have the problems of poor timeliness, low efficiency and low precision, and the elimination of the wheat-spraying phenomenon in the audio file is particularly important.
Fig. 1 shows a flowchart of an audio detection method provided in an embodiment of the present specification, which specifically includes the following steps:
and S102, acquiring the audio file to be detected.
In order to improve the positioning accuracy of the audio segment of the wheat spraying and improve the positioning processing efficiency, after the audio file to be detected is obtained, inputting the audio file to be detected into a wheat spraying detection model for processing, obtaining a wheat spraying audio interval in the audio file to be detected and the wheat spraying probability of the wheat spraying audio interval, processing the audio file to be detected according to the wheat spraying audio interval and the corresponding wheat spraying probability in order to accurately position a wheat spraying audio segment, therefore, the wheat spraying audio clip in the audio file to be detected is obtained, the wheat spraying audio clip can be accurately positioned, the positioning processing efficiency of the wheat spraying audio clip is improved, a wheat spraying processing scene with wider application is realized, and the experience effect of listening to the audio file to be detected by a user is improved.
When the method is specifically implemented, the audio file to be detected refers to the audio frequency required to be subjected to wheat spraying audio clip detection, the audio file to be detected can be an audio file uploaded by a client, for example, in a recorded broadcast scene, a teacher records teaching contents in a recorded broadcast mode to enable students to watch the teaching contents so as to achieve the purpose of teaching, after the teacher finishes a course by recording, the course needs to be sent to each student to watch or listen to the course, at the moment, in order to avoid the wheat spraying phenomenon in the recorded broadcast contents from influencing the student listening, the audio file uploaded by the teacher through the client needs to be checked and processed, the wheat spraying phenomenon is eliminated, and the student listening effect is improved.
Or, the audio file to be detected may also be an audio file to be played, for example, when a user listens to a voice signal (audio file to be played) in an instant messaging scene, in order to avoid the problem that the played voice is unclear due to a microphone ejection phenomenon, microphone ejection detection processing may also be performed on the voice signal, so that a microphone ejection audio clip in the voice information is eliminated, the user can listen to clearly voice information, and the experience effect of the user is improved.
Or, the audio file to be detected may also be an audio file to be processed, for example, in a model training scenario, a translation model or a semantic recognition model needs to be trained using a large number of audio files, and at this time, if a wheat-spraying audio clip exists in the audio file, the trained model has a poor effect, and in order to improve the model prediction effect, wheat-spraying detection may be performed on the audio file (the audio file to be processed) in a preprocessing stage, and the detected wheat-spraying audio clip is removed, so that the model training effect is improved.
In this embodiment, the audio detection method will be described by taking the audio file to be detected as the audio file recorded by the client as an example, and the corresponding description contents of this embodiment can be referred to for other processes of positioning the microphone audio clip in the audio file, which is not described herein in any detail.
Step S104, inputting the audio file to be detected into a wheat spraying detection model for processing, and obtaining a wheat spraying audio interval in the audio file to be detected and the wheat spraying probability of the wheat spraying audio interval.
Specifically, on the basis of obtaining the audio file to be detected, the audio file to be detected is further processed, so that a wheat spraying audio interval in the audio file to be detected and a wheat spraying probability of the wheat spraying audio interval are obtained. The wheat spraying audio interval specifically refers to an audio clip in the audio file to be detected, and the wheat spraying audio interval comprises at least one wheat spraying audio subinterval; the wheat spraying probability specifically refers to the probability that wheat spraying is possible or not possible in the audio frequency fragment, so that the subsequent wheat spraying audio frequency fragment in the audio frequency file to be detected can be accurately positioned.
Based on the above, the wheat spraying audio interval comprises at least one audio frame; correspondingly, the wheat spray detection model is specifically a classifier for performing wheat spray detection, and can output the probability that the wheat spray exists or the probability that the wheat spray does not exist in the audio segment contained in the audio after processing the input audio, so as to conveniently determine the wheat spray audio segment in the audio file to be detected, wherein the wheat spray detection model includes but is not limited to a VGGish model or a ResNet (Deep residual network) model, the VGGish model is structured as shown in fig. 2, and the ResNet model is structured as shown in fig. 3.
It should be noted that, because the VGGish model has a large volume, a low wheat-spraying detection speed and a high accuracy, the VGGish model can be used in a scene with a low real-time requirement, such as a wheat-spraying detection processing scene of recorded and broadcast video, a wheat-spraying detection processing scene of model training through an audio file, and the like, and meanwhile, when inputting characteristics to the model, the input characteristics can be framed in units of set duration, the composed audio frames can be overlapped in different degrees according to requirements, and each frame of audio contains a set number of stft frequency points, that is, a matrix with a format of [ batch _ size, s, p ] dimension is used as the model input characteristics, wherein batch _ size is the number of frames to be detected in the audio file to be detected, and is related to the total duration, s represents the duration of the audio frame, and p represents the magnitude spectrum value of the selected characteristic frequency band. And outputting a matrix with [ batch _ size,2] dimension, wherein the matrix respectively corresponds to the probability that each audio subinterval in the wheat spraying audio interval is predicted to contain wheat spraying and the probability that the audio subinterval is predicted not to contain wheat spraying.
The ResNet model is small in size, the wheat spraying detection speed is high, and the accuracy is a bit less than that of a VGGish model, so that the ResNet model can be used in scenes with high real-time requirements, such as a live wheat spraying detection processing scene and a voice wheat spraying detection processing scene, when characteristics are input into the model, the input characteristics can be framed in units of set duration, no overlap among audio frames is ensured, each frame of audio comprises a set number of stft frequency points, when the model is output, a matrix with a [ batch _ size, L/s,1] dimension can be output, wherein the batch _ size corresponds to the number of files, (L/s) corresponds to the number of s-hour long frames contained in the file, and 1 corresponds to the probability that the current audio subinterval contains wheat spraying.
In addition, since the trained VGGish model and the ResNet model can both independently achieve the purpose of performing the wheat-spray detection, and other models not mentioned may also achieve the function of performing the wheat-spray detection through appropriate configuration and training, they may be selected according to the actual application scenario, and the embodiment is not limited herein.
Further, in order to improve the prediction capability of the wheat spraying detection model, the audio frame used by the training model needs to be preprocessed, so as to avoid interference of other factors on the model, in this embodiment, the wheat spraying detection model is trained in the following manner:
acquiring a sample audio file, and formatting the sample audio file to obtain a standard audio file;
dividing the standard audio file into a plurality of audio intervals, and adding wheat spraying labels or normal voice labels to the plurality of audio intervals according to the attribute information of the plurality of audio intervals;
forming a sample audio interval pair based on the plurality of audio intervals and the plurality of audio intervals added with the labels, and training an initial wheat spraying detection model by using the sample audio interval to obtain the wheat spraying detection model.
Specifically, the sample audio file specifically refers to an audio file used by a training model, and the audio file comprises a wheat spraying audio frame and a normal audio frame without wheat spraying; the standard audio file is specifically an audio file which can be applied to a training model after the sample audio file is formatted, wherein the formatting process is specifically to store the sample audio file in a 16kHz sampling 16bit quantized single-channel PCM audio data format, so that the audio file used by the training model is met.
Based on this, because the wheat-spraying detection model predicts the probability of whether the wheat-spraying phenomenon exists in the audio file, in order to improve the accuracy of the model in predicting the audio file, one audio file can be divided into a plurality of audio intervals, and the model predicts the plurality of audio intervals respectively, so that the prediction of the interval hierarchy is achieved, and the prediction capability of the model in predicting the wheat-spraying phenomenon is improved.
Further, after the target audio file is divided into a plurality of audio intervals, adding a label to each audio interval according to attribute information of each audio interval, wherein the attribute information specifically indicates whether each audio interval contains a wheat spraying audio frame, and if the audio interval contains a wheat spraying audio frame, adding a wheat spraying label to the audio interval; and if the acoustic frequency interval does not exist, adding a normal voice label to the acoustic frequency interval, then using the acoustic frequency interval with the label as the output of the model, using the acoustic frequency interval before the label is added as the input of the model to form a sample acoustic frequency interval pair, and training the initial wheat spraying detection model so as to obtain the wheat spraying detection model capable of meeting the requirement.
It should be noted that, as the model training requires continuous iteration to obtain a model with good prediction capability, in the process of training the initial wheat spraying detection model, an iteration stop condition can be set according to actual requirements, and the condition can be set based on a loss function or determined based on the prediction accuracy of the model; after each stage of training is finished, detecting the current model by using the audio frequency of the test sample, stopping training the model if an iteration stopping condition is met, and taking the model obtained at the moment as the wheat spraying detection model; and if the iteration stop condition is not met, continuing training the model until the model meeting the iteration stop condition can be obtained.
Furthermore, because the length of the audio file to be detected may be too large and does not meet the input standard of the wheat spraying detection model, before the audio file to be detected is input into the wheat spraying detection model, the audio file to be detected may be preprocessed, so as to obtain an applicable input of the wheat spraying detection model, in this embodiment, the specific implementation manner is as follows:
transforming the audio file to be detected based on a processing window with a preset frame length and frame shift of preset time to obtain a magnitude spectrum corresponding to the audio file to be detected;
extracting low-frequency features from the audio file to be detected according to the amplitude spectrum, inputting the low-frequency features into the wheat spraying detection model for processing, and obtaining the wheat spraying audio interval and the wheat spraying probability.
Specifically, due to the structural requirements of the wheat spraying detection model, a low-frequency part of short-time fourier transform of an audio file to be detected can be input to the wheat spraying detection model as a feature for processing, on the basis, the audio file to be detected is firstly transformed according to a processing window with a preset frame length and a frame shift of preset time to obtain an amplitude spectrum corresponding to the audio file to be detected, then corresponding low-frequency features are extracted from the audio file to be detected according to the amplitude spectrum to be used as input of the wheat spraying detection model, and finally a wheat spraying audio interval output by the wheat spraying detection model and a wheat spraying probability corresponding to the wheat spraying audio interval are obtained, so that a wheat spraying audio segment is subsequently positioned on the basis of the wheat spraying audio interval, wherein the low-frequency features are expressed in a matrix form.
It should be noted that, because the audio file to be detected may be relatively long, when the short-time fourier transform is performed on the audio file to be detected, the audio file to be detected may be firstly segmented into a plurality of audio intervals, and then a processing window with a preset frame length and a frame shift with a preset time are used to transform the plurality of audio intervals, so as to obtain the magnitude spectrum corresponding to each audio interval, and then, it is more convenient to extract the low-frequency features corresponding to each magnitude spectrum to perform the processing process of the model.
For example, the audio file to be detected is a section of audio file recorded by a teacher, and the audio file needs to be subjected to wheat spraying elimination processing at the moment, so that students can conveniently listen to the audio file; based on this, 512-order short-time fourier transform is performed on the audio file by using a Hann window with a frame length of 25ms and a frame shift of 10ms to obtain an 257-dimensional amplitude spectrum, then a low-frequency interval such as [2,66 ] in the 257-dimensional amplitude spectrum is taken as an input feature of the model, then the low-frequency feature of each audio frame is input into a wheat spraying detection model to predict a wheat spraying audio interval, a wheat spraying audio interval with a format of [ batch _ size,96,64] input by the model and a wheat spraying probability of the wheat spraying audio interval are obtained, and prediction of the wheat spraying audio interval in the audio file through the wheat spraying detection model is realized for subsequent determination of the wheat spraying audio segments in the audio file.
In conclusion, in order to improve the prediction capability of the model and improve the detection accuracy of the wheat spraying audio interval, the audio file of the input model can be preprocessed in a short-time Fourier transform mode, so that the model can predict the wheat spraying audio interval through the frequency spectrum characteristics, and the accuracy of the subsequent positioning wheat spraying audio segment is further improved.
And S106, determining the wheat spraying audio clip in the audio file to be detected according to the wheat spraying audio interval and the wheat spraying probability.
Specifically, after the wheat spraying audio interval output by the wheat spraying detection model and the corresponding wheat spraying probability are obtained, at this time, the wheat spraying audio segment in the audio file to be detected needs to be analyzed based on the wheat spraying audio interval and the wheat spraying probability, so that the audio file to be detected can be subsequently processed according to the wheat spraying audio segment to obtain the audio file meeting the playing requirement, wherein the wheat spraying audio segment specifically refers to the audio segment corresponding to the wheat spraying audio frame in the audio file to be detected.
In addition, different wheat spray detection models can be selected in different application scenes in a targeted manner to achieve the purpose of wheat spray detection, so that different ways are adopted for determining the wheat spray audio segments according to different wheat spray detection models in order to fit the wheat spray detection models when the wheat spray audio segments in the audio file to be detected are subsequently positioned.
On the basis, on the one hand, when the wheat spray detection model is the ResNet model, it is described that the current wheat spray detection scene is a scene with a high real-time requirement, and therefore, the positioning of the wheat spray audio segment needs to be completed as soon as possible to adapt to the current wheat spray detection scene, that is, the wheat spray detection requirement is met by replacing the detection accuracy with the detection efficiency, and in the embodiment, the specific implementation manner is as follows:
determining a plurality of wheat spraying audio subintervals in the wheat spraying audio interval, and determining the sub-wheat spraying probability corresponding to the wheat spraying audio subintervals in the plurality of wheat spraying audio subintervals according to the wheat spraying probability;
comparing the sub-wheat spraying probability corresponding to the wheat spraying audio subintervals in the plurality of wheat spraying audio subintervals with a preset probability threshold;
and selecting the wheat spraying audio subintervals which are greater than or equal to the preset probability threshold value according to the comparison result to form the wheat spraying audio clip in the audio file to be detected.
Specifically, in the process of processing the audio file to be detected, the ResNet model may segment the audio file to be detected into a plurality of audio subintervals according to model processing requirements, and then predict the wheat spraying probability for each audio subinterval, so that after the wheat spraying audio interval output by the model is obtained, a plurality of wheat spraying audio subintervals may be determined through the wheat spraying audio interval, and then the wheat spraying audio segment is determined by detecting each wheat spraying audio subinterval.
Based on this, the wheat spraying audio interval is firstly analyzed to obtain a plurality of wheat spraying audio subintervals, meanwhile, the sub-wheat spraying probability of each wheat spraying audio subinterval is determined according to the wheat spraying probability (the set of the wheat spraying probabilities corresponding to the wheat spraying audio subintervals) of the wheat spraying audio interval output by the wheat spraying detection model (ResNet model), then, the sub-wheat spraying probability of each wheat spraying audio subinterval is compared with a preset probability threshold, finally, the wheat spraying audio segments in the audio file to be detected can be analyzed according to the comparison result, namely, the wheat spraying audio subintervals larger than or equal to the preset probability threshold are selected as the audio intervals with higher wheat spraying probabilities according to the comparison result, and then, the wheat spraying audio segments are formed based on the partial wheat spraying audio subintervals.
For example, after a section of audio file is obtained in chat software, in order to provide better-quality audio for a user, at this time, a plurality of audio files are segmented to obtain a plurality of 0.96s audio intervals, then, short-time fourier transform is performed on each 0.96s audio interval to obtain a 257-dimensional matrix capable of being input into a model, at this time, the matrix corresponding to each audio interval is input into a ResNet model to be processed, and the wheat spraying probabilities corresponding to each audio interval output by the model are obtained, wherein the wheat spraying probabilities corresponding to the first audio interval are 55%, and the wheat spraying probability corresponding to the second audio interval is 80% >. the nth audio interval is 15%; at this time, the wheat spraying probability of each audio interval is compared with a preset probability threshold value of 78%, the probability that wheat spraying exists in the second audio interval is determined to be larger than the preset probability threshold value, the probability that wheat spraying audio frames exist in the second audio interval can be determined to be larger, and then the second audio interval is determined to be a wheat spraying audio clip in the audio file, so that the follow-up quality improvement treatment can be carried out on the audio file according to the second audio interval.
In summary, in a scene with a high real-time requirement, in order to meet the requirement for quickly determining the wheat spraying audio clip in the audio file to be detected, the wheat spraying audio clip is positioned by comparing the wheat spraying probability, so that the positioning of the wheat spraying audio clip is completed in a short time, and the efficiency of audio detection is effectively improved.
On the other hand, when the wheat spray detection model is the VGGish model, it is described that the current wheat spray detection scene is a scene with a low real-time requirement (i.e., a scene with high wheat spray detection accuracy), and therefore, the wheat spray audio segment needs to be accurately positioned, that is, the wheat spray detection requirement is met by replacing the detection accuracy with the detection efficiency, in this embodiment, the specific implementation manner is as follows:
(1) determining a plurality of wheat spraying audio subintervals in the wheat spraying audio interval, and determining the sub-wheat spraying probability corresponding to the wheat spraying audio subintervals in the plurality of wheat spraying audio subintervals according to the wheat spraying probability.
(2) And comparing the sub-wheat spraying probability corresponding to the wheat spraying audio subintervals in the plurality of wheat spraying audio subintervals with a preset probability threshold, and determining a plurality of target wheat spraying audio subintervals according to the comparison result.
Specifically, in the process of processing the audio file to be detected, the VGGish model may segment the audio file to be detected into a plurality of audio subintervals according to model processing requirements, and then predict the wheat spraying probability of each audio subinterval, so after the wheat spraying audio interval output by the model is obtained, a plurality of wheat spraying audio subintervals can be determined through the wheat spraying audio interval, and then the wheat spraying audio segment is determined by detecting each wheat spraying audio subinterval.
Based on this, the spouted wheat audio interval is firstly analyzed to obtain the multiple spouted wheat audio subintervals, meanwhile, the sub-spouted wheat probability of each spouted wheat audio subinterval is determined according to the spouted wheat probability (the set of the spouted wheat probability corresponding to each spouted wheat audio subinterval) of the spouted wheat audio interval output by the spouted wheat detection model (VGGish model), then, the sub-spouted wheat probability of each spouted wheat audio subinterval is compared with a preset probability threshold, and finally, the multiple target spouted wheat audio subintervals can be determined from the multiple spouted wheat audio subintervals according to the comparison result, namely, the spouted wheat audio subintervals with the sub-spouted wheat probabilities larger than or equal to the preset probability threshold are selected as the target spouted wheat audio subintervals, wherein the target spouted wheat audio subintervals specifically refer to the audio intervals with the existence of the spouted wheat audio frames.
For example, after a section of audio file is obtained in audio-visual software, in order to provide better-quality audio for a user, the audio file is divided to obtain a plurality of 0.96s audio intervals, then short-time fourier transform is performed on each 0.96s audio interval to obtain a 257-dimensional matrix capable of being input into a model, at this time, the matrix corresponding to each audio interval is input into a VGGish model to be processed, and the wheat spraying probabilities corresponding to each audio interval output by the model are obtained, wherein the wheat spraying probabilities of the first audio interval are 85%, and the wheat spraying probability of the second audio interval is 80% >. 9.. nth audio interval is 15%; at this time, the wheat spraying probability of each audio interval is compared with a preset probability threshold value of 78%, the probability that wheat spraying exists in the first audio interval and the second audio interval is determined to be larger than the preset probability threshold value, the probability that wheat spraying audio frames exist in the first audio interval and the second audio interval can be determined to be larger, and then the first audio interval and the second audio interval are determined to be target wheat spraying audio subintervals to be used for accurately positioning the wheat spraying audio clips in the audio file in the following period.
(3) And extracting the target wheat spraying audio subintervals from the audio file to be detected to form a continuous wheat spraying audio subinterval.
Specifically, on the basis of obtaining the target wheat spraying audio subinterval, further, in order to accurately position the wheat spraying audio segment in the audio file to be detected, a frame-by-frame analysis mode is adopted for positioning, so that the accuracy of subsequently determining the wheat spraying audio segment is improved, and the processing requirement on the audio file to be detected is met; based on the above, after the target wheat spraying audio subinterval is obtained, the continuous wheat spraying audio subinterval is formed according to the attribute of the target wheat spraying audio subinterval so as to assist in positioning the wheat spraying audio clip; the continuous wheat spraying audio subinterval is an audio interval formed by a plurality of target wheat spraying audio subintervals which are continuous in time in the audio file to be detected.
Further, in the process of extracting the sustained wheat spraying audio subinterval from the audio file to be detected, in order to improve the accuracy of subsequent positioning wheat spraying audio segments, adjacent target wheat spraying audio subintervals may be selected to form the sustained wheat spraying audio subinterval, and in this embodiment, the specific implementation manner is as follows:
determining audio positions of the target wheat spraying audio subintervals in the audio file to be detected, and determining audio durations corresponding to the target wheat spraying audio subintervals;
judging whether the target wheat spraying audio subintervals are continuous in the time dimension according to the audio positions and the audio duration;
and if so, integrating the multiple target wheat spraying audio subintervals to obtain the continuous wheat spraying audio subintervals.
Specifically, firstly, determining the audio position of the target wheat spraying audio subinterval in the audio file to be detected and the audio duration corresponding to the target wheat spraying audio subinterval, and secondly, determining whether the target wheat spraying audio subintervals have a continuous relation in the time dimension; and finally combining the target wheat spraying audio subintervals with continuous time to form the continuous wheat spraying audio subintervals.
Along the above example, in the case that the first audio interval and the second audio interval are determined as the target boomerang audio subintervals, since the two audio intervals are two adjacent audio intervals (0.00s-0.96s and 0.96s-1.92s) in terms of time, the two audio intervals are combined to form the sustained boomerang audio subintervals (0.00s-1.92s) for subsequently locating the boomerang audio segments in the audio file.
(4) And determining the wheat spraying audio clip in the audio file to be detected based on the continuous wheat spraying audio subinterval.
Specifically, on the basis of extracting the continuous wheat spraying audio subinterval corresponding to the target wheat spraying audio subinterval, the wheat spraying audio clip in the audio file to be detected can be determined in a mode of rechecking the continuous wheat spraying audio subinterval.
Further, determining a specific implementation manner of the puff audio clip based on the continuous puff audio subinterval is as follows: performing framing processing on the continuous wheat spraying audio subinterval to obtain a plurality of target audio frames;
respectively carrying out wheat spraying detection on the target audio frames, comparing the detection result with a wheat spraying detection threshold value, and forming a wheat spraying audio sequence according to the comparison result;
determining the duration of the wheat spraying audio sequence; determining the time length threshold value according to a preset condition; judging whether the duration of the wheat spraying audio is greater than or equal to the duration threshold;
if so, performing framing processing on the wheat spraying audio sequence to obtain a plurality of audio frames sequenced according to a time sequence; selecting audio frames with the first output probability larger than a first threshold value from a plurality of audio frames which are sequenced according to a time sequence to determine as initial wheat-spraying audio frames, and determining audio frames with the first output probability smaller than a second threshold value from the plurality of audio frames which are sequenced according to the time sequence to determine as terminal wheat-spraying audio frames; determining an intermediate blasting audio frame between the initial blasting audio frame and the terminal blasting audio frame, and forming the blasting audio segment based on the initial blasting audio frame, the terminal blasting audio frame and the intermediate blasting audio frame, wherein the blasting audio sequence formed by the initial blasting audio frame, the intermediate blasting audio frame and the terminal blasting audio frame is the blasting audio segment;
if not, the sequence is not the wheat spraying audio clip.
Specifically, firstly, a plurality of audio frames are obtained by re-framing the continuous wheat-spraying audio subinterval according to a fixed frame length and a smaller frame shift; inputting the characteristics of the audio frames into a VGGish wheat spray detection model respectively for processing to obtain the output probability corresponding to each audio frame output by the model; selecting audio frames with the first output probability larger than a first threshold value from a plurality of audio frames which are sequenced according to the time sequence to determine as initial wheat spraying audio frames, and determining audio frames with the first output probability smaller than a second threshold value from the subsequent audio frames as terminal wheat spraying audio frames; determining an intermediate blasting audio frame between the initial blasting audio frame and the terminal blasting audio frame, and forming a blasting audio sequence based on the initial blasting audio frame, the terminal blasting audio frame and the intermediate blasting audio frame.
Based on this, after the duration threshold is determined according to the preset condition, the duration of the wheat spraying audio sequence is compared with the duration threshold, if the duration of the wheat spraying audio sequence is greater than or equal to the duration threshold, it is indicated that more wheat spraying audio frames are contained in the current wheat spraying audio sequence or all the audio frames are wheat spraying audio frames, and at this time, the wheat spraying audio sequence can be determined as a wheat spraying audio segment.
If the duration of the wheat spraying audio sequence is smaller than the duration threshold, it is indicated that the number of the wheat spraying audio frames contained in the current wheat spraying audio sequence is small, and misjudgment may occur, and the wheat spraying audio sequence is not determined as the wheat spraying audio segment.
According to the above example, after the sustained audio subinterval (0.00s-1.92s) is obtained, the data is divided into 193 data frames (data front end complement 0) with frame length of 0.96s and frame shift of 0.01s, the data frames are respectively input into the model for detection, the output probability is greater than 0.9 from the 80 th frame, and is less than 0.85 from the 181 th frame, the 80 th frame is called an initial wheat-spraying audio frame, the 181 th frame is called an end wheat-spraying audio frame, the 81 th to 180 th frames in the middle are called middle wheat-spraying audio frames, the wheat-spraying audio sequence is formed by the 80 th to 181 th frames, and the sustained duration is 1.02 s. Because the time length threshold value determined according to the preset condition is 0.9s, and the duration of the wheat spraying audio sequence is greater than the preset time length threshold value, the wheat spraying audio sequence is determined as a wheat spraying audio clip for the subsequent processing process.
In summary, in order to accurately locate the wheat spraying audio clip, the time threshold is determined according to a preset condition; in addition, after the continuous wheat spraying audio interval is divided into frames, the wheat spraying audio frames are analyzed frame by frame, and then the wheat spraying audio segments are obtained, so that the accuracy of positioning the wheat spraying audio segments is further improved, and the elimination of the wheat spraying phenomenon of the audio file to be detected is more convenient.
In addition, after the determination of the microphone ejection audio clip in the audio file to be detected is completed, the audio file to be detected may be processed according to the microphone ejection audio clip, so as to generate a target audio file without the microphone ejection audio clip for playing, in this embodiment, a specific implementation manner is as follows:
standardizing the audio file to be detected according to the wheat spraying audio clip to obtain a target audio file, and playing the target audio file; wherein the normalization process comprises deleting the puff audio piece or reducing the energy of the puff audio piece.
For example, a teacher performs teaching of a speech and pinyin course on a live broadcast course, because pinyin "t", "p", "k" and the like can generate a phenomenon of large airflow, thereby causing a wheat spraying situation to occur, in order to avoid the wheat spraying phenomenon from bringing bad influence to students listening to the course, at the moment, the wheat spraying detection is performed on an audio file recorded by the teacher, and after a wheat spraying audio clip is detected, the wheat spraying audio clip is eliminated, thereby generating a target audio file without the wheat spraying audio clip and playing the target audio file, and enabling the students to hear an audio with a good playing effect.
In practical applications, the processing of the wheat spraying audio segment in the audio file to be detected includes, but is not limited to, eliminating the wheat spraying audio segment, reducing energy of the wheat spraying audio segment, and the like, and the embodiment is not limited herein.
According to the audio detection method provided by the specification, after the audio file to be detected is obtained, the audio file to be detected is input to a wheat spraying detection model to be processed, a wheat spraying audio interval in the audio file to be detected and the wheat spraying probability of the wheat spraying audio interval are obtained, at the moment, in order to accurately position a wheat spraying audio clip, the audio file to be detected is processed according to the wheat spraying audio interval and the corresponding wheat spraying probability, so that the wheat spraying audio clip in the audio file to be detected is obtained, the wheat spraying audio clip can be accurately positioned, the positioning processing efficiency of the wheat spraying audio clip is improved, a more widely applied wheat spraying processing scene is realized, and the experience effect of a user listening to the audio file to be detected is improved.
The following will further describe the audio detection method with reference to fig. 4 by taking an application of the audio detection method provided in this specification in an audio playing scene as an example. Fig. 4 shows a processing flow chart of an audio detection method applied in an audio playing scene, which is provided in an embodiment of the present specification, and specifically includes the following steps:
step S402, acquiring the audio file to be detected.
In this embodiment, the audio detection method is described by taking the audio file to be detected as recorded and broadcast audio as an example; the problem that the audio frame of the wheat spraying is not beneficial to the listening of the user can be caused due to the existence of the audio frame of the wheat spraying in the audio file to be detected, and in order to avoid the influence of the problem on the listening experience of the user, the wheat spraying audio frame in the audio file to be detected is detected and eliminated, so that the audio file which is convenient for the user to listen to is played.
Step S404, the audio file to be detected is segmented to obtain a plurality of audio intervals.
Step S406, performing 512-order short-time Fourier transform on each audio interval based on a Hann window with the frame length of 25ms and the frame shift of 10ms, and obtaining a 257-dimensional amplitude spectrum corresponding to each audio interval.
Step S408, the [2,66) th dimension low frequency band in the 257-dimension amplitude spectrum corresponding to each audio interval is used as an input feature and input into the VGGish model.
Specifically, in a recording and broadcasting scene, the requirement on the timeliness of wheat spraying detection processing is not high, but certain requirements are provided for the wheat spraying detection accuracy, so that when wheat spraying detection is carried out, prediction of wheat spraying probability of each audio interval is realized by adopting a VGGish model, and the wheat spraying detection processing accuracy is improved.
And step S410, obtaining the wheat spraying probability corresponding to each audio interval output by the VGGish model.
Step S412, selecting an audio interval with the wheat spraying probability larger than a preset wheat spraying probability threshold value as a wheat spraying audio interval.
And step S414, determining a continuous wheat spraying audio interval in the audio file to be detected based on the wheat spraying audio interval.
Step S416, performing framing processing on the continuous wheat spraying audio interval to obtain a plurality of continuous audio frames.
And step S418, inputting the characteristics of a plurality of continuous audio frames into a model for detection, and forming the audio frames with the model output probability meeting the probability threshold into a wheat-blasting audio sequence.
Step S420, under the condition that the duration of the wheat spraying audio sequence is greater than the duration threshold, the wheat spraying audio sequence is judged to be a wheat spraying audio clip.
And S422, processing the audio file to be detected according to the wheat spraying audio clip, and playing the processing result.
The audio detection methods provided in this embodiment can all refer to the corresponding descriptions of the above embodiments, and are not described herein again.
According to the audio detection method provided by the specification, after the audio file to be detected is obtained, the audio file to be detected is input to a wheat spraying detection model to be processed, a wheat spraying audio interval in the audio file to be detected and the wheat spraying probability of the wheat spraying audio interval are obtained, at the moment, in order to accurately position a wheat spraying audio clip, the audio file to be detected is processed according to the wheat spraying audio interval and the corresponding wheat spraying probability, so that the wheat spraying audio clip in the audio file to be detected is obtained, the wheat spraying audio clip can be accurately positioned, the positioning processing efficiency of the wheat spraying audio clip is improved, a more widely applied wheat spraying processing scene is realized, and the experience effect of a user listening to the audio file to be detected is improved.
The following will further describe the audio detection method with reference to fig. 5 by taking an application of the audio detection method provided in this specification in an audio playing scene as an example. Fig. 5 shows a processing flow chart of another audio detection method applied in an audio playing scene provided in an embodiment of the present specification, which specifically includes the following steps:
step S502, the audio file to be detected is obtained.
In this embodiment, the audio detection method is described by taking the audio file to be detected as live audio as an example; the problem that the audio frame of the wheat spraying is not beneficial to the listening of the user can be caused due to the existence of the audio frame of the wheat spraying in the audio file to be detected, and in order to avoid the influence of the problem on the listening experience of the user, the wheat spraying audio frame in the audio file to be detected is detected and eliminated, so that the audio file which is convenient for the user to listen to is played.
Step S504, the audio file to be detected is segmented to obtain a plurality of audio intervals.
Step S506, performing 512-order short-time Fourier transform on each audio interval based on a Hann window with the frame length of 32ms and the frame shift of 16ms, and obtaining a 257-dimensional amplitude spectrum corresponding to each audio interval.
Step S508, using the [2,66) th dimensional low frequency band in the 257-dimensional amplitude spectrum corresponding to each audio interval as an input feature, and inputting the input feature into the ResNet model.
Specifically, in a live broadcast scene, the timeliness requirement on wheat spraying detection processing is high, so that the aim of timely processing and timely feedback can be fulfilled, a ResNet model is adopted to predict wheat spraying probability of each audio interval, and accordingly wheat spraying detection processing efficiency is improved.
And step S510, obtaining the wheat spraying probability corresponding to each audio interval output by the ResNet model.
And S512, selecting the audio interval with the wheat spraying probability greater than a preset wheat spraying probability threshold value as the wheat spraying audio interval.
And step S514, carrying out noise reduction processing on the wheat spraying audio interval to obtain a target audio file and playing the target audio file.
The audio detection methods provided in this embodiment can all refer to the corresponding descriptions of the above embodiments, and are not described herein again.
According to the audio detection method provided by the specification, after the audio file to be detected is obtained, the audio file to be detected is input to a wheat spraying detection model to be processed, a wheat spraying audio interval in the audio file to be detected and the wheat spraying probability of the wheat spraying audio interval are obtained, at the moment, in order to accurately position a wheat spraying audio clip, the audio file to be detected is processed according to the wheat spraying audio interval and the corresponding wheat spraying probability, so that the wheat spraying audio clip in the audio file to be detected is obtained, the wheat spraying audio clip can be accurately positioned, the positioning processing efficiency of the wheat spraying audio clip is improved, a more widely applied wheat spraying processing scene is realized, and the experience effect of a user listening to the audio file to be detected is improved.
Corresponding to the above method embodiment, the present specification further provides an audio detection apparatus embodiment, and fig. 6 shows a schematic structural diagram of an audio detection apparatus provided in an embodiment of the present specification. As shown in fig. 6, the apparatus includes:
an obtaining module 602 configured to obtain an audio file to be detected;
the processing module 604 is configured to input the audio file to be detected to a wheat spraying detection model for processing, so as to obtain a wheat spraying audio interval in the audio file to be detected and a wheat spraying probability of the wheat spraying audio interval;
a determining module 606 configured to determine a puff audio clip in the audio file to be detected according to the puff audio interval and the puff probability.
In an optional embodiment, the wheat spray detection model is trained as follows:
acquiring a sample audio file, and formatting the sample audio file to obtain a standard audio file;
dividing the standard audio file into a plurality of audio intervals, and adding wheat spraying labels or normal voice labels to the plurality of audio intervals according to the attribute information of the plurality of audio intervals;
forming a sample audio interval pair based on the plurality of audio intervals and the plurality of audio intervals added with the labels, and training an initial wheat spraying detection model by using the sample audio interval to obtain the wheat spraying detection model.
In an alternative embodiment, the processing module 604 includes:
the conversion unit is configured to convert the audio file to be detected based on a processing window with a preset frame length and frame shift of preset time to obtain a magnitude spectrum corresponding to the audio file to be detected;
and the processing unit is configured to extract low-frequency features from the audio file to be detected according to the amplitude spectrum, input the low-frequency features into the wheat spraying detection model and process the low-frequency features to obtain the wheat spraying audio interval and the wheat spraying probability.
In an optional embodiment, the determining module 606 includes:
the first determining unit is configured to determine a plurality of jet microphone audio subintervals in the jet microphone audio interval and determine sub-jet microphone probabilities corresponding to the jet microphone audio subintervals in the plurality of jet microphone audio subintervals according to the jet microphone probabilities;
a first comparing unit, configured to compare a sub-wheat spraying probability corresponding to a wheat spraying audio subinterval in the plurality of wheat spraying audio subintervals with a preset probability threshold;
and the first composition unit is configured to select the wheat spraying audio subintervals larger than or equal to the preset probability threshold value according to the comparison result to form the wheat spraying audio clips in the audio file to be detected.
In an optional embodiment, the determining module 606 includes:
the second determining unit is configured to determine a plurality of jet microphone audio subintervals in the jet microphone audio interval and determine sub-jet microphone probabilities corresponding to the jet microphone audio subintervals in the plurality of jet microphone audio subintervals according to the jet microphone probabilities;
the second comparison unit is configured to compare the sub-wheat-spraying probability corresponding to the wheat-spraying audio subintervals in the plurality of wheat-spraying audio subintervals with a preset probability threshold value, and determine a plurality of target wheat-spraying audio subintervals according to the comparison result;
the extracting unit is configured to extract the target wheat spraying audio subintervals from the audio file to be detected to form a continuous wheat spraying audio subinterval;
a second component configured to determine a puff audio clip in the audio file to be detected based on the continuous puff audio subinterval.
In an optional embodiment, the extracting unit includes:
the audio position determining subunit is configured to determine audio positions of the target wheat spraying audio subintervals in the audio file to be detected, and determine audio durations corresponding to the target wheat spraying audio subintervals;
a judging subunit, configured to judge whether the multiple target wheat spraying audio subintervals are continuous in the time dimension according to the audio positions and the audio durations;
if so, operating an integration subunit, wherein the integration subunit is configured to integrate the target wheat spraying audio subintervals to obtain the continuous wheat spraying audio subintervals.
In an alternative embodiment, the second component unit includes:
the framing processing subunit is configured to perform framing processing on the continuous wheat spraying audio subinterval to obtain a plurality of target audio frames;
the wheat spraying detection subunit is configured to respectively perform wheat spraying detection on the target audio frames, compare the detection result with a wheat spraying detection threshold value, and form a wheat spraying audio sequence according to the comparison result;
and the forming subunit is configured to determine the wheat spraying audio clip in the audio file to be detected according to the audio frames forming the wheat spraying audio sequence under the condition that the duration of the wheat spraying audio sequence is greater than a duration threshold.
In an alternative embodiment, the component sub-units include:
a first determination submodule configured to determine a duration of the puff wheat audio sequence;
a second determining submodule configured to determine the duration threshold according to a preset condition;
the judging submodule is configured to judge whether the duration of the wheat spraying audio sequence is greater than or equal to the duration threshold;
and if so, determining the wheat spraying audio clip in the audio file to be detected according to the audio frames forming the wheat spraying audio sequence.
In an optional embodiment, the determining, according to the audio frames constituting the puff audio sequence, a puff audio clip in the audio file to be detected includes:
performing framing processing on the wheat spraying audio sequence to obtain a plurality of audio frames which are sequenced according to a time sequence; selecting audio frames with the first output probability larger than a first threshold value from a plurality of audio frames which are sequenced according to a time sequence to determine as initial wheat-spraying audio frames, and determining audio frames with the first output probability smaller than a second threshold value from the plurality of audio frames which are sequenced according to the time sequence to determine as terminal wheat-spraying audio frames; determining an intermediate jet audio frame between the initial jet audio frame and the terminal jet audio frame, and forming the jet audio segment based on the initial jet audio frame, the terminal jet audio frame, and the intermediate jet audio frame.
In an optional embodiment, the audio detection apparatus further includes:
the playing module is configured to perform standardization processing on the audio file to be detected according to the wheat spraying audio clip to obtain a target audio file, and play the target audio file;
wherein the normalization process comprises deleting the puff audio piece or reducing the energy of the puff audio piece.
In an optional embodiment, the audio file to be detected includes at least one of the following:
the audio file uploaded by the client, the audio file to be played and the audio file to be processed.
The audio detection device provided by this embodiment, after acquiring the audio file to be detected, inputs the audio file to be detected to the wheat spraying detection model for processing, and obtains the wheat spraying audio interval in the audio file to be detected and the wheat spraying probability of the wheat spraying audio interval, at this time, in order to accurately locate the wheat spraying audio segment, the audio file to be detected is processed according to the wheat spraying audio interval and the wheat spraying probability corresponding to the wheat spraying audio interval, so as to obtain the wheat spraying audio segment in the audio file to be detected, so that not only can the wheat spraying audio segment be accurately located, but also the location processing efficiency of the wheat spraying audio segment is improved, a more widely applied wheat spraying processing scene is realized, and further, the experience effect of a user to listen to the audio file to be detected is improved.
The foregoing is a schematic solution of an audio detection apparatus according to the embodiment. It should be noted that the technical solution of the audio detection apparatus and the technical solution of the audio detection method belong to the same concept, and details that are not described in detail in the technical solution of the audio detection apparatus can be referred to the description of the technical solution of the audio detection method.
Fig. 7 illustrates a block diagram of a computing device 700 provided according to an embodiment of the present description. The components of the computing device 700 include, but are not limited to, memory 710 and a processor 720. Processor 720 is coupled to memory 710 via bus 730, and database 750 is used to store data.
Computing device 700 also includes access device 740, access device 740 enabling computing device 700 to communicate via one or more networks 760. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. Access device 740 may include one or more of any type of network interface, e.g., a Network Interface Card (NIC), wired or wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.
In one embodiment of the present description, the above-described components of computing device 700, as well as other components not shown in FIG. 7, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device architecture shown in FIG. 7 is for purposes of example only and is not limiting as to the scope of the present description. Those skilled in the art may add or replace other components as desired.
Computing device 700 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smartphone), wearable computing device (e.g., smartwatch, smartglasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 700 may also be a mobile or stationary server.
Wherein processor 720 is configured to execute the following computer-executable instructions:
acquiring an audio file to be detected;
inputting the audio file to be detected into a wheat spraying detection model for processing, and obtaining a wheat spraying audio interval in the audio file to be detected and the wheat spraying probability of the wheat spraying audio interval;
and determining the wheat spraying audio frequency segment in the audio file to be detected according to the wheat spraying audio frequency interval and the wheat spraying probability.
The above is an illustrative scheme of a computing device of the present embodiment. It should be noted that the technical solution of the computing device and the technical solution of the audio detection method belong to the same concept, and details that are not described in detail in the technical solution of the computing device can be referred to the description of the technical solution of the audio detection method.
An embodiment of the present specification also provides a computer readable storage medium storing computer instructions that, when executed by a processor, are operable to:
acquiring an audio file to be detected;
inputting the audio file to be detected into a wheat spraying detection model for processing, and obtaining a wheat spraying audio interval in the audio file to be detected and the wheat spraying probability of the wheat spraying audio interval;
and determining the wheat spraying audio frequency segment in the audio file to be detected according to the wheat spraying audio frequency interval and the wheat spraying probability.
The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium belongs to the same concept as the technical solution of the audio detection method, and details that are not described in detail in the technical solution of the storage medium can be referred to the description of the technical solution of the audio detection method.
The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The computer instructions comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
It should be noted that, for the sake of simplicity, the foregoing method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present disclosure is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present disclosure. Further, those skilled in the art should also appreciate that the embodiments described in this specification are preferred embodiments and that acts and modules referred to are not necessarily required for this description.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The preferred embodiments of the present specification disclosed above are intended only to aid in the description of the specification. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the specification and its practical application, to thereby enable others skilled in the art to best understand the specification and its practical application. The specification is limited only by the claims and their full scope and equivalents.

Claims (14)

1. An audio detection method, comprising:
acquiring an audio file to be detected;
inputting the audio file to be detected into a wheat spraying detection model for processing, and obtaining a wheat spraying audio interval in the audio file to be detected and the wheat spraying probability of the wheat spraying audio interval;
and determining the wheat spraying audio frequency segment in the audio file to be detected according to the wheat spraying audio frequency interval and the wheat spraying probability.
2. The audio detection method of claim 1, wherein the puff detection model is trained by:
acquiring a sample audio file, and formatting the sample audio file to obtain a standard audio file;
dividing the standard audio file into a plurality of audio intervals, and adding wheat spraying labels or normal voice labels to the plurality of audio intervals according to the attribute information of the plurality of audio intervals;
forming a sample audio interval pair based on the plurality of audio intervals and the plurality of audio intervals added with the labels, and training an initial wheat spraying detection model by using the sample audio interval to obtain the wheat spraying detection model.
3. The audio detection method according to claim 1, wherein the inputting the audio file to be detected into a wheat spraying detection model for processing to obtain a wheat spraying audio interval in the audio file to be detected and a wheat spraying probability of the wheat spraying audio interval comprises:
transforming the audio file to be detected based on a processing window with a preset frame length and frame shift of preset time to obtain a magnitude spectrum corresponding to the audio file to be detected;
extracting low-frequency features from the audio file to be detected according to the amplitude spectrum, inputting the low-frequency features into the wheat spraying detection model for processing, and obtaining the wheat spraying audio interval and the wheat spraying probability.
4. The audio detection method according to claim 1, wherein the determining, according to the puff audio interval and the puff probability, a puff audio clip in the audio file to be detected comprises:
determining a plurality of wheat spraying audio subintervals in the wheat spraying audio interval, and determining the sub-wheat spraying probability corresponding to the wheat spraying audio subintervals in the plurality of wheat spraying audio subintervals according to the wheat spraying probability;
comparing the sub-wheat spraying probability corresponding to the wheat spraying audio subintervals in the plurality of wheat spraying audio subintervals with a preset probability threshold;
and selecting the wheat spraying audio subintervals which are greater than or equal to the preset probability threshold value according to the comparison result to form the wheat spraying audio clip in the audio file to be detected.
5. The audio detection method according to claim 1, wherein the determining, according to the puff audio interval and the puff probability, a puff audio clip in the audio file to be detected comprises:
determining a plurality of wheat spraying audio subintervals in the wheat spraying audio interval, and determining the sub-wheat spraying probability corresponding to the wheat spraying audio subintervals in the plurality of wheat spraying audio subintervals according to the wheat spraying probability;
comparing the sub-wheat spraying probability corresponding to the wheat spraying audio subintervals in the plurality of wheat spraying audio subintervals with a preset probability threshold, and determining a plurality of target wheat spraying audio subintervals according to the comparison result;
extracting the target wheat spraying audio subintervals from the audio file to be detected to form a continuous wheat spraying audio subinterval;
and determining the wheat spraying audio clip in the audio file to be detected based on the continuous wheat spraying audio subinterval.
6. The audio detection method according to claim 5, wherein the extracting the target wheat spraying audio subintervals from the audio file to be detected to form a continuous wheat spraying audio subinterval comprises:
determining audio positions of the target wheat spraying audio subintervals in the audio file to be detected, and determining audio durations corresponding to the target wheat spraying audio subintervals;
judging whether the target wheat spraying audio subintervals are continuous in the time dimension according to the audio positions and the audio duration;
and if so, integrating the multiple target wheat spraying audio subintervals to obtain the continuous wheat spraying audio subintervals.
7. The audio detection method according to claim 6, wherein the determining the puff audio clip in the audio file to be detected based on the continuous puff audio subinterval comprises:
performing framing processing on the continuous wheat spraying audio subinterval to obtain a plurality of target audio frames;
respectively carrying out wheat spraying detection on the target audio frames, comparing the detection result with a wheat spraying detection threshold value, and forming a wheat spraying audio sequence according to the comparison result;
and under the condition that the duration of the wheat spraying audio sequence is greater than a duration threshold, determining the wheat spraying audio segments in the audio file to be detected according to the audio frames forming the wheat spraying audio sequence.
8. The audio detection method according to claim 7, wherein the determining, according to the audio frames constituting the puff audio sequence, the puff audio clip in the audio file to be detected when the duration of the puff audio sequence is greater than the duration threshold comprises:
determining the duration of the wheat spraying audio sequence;
determining the time length threshold value according to a preset condition;
judging whether the duration of the wheat spraying audio sequence is greater than or equal to the duration threshold;
and if so, executing the step of determining the wheat spraying audio clip in the audio file to be detected according to the audio frames forming the wheat spraying audio sequence.
9. The audio detection method according to claim 8, wherein the determining the puff audio clip in the audio file to be detected according to the audio frames constituting the puff audio sequence comprises:
performing framing processing on the wheat spraying audio sequence to obtain a plurality of audio frames which are sequenced according to a time sequence;
selecting audio frames with the first output probability larger than a first threshold value from a plurality of audio frames which are sequenced according to a time sequence to determine as initial wheat-spraying audio frames, and determining audio frames with the first output probability smaller than a second threshold value from the plurality of audio frames which are sequenced according to the time sequence to determine as terminal wheat-spraying audio frames;
determining an intermediate jet audio frame between the initial jet audio frame and the terminal jet audio frame, and forming the jet audio segment based on the initial jet audio frame, the terminal jet audio frame, and the intermediate jet audio frame.
10. The audio detection method according to claim 1, wherein after the step of determining the puff audio clip in the audio file to be detected according to the puff audio interval and the puff probability is executed, the audio detection method further comprises:
standardizing the audio file to be detected according to the wheat spraying audio clip to obtain a target audio file, and playing the target audio file;
wherein the normalization process comprises deleting the puff audio piece or reducing the energy of the puff audio piece.
11. The audio detection method according to claim 1, wherein the audio files to be detected comprise at least one of:
the audio file uploaded by the client, the audio file to be played and the audio file to be processed.
12. An audio detection apparatus, comprising:
the acquisition module is configured to acquire an audio file to be detected;
the processing module is configured to input the audio file to be detected into a wheat spraying detection model for processing, and obtain a wheat spraying audio interval in the audio file to be detected and a wheat spraying probability of the wheat spraying audio interval;
and the determining module is configured to determine the wheat spraying audio clip in the audio file to be detected according to the wheat spraying audio interval and the wheat spraying probability.
13. A computing device, comprising:
a memory and a processor;
the memory is configured to store computer-executable instructions, and the processor is configured to execute the computer-executable instructions to implement the method of:
acquiring an audio file to be detected;
inputting the audio file to be detected into a wheat spraying detection model for processing, and obtaining a wheat spraying audio interval in the audio file to be detected and the wheat spraying probability of the wheat spraying audio interval;
and determining the wheat spraying audio frequency segment in the audio file to be detected according to the wheat spraying audio frequency interval and the wheat spraying probability.
14. A computer-readable storage medium storing computer instructions which, when executed by a processor, carry out the steps of the audio detection method of any of claims 1 to 11.
CN202011211792.9A 2020-11-03 2020-11-03 Audio detection method and device Pending CN112289340A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011211792.9A CN112289340A (en) 2020-11-03 2020-11-03 Audio detection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011211792.9A CN112289340A (en) 2020-11-03 2020-11-03 Audio detection method and device

Publications (1)

Publication Number Publication Date
CN112289340A true CN112289340A (en) 2021-01-29

Family

ID=74351028

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011211792.9A Pending CN112289340A (en) 2020-11-03 2020-11-03 Audio detection method and device

Country Status (1)

Country Link
CN (1) CN112289340A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE60319796D1 (en) * 2003-01-24 2008-04-30 Sony Ericsson Mobile Comm Ab tsdetektion
CN104409081A (en) * 2014-11-25 2015-03-11 广州酷狗计算机科技有限公司 Speech signal processing method and device
CN107507621A (en) * 2017-07-28 2017-12-22 维沃移动通信有限公司 A kind of noise suppressing method and mobile terminal
US20180167754A1 (en) * 2014-10-08 2018-06-14 Gn Netcom A/S Robust noise cancellation using uncalibrated microphones
CN109741758A (en) * 2019-01-14 2019-05-10 杭州微纳科技股份有限公司 A kind of dual microphone voice de-noising method
CN110335620A (en) * 2019-07-08 2019-10-15 广州欢聊网络科技有限公司 A kind of noise suppressing method, device and mobile terminal

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE60319796D1 (en) * 2003-01-24 2008-04-30 Sony Ericsson Mobile Comm Ab tsdetektion
US20180167754A1 (en) * 2014-10-08 2018-06-14 Gn Netcom A/S Robust noise cancellation using uncalibrated microphones
CN104409081A (en) * 2014-11-25 2015-03-11 广州酷狗计算机科技有限公司 Speech signal processing method and device
CN107507621A (en) * 2017-07-28 2017-12-22 维沃移动通信有限公司 A kind of noise suppressing method and mobile terminal
CN109741758A (en) * 2019-01-14 2019-05-10 杭州微纳科技股份有限公司 A kind of dual microphone voice de-noising method
CN110335620A (en) * 2019-07-08 2019-10-15 广州欢聊网络科技有限公司 A kind of noise suppressing method, device and mobile terminal

Similar Documents

Publication Publication Date Title
US11115541B2 (en) Post-teleconference playback using non-destructive audio transport
JP7427723B2 (en) Text-to-speech synthesis in target speaker's voice using neural networks
CN111489760B (en) Speech signal dereverberation processing method, device, computer equipment and storage medium
EP3254453B1 (en) Conference segmentation based on conversational dynamics
US10516782B2 (en) Conference searching and playback of search results
CN107799126A (en) Sound end detecting method and device based on Supervised machine learning
CN110782872A (en) Language identification method and device based on deep convolutional recurrent neural network
US20180191912A1 (en) Selective conference digest
Kelly et al. Deep neural network based forensic automatic speaker recognition in VOCALISE using x-vectors
CN110149805A (en) Double-directional speech translation system, double-directional speech interpretation method and program
CN112397083A (en) Voice processing method and related device
CN112133277B (en) Sample generation method and device
CN112185363B (en) Audio processing method and device
WO2023001128A1 (en) Audio data processing method, apparatus and device
Rao et al. Interspeech 2021 conferencingspeech challenge: Towards far-field multi-channel speech enhancement for video conferencing
CN111883107A (en) Speech synthesis and feature extraction model training method, device, medium and equipment
Mirishkar et al. CSTD-Telugu corpus: Crowd-sourced approach for large-scale speech data collection
US11551707B2 (en) Speech processing method, information device, and computer program product
CN112289340A (en) Audio detection method and device
US20220157316A1 (en) Real-time voice converter
US11074926B1 (en) Trending and context fatigue compensation in a voice signal
CN112151055B (en) Audio processing method and device
Wu et al. Speaker-invariant feature-mapping for distant speech recognition via adversarial teacher-student learning
Wang et al. Speaker-Independent Audio-Visual Speech Separation Based on Transformer in Multi-Talker Environments
WO2022068675A1 (en) Speaker speech extraction method and apparatus, storage medium, and electronic device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination