CN111145779A

CN111145779A - Target detection method of audio file and related equipment

Info

Publication number: CN111145779A
Application number: CN201911399041.1A
Authority: CN
Inventors: 缪畅宇
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-12-26
Filing date: 2019-12-26
Publication date: 2020-05-12
Anticipated expiration: 2039-12-26
Also published as: CN111145779B

Abstract

The embodiment of the invention discloses a target detection method of an audio file and related equipment, wherein the method comprises the following steps: acquiring a spectrogram of an audio file to be detected, wherein the spectrogram is determined according to a time domain sequence of the audio file to be detected and a frequency domain sequence of the audio file to be detected; determining a discontinuity point in the spectrogram; determining at least one audio segment to be detected of the audio file to be detected according to the mutation point, wherein the mutation point is used for indicating the initial position of the audio segment to be detected; detecting at least one audio clip to be detected according to the trained audio target detection model to obtain a detection result, wherein the detection result comprises at least one target audio clip and the category identification of each target audio clip, and the category identification is used for identifying the audio category to which the target audio clip belongs; and outputting a detection result. By adopting the embodiment of the invention, not only the target audio clip can be detected, but also the audio type of the target audio clip can be identified.

Description

Target detection method of audio file and related equipment

Technical Field

The invention relates to the technical field of computers, in particular to a target detection method of an audio file and related equipment.

Background

The composition of audio files is very rich. For example, from the flow composition of music, an audio file may be composed of audio segments such as a refrain, an interlude, a prelude, etc.; from the tone composition of music, the audio file can be composed of audio segments such as percussion music, string music, keyboard music, human voice and the like; from the human vocal part composition of music, an audio file may be composed of audio pieces such as a high vocal part, a middle vocal part, and a low vocal part.

In order to make more full use of audio files, there are many times when it is desirable to break down an audio file into audio pieces. For example, an audio file is composed of audio segments of three parts of female treble, female midrange and female bass alternately, and the chorus team often needs to decompose the audio file into audio segments of three independent parts of female treble, female bass and female midrange for the practice of part division. Therefore, how to more accurately decompose the audio file into audio segments is a problem to be solved.

Disclosure of Invention

The embodiment of the invention provides a target detection method of an audio file and related equipment, which can not only detect a target audio clip, but also identify the audio type of the target audio clip.

In a first aspect, an embodiment of the present invention provides a method for detecting an object of an audio file, where the method includes:

acquiring a spectrogram of an audio file to be detected, wherein the spectrogram is determined according to a time domain sequence of the audio file to be detected and a frequency domain sequence of the audio file to be detected; determining a discontinuity point in the spectrogram; determining at least one audio segment to be detected of the audio file to be detected according to the mutation point, wherein the mutation point is used for indicating the initial position of the audio segment to be detected; detecting at least one audio clip to be detected according to the trained audio target detection model to obtain a detection result, wherein the detection result comprises at least one target audio clip and the category identification of each target audio clip, and the category identification is used for identifying the audio category to which the target audio clip belongs; and outputting a detection result.

In a second aspect, an embodiment of the present invention provides a detection apparatus, where the detection apparatus has a function of implementing the method for detecting an object of an audio file according to the first aspect. The functions can be realized by hardware, and the functions can also be realized by executing corresponding software by hardware. The hardware or software includes one or more units corresponding to the above functions.

In one implementation, the detection apparatus includes: the device comprises an acquisition unit, a determination unit, a detection unit and an output unit. The acquisition unit is used for acquiring a spectrogram of the audio file to be detected, wherein the spectrogram is determined according to a time domain sequence of the audio file to be detected and a frequency domain sequence of the audio file to be detected; a determining unit, configured to determine a discontinuity point in the spectrogram; the determining unit is further used for determining at least one audio segment to be detected of the audio file to be detected according to the mutation point, and the mutation point is used for indicating the initial position of the audio segment to be detected; the detection unit is used for detecting at least one audio clip to be detected according to the trained audio target detection model to obtain a detection result, wherein the detection result comprises at least one target audio clip and the class identification of each target audio clip, and the class identification is used for identifying the audio class to which the target audio clip belongs; and the output unit is used for outputting the detection result.

In a third aspect, an embodiment of the present invention provides a detection apparatus, including a memory and a processor, where the memory is used for storing a computer program; a processor for calling a computer program stored in the memory for executing the method for object detection of an audio file according to the first aspect.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium for storing a computer program for a detection apparatus, which contains program instructions for executing the method according to the first aspect.

In the embodiment of the present invention, the detection device may determine, in the obtained spectrogram of the audio file to be detected, a discontinuity point indicating an initial position of the audio segment to be detected. The detection device can also determine at least one audio segment to be detected in the audio file to be detected according to the mutation point, and the mutation point can be used for indicating the initial position of the audio segment to be detected. After determining at least one audio segment to be detected, the detection device may detect the at least one audio segment to be detected according to the trained audio target detection model, so as to obtain the at least one target audio segment and the category identifier of each target audio segment. By the embodiment of the invention, the detection equipment can not only detect the target audio clip in the audio file to be detected, but also identify the audio category of the target audio clip.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram of a frame of a method for detecting an object of an audio file according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating an example of a spectrogram of an audio file according to an embodiment of the present invention;

fig. 3a is a schematic diagram of a frame of an audio segment framing phase of a target detection method for an audio file according to an embodiment of the present invention;

FIG. 3b is a block diagram of an audio clip identification stage of a target detection method for an audio file according to an embodiment of the present invention;

FIG. 3c is a flow chart illustrating a method for determining audio segments according to an embodiment of the present invention;

FIG. 3d is an exemplary diagram of determining a spectrogram of an audio segment in accordance with implementations of the invention;

FIG. 4 is a block diagram of a prediction stage of a method for detecting an object of an audio file according to an embodiment of the present invention;

FIG. 5 is a flowchart illustrating a method for detecting an object of an audio file according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a detection apparatus according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a detection apparatus according to an embodiment of the present invention.

Detailed Description

The embodiments of the present invention will be described below with reference to the drawings.

The embodiment of the invention provides a target detection method of an audio file and related equipment by using the target detection idea in image processing as a reference and introducing the target detection idea into a detection method of the audio file. The embodiment of the invention not only can detect the target audio clip in the audio file, but also can identify the category identification of the target audio clip.

The target detection method for the audio file provided by the embodiment of the invention can be applied to detection equipment, and the detection equipment can be a mobile terminal, a Personal Computer (PC) end or a portable Computer (Tablet PC) end and the like. The target detection method for the audio file provided by the embodiment of the invention can be compiled into an independent application program, and the application program can be operated in the detection equipment.

Fig. 1 is a schematic diagram of a framework of a method for detecting an object of an audio file according to an embodiment of the present invention. After the detection device acquires the audio file to be detected, the audio file to be detected can be converted into a spectrogram of the audio file to be detected. The detection device may input the spectrogram of the audio file to be detected into the trained audio target detection model. After the spectrogram of the audio file to be detected is detected by the trained audio target detection model, the detection device may output a detection result, where the detection result may include a target audio segment and a category identifier of the target audio segment. The category identification may be used to identify an audio category to which the target audio piece belongs.

It should be noted that, in the schematic block diagram of the target detection method for an audio file shown in fig. 1, the detection result output by the detection device includes category identifiers of the target audio segment 1 and the target audio segment 1, and the category identifiers of the target audio segment 2 and the target audio segment 2 are only used as examples and do not constitute a limitation to the embodiment of the present invention. The number of target audio segments included in the detection result output by the detection device may be one, or a plurality of target audio segments included in the detection result output by the detection device may be multiple. The detection result output by the detection device may include one category identifier of the target audio segment, and the detection result output by the detection device may include a plurality of category identifiers of the target audio segment.

For example, after a spectrogram of an audio file to be detected is detected by the trained audio target detection model, a detection result output by the detection device includes a target audio segment, and a category of the target audio segment is identified as a "prelude". For another example, after the spectrogram of the audio file to be detected is detected by the trained audio target detection model, the detection result output by the detection device includes a target audio segment, and the category of the target audio segment is identified as "prelude" and "violin". For example, after the spectrogram of the audio file to be detected is detected by the trained audio target detection model, the detection result output by the detection device includes two target audio segments, the category of the target audio segment 1 is identified as "male treble", and the category of the target audio segment 2 is identified as "female treble" and "percussion music".

The trained audio target detection model may be obtained by the detection device after training the audio target detection model according to the acquired sample audio file, at least one sample audio clip of the sample audio file, and the category identifier of each sample audio clip. The trained audio target detection model may include a trained feature extraction model, a trained two-class model, and a trained multi-class model. The audio object detection model may include a feature extraction model, a two-class model, and a multi-class model.

The trained feature extraction model and the feature extraction model are used for extracting the audio features of the audio segments. The trained two-classification model and the two-classification model are used for determining the audio segments of the recognizable audio classes in the audio segments. The trained multi-classification models and multi-classification models are used to identify classes of audio segments that can identify audio classes. The trained feature extraction model and the feature extraction model can use algorithms such as Convolutional Neural Networks (CNN), Deep Neural Networks (DNN), Recurrent Neural Networks (RNN) and the like to extract audio features of the audio segment.

Fig. 2 is a diagram of a spectrogram of an audio file according to an embodiment of the present invention. The abscissa of the spectrogram is time in seconds and the ordinate is frequency in hertz. The spectrogram can include a plurality of spectral sequences, which can include a plurality of spectral vectors, which can include a plurality of element values. As shown in fig. 2, the selected portion of the black bold frame in the figure is a spectrum sequence G1 of the spectrogram of the audio file, the spectrum sequence G1 may include a plurality of spectrum vectors, and a spectrum vector of the spectrum sequence G1 may include a plurality of element values.

The target detection method for the audio file provided by the embodiment of the invention can comprise a training stage of the target detection method for the audio file and a prediction stage of the target detection method for the audio file. The training phase of the target detection method of the audio file may include an audio clip framing phase of the target detection method of the audio file and an audio clip identification phase of the target detection method of the audio file. Please refer to fig. 3a for a specific description of an execution process of an audio segment framing phase of the target detection method for an audio file; please refer to fig. 3b for a specific description of an implementation procedure of an audio clip identification phase of the target detection method for an audio file; the process of performing the prediction phase of the object detection method for audio files is described in detail with reference to fig. 4.

Fig. 3a is a schematic diagram of a frame of an audio segment framing phase of a target detection method for an audio file according to an embodiment of the present invention. There are two ways in which the detection device obtains at least one sample audio clip of a sample audio file. The first method is that after the detection device obtains the sample audio file, the detection device can perform frequency domain processing on the time domain sequence of the sample audio file according to a preset frequency domain transformation algorithm to obtain a frequency domain sequence of the sample audio file; the detection device can obtain a spectrogram of the sample audio file according to the time domain sequence of the sample audio file and the frequency domain sequence of the sample audio file; the detection device may determine at least one sample audio clip in the sample audio file according to the spectrogram of the sample audio file, and label the category identifier of each sample audio clip. The second is that the detection device can directly obtain the sample audio file, at least one sample audio fragment of the sample audio file and the category identification of each sample audio fragment. As shown in fig. 3a, the selected portion of the black bold frame in the spectrogram is the spectrogram of a sample audio clip of the sample audio file.

The predetermined Frequency domain Transform algorithm may include Fast Fourier Transform (FFT) algorithm, Mel-Frequency Cepstral Coefficients (MFCCs) algorithm, Discrete Fourier Transform (DFT) algorithm, and the like. Referring to fig. 3c, an implementation process of the detection device determining at least one sample audio clip in the sample audio file according to the spectrogram of the sample audio file is described in detail.

After the detection device acquires the at least one sample audio segment, the detection device may input the at least one sample audio segment into the feature extraction model. The detection device may extract the audio features of the at least one sample audio clip according to the feature extraction model to obtain a feature pool of the at least one sample audio clip, where the feature pool of any sample audio clip in the feature pool of the at least one sample audio clip includes the audio features of the sample audio clip in a preset dimension.

The detection device may input the feature pool of the at least one sample audio segment into the binary model, and the detection device may detect the feature pool of the at least one sample audio segment according to the binary model, determine the class indication information of each sample audio segment, where the class indication information of any sample audio segment of the at least one sample audio segment is used to indicate whether the sample audio segment is an audio segment of an identifiable audio class. If the class indication information indicates that any sample audio segment in the at least one sample audio segment is an audio segment of an identifiable audio class, the detection device may determine that the sample audio segment is a target audio segment. The detection device may determine at least one target audio segment from the at least one sample audio segment, where the at least one target audio segment is an audio segment of an identifiable audio category.

Fig. 3b is a schematic diagram of a frame of an audio segment identification stage of a target detection method for an audio file according to an embodiment of the present invention, referring to a schematic diagram of a frame of an audio segment selection stage of the target detection method for an audio file shown in fig. 3 a. The detection device may input the at least one target audio segment into the multi-classification model, and the detection device may identify the at least one target audio segment according to the multi-classification model to obtain the category identifier of each target audio segment. If the class identifier of each target audio segment obtained by the detection device according to the multi-classification model identification is the same as the class identifier of each sample audio segment corresponding to each target audio segment obtained or labeled by the detection device, it indicates that the detection device successfully trains the audio target detection model based on the sample audio file and at least one sample audio segment of the sample audio file. If the class identifier of each target audio segment obtained by the detection device according to the multi-classification model identification is different from the class identifier of each sample audio segment corresponding to each target audio segment obtained or labeled by the detection device, it indicates that the audio target detection model training of the detection device based on the sample audio file and at least one sample audio segment of the sample audio file fails.

The detection device may input a plurality of sample audio files to the audio target detection model (the feature extraction model, the binary model, the multi-classification model) based on the frame schematic diagram of the audio segment framing stage of the target detection method of an audio file shown in fig. 3a and the frame schematic diagram of the audio segment recognition stage of the target detection method of an audio file shown in fig. 3b, the detection device may train the audio target detection model (the feature extraction model, the binary model, the multi-classification model) according to the input plurality of sample audio files, and if the parameter values of the model performance evaluation parameters such as the recall ratio, the precision ratio, etc. of the trained audio target detection model (the feature extraction model, the binary model, the multi-classification model) exceed the parameter threshold, the detection device may use the trained audio target detection model (the feature extraction model, the trained, the multi-classification model, etc. to determine whether the parameter values exceed the parameter threshold, The trained two-classification model and the trained multi-classification model) to detect the sample audio file to be detected.

Referring to fig. 3c, a flow chart of a method for determining an audio segment according to an embodiment of the present invention is shown, the method includes, but is not limited to, the following steps:

step S301, the detection device determines a mutation point in a spectrogram of a sample audio file.

In one implementation, a spectrogram of a sample audio file may include a plurality of spectral sequences, a spectral sequence may include a plurality of spectral vectors, and a spectral vector may include a plurality of element values. The target spectrum vector is any one of the spectrum vectors in the spectrogram of the sample audio file. If the similarity between the target spectrum vector and each spectrum vector except the target spectrum vector in the spectrogram of the sample audio file is the lowest, the detection device may determine that the starting point of the target spectrum vector is a mutation point.

For example, the lowest similarity between the target spectrum vector and each of the spectrum vectors in the spectrogram of the sample audio file except the target spectrum vector may be understood as the largest modulus of the difference vector between the target spectrum vector and each of the spectrum vectors in the spectrogram of the sample audio file except the target spectrum vector.

In one implementation, a spectrogram of a sample audio file may include a plurality of spectral sequences, a spectral sequence may include a plurality of spectral vectors, and a spectral vector may include a plurality of element values. The target spectrum vector is any one of the spectrum vectors in the spectrogram of the sample audio file. If the maximum value of the element values of the target spectrum vector is the maximum value of the element values of each spectrum vector except the target spectrum vector in the spectrogram of the sample audio file, the detection device may determine that the starting point of the target spectrum vector is a discontinuity point.

Step S302, the detection device determines at least one sample audio segment in the sample audio file according to the first preset sequence number, the second preset sequence number and the catastrophe point.

In one implementation, the detection device may use the discontinuity as a starting location of the sample audio file; the detection device may determine, in a spectrogram of the sample audio file, an end position of the sample audio clip according to the first preset number of sequences, the second preset number of sequences, and the discontinuity point; the detection device may determine at least one sample audio segment in the sample audio file based on the start location and the end location. And the number of the first preset sequences is less than or equal to the number of the second preset sequences.

For example, the first predetermined number of sequences is 2, the second predetermined number of sequences is 3, and the first predetermined number of sequences is smaller than the second predetermined number of sequences. The spectrogram of the sample audio file comprises 4 spectral sequences { spectral sequence G1, spectral sequence G2, spectral sequence G3, spectral sequence G4 }. The spectrum vector in the spectrum sequence G1 is determined as a target spectrum vector, and the detection device determines that the starting point of the target spectrum vector is a catastrophe point. The detection device determines the obtained frequency spectrum sequences of the sample audio clip in the spectrogram of the sample audio file to be { frequency spectrum sequence G1, frequency spectrum sequence G2} and { frequency spectrum sequence G1, frequency spectrum sequence G2, frequency spectrum sequence G3} according to the first preset sequence number 2, the second preset sequence number 3 and the mutation point. The detection device takes the mutation point as the starting position of the sample audio segment, takes the starting point of the termination spectrum vector of the spectrum sequence G2 as the termination position, or takes the starting point of the termination spectrum vector of the spectrum sequence G3 as the termination position, and the detection device determines to obtain two sample audio segments. The terminating spectral vector of the spectral sequence is the last spectral vector in the spectral sequence. As shown in fig. 3d, for an exemplary diagram of determining a spectrogram of an audio segment according to an embodiment of the present invention, the spectrum sequence of the sample audio segment determined by the detecting device is { spectrum sequence G1, spectrum sequence G2, spectrum sequence G3 }.

Fig. 4 is a schematic diagram of a frame of a prediction stage of a target detection method for an audio file according to an embodiment of the present invention. After the detection equipment acquires the audio file to be detected, the detection equipment can also acquire a time domain sequence of the audio file to be detected, and the detection equipment can perform frequency domain processing on the time domain sequence of the audio file to be detected according to a preset frequency domain transformation algorithm to obtain a frequency domain sequence of the audio file to be detected; the detection equipment can obtain a spectrogram of the audio file to be detected according to the time domain sequence of the audio file to be detected and the frequency domain sequence of the audio file to be detected; the detection device may determine at least one audio clip to be detected in the audio file to be detected according to the spectrogram of the audio file to be detected. As shown in fig. 4, the selected part of the black bold-solid frame in the spectrogram is the spectrogram of an audio clip to be detected of the audio file to be detected.

The execution process of determining, by the detection device, at least one audio clip to be detected in the audio file to be detected according to the spectrogram of the audio file to be detected may include: determining a catastrophe point in a spectrogram of an audio file to be detected by detection equipment; and the detection equipment determines at least one audio segment to be detected in the audio file to be detected according to the first preset sequence number, the second preset sequence number and the catastrophe points. The detection device determines an execution process of at least one to-be-detected audio segment in the to-be-detected audio file according to the spectrogram of the to-be-detected audio file, which is the same as the execution process of determining at least one sample audio segment in the sample audio file according to the spectrogram of the sample audio file in the embodiment shown in fig. 3c, and the specific execution process may refer to the specific description of fig. 3c, which is not repeated here.

After the detection device determines at least one to-be-detected audio clip of the to-be-detected audio file, the detection device may input the at least one to-be-detected audio clip into the trained feature extraction model, the detection device may extract the audio features of the at least one to-be-detected audio clip according to the trained feature extraction model to obtain a feature pool of the at least one to-be-detected audio clip, and the feature pool of any one to-be-detected audio clip of the at least one to-be-detected audio clip includes the audio features of the to-be-detected audio clip in a preset dimension.

The detection device can input the feature pool of the at least one audio clip to be detected into the trained binary classification model, the detection device can detect the feature pool of the at least one audio clip to be detected according to the trained binary classification model, the class indication information of each audio clip to be detected is determined, and the class indication information of any one audio clip to be detected of the at least one audio clip to be detected is used for indicating whether the audio clip to be detected is an audio clip with an identifiable audio class. If the class indication information indicates that any one of the at least one to-be-detected audio clip is an audio clip of an identifiable audio class, the detection device may determine that the to-be-detected audio clip is a target audio clip. The detection device can determine at least one target audio segment from at least one audio segment to be detected, wherein the at least one target audio segment is an audio segment capable of identifying audio categories.

The detection device may input at least one target audio segment into the trained multi-class model, and the detection device may identify the at least one target audio segment according to the trained multi-class model to obtain the category identifier of each target audio segment.

The detection device may detect the audio file to be detected according to the trained audio target detection model (the trained feature extraction model, the trained binary model, and the trained multi-classification model) based on the schematic frame diagram of the prediction stage of the method for detecting the target of the audio file shown in fig. 4, and the detection device may not only detect the target audio clip in the audio file to be detected, but also identify the audio category of the target audio clip.

Referring to fig. 5, a schematic flow chart of a method for detecting an object of an audio file according to an embodiment of the present invention is shown based on a schematic frame diagram of a prediction stage of the method for detecting an object of an audio file shown in fig. 4, where the method for detecting an object of an audio file includes, but is not limited to, the following steps:

step S501, the detection equipment acquires a spectrogram of an audio file to be detected.

The detection device may obtain a spectrogram of an audio file to be detected. The spectrogram of the audio file to be detected can be determined according to the time domain sequence of the audio file to be detected and the frequency domain sequence of the audio file to be detected. The frequency domain sequence of the audio file to be detected can be obtained by performing frequency domain processing on the time domain sequence of the audio file to be detected according to a preset frequency domain transformation algorithm.

Step S502, the detection device determines a catastrophe point in the spectrogram.

The detection device may determine a discontinuity point in a spectrogram of the audio file to be detected, where the discontinuity point may be used to indicate a starting position of the audio segment to be detected determined in the audio file to be detected.

In one implementation, a spectrogram of an audio file to be detected comprises a plurality of spectral sequences, a spectral sequence comprising a plurality of spectral vectors, a spectral vector comprising a plurality of element values. The target spectrum vector is any one of the spectrum vectors in the spectrogram of the audio file to be detected. If the similarity between the target spectrum vector and each spectrum vector except the target spectrum vector in the spectrogram of the audio file to be detected is the lowest, the detection device may determine that the starting point of the target spectrum vector is a mutation point.

In one implementation, a spectrogram of an audio file to be detected comprises a plurality of spectral sequences, a spectral sequence comprising a plurality of spectral vectors, a spectral vector comprising a plurality of element values. The target spectrum vector is any one of the spectrum vectors in the spectrogram of the audio file to be detected. If the maximum value of the element values of the target spectrum vector is the maximum value of the element values of each spectrum vector except the target spectrum vector in the spectrogram of the audio file to be detected, the detection device may determine that the starting point of the target spectrum vector is a mutation point.

It should be noted that the execution process of the detection device for determining the abrupt change point in the spectrogram of the audio file to be detected is the same as the execution process of the detection device for determining the abrupt change point in the spectrogram of the sample audio file, and the specific execution process may refer to the specific description of step S301 in the embodiment shown in fig. 3c, and is not described herein again.

Step S503, the detection device determines at least one audio segment to be detected of the audio file to be detected according to the catastrophe point.

In an implementation manner, the specific implementation manner of determining, by the detection device, at least one to-be-detected audio segment of the to-be-detected audio file according to the mutation point may be: the detection equipment can take the catastrophe point as the initial position of the audio segment to be detected; the detection device can determine the termination position of the audio segment to be detected in the spectrogram of the audio file to be detected according to the first preset sequence number, the second preset sequence number and the catastrophe points; the detection equipment can determine at least one audio clip to be detected in the audio file to be detected according to the starting position and the ending position; wherein the first number of preset sequences is less than or equal to the second number of preset sequences.

It should be noted that the determining, by the detection device, an execution process of at least one to-be-detected audio segment of the to-be-detected audio file according to the mutation point is the same as the determining, by the detection device, an execution process of at least one sample audio segment of the sample audio file according to the first preset sequence number, the second preset sequence number, and the mutation point, and a specific execution process may refer to the specific description of step S302 in the embodiment shown in fig. 3c, and is not described herein again.

And S504, detecting at least one audio clip to be detected by the detection equipment according to the trained audio target detection model to obtain a detection result.

The detection device can detect at least one audio clip to be detected according to the trained audio target detection model to obtain a detection result. The detection result may include at least one target audio segment and a category identifier of each target audio segment, where the category identifier may be used to identify an audio category to which the target audio segment belongs.

In one implementation, the trained audio target detection model may include a trained feature extraction model, a trained two-class model, and a trained multi-class model. The detection equipment can detect at least one audio segment to be detected according to the trained feature extraction model to obtain the audio features of the audio segments to be detected. The detection device can process the audio features of each audio clip to be detected according to the trained binary classification model, and determine at least one target audio clip. The detection device can identify each target audio clip according to the trained multi-classification model to obtain the category identification of each target audio clip.

It should be noted that, for a specific execution process of detecting at least one audio segment to be detected by the detection device according to the trained audio target detection model to obtain a detection result, reference may be made to the specific description of the embodiment shown in fig. 4, which is not described herein again.

And step S505, the detection device outputs a detection result.

The detection device may output a detection result including at least one target audio piece and the class identification of each target audio piece.

In the embodiment of the present invention, the detection device may determine a mutation point in the obtained spectrogram of the audio file to be detected, and determine at least one audio segment to be detected in the audio file to be detected according to the mutation point. The detection device can detect the determined at least one audio clip to be detected according to the trained audio target detection model to obtain at least one target audio clip and the category identification of each target audio clip. By the embodiment of the invention, the detection equipment can not only detect the target audio clip in the audio file to be detected, but also identify the audio category of the target audio clip.

Referring to fig. 6, which is a schematic structural diagram of a detection apparatus 60 according to an embodiment of the present invention, the detection apparatus 60 is used for performing steps performed by detection equipment in the method embodiments corresponding to fig. 1 to fig. 5, and the detection apparatus 60 may include an obtaining unit 601, a determining unit 602, a detecting unit 603, and an output unit 604, wherein,

an obtaining unit 601, configured to obtain a spectrogram of an audio file to be detected, where the spectrogram is determined according to a time domain sequence of the audio file to be detected and a frequency domain sequence of the audio file to be detected;

a determining unit 602, configured to determine a discontinuity in the spectrogram;

the determining unit 602 is further configured to determine at least one audio segment to be detected of the audio file to be detected according to the mutation point, where the mutation point is used to indicate an initial position of the audio segment to be detected;

the detecting unit 603 is configured to detect at least one audio segment to be detected according to the trained audio target detection model to obtain a detection result, where the detection result includes at least one target audio segment and a category identifier of each target audio segment, and the category identifier is used to identify an audio category to which the target audio segment belongs;

and an output unit 604 for outputting the detection result.

In one implementation, the trained audio target detection model includes a trained binary model, and the trained binary model is used for determining an audio segment capable of identifying an audio category from at least one audio segment to be detected; the detecting unit 603 detects at least one audio segment to be detected according to the trained audio target detection model, and when a detection result is obtained, is specifically configured to process the audio features of each audio segment to be detected according to the trained binary model, and determine at least one target audio segment.

In one implementation, the trained audio target detection model further includes a trained feature extraction model, and the trained feature extraction model is used for extracting audio features of at least one audio segment to be detected; the detecting unit 603 is further configured to, before processing the audio features of each audio segment to be detected according to the trained binary classification model and determining at least one target audio segment, detect at least one audio segment to be detected according to the trained feature extraction model to obtain the audio features of each audio segment to be detected.

In an implementation manner, the detection unit 603 detects at least one to-be-detected audio segment according to the trained feature extraction model, and when the audio features of each to-be-detected audio segment are obtained, is specifically configured to extract the features of each to-be-detected audio segment according to the trained feature extraction model, and obtain a feature pool of each to-be-detected audio segment, where the feature pool of any to-be-detected audio segment in the at least one to-be-detected audio segment includes the audio features of the to-be-detected audio segment in the preset dimension.

In an implementation manner, the detecting unit 603 processes audio features of each audio segment to be detected according to the trained binary classification model, and when at least one target audio segment is determined, is specifically configured to detect a feature pool of each audio segment to be detected according to the trained binary classification model, and determine category indication information of the audio segment to be detected, where the category indication information is used to indicate whether the audio segment to be detected is an audio segment with an identifiable audio category; and if the type indication information indicates that the audio clip to be detected is the audio clip of the identifiable audio type, determining that the audio clip to be detected is the target audio clip.

In one implementation, the trained audio target detection model includes a trained multi-class model, and the trained multi-class model is used for identifying the class of each target audio segment; the detecting unit 603 detects at least one audio segment to be detected according to the trained audio target detection model, and when a detection result is obtained, is specifically configured to identify each target audio segment according to the trained multi-classification model, so as to obtain a category identifier of each target audio segment.

In one implementation, a spectrogram comprises a plurality of spectral sequences, a spectral sequence comprising a plurality of spectral vectors, a spectral vector comprising a plurality of element values; when the determining unit 602 determines a mutation point in the spectrogram, if the similarity between the target spectrum vector and each of the spectrum vectors in the spectrogram except the target spectrum vector is the lowest, the determining unit 602 is specifically configured to determine that the starting point of the target spectrum vector is the mutation point; or, if the maximum value of the element values of the target spectrum vector is the maximum value of the element values of each spectrum vector except the target spectrum vector in the spectrogram, the determining unit 602 is specifically configured to determine that the starting point of the target spectrum vector is a mutation point; the target spectrum vector is any one of spectrum vectors in the spectrogram.

In an implementation manner, when the determining unit 602 determines at least one audio segment to be detected of the audio file to be detected according to the mutation point, the determining unit is specifically configured to use the mutation point as an initial position of the audio segment to be detected; determining the termination position of the audio segment to be detected in the spectrogram according to the first preset sequence number, the second preset sequence number and the catastrophe point; determining at least one audio clip to be detected in the audio file to be detected according to the starting position and the ending position; and the number of the first preset sequences is less than or equal to the number of the second preset sequences.

It should be noted that details that are not mentioned in the embodiment corresponding to fig. 6 and specific implementation manners of the steps executed by each unit may refer to the embodiments shown in fig. 1 to fig. 5 and the foregoing details, and are not described again here.

In one implementation, the associated functions implemented by the various units in FIG. 6 may be implemented in conjunction with a processor, memory, and a communications interface. Referring to fig. 7, fig. 7 is a schematic structural diagram of a detection apparatus according to an embodiment of the present invention, where the detection apparatus 70 includes a processor 701, a memory 702, and a communication interface 703, and the processor 701, the memory 702, and the communication interface 703 are connected by one or more communication buses.

The processor 701 is configured to support the detection device to perform the corresponding functions of the detection device in the method described in fig. 1 to 5. The Processor 701 may be a Central Processing Unit (CPU), a Network Processor (NP), a hardware chip, or any combination thereof.

The memory 702 is used to store program codes and the like. The Memory 702 may include volatile Memory (volatile), such as Random Access Memory (RAM); the Memory 702 may also include a Non-Volatile Memory (NVM), such as a Read-Only Memory (ROM), a flash Memory (flash Memory), a Hard Disk (Hard Disk Drive, HDD) or a Solid-State Drive (SSD); the memory 702 may also comprise a combination of the above types of memory.

The communication interface 703 is used to acquire and output data, for example, the communication interface 703 is used to acquire a spectrogram of an audio file to be detected, or the communication interface 703 is used to output a detection result, etc.

In the embodiment of the present invention, the detection device 70 includes a plurality of communication interfaces, wherein the communication interface for acquiring data and the communication interface for outputting data may not be the same communication interface.

The processor 701 may call the program code stored in the memory 702 to perform the following operations:

acquiring a spectrogram of an audio file to be detected through a communication interface 703, wherein the spectrogram is determined according to a time domain sequence of the audio file to be detected and a frequency domain sequence of the audio file to be detected;

determining a discontinuity point in the spectrogram;

determining at least one audio segment to be detected of the audio file to be detected according to the mutation point, wherein the mutation point is used for indicating the initial position of the audio segment to be detected;

detecting at least one audio clip to be detected according to the trained audio target detection model to obtain a detection result, wherein the detection result comprises at least one target audio clip and the category identification of each target audio clip, and the category identification is used for identifying the audio category to which the target audio clip belongs;

the detection result is output through the communication interface 703.

Further, the processor 701 may further cooperate with the communication interface 703 to execute operations corresponding to the detection device in the embodiments shown in fig. 1 to fig. 5, which may specifically refer to the description in the method embodiments, and is not described herein again.

An embodiment of the present invention further provides a computer-readable storage medium, which can be used for storing computer program instructions called by the processor 701 in the detection device in the embodiment shown in fig. 7, where the computer program instructions include a program for executing the program designed for the detection device in the foregoing embodiment.

The computer-readable storage medium includes, but is not limited to, a flash memory (flash memory), a Hard Disk Drive (HDD), and a Solid-State Drive (SSD).

The embodiment of the present invention further provides a computer program product, and when the computer program product is run by a computer device, the computer program product may execute the target detection method for the audio file designed for the detection device in the embodiments of fig. 1 to 5.

The target detection method for the audio file provided by the embodiment of the present invention may also be applied to a block link point device of a block link network, where the block link point device may be a mobile terminal, a Personal Computer (PC) terminal, or a portable Computer (Tablet PC) terminal, and so on. The target detection method for the audio file provided by the embodiment of the invention can be compiled into an independent application program, and the application program can be operated in block chain node equipment.

The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism and an encryption algorithm. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product services layer, and an application services layer.

The block chain underlying platform can comprise processing modules such as user management, basic service, intelligent contract and operation monitoring. The user management module is responsible for identity information management of all blockchain participants, and comprises public and private key generation maintenance (account management), key management, user real identity and blockchain address corresponding relation maintenance (authority management) and the like, and under the authorization condition, the user management module supervises and audits the transaction condition of certain real identities and provides rule configuration (wind control audit) of risk control; the basic service module is deployed on all block chain node equipment and used for verifying the validity of the service request, recording the service request to storage after consensus on the valid request is completed, for a new service request, the basic service firstly performs interface adaptation analysis and authentication processing (interface adaptation), then encrypts service information (consensus management) through a consensus algorithm, transmits the service information to a shared account (network communication) completely and consistently after encryption, and performs recording and storage; the intelligent contract module is responsible for registering and issuing contracts, triggering the contracts and executing the contracts, developers can define contract logics through a certain programming language, issue the contract logics to a block chain (contract registration), call keys or other event triggering and executing according to the logics of contract clauses, complete the contract logics and simultaneously provide the function of upgrading and canceling the contracts; the operation monitoring module is mainly responsible for deployment, configuration modification, contract setting, cloud adaptation in the product release process and visual output of real-time states in product operation, such as: alarm, monitoring network conditions, monitoring node equipment health status, and the like.

The platform product service layer provides basic capability and an implementation framework of typical application, and developers can complete block chain implementation of business logic based on the basic capability and the characteristics of the superposed business. The application service layer provides the application service based on the block chain scheme for the business participants to use.

The embodiment of the present invention is described in detail by taking an example that the target detection method for an audio file is applied to a detection device, and when the target detection method for an audio file is applied to a block link point device, reference may be made to the specific description that the target detection method for an audio file is applied to a detection device.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, or other programmable apparatus. The computer instructions may be stored in or transmitted over a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL), etc.) or wireless (e.g., infrared, microwave, etc.) means. The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A method of object detection for an audio file, the method comprising:

acquiring a spectrogram of an audio file to be detected, wherein the spectrogram is determined according to a time domain sequence of the audio file to be detected and a frequency domain sequence of the audio file to be detected;

determining a discontinuity in the spectrogram;

detecting the at least one audio clip to be detected according to the trained audio target detection model to obtain a detection result, wherein the detection result comprises at least one target audio clip and a category identifier of each target audio clip, and the category identifier is used for identifying an audio category to which the target audio clip belongs;

and outputting the detection result.

2. The method according to claim 1, wherein the trained audio target detection model comprises a trained binary model, and the trained binary model is used for determining an audio segment of an identifiable audio class from the at least one audio segment to be detected;

the detecting the at least one audio clip to be detected according to the trained audio target detection model to obtain a detection result, including:

and processing the audio features of the audio segments to be detected according to the trained binary classification model, and determining the at least one target audio segment.

3. The method according to claim 2, wherein the trained audio target detection model further comprises a trained feature extraction model, and the trained feature extraction model is used for extracting the audio features of the at least one audio segment to be detected;

before the processing the audio features of each audio segment to be detected according to the trained binary model and determining the at least one target audio segment, the method further includes:

and detecting the at least one audio clip to be detected according to the trained feature extraction model to obtain the audio features of the audio clips to be detected.

4. The method according to claim 3, wherein the detecting the at least one audio segment to be detected according to the trained feature extraction model to obtain the audio feature of each audio segment to be detected comprises:

extracting the characteristics of each audio clip to be detected according to the trained characteristic extraction model to obtain a characteristic pool of each audio clip to be detected, wherein the characteristic pool of any one audio clip to be detected in the at least one audio clip to be detected comprises the audio characteristics of the audio clip to be detected in a preset dimension.

5. The method according to claim 4, wherein the processing the audio features of the audio segments to be detected according to the trained binary model to determine the at least one target audio segment comprises:

detecting the feature pool of each audio clip to be detected according to the trained binary classification model, and determining the class indication information of the audio clip to be detected, wherein the class indication information is used for indicating whether the audio clip to be detected is an audio clip with an identifiable audio class;

and if the type indication information indicates that the audio clip to be detected is an audio clip of an identifiable audio type, determining that the audio clip to be detected is the target audio clip.

6. The method of claim 1, wherein the trained audio target detection model comprises a trained multi-class model for identifying a class of each of the target audio pieces;

and identifying each target audio clip according to the trained multi-classification model to obtain the category identification of each target audio clip.

7. The method of claim 1, wherein the spectrogram comprises a plurality of spectral sequences comprising a plurality of spectral vectors comprising a plurality of element values;

the determining the mutation point in the spectrogram comprises:

if the similarity between the target spectrum vector and each spectrum vector except the target spectrum vector in the spectrogram is the lowest, determining the starting point of the target spectrum vector as the mutation point; alternatively, the first and second electrodes may be,

if the maximum value in the element values of the target spectrum vector is the maximum value in the element values of the spectrum vectors except the target spectrum vector in the spectrogram, determining the starting point of the target spectrum vector as the mutation point;

and the target spectrum vector is any one of the spectrum vectors in the spectrogram.

8. The method according to claim 1, wherein the determining at least one audio segment to be detected of the audio file to be detected according to the mutation point comprises:

taking the mutation point as the initial position of the audio segment to be detected;

determining the termination position of the audio segment to be detected in the spectrogram according to the first preset sequence number, the second preset sequence number and the catastrophe point;

determining the at least one audio clip to be detected in the audio file to be detected according to the starting position and the ending position;

wherein the first number of preset sequences is less than or equal to the second number of preset sequences.

9. A detection device, characterized in that the detection device comprises:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a spectrogram of an audio file to be detected, and the spectrogram is determined according to a time domain sequence of the audio file to be detected and a frequency domain sequence of the audio file to be detected;

a determining unit, configured to determine a discontinuity point in the spectrogram;

the determining unit is further configured to determine at least one to-be-detected audio segment of the to-be-detected audio file according to the mutation point, where the mutation point is used to indicate an initial position of the to-be-detected audio segment;

the detection unit is used for detecting the at least one audio clip to be detected according to the trained audio target detection model to obtain a detection result, wherein the detection result comprises at least one target audio clip and a category identifier of each target audio clip, and the category identifier is used for identifying an audio category to which the target audio clip belongs;

and the output unit is used for outputting the detection result.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions that, when executed by a processor, cause the processor to execute the method of object detection of an audio file according to any one of claims 1 to 8.