CN109671425B

CN109671425B - Audio classification method, device and storage medium

Info

Publication number: CN109671425B
Application number: CN201811632676.7A
Authority: CN
Inventors: 劳振锋
Original assignee: Guangzhou Kugou Computer Technology Co Ltd
Current assignee: Guangzhou Kugou Computer Technology Co Ltd
Priority date: 2018-12-29
Filing date: 2018-12-29
Publication date: 2021-04-06
Anticipated expiration: 2038-12-29
Also published as: CN109671425A

Abstract

The invention discloses an audio classification method, an audio classification device and a storage medium, and belongs to the technical field of the Internet. The method comprises the following steps: acquiring at least one target audio clip in the target audio information; carrying out high-pass filtering and feature extraction on at least one target audio segment to obtain at least one audio feature corresponding to the at least one target audio segment; determining a classification identifier of at least one target audio fragment based on the audio classification model and at least one audio feature, and determining a classification identifier of target audio information according to the classification identifier of at least one target audio fragment; the first mark is used for indicating that the corresponding audio information is normal audio information, and the second mark is used for indicating that the corresponding audio information is sensitive audio information. High-pass filtering is carried out before the classification identification of the target audio information is determined, and low-frequency noise of the target audio information can be filtered, so that the condition that the low-frequency noise is determined as sensitive audio information does not occur, and the audio classification accuracy is improved.

Description

Audio classification method, device and storage medium

Technical Field

The present invention relates to the field of internet technologies, and in particular, to an audio classification method, apparatus, and storage medium.

Background

With the rapid development of the internet technology, the information scale of the internet is gradually enlarged, so that a lot of sensitive information is widely spread, such as sensitive videos, sensitive audios and the like, and the sensitive videos and the sensitive audios have bad influence on the mental health of people, pollute the network environment and easily cause the network information safety problem. Therefore, how to identify sensitive information becomes a problem to be solved urgently.

Related art provides an audio classification method, which can classify audio information and identify normal audio information and sensitive audio information. Firstly, a plurality of sensitive audio information are obtained, feature extraction is carried out on each sensitive audio information to obtain the audio feature of each sensitive audio information, and model training is carried out according to the plurality of audio features to obtain a Gaussian mixture model. And then, obtaining target audio information to be identified, performing feature extraction on the target audio information to obtain target audio features of the target audio information, determining the Mahalanobis distance between the target audio features and the Gaussian mixture model based on the Gaussian mixture model, judging whether the Mahalanobis distance is greater than a preset threshold value, determining that the target audio information is normal audio information when the Mahalanobis distance is greater than the preset threshold value, and determining that the target audio information is sensitive audio information when the Mahalanobis distance is not greater than the preset threshold value.

When the target audio information includes low-frequency noise, the low-frequency noise has characteristics similar to those of the sensitive audio information, so that when classification is performed based on a Gaussian mixture model, the low-frequency noise is mistaken for the sensitive audio information, and audio classification is wrong, and the accuracy is low.

Disclosure of Invention

The embodiment of the invention provides an audio classification method, an audio classification device and a storage medium, which can solve the problems in the related art. The technical scheme is as follows:

in a first aspect, a method for audio classification is provided, the method comprising:

acquiring at least one target audio clip in the target audio information;

carrying out high-pass filtering and feature extraction on the at least one target audio segment to obtain at least one audio feature corresponding to the at least one target audio segment;

determining a classification identification of the at least one target audio segment based on an audio classification model and the at least one audio feature, and determining a classification identification of the target audio information according to the classification identification of the at least one target audio segment;

the classification identification comprises a first identification and a second identification, the first identification is used for indicating that the corresponding audio information is normal audio information, and the second identification is used for indicating that the corresponding audio information is sensitive audio information.

Optionally, the obtaining at least one target audio segment in the target audio information includes:

dividing the target audio information according to a first preset length to obtain a plurality of audio segments with the length equal to the first preset length;

for each audio clip in the plurality of audio clips, acquiring a plurality of fundamental frequencies in the audio clip, and acquiring the proportion of the fundamental frequencies which are greater than a first preset frequency in the plurality of fundamental frequencies;

and acquiring the audio clip with the proportion smaller than a first preset proportion from the plurality of audio clips as a target audio clip.

dividing the target audio information according to a second preset length and a third preset length to obtain a plurality of audio segments with the length equal to the second preset length, wherein any two adjacent audio segments in the plurality of audio segments comprise the same audio information with the third preset length; the third preset length is smaller than the second preset length;

for each audio clip in the multiple audio clips, dividing the audio clip according to a fourth preset length to obtain multiple audio sub-clips with the length equal to the fourth preset length, and acquiring a statistical value of the amplitude of each audio sub-clip; the fourth preset length is smaller than the second preset length;

and acquiring any audio clip with the statistical value larger than a preset value from the plurality of audio clips as a target audio clip.

and acquiring the audio clips with the proportion larger than a second preset proportion and smaller than a third preset proportion from the plurality of audio clips as target audio clips.

Optionally, the performing high-pass filtering and feature extraction on the at least one target audio segment to obtain at least one audio feature corresponding to the at least one target audio segment includes:

performing high-pass filtering on the at least one target audio segment to obtain at least one high-pass filtered audio segment;

dividing each high-pass filtered audio segment according to a fifth preset length to obtain a plurality of audio sub-segments with the length equal to the fifth preset length;

and performing feature extraction on each audio sub-segment to obtain the audio features of each audio sub-segment.

Optionally, the determining, according to the classification identifier of the at least one target audio segment, the classification identifier of the target audio information includes at least one of:

when the classification identifiers of a first preset number of continuous target audio segments in the at least one target audio segment are the second identifiers, determining that the classification identifier of the target audio information is the second identifier;

and when the proportion of the number of the target audio segments with the classification identifiers of the second identifiers in the at least one target audio segment reaches a fourth preset proportion, determining that the classification identifiers of the target audio information are the second identifiers.

Optionally, the method further comprises:

obtaining a plurality of sample audio information and classification identifications of the plurality of sample audio information;

carrying out high-pass filtering and feature extraction on the plurality of sample audio information to obtain a plurality of audio features corresponding to the plurality of sample audio information;

and performing model training according to the plurality of audio features and the classification identifier corresponding to each audio feature to obtain the audio classification model.

Optionally, the audio classification model includes a first audio classification model and a second audio classification model, and the performing model training according to the multiple audio features and the classification identifier corresponding to each audio feature to obtain the audio classification model includes:

performing model training according to the audio features of which the classification identifiers are the first classification identifiers in the plurality of audio features to obtain the first audio classification model;

and performing model training according to the audio features of which the classification identifiers are the second classification identifiers in the plurality of audio features to obtain the second audio classification model.

In a second aspect, there is provided an audio classification apparatus, the apparatus comprising:

the acquisition module is used for acquiring at least one target audio clip in the target audio information;

the extraction module is used for carrying out high-pass filtering and feature extraction on the at least one target audio segment to obtain at least one audio feature corresponding to the at least one target audio segment;

the determining module is used for determining the classification identifier of the at least one target audio segment based on an audio classification model and the at least one audio feature, and determining the classification identifier of the target audio information according to the classification identifier of the at least one target audio segment;

Optionally, the obtaining module includes:

the first dividing unit is used for dividing the target audio information according to a first preset length to obtain a plurality of audio segments with the length equal to the first preset length;

the fundamental frequency obtaining unit is used for obtaining a plurality of fundamental frequencies in the audio frequency segments and obtaining the proportion of the fundamental frequencies which are larger than a first preset frequency in the plurality of fundamental frequencies for each audio frequency segment in the plurality of audio frequency segments;

and the obtaining unit is used for obtaining the audio clips with the proportion smaller than a first preset proportion from the plurality of audio clips as target audio clips.

Optionally, the obtaining module includes:

the second dividing unit is used for dividing the target audio information according to a second preset length and a third preset length to obtain a plurality of audio segments with the length equal to the second preset length, and any two adjacent audio segments in the plurality of audio segments comprise the same audio information with the third preset length; the third preset length is smaller than the second preset length;

the second dividing unit is further configured to divide each of the plurality of audio segments according to a fourth preset length to obtain a plurality of audio sub-segments with lengths equal to the fourth preset length, and obtain a statistical value of an amplitude of each of the audio sub-segments; the fourth preset length is smaller than the second preset length;

and the acquisition unit is used for acquiring any audio clip with the statistical value larger than a preset numerical value from the plurality of audio clips as a target audio clip.

Optionally, the obtaining module includes:

the third dividing unit is configured to divide the target audio information according to a second preset length and a third preset length to obtain a plurality of audio segments with lengths equal to the second preset length, where any two adjacent audio segments in the plurality of audio segments include the same audio information with the third preset length; the third preset length is smaller than the second preset length;

and the obtaining unit is used for obtaining the audio clips with the proportion larger than a second preset proportion and smaller than a third preset proportion from the plurality of audio clips as target audio clips.

Optionally, the extraction module includes:

the filtering unit is used for carrying out high-pass filtering on the at least one target audio segment to obtain at least one audio segment after high-pass filtering;

the dividing unit is used for dividing each high-pass filtered audio segment according to a fifth preset length to obtain a plurality of audio sub-segments with the length equal to the fifth preset length;

and the extraction unit is used for extracting the characteristics of each audio sub-segment to obtain the audio characteristics of each audio sub-segment.

Optionally, the determining module is configured to perform at least one of:

Optionally, the apparatus further comprises:

the obtaining module is further configured to obtain a plurality of sample audio information and classification identifiers of the plurality of sample audio information;

the extraction module is further configured to perform high-pass filtering and feature extraction on the multiple sample audio information to obtain multiple audio features corresponding to the multiple sample audio information;

and the training module is used for carrying out model training according to the plurality of audio features and the classification mark corresponding to each audio feature to obtain the audio classification model.

Optionally, the audio classification model comprises a first audio classification model and a second audio classification model;

the training module is further configured to perform model training according to the audio features of which the classification identifiers are the first classification identifiers among the multiple audio features, so as to obtain the first audio classification model;

the training module is further configured to perform model training according to the audio features of which the classification identifiers are the second classification identifiers, among the multiple audio features, to obtain the second audio classification model.

In a third aspect, an audio classification apparatus is provided, the apparatus comprising a processor and a memory, the memory having stored therein at least one instruction, the instruction being loaded and executed by the processor to implement the operations performed in the audio classification method according to the first aspect.

In a fourth aspect, a computer-readable storage medium is provided, in which at least one instruction is stored, the instruction being loaded and executed by a processor to implement the operations performed in the audio classification method according to the first aspect.

The technical scheme provided by the embodiment of the invention has the following beneficial effects:

the method, the device and the storage medium provided by the embodiment of the invention can be used for obtaining at least one target audio frequency segment in the target audio frequency information, carrying out high-pass filtering and feature extraction on the at least one target audio frequency segment, filtering low-frequency noise through the high-pass filtering, carrying out feature extraction to obtain at least one audio frequency feature corresponding to the at least one target audio frequency segment, determining the classification identifier of the at least one target audio frequency segment based on the audio frequency classification model and the at least one audio frequency feature, determining the classification identifier of the target audio frequency information according to the classification identifier of the at least one target audio frequency segment, thereby determining the target audio frequency information as normal audio frequency information or sensitive audio frequency information, and filtering the low-frequency noise of the target audio frequency information because the high-pass filtering is carried out before the classification identifier of the target audio frequency information is determined, thereby avoiding the situation that the low-frequency noise is, the accuracy of audio classification is improved.

And when at least one target audio clip in the target audio information is acquired, the target audio information is divided, so that a plurality of divided audio clips are obtained, the plurality of audio clips are screened by setting a preset condition, and the audio clip meeting the preset condition is used as the target audio clip.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart of an audio classification method according to an embodiment of the present invention;

fig. 2 is a flowchart of an audio classification method according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an audio classification apparatus according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a server according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a terminal according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a flowchart of an audio classification method according to an embodiment of the present invention. The execution subject of the embodiment of the invention is a classification device, and referring to fig. 1, the method comprises the following steps:

101. at least one target audio segment in the target audio information is obtained.

102. And carrying out high-pass filtering and feature extraction on at least one target audio segment to obtain at least one audio feature corresponding to at least one target audio segment.

103. And determining the classification identification of at least one target audio segment based on the audio classification model and at least one audio characteristic, and determining the classification identification of the target audio information according to the classification identification of at least one target audio segment.

The classification mark comprises a first mark and a second mark, the first mark is used for indicating that the corresponding audio information is normal audio information, and the second mark is used for indicating that the corresponding audio information is sensitive audio information.

The method provided by the embodiment of the invention comprises the steps of obtaining at least one target audio frequency segment in target audio frequency information, carrying out high-pass filtering and feature extraction on the at least one target audio frequency segment, filtering low-frequency noise through the high-pass filtering, carrying out the feature extraction to obtain at least one audio frequency feature corresponding to the at least one target audio frequency segment, determining the classification identifier of the at least one target audio frequency segment based on an audio frequency classification model and the at least one audio frequency feature, determining the classification identifier of the target audio frequency information according to the classification identifier of the at least one target audio frequency segment, so that the target audio frequency information can be determined to be normal audio frequency information or sensitive audio frequency information, and because the high-pass filtering is carried out before the classification identifier of the target audio frequency information is determined, the low-frequency noise of the target audio frequency information can be filtered, so that the condition that the, the accuracy of audio classification is improved.

Optionally, obtaining at least one target audio segment in the target audio information includes:

for each audio clip in the plurality of audio clips, acquiring a plurality of fundamental frequencies in the audio clip, and acquiring the proportion of the fundamental frequencies which are larger than a first preset frequency in the plurality of fundamental frequencies;

and acquiring an audio segment with the proportion smaller than a first preset proportion from the plurality of audio segments as a target audio segment.

and acquiring an audio clip with the proportion larger than a second preset proportion and smaller than a third preset proportion from the plurality of audio clips as a target audio clip.

Optionally, the performing high-pass filtering and feature extraction on at least one target audio segment to obtain at least one audio feature corresponding to the at least one target audio segment includes:

carrying out high-pass filtering on at least one target audio segment to obtain at least one audio segment subjected to high-pass filtering;

Optionally, the determining, according to the classification identifier of the at least one target audio segment, a classification identifier of the target audio information includes at least one of:

when the classification identifiers of the continuous first preset number of target audio segments in at least one target audio segment are second identifiers, determining that the classification identifier of the target audio information is the second identifier;

and when the proportion of the number of the target audio segments with the classification identification as the second identification in the at least one target audio segment reaches a fourth preset proportion, determining that the classification identification of the target audio information is the second identification.

Optionally, the method further comprises:

obtaining a plurality of sample audio information and classification marks of the plurality of sample audio information;

and performing model training according to the plurality of audio features and the classification mark corresponding to each audio feature to obtain an audio classification model.

Optionally, the audio classification model includes a first audio classification model and a second audio classification model, and the model training is performed according to the multiple audio features and the classification identifier corresponding to each audio feature to obtain the audio classification model, including:

performing model training according to the audio features of which the classification identifiers are the first classification identifiers in the multiple audio features to obtain a first audio classification model;

and performing model training according to the audio features of which the classification identifiers are the second classification identifiers in the plurality of audio features to obtain a second audio classification model.

All the above-mentioned optional technical solutions can be combined arbitrarily to form the optional embodiments of the present invention, and are not described herein again.

Fig. 2 is a flowchart of an audio classification method according to an embodiment of the present invention. The execution main body of the embodiment of the invention is a classification device, and the classification device can be a terminal such as a mobile phone, a computer or a tablet computer, and can also be a server. Referring to fig. 2, the method includes:

201. an audio classification model is obtained.

In the embodiment of the invention, any audio information can be classified based on the audio classification model, and whether the audio information is normal audio information or sensitive audio information is determined. The audio classification model is used for determining a classification identifier of the audio information, the classification identifier includes a first identifier and a second identifier, the first identifier is used for indicating that the corresponding audio information is normal audio information, and the second identifier is used for indicating that the corresponding audio information is sensitive audio information.

The first identifier and the second identifier are two different identifiers, for example, the first identifier is 1 and the second identifier is 0, or the first identifier is 0 and the second identifier is 1.

The audio classification model may be trained by the classification device and stored by the classification device, or the audio classification model may be trained by another device and then sent to the classification device and stored by the classification device.

When an audio classification model is trained, a plurality of sample audio information and classification identifications of the plurality of sample audio information are obtained, for each sample audio information, high-pass filtering is carried out on the sample audio information, low-frequency noise in the sample audio information is filtered, feature extraction is carried out on the sample audio information after the high-pass filtering, and audio features corresponding to the sample audio information are obtained. By adopting the mode, a plurality of audio features corresponding to a plurality of sample audio information can be obtained, the classification identifications of the plurality of sample audio information are used as the classification identifications corresponding to the plurality of audio features, and model training is carried out according to the plurality of audio features and the classification identification corresponding to each audio feature to obtain an audio classification model.

For example, each audio feature may describe sample audio information, may be mel-frequency cepstral coefficients, linear prediction cepstral coefficients, or other features that may describe a target audio segment. Accordingly, when the characteristics of the sample audio information are extracted, a mel-frequency cepstrum coefficient algorithm, a linear prediction cepstrum coefficient algorithm or other characteristic extraction algorithms can be adopted for extracting the characteristics.

In addition, various training algorithms may be employed to train the audio classification model, which may be a gaussian mixture model, a neural network model, a decision tree model, or other models.

By carrying out high-pass filtering on the sample audio information, the low-frequency noise in the sample audio information can be filtered, the audio features which can more accurately describe the sample audio information can be extracted, the interference of the low-frequency noise can be avoided, and the accuracy of the trained audio classification model is improved.

Optionally, when training the audio classification model, an initial audio classification model may be constructed first, and a training data set and a test data set are obtained, where the training data set and the test data set both include a plurality of sample audio information.

The method comprises the steps of respectively carrying out high-pass filtering on a plurality of sample audio information in a training data set, carrying out feature extraction on the plurality of sample audio information after the high-pass filtering to obtain a plurality of audio features corresponding to the plurality of sample audio information, taking the plurality of audio features as the input of an audio classification model, and training the audio classification model, so that the audio classification model learns the difference between normal audio information and sensitive audio information, and has the capability of distinguishing the normal audio information and the sensitive audio information.

Then, high-pass filtering is carried out on a plurality of sample audio information in the test data set respectively, feature extraction is carried out on the plurality of sample audio information after the high-pass filtering to obtain a plurality of audio features corresponding to the plurality of sample audio information, the plurality of audio features are input into an audio classification model, a classification identifier of each sample audio information is determined based on the audio classification model, the determined classification identifier is compared with an actual classification identifier, and the audio classification model is updated according to a comparison result.

In the subsequent process, new sample audio information and the classification identification thereof can be obtained, and the audio classification model continues to be trained, so that the accuracy of the audio classification model can be improved.

Optionally, the audio classification model includes a first audio classification model and a second audio classification model, when performing model training, the audio classification model obtains a plurality of sample audio information and classification identifiers of the plurality of sample audio information, performs high-pass filtering and feature extraction on the plurality of sample audio information to obtain a plurality of audio features corresponding to the plurality of sample audio information, performs model training according to the plurality of audio features of which the classification identifiers are the first classification identifiers to obtain the first audio classification model, and performs model training according to the audio features of which the classification identifiers are the second classification identifiers to obtain the second audio classification model.

The first audio classification model can learn the characteristics of normal audio information and has the capability of identifying the normal audio information. The probability that any audio information belongs to normal audio information can be determined based on the first audio classification model. The second audio classification model can learn the characteristics of the sensitive audio information and has the capacity of identifying the sensitive audio information. The probability that any audio information belongs to sensitive audio information can be determined based on the second audio classification model. The target audio information may then be classified based on the first audio classification model and the second audio classification model.

The first audio classification model and the second audio classification model are obtained by respectively training two different sample audio information, pertinence can be improved, and accuracy of the audio classification models is improved.

202. At least one target audio segment in the target audio information is obtained.

In the embodiment of the invention, the target audio information is the audio information to be classified, and the target audio information needs to be classified to determine whether the target audio information is normal audio information or sensitive audio information.

In terms of information, the target audio information may be audio information in a single audio file, or may be audio information extracted from a video file, or may also be other forms of audio information.

From the information source, the target audio information may be recorded by the sorting apparatus, downloaded from the internet by the sorting apparatus, or transmitted to the sorting apparatus by other devices. For example, in the process of playing live video by the classification device, audio information in the live video can be acquired as target audio information.

The target audio information may include singing audio information, chat audio information, sensitive audio information, noise audio information, and the like in terms of information content.

Alternatively, the classification means may use the complete target audio information as the target audio piece to be classified.

Alternatively, a preset condition may be set, where the preset condition is used to specify a condition that is satisfied by audio information that may be sensitive audio information, that is, when a certain audio segment satisfies the preset condition, it indicates that the audio segment may contain sensitive audio information, and when the audio segment does not satisfy the preset condition, it indicates that the audio segment does not contain sensitive audio information.

When the classification device acquires the target audio information, in order to improve the accuracy, the whole target audio information is not directly classified, but the target audio information is divided into at least one audio segment, and whether the at least one audio segment meets the preset condition is judged, so that the audio segment meeting the preset condition is determined as the target audio segment, the target audio segment can be classified later, and other audio segments are not classified.

Optionally, the step 202 comprises at least one of the following steps 2021-2023:

2021. and acquiring other audio segments except the singing audio segment in the target audio information as the target audio segment.

The target audio information is divided according to a first preset length to obtain a plurality of audio segments with the length equal to the first preset length, for each audio segment in the plurality of audio segments, a plurality of fundamental frequencies in the audio segment are obtained, and the proportion of the fundamental frequencies larger than a first preset frequency in the plurality of fundamental frequencies is obtained.

The fundamental frequency is the frequency of the basic audio of the audio clip, when the proportion occupied by the fundamental frequency larger than the first preset frequency in the multiple fundamental frequencies is obtained, the number of the fundamental frequencies larger than the first preset frequency is obtained, and the proportion between the number of the obtained fundamental frequencies and the number of the fundamental frequencies in the audio clip is calculated, so that the proportion occupied by the fundamental frequency larger than the first preset frequency in the multiple fundamental frequencies is obtained.

And when the proportion of the fundamental frequency larger than the first preset frequency in the audio segment is not smaller than the first preset proportion, determining the audio segment as a singing segment and not containing sensitive audio information. And when the proportion of the fundamental frequency larger than the first preset frequency in the audio segment is smaller than the first preset proportion, determining that the audio segment is not a singing segment and possibly containing sensitive audio information.

Therefore, the classification device obtains the audio frequency segment with the proportion of the fundamental frequency larger than the first preset frequency smaller than the first preset proportion from the plurality of audio frequency segments as the target audio frequency segment.

Wherein the first preset length may be set to 10 seconds, 20 seconds or other time duration. The first preset frequency may be set to 150 hz, 160 hz or other frequencies. The first preset ratio may be set to 60%, 70%, or other percentage.

For example, the first preset length is 20 seconds, the first preset proportion is 60%, and the first preset frequency is 150 hertz. Dividing target audio information into a plurality of audio segments of 20 seconds, extracting 100 fundamental frequencies from each audio segment, and if the number of the fundamental frequencies larger than 150 Hz in a certain audio segment is 50, the occupied proportion is 50%, and the occupied proportion is smaller than a first preset proportion of 60%, taking the audio segment as the target audio segment.

The target audio information is divided into a plurality of audio segments, and whether the audio segment is a singing segment or not is judged according to the fundamental frequency of the audio segment for each audio segment, so that the singing segment in the target audio information can be eliminated, the calculation amount is reduced, and the accuracy of subsequent classification of the target audio information is improved.

2022. And acquiring other audio segments except the mute audio segment in the target audio information as the target audio segments.

Dividing the target audio information according to a second preset length and a third preset length to obtain a plurality of audio segments with the length equal to the second preset length, wherein any two adjacent audio segments in the plurality of audio segments comprise the same audio information with the third preset length, dividing the audio segments according to a fourth preset length for each audio segment in the plurality of audio segments to obtain a plurality of audio sub-segments with the length equal to the fourth preset length, and obtaining a statistical value of the amplitude of each audio sub-segment.

Where the amplitude is used to represent the energy size of the audio piece, the statistical value of the amplitude may be an average of absolute values, an average of squares, or other statistical values of the amplitude of each audio sub-piece.

And when the statistic value of the amplitude value in each audio sub-segment is smaller than a preset value, determining that the audio segment is a mute segment and does not contain sensitive audio information. And when the statistical value of any amplitude is not less than the preset value, the audio clip is determined to be a non-silent clip, and sensitive audio information may be contained.

Therefore, the classification device obtains any audio segment with a statistical value larger than a preset value from a plurality of audio segments as a target audio segment.

The second preset length may be set to 1 second, 2 seconds or other duration, and the third preset length may be set to 0.4 second, 0.5 second or other duration, where the third preset length is smaller than the second preset length. The fourth preset length may be set to 0.1 second, 0.2 second, or other duration, the fourth preset length is smaller than the second preset length, and the audio segment with the second preset length may be divided into audio sub-segments according to the fourth preset length. The preset value may be set to 0.2, 0.3, or other values.

For example, the second preset length is 1 second, the third preset length is 0.5 second, the fourth preset length is 0.2 second, and the preset value is 0.3. And dividing according to the second preset length and the third preset length to obtain a plurality of audio segments which are respectively 0-1 second, 0.5 second-1.5 second and 1 second-2 second, and so on, dividing the target audio information into a plurality of audio segments, then dividing the plurality of audio segments into a plurality of audio sub-segments of 0.2 second, respectively calculating the average value of the amplitudes of the plurality of audio sub-segments in each audio segment, and if the average value of any one audio sub-segment in a certain audio segment is 0.4 and is greater than the preset value of 0.3, taking the audio segment as the target audio segment.

The target audio information is divided according to the second preset length and the third preset length, any two adjacent audio segments in the multiple audio segments comprise the same audio information with the third length, and each audio segment comprises the audio information at the end of the previous audio segment, so that the faults of the audio segments are reduced, and the completeness of the audio information is ensured.

And, by dividing the target audio information into a plurality of audio segments, for each audio segment, dividing the audio segment into a plurality of audio sub-segments, and determining whether the audio segment is a silent segment according to the statistical value of each audio sub-segment, the silent segment in the target audio information can be excluded, the amount of calculation is reduced, and the accuracy of the following classification of the target audio information is improved.

2023. And acquiring other audio segments except the noise segment and the singing segment in the target audio information as the target audio segments.

Dividing the target audio information according to the second preset length and the third preset length to obtain a plurality of audio segments with the length equal to the second preset length, wherein any two adjacent audio segments in the plurality of audio segments comprise the same audio information with the third preset length, for each audio segment in the plurality of audio segments, obtaining a plurality of fundamental frequencies in the audio segment, and obtaining the proportion of the fundamental frequencies which are larger than the first preset frequency in the plurality of fundamental frequencies.

And when the proportion of the fundamental frequency larger than the first preset frequency in the audio clip is not larger than a second preset proportion, determining that the audio clip is a noise clip and does not contain sensitive audio information. And when the proportion of the fundamental frequency larger than the first preset frequency in the audio segment is not smaller than a third preset proportion, determining that the audio segment is a singing segment and does not contain sensitive audio information. And when the proportion of the fundamental frequency larger than the first preset frequency in the audio segment is larger than the second preset proportion and smaller than the third preset proportion, determining that the audio segment is not a noise segment or a singing segment, and possibly containing sensitive audio information.

Therefore, the classification device obtains the audio frequency segments with the ratio of the fundamental frequency larger than the first preset frequency larger than the second preset ratio and smaller than the third preset ratio from the plurality of audio frequency segments as the target audio frequency segments.

Wherein the second preset length is 1 second, the third preset length is 0.5 second, and the second preset proportion may be set to 10%, 20% or other percentages. The third preset proportion may be set to 60%, 70% or other percentage. The third predetermined ratio may be the same as or different from the first predetermined ratio in step 2021.

The target audio information is divided into a plurality of audio segments, and for each audio segment, whether the audio segment is a noise segment, a singing segment or a segment except the noise segment and the singing segment is judged through the fundamental frequency of the audio segment, so that the noise segment and the singing segment in the target audio information are eliminated, the calculated amount is reduced, and the accuracy of subsequent classification of the target audio information is improved.

It should be noted that, the above steps 2021 and 2023 may be combined with each other, and the noise segment, the mute segment, the singing segment and the segments other than the above three segments in the target audio information may be distinguished, so as to exclude the noise segment, the mute segment and the singing segment, determine the target audio segment that needs to be classified, and determine whether the target audio segment is normal audio information or sensitive audio information according to the target audio segment.

203. And carrying out high-pass filtering on at least one target audio segment to obtain at least one audio segment after high-pass filtering.

The classification device can set a preset cut-off frequency during high-pass filtering, and when the high-pass filtering is carried out on the audio information, if the frequency of the audio information is lower than the preset cut-off frequency, the audio information is filtered, and if the frequency of the audio information is not lower than the preset cut-off frequency, the audio information is reserved.

After at least one target audio segment is determined in step 202, high-pass filtering is performed on the at least one target audio segment according to a preset cut-off frequency, if the frequency of the audio information is lower than the preset cut-off frequency, the audio information is filtered, and if the frequency of the audio information is not lower than the preset cut-off frequency, the audio information is retained, so that low-frequency noise can be filtered, and at least one high-pass filtered audio segment is obtained.

The preset cut-off frequency can be set to 100 hz, 120 hz or other frequencies, and can be set according to the maximum frequency that the low-frequency noise in daily life can reach.

204. And performing feature extraction on the at least one target audio segment after the high-pass filtering to obtain at least one audio feature corresponding to the at least one target audio segment.

At least one audio feature may describe the target audio segment, and may be a mel-frequency cepstrum coefficient, a linear prediction cepstrum coefficient, or other features that may describe the target audio segment. Accordingly, when the feature extraction is performed on the target audio segment, the feature extraction may be performed by using a mel-frequency cepstrum coefficient algorithm, a linear prediction cepstrum coefficient algorithm, or other feature extraction algorithms.

For example, when extraction is performed by using the mel-frequency cepstrum coefficient algorithm, the set dimension is 40, and features of 1-13 dimensions are taken as features of the target audio piece.

Optionally, each high-pass filtered audio segment is divided according to a fifth preset length to obtain a plurality of audio sub-segments with the length equal to the fifth preset length, feature extraction is performed on each audio sub-segment to obtain an audio feature of each audio sub-segment, and the audio feature of each audio sub-segment is used as an audio feature of the target audio segment, or the audio features of the plurality of audio sub-segments are combined to form an audio feature of the target audio segment.

For example, the fifth preset length may be 20 milliseconds, 25 milliseconds, or other duration.

Each audio clip is divided according to the fifth preset length to obtain a plurality of audio sub-clips, and the audio clips can be divided more accurately, so that more accurate characteristics are extracted, and the accuracy can be improved.

205. And determining the classification identification of at least one target audio segment based on the audio classification model and at least one audio characteristic, and determining the classification identification of the target audio information according to the classification identification of at least one target audio segment.

And calculating the audio characteristics of each target audio segment based on the audio classification model to obtain a classification identifier, namely the classification identifier of the target audio segment. By adopting the method, the classification identification of at least one target audio clip can be acquired, so that whether each target audio clip comprises sensitive audio information or not is determined.

Optionally, when the audio classification models include a first audio classification model and a second audio classification model, the audio features of the target audio piece are respectively input into the first audio classification model and the second audio classification model, a first probability is output based on the first audio classification model, and a second probability is output by the second audio classification model, the first probability represents the probability that the target audio piece belongs to the normal audio information, and the second probability represents the probability that the target audio piece belongs to the sensitive audio information.

And when the first probability is larger than the second probability, determining that the target audio clip belongs to the normal audio information, wherein the classification mark is a first mark. And when the first probability is smaller than the second probability, determining that the target audio clip belongs to the sensitive audio information, wherein the classification mark is a second mark. When the first probability is equal to the second probability, it may be determined that the classification identifier of the target audio piece is the first identifier or the second identifier, or the target audio piece may be reclassified.

And then, determining the classification identification of the target audio information according to the classification identification of at least one target audio fragment. Optionally, the process may include at least one of the following steps 2051-2052:

2051. and when the classification identifiers of the continuous first preset number of target audio segments in at least one target audio segment are the second identifiers, determining that the classification identifiers of the target audio information are the second identifiers.

And when the at least one target audio clip is a plurality of target audio clips, traversing the plurality of target audio clips according to the sequence of the plurality of target audio clips, counting the number of continuous target audio clips with the classification identifiers of the second identifiers, adding 1 to the counted number when the traversed target audio clip is the second identifier, and clearing the counted number when the traversed target audio clip is the first identifier.

And when the statistical number reaches a first preset number, determining that the classification identifier of the target audio information is a second identifier, namely determining that the target audio information is sensitive audio information.

Wherein, the first preset number may be set to 3, 4 or other numbers. The order of the plurality of target audio segments may be an order of the plurality of target audio segments from morning to evening in time in the target audio information.

2052. And when the proportion of the number of the target audio segments with the classification identification as the second identification in the at least one target audio segment reaches a fourth preset proportion, determining that the classification identification of the target audio information is the second identification.

And when the proportion of the number of the target audio segments with the classification identification as the second identification reaches a fourth preset proportion, determining that the classification identification of the target audio information is the second identification, namely determining that the target audio information is sensitive audio information.

Wherein the fourth preset ratio may be set to 70%, 75% or other percentage.

It should be noted that, step 2051 and step 2052 may be combined, and when, in at least one target audio segment, the classification identifier of a first preset number of continuous target audio segments is the second identifier, and the proportion of the number of the target audio segments with the classification identifier as the second identifier reaches a fourth preset proportion, the classification identifier of the target audio information is determined to be the second identifier. When the target audio information is classified and two conditions are met simultaneously, the target audio information is determined to be sensitive audio information, and the classification accuracy is improved.

The embodiment of the invention can be applied to scenes such as network live broadcast, voice interaction, video playing and the like, for example, in the scene of network live broadcast, audio information is extracted from live broadcast video, the audio information is classified, when the audio information is determined to be sensitive audio information, the network live broadcast video is determined to be sensitive video, and the network live broadcast is closed. Under the voice interaction scene, extracting audio information in voice, classifying the audio information, and deleting the voice when the audio information is determined to be sensitive audio information. Under a video playing scene, audio information is extracted from a video, the audio information is classified, when the audio information is determined to be sensitive audio information, the video is determined to be sensitive video, and the video is closed.

Fig. 3 is a schematic structural diagram of an audio classification apparatus according to an embodiment of the present invention, and referring to fig. 3, the apparatus includes: an acquisition module 301, an extraction module 302 and a determination module 303;

an obtaining module 301, configured to obtain at least one target audio segment in the target audio information;

an extracting module 302, configured to perform high-pass filtering and feature extraction on at least one target audio segment to obtain at least one audio feature corresponding to the at least one target audio segment;

a determining module 303, configured to determine a classification identifier of at least one target audio segment based on the audio classification model and the at least one audio feature, and determine a classification identifier of the target audio information according to the classification identifier of the at least one target audio segment;

Optionally, the obtaining module 301 includes:

the base frequency obtaining unit is used for obtaining a plurality of base frequencies in the audio frequency segments for each audio frequency segment in the audio frequency segments and obtaining the proportion of the base frequencies larger than a first preset frequency in the base frequencies;

the acquiring unit is used for acquiring an audio segment with the proportion smaller than a first preset proportion from a plurality of audio segments as a target audio segment.

Optionally, the obtaining module 301 includes:

the second dividing unit is further configured to divide the audio segments according to a fourth preset length for each of the multiple audio segments to obtain multiple audio sub-segments with lengths equal to the fourth preset length, and obtain a statistical value of the amplitude of each audio sub-segment; the fourth preset length is smaller than the second preset length;

the acquiring unit is used for acquiring any audio clip with the statistical value larger than the preset value from the plurality of audio clips as a target audio clip.

Optionally, the obtaining module 301 includes:

the third dividing unit is used for dividing the target audio information according to a second preset length and a third preset length to obtain a plurality of audio segments with the length equal to the second preset length, and any two adjacent audio segments in the plurality of audio segments comprise the same audio information with the third preset length; the third preset length is smaller than the second preset length;

the acquiring unit is used for acquiring an audio clip with the proportion larger than a second preset proportion and smaller than a third preset proportion from the plurality of audio clips as a target audio clip.

Optionally, the extracting module 302 includes:

the filtering unit is used for carrying out high-pass filtering on at least one target audio segment to obtain at least one audio segment after the high-pass filtering;

Optionally, the determining module 303 is configured to perform at least one of the following:

Optionally, the apparatus further comprises:

the obtaining module 301 is further configured to obtain a plurality of sample audio information and a classification identifier of the plurality of sample audio information;

the extracting module 301 is further configured to perform high-pass filtering and feature extraction on the multiple sample audio information to obtain multiple audio features corresponding to the multiple sample audio information;

and the training module is used for carrying out model training according to the plurality of audio features and the classification mark corresponding to each audio feature to obtain an audio classification model.

the training module is further used for carrying out model training according to the audio features of which the classification identifiers are the first classification identifiers in the plurality of audio features to obtain a first audio classification model;

and the training module is further used for carrying out model training according to the audio features of which the classification identifiers are the second classification identifiers in the plurality of audio features to obtain a second audio classification model.

The device provided by the embodiment of the invention can be used for obtaining at least one target audio frequency segment in the target audio frequency information, carrying out high-pass filtering and feature extraction on the at least one target audio frequency segment, filtering low-frequency noise through the high-pass filtering, carrying out the feature extraction to obtain at least one audio frequency feature corresponding to the at least one target audio frequency segment, determining the classification identifier of the at least one target audio frequency segment based on the audio frequency classification model and the at least one audio frequency feature, determining the classification identifier of the target audio frequency information according to the classification identifier of the at least one target audio frequency segment, thereby determining the target audio frequency information as normal audio frequency information or sensitive audio frequency information, and filtering the low-frequency noise of the target audio frequency information due to the high-pass filtering before determining the classification identifier of the target audio frequency information, thereby avoiding the situation that the low-frequency noise is determined as the sensitive, the accuracy of audio classification is improved.

It should be noted that: in the foregoing embodiment, when the audio classification device classifies audio information, only the division of the functional modules is illustrated, and in practical applications, the function distribution may be completed by different functional modules according to needs, that is, the internal structure of the classification device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the audio classification device and the audio classification method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments and are not described herein again.

Fig. 4 is a schematic structural diagram of a server 400 according to an embodiment of the present invention, where the server 400 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 401 and one or more memories 402, where the memory 402 stores at least one instruction, and the at least one instruction is loaded and executed by the processor 401 to implement the methods provided by the foregoing method embodiments. Of course, the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input/output, and the server may also include other components for implementing the functions of the device, which are not described herein again.

The server 400 may be configured to perform the steps performed by the classification apparatus in the audio classification method described above.

Fig. 5 is a schematic structural diagram of a terminal according to an embodiment of the present invention. The terminal 500 may be a portable mobile terminal such as: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion Picture Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion Picture Experts compression standard Audio Layer 4), a notebook computer, a desktop computer, a head-mounted device, or any other intelligent terminal. Terminal 500 may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, and the like.

In general, the terminal 500 includes: a processor 501 and a memory 502.

The processor 501 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on. The processor 501 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 501 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 501 may be integrated with a GPU (Graphics Processing Unit, image Processing interactor) which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, processor 501 may also include an AI (Artificial Intelligence) processor for processing computational operations related to machine learning.

Memory 502 may include one or more computer-readable storage media, which may be non-transitory. Memory 502 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 502 is used to store at least one instruction for being possessed by processor 501 to implement the prediction mode selection methods provided by the method embodiments herein.

In some embodiments, the terminal 500 may further optionally include: a peripheral interface 503 and at least one peripheral. The processor 501, memory 502 and peripheral interface 503 may be connected by a bus or signal lines. Each peripheral may be connected to the peripheral interface 503 by a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 504, touch screen display 505, camera 506, audio circuitry 507, positioning components 508, and power supply 509.

The peripheral interface 503 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 501 and the memory 502. In some embodiments, the processor 501, memory 502, and peripheral interface 503 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 501, the memory 502, and the peripheral interface 503 may be implemented on a separate chip or circuit board, which is not limited in this embodiment.

The Radio Frequency circuit 504 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 504 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 504 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 504 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuitry 504 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 8G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the rf circuit 504 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.

The display screen 505 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 505 is a touch display screen, the display screen 505 also has the ability to capture touch signals on or over the surface of the display screen 505. The touch signal may be input to the processor 501 as a control signal for processing. At this point, the display screen 505 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display screen 505 may be one, providing the front panel of the terminal 500; in other embodiments, the display screens 505 may be at least two, respectively disposed on different surfaces of the terminal 500 or in a folded design; in still other embodiments, the display 505 may be a flexible display disposed on a curved surface or on a folded surface of the terminal 500. Even more, the display screen 505 can be arranged in a non-rectangular irregular figure, i.e. a shaped screen. The Display screen 505 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and other materials.

The camera assembly 506 is used to capture images or video. Optionally, camera assembly 506 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 506 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

Audio circuitry 507 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 501 for processing, or inputting the electric signals to the radio frequency circuit 504 to realize voice communication. For the purpose of stereo sound collection or noise reduction, a plurality of microphones may be provided at different portions of the terminal 500. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 501 or the radio frequency circuit 504 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, audio circuitry 507 may also include a headphone jack.

The positioning component 508 is used for positioning the current geographic Location of the terminal 500 for navigation or LBS (Location Based Service). The Positioning component 508 may be a Positioning component based on the united states GPS (Global Positioning System), the chinese beidou System, the russian graves System, or the european union's galileo System.

Power supply 509 is used to power the various components in terminal 500. The power source 509 may be alternating current, direct current, disposable or rechargeable. When power supply 509 includes a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, terminal 500 also includes one or more sensors 510. The one or more sensors 510 include, but are not limited to: acceleration sensor 511, gyro sensor 512, pressure sensor 513, fingerprint sensor 514, optical sensor 515, and proximity sensor 516.

The acceleration sensor 511 may detect the magnitude of acceleration on three coordinate axes of the coordinate system established with the terminal 500. For example, the acceleration sensor 511 may be used to detect components of the gravitational acceleration in three coordinate axes. The processor 501 may control the touch screen 505 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 511. The acceleration sensor 511 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 512 may detect a body direction and a rotation angle of the terminal 500, and the gyro sensor 512 may cooperate with the acceleration sensor 511 to acquire a 3D motion of the user on the terminal 500. The processor 501 may implement the following functions according to the data collected by the gyro sensor 512: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

The pressure sensor 513 may be disposed on a side bezel of the terminal 500 and/or an underlying layer of the touch display screen 505. When the pressure sensor 513 is disposed on the side frame of the terminal 500, a user's holding signal of the terminal 500 may be detected, and the processor 501 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 513. When the pressure sensor 513 is disposed at the lower layer of the touch display screen 505, the processor 501 controls the operability control on the UI interface according to the pressure operation of the user on the touch display screen 505. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 514 is used for collecting a fingerprint of the user, and the processor 501 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 514, or the fingerprint sensor 514 identifies the identity of the user according to the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, the user is authorized by processor 501 to have relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying, and changing settings, etc. The fingerprint sensor 514 may be provided on the front, back, or side of the terminal 500. When a physical button or a vendor Logo is provided on the terminal 500, the fingerprint sensor 514 may be integrated with the physical button or the vendor Logo.

The optical sensor 515 is used to collect the ambient light intensity. In one embodiment, the processor 501 may control the display brightness of the touch display screen 505 based on the ambient light intensity collected by the optical sensor 515. Specifically, when the ambient light intensity is high, the display brightness of the touch display screen 505 is increased; when the ambient light intensity is low, the display brightness of the touch display screen 505 is turned down. In another embodiment, processor 501 may also dynamically adjust the shooting parameters of camera head assembly 506 based on the ambient light intensity collected by optical sensor 515.

A proximity sensor 516, also referred to as a distance sensor, is typically disposed on the front panel of the terminal 500. The proximity sensor 516 is used to collect the distance between the user and the front surface of the terminal 500. In one embodiment, when the proximity sensor 516 detects that the distance between the user and the front surface of the terminal 500 gradually decreases, the processor 501 controls the touch display screen 505 to switch from the bright screen state to the dark screen state; when the proximity sensor 516 detects that the distance between the user and the front surface of the terminal 500 becomes gradually larger, the processor 501 controls the touch display screen 505 to switch from the screen-rest state to the screen-on state.

Those skilled in the art will appreciate that the configuration shown in fig. 5 is not intended to be limiting of terminal 500 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.

The embodiment of the present invention further provides an audio classification device, where the audio classification device includes a processor and a memory, where the memory stores at least one instruction, and the instruction is loaded and executed by the processor to implement the operations performed in the audio classification method of the foregoing embodiment.

The embodiment of the present invention also provides a computer-readable storage medium, where at least one instruction is stored in the computer-readable storage medium, and the instruction is loaded and executed by a processor to implement the operations performed in the audio classification method of the above embodiment.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A method of audio classification, the method comprising:

acquiring at least one target audio clip in the target audio information;

the classification identification comprises a first identification and a second identification, the first identification is used for indicating that the corresponding audio information is normal audio information, and the second identification is used for indicating that the corresponding audio information is sensitive audio information;

the obtaining of at least one target audio clip in the target audio information includes:

dividing the target audio information according to a first preset length to obtain a plurality of audio segments with the length equal to the first preset length; for each audio clip in the plurality of audio clips, acquiring a plurality of fundamental frequencies in the audio clip, and acquiring the proportion of the fundamental frequencies which are greater than a first preset frequency in the plurality of fundamental frequencies; acquiring an audio clip with the proportion smaller than a first preset proportion from the plurality of audio clips as a target audio clip; and/or the presence of a gas in the gas,

dividing the target audio information according to a second preset length and a third preset length to obtain a plurality of audio segments with the length equal to the second preset length, wherein any two adjacent audio segments in the plurality of audio segments comprise the same audio information with the third preset length; the third preset length is smaller than the second preset length; for each audio clip in the multiple audio clips, dividing the audio clip according to a fourth preset length to obtain multiple audio sub-clips with the length equal to the fourth preset length, and acquiring a statistical value of the amplitude of each audio sub-clip; the fourth preset length is smaller than the second preset length; acquiring any audio clip with a statistical value larger than a preset value from the plurality of audio clips as a target audio clip; and/or the presence of a gas in the gas,

dividing the target audio information according to a second preset length and a third preset length to obtain a plurality of audio segments with the length equal to the second preset length, wherein any two adjacent audio segments in the plurality of audio segments comprise the same audio information with the third preset length; the third preset length is smaller than the second preset length; for each audio clip in the plurality of audio clips, acquiring a plurality of fundamental frequencies in the audio clip, and acquiring the proportion of the fundamental frequencies which are greater than a first preset frequency in the plurality of fundamental frequencies; and acquiring the audio clips with the proportion larger than a second preset proportion and smaller than a third preset proportion from the plurality of audio clips as target audio clips.

2. The method of claim 1, wherein the high-pass filtering and feature extracting the at least one target audio segment to obtain at least one audio feature corresponding to the at least one target audio segment comprises:

3. The method according to any of claims 1-2, wherein the determining the classification identifier of the target audio information according to the classification identifier of the at least one target audio segment comprises at least one of:

4. The method according to any one of claims 1-2, further comprising:

5. The method of claim 4, wherein the audio classification model comprises a first audio classification model and a second audio classification model, and the performing model training according to the plurality of audio features and the classification identifier corresponding to each audio feature to obtain the audio classification model comprises:

performing model training according to the audio features of which the classification identifiers are the first identifiers in the plurality of audio features to obtain the first audio classification model;

and performing model training according to the audio features of which the classification identifiers are the second identifiers in the plurality of audio features to obtain the second audio classification model.

6. An apparatus for audio classification, the apparatus comprising:

the acquisition module includes:

the acquiring unit is used for acquiring the audio clips with the proportion smaller than a first preset proportion from the plurality of audio clips as target audio clips; and/or the presence of a gas in the gas,

the acquiring unit is used for acquiring any audio clip with a statistical value larger than a preset value from the plurality of audio clips as a target audio clip; and/or the presence of a gas in the gas,

7. The apparatus of claim 6, wherein the extraction module comprises:

8. The apparatus according to any of claims 6-7, wherein the determining means is configured to perform at least one of:

9. The apparatus of any of claims 6-7, further comprising:

10. The apparatus of claim 9, wherein the audio classification model comprises a first audio classification model and a second audio classification model;

the training module is further configured to perform model training according to the audio features of which the classification identifiers are the first identifiers among the multiple audio features, so as to obtain the first audio classification model;

and the training module is further configured to perform model training according to the audio features of which the classification identifiers are the second identifiers in the plurality of audio features, so as to obtain the second audio classification model.

11. An apparatus for audio classification, the apparatus comprising a processor and a memory, the memory having stored therein at least one instruction, the instruction being loaded and executed by the processor to perform the operations as recited in any of claims 1 to 5.

12. A computer-readable storage medium having stored therein at least one instruction, which is loaded and executed by a processor to perform operations as recited in any of claims 1 to 5.