CN108538311A

CN108538311A - Audio frequency classification method, device and computer readable storage medium

Info

Publication number: CN108538311A
Application number: CN201810332491.8A
Authority: CN
Inventors: 王征韬; 张庆
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2018-04-13
Filing date: 2018-04-13
Publication date: 2018-09-14
Anticipated expiration: 2038-04-13
Also published as: CN108538311B

Abstract

The invention discloses a kind of audio frequency classification method, device and computer readable storage mediums, belong to electronic technology field.This method includes：Acquire audio signal；Audio signal is intercepted or is supplemented, the duration of audio signal is adjusted to preset duration；According to the frequency information of audio signal, target audio is converted audio signals into；The audio frequency characteristics of target audio are extracted by presetting the convolutional network that grader includes；The temporal aspect of audio frequency characteristics is extracted by presetting the thresholding recirculating network that grader includes；According to temporal aspect, by the probability for presetting the pre-set categories that the fully-connected network that grader includes determines that the classification of target audio is identified by each pre-set categories in multiple pre-set categories mark；The pre-set categories of maximum probability identify the classification that identified pre-set categories are determined as target audio during multiple pre-set categories are identified.The present invention remains the integrality of target audio, classification accuracy is higher without being segmented to target audio.

Description

Audio frequency classification method, device and computer readable storage medium

Technical field

The present invention relates to electronic technology field, more particularly to a kind of audio frequency classification method, device and computer-readable storage Medium.

Background technology

With the fast development of electronic technology, people often upload audio in music application.It is searched for the ease of user Rope and audio is used, music application often classifies to the magnanimity audio of upload.For example, music application can be to upload The quality of audio distinguishes, or judges that the audio uploaded is voice or musical background etc..

In the related technology, classified to audio using support vector machine classifier, due to support vector machine classifier It is limited that another characteristic can be known, so classified using support vector machine classifier, operation is relatively complicated, classification effectiveness compared with It is low.In addition, usually only sufficiently long audio can just reflect its real property, and carried out using support vector machine classifier Audio is first often divided into a series of segment when classification, to which the integrality of audio can be destroyed, cause to classify accuracy compared with It is low.

Invention content

An embodiment of the present invention provides a kind of audio frequency classification method, device and computer readable storage mediums, can solve Audio classification is less efficient in the related technology and the relatively low problem of accuracy.The technical solution is as follows：

On the one hand, a kind of audio frequency classification method is provided, the method includes：

Acquire audio signal；

The audio signal is intercepted or supplemented, the duration of the audio signal is adjusted to preset duration；

According to the frequency information of the audio signal, the audio signal is converted into target audio；

The audio frequency characteristics of the target audio are extracted by presetting the convolutional network that grader includes；

The thresholding recirculating network for including by the default grader extracts the temporal aspect of the audio frequency characteristics；

According to the temporal aspect, the fully-connected network for including by the default grader determines the target audio Classification identified by multiple pre-set categories in the probability of pre-set categories that identifies of each pre-set categories；

The pre-set categories of maximum probability identify identified pre-set categories and are determined as during the multiple pre-set categories are identified The classification of the target audio.

Optionally, the audio frequency characteristics that the target audio is extracted by presetting the convolutional network that grader includes, Including：

The target audio is divided into multiple audio fragments by the convolutional network；

By the convolutional network by the feature extraction of each audio fragment in the multiple audio fragment be a spy Sign；

The feature of extraction is made up of to the audio frequency characteristics of the target audio the convolutional network.

Optionally, the thresholding recirculating network for including by the default grader extract the audio frequency characteristics when Sequence characteristics, including：

The first temporal aspect of the audio frequency characteristics is extracted by the thresholding recirculating network；

Corresponding first characteristic of division of first temporal aspect is determined by the fully-connected network；

Each element in first characteristic of division is substituted into the first preset function, is obtained in first characteristic of division The weight of each element, the element in first temporal aspect are corresponded with the element in first characteristic of division；

For the either element A in first temporal aspect, by elements A in first characteristic of division corresponding member The weight of element is multiplied with elements A, obtains corresponding first element of elements A；

Each element in first temporal aspect is replaced with into corresponding first element, obtains the second temporal aspect work For the temporal aspect of the audio frequency characteristics.

Optionally, described according to the temporal aspect, it is determined by the fully-connected network that the default grader includes The probability of the classification of the target audio pre-set categories that each pre-set categories identify in being identified by multiple pre-set categories, Including：

The second characteristic of division of the temporal aspect is determined by the fully-connected network；

The element in second characteristic of division is substituted into the second preset function by the fully-connected network, is obtained described The probability of the classification of the target audio pre-set categories that each pre-set categories identify in being identified by the multiple pre-set categories.

Optionally, the default grader further includes batch standardization at least one of network and pond network.

Optionally, the audio signal is converted to the target sound by the frequency information according to the audio signal Frequently, including：

The mel-frequency cepstrum coefficient MFCC for determining the audio signal generates institute according to the MFCC of the audio signal State target audio；Alternatively,

The frequency spectrum for determining the audio signal generates the target audio according to the frequency spectrum of the audio signal.

Optionally, it is described by preset the grader convolutional network that includes extract the target audio audio frequency characteristics it Before, further include：

Obtain multiple trained audio collections, all training that each of the multiple trained audio collection training audio collection includes Audio corresponds to same pre-set categories mark；

Trained disaggregated model is treated using the multiple trained audio collection to be trained, and obtains the default grader.

On the one hand, a kind of audio classification device is provided, described device includes：

Acquisition module, for acquiring audio signal；

Module is adjusted, for the audio signal to be intercepted or supplemented, the duration of the audio signal is adjusted For preset duration；

The audio signal is converted to target audio by conversion module for the frequency information according to the audio signal；

First extraction module, the audio for extracting the target audio by presetting the convolutional network that grader includes Feature；

Second extraction module, it is special that the thresholding recirculating network for including by the default grader extracts the audio The temporal aspect of sign；

First determining module is used for according to the temporal aspect, the fully connected network for including by the default grader Network determines the pre-set categories that the classification of the target audio is identified by each pre-set categories in multiple pre-set categories mark Probability；

Second determining module, the pre-set categories for maximum probability in identifying the multiple pre-set categories are identified Pre-set categories be determined as the classification of the target audio.

Optionally, first extraction module includes：

Submodule is split, for the target audio to be divided into multiple audio fragments by the convolutional network；

First extracting sub-module, for passing through the convolutional network by each audio fragment in the multiple audio fragment Feature extraction be a feature；

Submodule is formed, the audio for the feature of extraction to be made up of to the target audio the convolutional network is special Sign.

Optionally, second extraction module includes：

Second extracting sub-module, the first sequential for extracting the audio frequency characteristics by the thresholding recirculating network are special Sign；

First determination sub-module, for determining that first temporal aspect is first point corresponding by the fully-connected network Category feature；

First substitutes into submodule, for each element in first characteristic of division to be substituted into the first preset function, obtains To the weight of each element in first characteristic of division, the element in first temporal aspect and first characteristic of division In element correspond；

Multiplication submodule is used for for the either element A in first temporal aspect, by elements A at described first point The weight of corresponding element is multiplied with elements A in category feature, obtains corresponding first element of elements A；

Submodule is replaced to obtain for each element in first temporal aspect to be replaced with corresponding first element Temporal aspect to the second temporal aspect as the audio frequency characteristics.

Optionally, first determining module includes：

Second determination sub-module, the second characteristic of division for determining the temporal aspect by the fully-connected network；

Second substitutes into submodule, for by the fully-connected network by the element substitution the in second characteristic of division Two preset functions, the classification for obtaining the target audio are marked by each pre-set categories in the multiple pre-set categories mark The probability of the pre-set categories of knowledge.

Optionally, the conversion module includes：

Third determination sub-module, the mel-frequency cepstrum coefficient MFCC for determining the audio signal, according to the sound The MFCC of frequency signal generates the target audio；

4th determination sub-module, the frequency spectrum for determining the audio signal are generated according to the frequency spectrum of the audio signal The target audio.

Optionally, described device further includes：

Acquisition module, for obtaining multiple trained audio collections, each of the multiple trained audio collection trains audio collection Including all trained audios correspond to same pre-set categories mark；

Training module is trained for treating trained disaggregated model using the multiple trained audio collection, obtains institute State default grader.

On the one hand, a kind of audio classification device is provided, described device includes processor, memory and is stored in described deposit On reservoir and the program code that can run on the processor, the processor are realized above-mentioned when executing said program code Audio frequency classification method.

On the one hand, a kind of computer readable storage medium is provided, instruction is stored on the computer readable storage medium, The step of above-mentioned audio frequency classification method is realized when described instruction is executed by processor.

The advantageous effect that technical solution provided in an embodiment of the present invention is brought is：

In embodiments of the present invention, audio signal is first acquired, then the audio signal is intercepted or supplemented, it should The duration of audio signal is adjusted to preset duration, and the duration of the audio signal will be by regular to one more suitable model at this time It encloses, then according to the frequency information of the audio signal, which is converted into target audio.Later, pass through default classification The audio frequency characteristics for the convolutional network extraction target audio that device includes, realize the dimensionality reduction to the feature of each audio fragment so that The dimension of the audio frequency characteristics extracted is relatively low.Later, the audio is extracted by presetting the thresholding recirculating network that grader includes The temporal aspect of feature.According to the temporal aspect, target audio is determined by presetting the fully-connected network that grader includes The probability of the classification pre-set categories that each pre-set categories identify in being identified by multiple pre-set categories, and will be multiple default The pre-set categories of maximum probability identify the classification that identified pre-set categories are determined as target audio in classification logotype.This is sorted Journey is simple and practicable, and classification effectiveness is higher, and due to being not necessarily to be segmented target audio, remains the integrality of target audio, Therefore classification accuracy is also higher.

Description of the drawings

To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for For those of ordinary skill in the art, without creative efforts, other are can also be obtained according to these attached drawings Attached drawing.

Fig. 1 is a kind of flow chart of audio frequency classification method provided in an embodiment of the present invention；

Fig. 2 is the flow chart of another audio frequency classification method provided in an embodiment of the present invention；

Fig. 3 is a kind of schematic diagram of collected audio signal provided in an embodiment of the present invention；

Fig. 4 is a kind of schematic diagram of target audio provided in an embodiment of the present invention；

Fig. 5 is the schematic diagram of another target audio provided in an embodiment of the present invention；

Fig. 6 is a kind of structural schematic diagram of default grader provided in an embodiment of the present invention；

Fig. 7 is the structural schematic diagram of the first audio classification device provided in an embodiment of the present invention；

Fig. 8 is a kind of structural schematic diagram of first extraction module provided in an embodiment of the present invention；

Fig. 9 is a kind of structural schematic diagram of second extraction module provided in an embodiment of the present invention；

Figure 10 is a kind of structural schematic diagram of first determining module provided in an embodiment of the present invention；

Figure 11 is a kind of structural schematic diagram of conversion module provided in an embodiment of the present invention；

Figure 12 is the structural schematic diagram of second of audio classification device provided in an embodiment of the present invention；

Figure 13 is the structural schematic diagram of the third audio classification device provided in an embodiment of the present invention.

Specific implementation mode

To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with attached drawing to embodiment party of the present invention Formula is described in further detail.

In order to make it easy to understand, before carrying out detailed explanation to the embodiment of the present invention, first to the embodiment of the present invention The application scenarios being related to are introduced.

With the fast development of electronic technology, people often upload audio in music application, are searched for the ease of user Rope and use, music application often classify to the magnanimity audio of upload.Currently, being come pair using support vector machine classifier Audio is classified, since can to know another characteristic limited for support vector machine classifier, so using support vector machine classifier Classify, operation is relatively complicated, and classification effectiveness is relatively low.In addition, usually only sufficiently long audio can just reflect that it is true Real attribute, and using support vector machine classifier come go classification when audio is first often divided into a series of segment, to meeting The integrality for destroying audio causes classification accuracy relatively low.For this purpose, the present invention provides a kind of audio frequency classification method, to improve The efficiency of audio classification and accuracy.

Next it will describe in detail to audio frequency classification method provided in an embodiment of the present invention in conjunction with attached drawing.

Fig. 1 is a kind of flow chart of audio frequency classification method provided in an embodiment of the present invention.Referring to Fig. 1, this method include with Lower step：

Step 101：Acquire audio signal.

Step 102：The audio signal is intercepted or supplemented, when the duration of the audio signal is adjusted to default It is long.

Step 103：According to the frequency information of the audio signal, which is converted into target audio.

Step 104：The audio frequency characteristics of target audio are extracted by presetting the convolutional network that grader includes.

Step 105：The temporal aspect of the audio frequency characteristics is extracted by presetting the thresholding recirculating network that grader includes.

Step 106：According to the temporal aspect, target audio is determined by presetting the fully-connected network that grader includes The probability of the classification pre-set categories that each pre-set categories identify in being identified by multiple pre-set categories.

Step 107：The pre-set categories of maximum probability identify identified pre-set categories during multiple pre-set categories are identified It is determined as the classification of target audio.

Optionally, the audio frequency characteristics of target audio are extracted by presetting the convolutional network that grader includes, including：

Target audio is divided into multiple audio fragments by convolutional network；

By convolutional network by the feature extraction of each audio fragment in multiple audio fragment be a feature；

The feature of extraction is made up of to the audio frequency characteristics of target audio convolutional network.

Optionally, the temporal aspect of the audio frequency characteristics, packet are extracted by presetting the thresholding recirculating network that grader includes It includes：

The first temporal aspect of the audio frequency characteristics is extracted by thresholding recirculating network；

Corresponding first characteristic of division of the first temporal aspect is determined by fully-connected network；

Each element in first characteristic of division is substituted into the first preset function, obtains each element in the first characteristic of division Weight, the element in element and the first characteristic of division in the first temporal aspect corresponds；

For the either element A in the first temporal aspect, by the weight of elements A corresponding element in the first characteristic of division It is multiplied with elements A, obtains corresponding first element of elements A；

Each element in first temporal aspect is replaced with into corresponding first element, obtains the second temporal aspect as sound The temporal aspect of frequency feature.

Optionally, according to temporal aspect, the class of target audio is determined by presetting the fully-connected network that grader includes The probability for the pre-set categories that each pre-set categories identify in not identified by multiple pre-set categories, including：

The second characteristic of division of temporal aspect is determined by fully-connected network；

The element in the second characteristic of division is substituted into the second preset function by fully-connected network, obtains the class of target audio The probability for the pre-set categories that each pre-set categories identify in not identified by multiple pre-set categories.

Optionally, default grader further includes batch standardization at least one of network and pond network.

Optionally, according to the frequency information of audio signal, target audio is converted audio signals into, including：

The mel-frequency cepstrum coefficient MFCC for determining the audio signal generates target sound according to the MFCC of the audio signal Frequently；Alternatively,

The frequency spectrum for determining the audio signal generates target audio according to the frequency spectrum of the audio signal.

Optionally, it before the audio frequency characteristics that target audio is extracted by presetting the convolutional network that grader includes, also wraps It includes：

Obtain multiple trained audio collections, all trained sounds that each of multiple trained audio collection training audio collection includes The corresponding same pre-set categories mark of frequency；

Trained disaggregated model is treated using multiple trained audio collection to be trained, and obtains default grader.

Above-mentioned all optional technical solutions, can form the alternative embodiment of the present invention according to any combination, and the present invention is real Example is applied no longer to repeat this one by one.

Fig. 2 is a kind of flow chart of audio frequency classification method provided in an embodiment of the present invention.The embodiment of the present invention will combine Fig. 2 Expansion discussion is carried out to embodiment shown in FIG. 1.Referring to Fig. 2, this approach includes the following steps：

Step 201：Acquire audio signal.

In practical application, generally sound is sampled with fixed sample frequency, collected audio signal can wrap Multiple sampled points are included, before and after the sampling instant that each sampled point is represented before and after the position of each sampled point, the value of each sampled point For the amplitude of each sampled point, multiple sampled point can be linked to be an audio curve, only include at this time to shake in the audio curve Width information, that is to say, which is one-dimensional audio signal.For example, Fig. 3 is the audio curve of the audio signal, the audio The value of each point is the amplitude of each point on curve, which is one-dimensional audio signal.

Step 202：The audio signal is intercepted or supplemented, when the duration of the audio signal is adjusted to default It is long.

It should be noted that preset duration can be in advance configured according to different demands, and preset duration can be arranged It is longer, for example, preset duration can be 2 minutes, 3 minutes, 4 minutes etc..

It is by the audio signal in addition, the duration of the audio signal is adjusted to preset duration in the embodiment of the present invention Regular to one more suitable range of duration, so as to improve the accuracy subsequently classified to the audio signal.

Specifically, it is preset duration by audio signal interception when the duration of the audio signal is more than preset duration Audio signal；When the duration of the audio signal is less than preset duration, the audio that audio signal supplement is preset duration is believed Number.

Wherein, when the duration of the audio signal is more than preset duration, by the sound that audio signal interception is preset duration The realization process of frequency signal can be：A time point is selected from the time point of the audio signal, is opened from the time point of selection Begin, intercepts the audio signal of preset duration.

For example, preset duration is 3 minutes, then a time point can be selected from the time point of the audio signal, it is assumed that The time point selected is 1 second, then can intercept 3 minutes audio signals since 1 second time point of the audio signal.

Wherein, when the duration of the audio signal is less than preset duration, by the sound that audio signal supplement is preset duration The realization process of frequency signal can be：The audio signal is first converted into digital signal, then is continued at the end of the digital signal 0 is mended, until the digital signal reaches preset duration, the digital signal for being then up to preset duration is converted to audio letter Number.

Step 203：According to the frequency information of the audio signal, which is converted into target audio.

Specifically, the realization process of step 203 may include the possible realization method of the following two kinds.

The first possible realization method：Determine MFCC (the Mel-scale Frequency of the audio signal Cepstral Coefficients, mel-frequency cepstrum coefficient), according to the MFCC of the audio signal, generate target audio.

It should be noted that mel-frequency is non-linear depending on being judged the sense organ of equidistant change in pitch based on human ear Frequency, with actual frequency, at nonlinear correspondence relation, MFCC can be calculated based on this relationship in it.

For example, the nonlinear correspondence relation between mel-frequency and actual frequency can use formulaApproximate representation, wherein Mel (f) is mel-frequency, and f is actual frequency.

Wherein it is determined that the realization process of the MFCC of the audio signal can be：Preemphasis is carried out to the audio signal；To pre- Audio signal after exacerbation carries out framing；Adding window is carried out to each frame in the audio signal after framing；By the audio after adding window Signal is converted by time domain to frequency domain, and the frequency spectrum of the audio signal is obtained；The work(of the audio signal is obtained to the frequency spectrum modulus square Rate is composed；The power spectrum is filtered by the triangle bandpass filter of one group of Meier scale；To the triangle bandpass filter group Output seek logarithm, obtain logarithmic energy；To the logarithmic energy carry out DCT (discrete cosine transform, from Dissipate cosine transform) obtain MFCC.

It should be noted that preemphasis can mend the decaying of the high fdrequency component of audio signal in transmission process It repays, audio signal can be carried out preemphasis by when practical application by a high-pass filter.

In addition, framing refers to that audio signal is divided into multiple short time intervals, each short time interval is a frame.Since audio is believed Number stationarity is only presented in a relatively short period of time, it is therefore desirable to framing be carried out to audio signal, and to avoid dropped audio signal Information, can there is one section of overlapping region, overlapping region to be generally the 1/2 or 1/3 of frame length between consecutive frame.

It, can be to the work(furthermore after being filtered to the power spectrum by the triangle bandpass filter of one group of Meier scale Rate spectrum is smoothed, and harmonic carcellation interference highlights the formant of audio signal.

Wherein, can be to the realization process of each frame progress adding window in the audio signal after framing：After framing Each frame in audio signal is multiplied by specified window.After each frame in the audio signal is multiplied by specified window, each frame can be eliminated The signal discontinuity that both ends are likely to result in.Specified window can be in advance configured according to different demands, for example, specified window can Think hamming (Hamming) window etc..

Wherein, the audio signal after adding window is converted by time domain to frequency domain, obtains the realization of the frequency spectrum of the audio signal Journey can be：Audio signal after adding window is carried out FFT (Fast Fourier Transform, Fast Fourier Transform (FFT)) to obtain The frequency spectrum of the audio signal.It is of course also possible to otherwise convert the audio signal after adding window to frequency domain by time domain, obtain To the frequency spectrum of the audio signal, the embodiment of the present invention is not construed as limiting this.

Wherein, according to the MFCC of the audio signal, generating the realization process of target audio can be：According to the audio signal Duration, the time point of the audio signal is determined, using the time point of the audio signal as horizontal axis coordinate, by the audio signal MFCC generates target audio as ordinate of orthogonal axes.

It should be noted that the time point of the audio signal is used to indicate the acquisition progress of the audio signal, such as the audio The time point of signal can be 1 second, 2 seconds etc., and the sampled point that time point is 1 second in the audio signal is to be passed through after starting acquisition The sampled point obtained at 1 second is crossed, the sampled point that time point is 2 seconds in the audio signal is to be obtained when after starting acquisition by 2 seconds The sampled point arrived.

In addition, target audio includes the time point information and MFCC information of the audio signal at this time, thus target audio is Two-dimentional audio signal.For example, as shown in figure 4, target audio is two-dimentional audio signal, the horizontal axis coordinate of target audio is the audio The time point of signal, the ordinate of orthogonal axes of target audio are the MFCC of the audio signal.

It is worth noting that can be according to the MFCC of the audio signal, next life in the first above-mentioned possible realization method At target audio, the target audio so generated includes not only the acoustic feature of the audio signal, but also operation dimension is relatively low, Calculation amount when subsequently classifying to target audio so as to reduce.

Second of possible realization method：The frequency spectrum for determining the audio signal generates mesh according to the frequency spectrum of the audio signal Mark with phonetic symbols frequency.

Specifically, it is determined that the realization process of the frequency spectrum of the audio signal can be：The audio signal is converted by time domain Frequency domain obtains the frequency spectrum of the audio signal；The frequency of the audio signal is obtained from the frequency spectrum of the audio signal；According to the audio The duration of signal determines the time point of the audio signal；Using the time point of the audio signal as horizontal axis coordinate, which is believed Number frequency as ordinate of orthogonal axes, generate target audio.

Wherein, which is converted into frequency domain by time domain, the realization process for obtaining the frequency spectrum of the audio signal can be with For：Audio signal progress FFT is obtained into the frequency spectrum of the audio signal.It is of course also possible to which the audio is believed otherwise It number is converted to frequency domain by time domain, obtains the frequency spectrum of the audio signal, the embodiment of the present invention is not construed as limiting this.

It should be noted that target audio includes the time point information and actual frequency information of the audio signal at this time, because And target audio is two-dimentional audio signal.For example, as shown in figure 5, target audio is two-dimentional audio signal, the horizontal axis of target audio Coordinate is the time point of the audio signal, and the ordinate of orthogonal axes of target audio is the frequency of the audio signal.

It is worth noting that can be according to the frequency spectrum of the audio signal, next life in above-mentioned second of possible realization method At target audio, the target audio so generated contains the complete characterization of the audio signal, so as to improve subsequently to mesh Accuracy when mark with phonetic symbols frequency is classified.

It should be noted that after getting target audio according to above-mentioned steps 201- steps 203, it can also be according to as follows Step 204- steps 207 by presetting grader determine the classification of target audio.

In addition, default grader is for classifying to audio, in practical application, some audio is input into default classification After device, default grader can determine the probability for the pre-set categories that the classification of the audio is identified by each pre-set categories simultaneously Output.Default grader may include convolutional network, thresholding recirculating network and fully-connected network, and default grader can also wrap Include at least one of batch standardization network, pond network etc..

Step 204：The audio frequency characteristics of target audio are extracted by presetting the convolutional network that grader includes.

Specifically, target audio is divided by multiple audio fragments by convolutional network, it then again should by convolutional network The feature extraction of each audio fragment in multiple audio fragments is a feature, finally by convolutional network by the feature of extraction Form the audio frequency characteristics of target audio.

It should be noted that since the feature extraction of each audio fragment can be a feature by convolutional network, realize To the dimensionality reduction of the feature of each audio fragment, therefore the shortening to the duration of target audio may be implemented so that the sound extracted The dimension of frequency feature is relatively low, so as to convenient for other networks in subsequently default grader directly to the lower audio of the dimension Feature is handled.

Further, when further including criticizing standardization (BatchNormalization) network in default grader, pass through It, can also be by batch standardization network to the audio frequency characteristics that extract after the convolutional network extracts the audio frequency characteristics of target audio It is handled, to make audio frequency characteristics be distributed in a stable range, to improve other networks in follow-up default grader Processing accuracy to audio frequency characteristics.

Specifically, criticizing standardization network can subtract each element in the audio frequency characteristics extracted in the audio frequency characteristics All elements average value so that the average value of all elements in the new audio frequency characteristics obtained after subtracting each other is 0.Certainly, In practical applications, can also otherwise the audio frequency characteristics extracted be handled by criticizing standardization network, the present invention Embodiment is not construed as limiting this.

Further, when further including pond network in default grader, target audio is extracted by the convolutional network Audio frequency characteristics after, the audio frequency characteristics extracted can also be handled by pond network, to reduce the audio extracted Feature, to reduce the operand of other networks in follow-up default grader.

Specifically, all elements in the audio frequency characteristics extracted can be divided into multiple element groups by pond network, to each Element in a element group is averaged or maximum value, obtains the corresponding second element of each element group, by each element group pair The second element answered forms new audio frequency characteristics.Certainly, in practical applications, pond network can also be otherwise to carrying The audio frequency characteristics got are handled, and the embodiment of the present invention is not construed as limiting this.

It, can be by audio frequency characteristics that the convolutional network extracts directly by following it should be noted that in practical application Step 205 is worked on, alternatively, can also the audio frequency characteristics that the convolutional network extracts be passed through batch standardization network And/or pond network is handled, and to obtain new audio frequency characteristics, which is continued by following step 205 It is operated.

Further, before the audio frequency characteristics that target audio is extracted by presetting the convolutional network that grader includes, also Default grader can first be generated.Specifically, multiple trained audio collections are obtained, each of multiple trained audio collection trains sound Frequently collect all trained audios for including and correspond to same pre-set categories mark, trained classification is treated using multiple trained audio collection Model is trained, and obtains default grader.

It should be noted that each of multiple trained audio collection training audio collection is both provided with sample labeling, it is each The sample labeling of training audio collection is the corresponding pre-set categories mark of each trained audio collection.For example, sample labeling can wrap Positive sample label and negative sample label are included, positive sample label can be that pre-set categories identify 1, and negative sample label can be default class Do not identify 0.

Wherein, it treats trained disaggregated model using multiple trained audio collection to be trained, obtains default grader Realization process can be：A trained audio collection is selected from multiple trained audio collection, to the training audio collection selected Following processing is executed, until having handled each of multiple trained audio collection training audio collection：The training that will be selected Input of the audio collection as disaggregated model to be trained is identified according to the output data of the disaggregated model from multiple pre-set categories The corresponding reference category mark of each of training audio collection that middle determination is selected training audio, then by each trained sound Frequently corresponding reference category mark pre-set categories corresponding with each trained audio, which identify, is compared, according to comparison result pair Parameter in the disaggregated model is adjusted；Disaggregated model after the completion of parameter adjustment is determined as default grader.

Wherein, the training sound selected is determined from multiple pre-set categories mark according to the output data of the disaggregated model Frequently the realization process of the corresponding reference category mark of each of concentration training audio can be：For the training audio collection selected In any trained audio A, the classification of the training audio A that the output data according to the disaggregated model includes is multiple default Each pre-set categories identify the probability of identified pre-set categories in classification logotype, most by probability in multiple pre-set categories mark Big pre-set categories mark is determined as training the corresponding reference category mark of audio A.

Step 205：The temporal aspect of the audio frequency characteristics is extracted by presetting the thresholding recirculating network that grader includes.

It should be noted that thresholding recirculating network in practical applications can be GRU (Gated Recurrent Unit, Thresholding cycling element), feature of the input data on time dimension can be extracted.

In addition, since audio signal is often with time correlation, each element in the audio frequency characteristics extracted is not only It may be related to the element before it, it is also possible to it is related to element after it, it that is to say, each member in audio frequency characteristics There may be dependences with the element before and after it for element.And thresholding recirculating network is bi-directional cyclic network, thresholding recirculating network can Usually to extract the feature of the element, the spy of each element of extraction according to member of some element in the audio frequency characteristics before and after it Sign can form the temporal aspect of the audio frequency characteristics.

Specifically, the sequential spy of the audio frequency characteristics can be directly extracted by presetting the thresholding recirculating network in grader Sign, alternatively, the temporal aspect of the audio frequency characteristics can also jointly be extracted by thresholding recirculating network and fully-connected network.

Wherein, when extracting the temporal aspect of the audio frequency characteristics jointly by thresholding recirculating network and fully-connected network, pass through Thresholding recirculating network extracts the first temporal aspect of the audio frequency characteristics；Determine that the first temporal aspect is corresponding by fully-connected network First characteristic of division；Each element in first characteristic of division is substituted into the first preset function, is obtained each in the first characteristic of division The weight of a element, the element in element and the first characteristic of division in the first temporal aspect correspond；For the first sequential Either element A in feature, by elements A, the weight of corresponding element is multiplied with elements A in the first characteristic of division, obtains element Corresponding first elements of A；Each element in first temporal aspect is replaced with into corresponding first element, obtains the second sequential spy Levy the temporal aspect as the audio frequency characteristics.

It should be noted that including multiple nodes in fully-connected network comprising each node and previous net All nodes that network includes are connected, and fully-connected network can integrate all elements in the first temporal aspect extracted before Get up to obtain the first characteristic of division.

In addition, the first preset function can be in advance configured according to different demands, for example, the first preset function can be Softmax functions etc..

Wherein, determine that the realization process of corresponding first characteristic of division of the first temporal aspect can be with by fully-connected network For：First temporal aspect of input and parameter preset matrix multiple are obtained the first temporal aspect corresponding by fully-connected network One characteristic of division.Certainly, in practical applications, fully-connected network (such as convolution mode) can also determine first otherwise Corresponding first characteristic of division of temporal aspect, this is not limited by the present invention.

It should be noted that parameter preset matrix can be pre-set according to different demands.Usual first temporal aspect can Think that the row vector of a 1*N, parameter preset matrix can be the matrix of a N*N, the first characteristic of division obtained at this time is i.e. For the row vector of a 1*N, N is positive integer.

It is worth noting that in embodiments of the present invention, inserting attention mechanism, which can be by thresholding The temporal aspect that recirculating network extracts first is converted to the weight of each element in temporal aspect, and the weight of each element is used for Importance of each element in temporal aspect is described, in this way, can subsequently refer to the weight of each element in temporal aspect Continue to handle temporal aspect, to increase influence of the larger element of weight to temporal aspect, reduces the smaller member of weight Interference of the element to temporal aspect so that the temporal aspect extracted is more accurate.

It further, can be by sequential spy after obtaining the second temporal aspect as the temporal aspect of the audio frequency characteristics Sign is directly worked on by following step 206；Alternatively, can also by the temporal aspect again by thresholding recirculating network into Row processing, to obtain new temporal aspect, which is worked on by following step 206；Alternatively, working as Further include that when criticizing standardization network, which can also be passed through to thresholding recirculating network and batch standardization in default grader Network is handled, and come obtained new temporal aspect, which is continued to grasp by following step 206 Make, which is distributed in a stable range, thus can improve other networks in follow-up default grader Processing accuracy to temporal aspect.

Step 206：According to the temporal aspect, target audio is determined by presetting the fully-connected network that grader includes The probability of the classification pre-set categories that each pre-set categories identify in being identified by multiple pre-set categories.

It should be noted that multiple pre-set categories marks can be in advance configured according to different demands, for example, multiple pre- If classification logotype is mark 1 and mark 0, wherein the classification that mark 1 is identified is high quality classification, the classification that mark 0 is identified For low quality classification, alternatively, the classification that mark 1 is identified is voice classification, the classification that mark 0 is identified is audio accompaniment class Not.

Specifically, the second characteristic of division that the temporal aspect is determined by fully-connected network, by fully-connected network by Element in two characteristic of division substitutes into the second preset function, and the classification for obtaining target audio is every in multiple pre-set categories mark A pre-set categories identify the probability of identified pre-set categories.

It should be noted that the second preset function can be in advance configured according to different demands, such as the second preset function Can be softmax functions etc., the output valve of the second preset function is a vector, and each element in the vector represents target The probability for the pre-set categories that the classification of audio is identified by the corresponding pre-set categories of each element.For example, multiple default Classification logotype is mark 1, mark 2, mark 3, and the second preset function is softmax functions, and by the member in the second characteristic of division After element substitutes into the softmax functions, the output valve of the softmax functions is vectorial (0.02,0.08,0.9), wherein element 0.02 corresponding pre-set categories are identified as mark 1, and 0.08 corresponding pre-set categories of element are identified as mark 2, and element 0.9 is corresponding Pre-set categories are identified as mark 3, then the probability for the pre-set categories that the classification of target audio is identified by mark 1 is 0.02, target The probability for the pre-set categories that the classification of audio is identified by mark 2 is 0.08, and the classification of target audio is identified pre- by mark 3 If the probability of classification is 0.9.

In practical application, when default grader is the default grader of two classification, the realization process of step 206 can be with For：Corresponding second characteristic of division of the temporal aspect is determined by fully-connected network, and second is classified by fully-connected network Element in feature substitutes into third preset function, and the classification for obtaining target audio is preset by what the first pre-set categories identified The probability of classification subtracts the probability for the pre-set categories that the classification of target audio is identified by the first pre-set categories by 1, obtains The probability for the pre-set categories that the classification of target audio is identified by the second pre-set categories.

It should be noted that when default grader is the default grader of two classification, multiple pre-set categories mark Quantity is two, i.e., multiple pre-set categories are identified as the first pre-set categories mark and the second pre-set categories mark.

In addition, third preset function can be in advance configured according to different demands, as third preset function can be Sigmoid functions etc..

For example, multiple pre-set categories are identified as the first pre-set categories mark and the second pre-set categories mark, third is default Function is sigmoid functions, and after the element in the second characteristic of division is substituted into the sigmoid functions, the sigmoid functions Output valve is 0.8, then the probability for the pre-set categories that the classification of target audio is identified by the first pre-set categories is 0.8, mesh The probability for the pre-set categories that the classification of mark with phonetic symbols frequency is identified by the second pre-set categories is 0.2.

Step 207：The pre-set categories of maximum probability identify identified pre-set categories during multiple pre-set categories are identified It is determined as the classification of target audio.

For example, there are two pre-set categories to identify, 1 and mark 0 are respectively identified, the classification that mark 1 is identified is high quality Classification, the classification that mark 0 is identified are low quality classification, it is assumed that the high quality class that the classification of target audio is identified by mark 1 Other probability is 0.8, and the probability for the low quality classification that the classification of target audio is identified by mark 0 is 0.2, then at this time can will The pre-set categories that the mark 1 of maximum probability is identified in the two pre-set categories mark are determined as the classification of target audio, i.e. mesh The classification of mark with phonetic symbols frequency is high quality classification.

In order to make it easy to understand, to carry out in detail audio frequency classification method provided in an embodiment of the present invention with reference to Fig. 6 It illustrates.

Referring to Fig. 6, default grader includes convolutional network C, thresholding recirculating network G (bi), fully-connected network D, batch specification Change network B and pond network P.

First, the audio frequency characteristics that target audio is extracted by convolutional network C pass through crowd standardization network B and pond network P The audio frequency characteristics extracted are handled.Then, the first sequential that audio frequency characteristics are extracted by thresholding recirculating network G (bi) is special Sign handles the first temporal aspect extracted by batch standardization network B, then determines first by fully-connected network D The weight of each element in corresponding first characteristic of division of temporal aspect.Later, for any member in the first temporal aspect Plain A, by element multiplication network (Element-wise Multiply) by elements A in the first characteristic of division corresponding member The weight of element is multiplied with elements A, obtains corresponding first element of elements A, and each element in the first temporal aspect is replaced For corresponding first element of each element, temporal aspect of second temporal aspect as audio frequency characteristics is obtained.Later, pass through thresholding Recirculating network G (bi) and batch standardization network B handle the temporal aspect of audio frequency characteristics.Later, pass through fully-connected network D Temporal aspect is handled, the classification for obtaining target audio is marked by each pre-set categories in multiple pre-set categories mark The probability of the pre-set categories of knowledge.Finally, the pre-set categories of maximum probability are identified in multiple pre-set categories being identified Pre-set categories are determined as the classification of target audio.

Next audio classification device provided in an embodiment of the present invention is introduced.

Fig. 7 is a kind of structural schematic diagram of audio classification device provided in an embodiment of the present invention.Referring to Fig. 7, the device packet Include acquisition module 301, adjustment module 302, conversion module 303, the first extraction module 304, the second extraction module 305, first really Cover half block 306 and the second determining module 307.

Acquisition module 301, for acquiring audio signal.

The duration of the audio signal is adjusted to by adjustment module 302 for the audio signal to be intercepted or supplemented Preset duration.

The audio signal is converted to target audio by conversion module 303 for the frequency information according to the audio signal.

First extraction module 304, the audio for extracting target audio by presetting the convolutional network that grader includes Feature.

Second extraction module 305, for extracting audio frequency characteristics by presetting the thresholding recirculating network that grader includes Temporal aspect.

First determining module 306, for according to the temporal aspect, the fully-connected network for including by default grader to be true The probability of the classification for the audio that the sets the goal pre-set categories that each pre-set categories identify in being identified by multiple pre-set categories.

Second determining module 307, the pre-set categories for maximum probability in identifying multiple pre-set categories are marked The pre-set categories of knowledge are determined as the classification of target audio.

Optionally, referring to Fig. 8, the first extraction module 304 includes：

Submodule 3041 is split, target audio is divided into multiple audio fragments for passing through convolutional network.

First extracting sub-module 3042, for passing through convolutional network by each audio fragment in multiple audio fragment Feature extraction is a feature.

Submodule 3043 is formed, the audio frequency characteristics for the feature of extraction to be made up of to target audio convolutional network.

Optionally, referring to Fig. 9, the second extraction module 305 includes：

Second extracting sub-module 3051, the first temporal aspect for extracting audio frequency characteristics by thresholding recirculating network.

First determination sub-module 3052 determines that corresponding first classification of the first temporal aspect is special for passing through fully-connected network Sign.

First substitutes into submodule 3053, for each element in the first characteristic of division to be substituted into the first preset function, obtains To the weight of each element in the first characteristic of division, the element in element and the first characteristic of division in the first temporal aspect is one by one It is corresponding.

Multiplication submodule 3054 is used for for the either element A in the first temporal aspect, and elements A is special in the first classification The weight of corresponding element is multiplied with elements A in sign, obtains corresponding first element of elements A.

Submodule 3055 is replaced to obtain for each element in the first temporal aspect to be replaced with corresponding first element Temporal aspect to the second temporal aspect as audio frequency characteristics.

Optionally, referring to Figure 10, the first determining module 306 includes：

Second determination sub-module 3061, the second characteristic of division for determining temporal aspect by fully-connected network.

Second substitutes into submodule 3062, pre- for the element in the second characteristic of division to be substituted into second by fully-connected network If function, the pre-set categories that the classification of target audio is identified by each pre-set categories in multiple pre-set categories mark are obtained Probability.

Optionally, referring to Figure 11, conversion module 303 includes：

Third determination sub-module 3031, the mel-frequency cepstrum coefficient MFCC for determining the audio signal, according to the sound The MFCC of frequency signal generates target audio.

4th determination sub-module 3032, the frequency spectrum for determining the audio signal are generated according to the frequency spectrum of the audio signal Target audio.

Optionally, referring to Figure 12, which further includes：

Acquisition module 308, for obtaining multiple trained audio collections, each of multiple trained audio collection trains audio collection Including all trained audios correspond to same pre-set categories mark.

Training module 309 is trained for treating trained disaggregated model using multiple trained audio collection, obtains pre- If grader.

It should be noted that：The audio classification device that above-described embodiment provides is when classifying to audio, only with above-mentioned The division progress of each function module, can be as needed and by above-mentioned function distribution by different for example, in practical application Function module is completed, i.e., the internal structure of device is divided into different function modules, with complete it is described above whole or Partial function.In addition, the audio classification device that above-described embodiment provides belongs to same design with audio frequency classification method embodiment, Specific implementation process refers to embodiment of the method, and which is not described herein again.

Figure 13 is a kind of structural schematic diagram of audio classification device provided in an embodiment of the present invention.Referring to Figure 13, the audio Sorter can be terminal 400, which can be：Smart mobile phone, tablet computer, MP3 player (Moving Picture Experts Group Audio Layer III, dynamic image expert's compression standard audio level 3), MP4 (Moving Picture Experts GroupAudio Layer IV, dynamic image expert's compression standard audio level 4) is broadcast Put device, laptop or desktop computer.Terminal 400 be also possible to be referred to as user equipment, portable terminal, laptop terminal, Other titles such as terminal console.

In general, terminal 400 includes：Processor 401 and memory 402.

Processor 401 may include one or more processing cores, such as 4 core processors, 8 core processors etc..Place DSP (Digital Signal Processing, Digital Signal Processing), FPGA (Field- may be used in reason device 401 Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array, may be programmed Logic array) at least one of example, in hardware realize.Processor 401 can also include primary processor and coprocessor, master Processor is the processor for being handled data in the awake state, also referred to as CPU (Central Processing Unit, central processing unit)；Coprocessor is the low power processor for being handled data in the standby state. In some embodiments, processor 401 can be integrated with GPU (Graphics Processing Unit, image processor), GPU is used to be responsible for the rendering and drafting of content to be shown needed for display screen.In some embodiments, processor 401 can also wrap AI (Artificial Intelligence, artificial intelligence) processor is included, the AI processors are for handling related machine learning Calculating operation.

Memory 402 may include one or more computer readable storage mediums, which can To be non-transient.Memory 402 may also include high-speed random access memory and nonvolatile memory, such as one Or multiple disk storage equipments, flash memory device.In some embodiments, the non-transient computer in memory 402 can Storage medium is read for storing at least one instruction, at least one instruction is for performed to realize this Shen by processor 401 Please in embodiment of the method provide audio frequency classification method.

In some embodiments, terminal 400 is also optional includes：Peripheral device interface 403 and at least one peripheral equipment. It can be connected by bus or signal wire between processor 401, memory 402 and peripheral device interface 403.Each peripheral equipment It can be connected with peripheral device interface 403 by bus, signal wire or circuit board.Specifically, peripheral equipment includes：Radio circuit 404, at least one of touch display screen 405, camera 406, voicefrequency circuit 407, positioning component 408 and power supply 409.

Peripheral device interface 403 can be used for I/O (Input/Output, input/output) is relevant at least one outer Peripheral equipment is connected to processor 401 and memory 402.In some embodiments, processor 401, memory 402 and peripheral equipment Interface 403 is integrated on same chip or circuit board；In some other embodiments, processor 401, memory 402 and outer Any one or two in peripheral equipment interface 403 can realize on individual chip or circuit board, the present embodiment to this not It is limited.

Radio circuit 404 is for receiving and emitting RF (Radio Frequency, radio frequency) signal, also referred to as electromagnetic signal.It penetrates Frequency circuit 404 is communicated by electromagnetic signal with communication network and other communication equipments.Radio circuit 404 turns electric signal It is changed to electromagnetic signal to be sent, alternatively, the electromagnetic signal received is converted to electric signal.Optionally, radio circuit 404 wraps It includes：Antenna system, RF transceivers, one or more amplifiers, tuner, oscillator, digital signal processor, codec chip Group, user identity module card etc..Radio circuit 404 can be carried out by least one wireless communication protocol with other terminals Communication.The wireless communication protocol includes but not limited to：WWW, Metropolitan Area Network (MAN), Intranet, each third generation mobile communication network (2G, 3G, 4G and 5G), WLAN and/or WiFi (Wireless Fidelity, Wireless Fidelity) network.In some embodiments, it penetrates Frequency circuit 404 can also include the related circuits of NFC (Near Field Communication, wireless near field communication), this Application is not limited this.

Display screen 405 is for showing UI (User Interface, user interface).The UI may include figure, text, figure Mark, video and its their arbitrary combination.When display screen 405 is touch display screen, display screen 405 also there is acquisition to show The ability of the surface of screen 405 or the touch signal of surface.The touch signal can be used as control signal to be input to processor 401 are handled.At this point, display screen 405 can be also used for providing virtual push button and/or dummy keyboard, also referred to as soft button and/or Soft keyboard.In some embodiments, display screen 405 can be one, and the front panel of terminal 400 is arranged；In other embodiments In, display screen 405 can be at least two, be separately positioned on the different surfaces of terminal 400 or in foldover design；In still other reality Apply in example, display screen 405 can be flexible display screen, be arranged on the curved surface of terminal 400 or fold plane on.Even, it shows Display screen 405 can also be arranged to non-rectangle irregular figure, namely abnormity screen.LCD (Liquid may be used in display screen 405 Crystal Display, liquid crystal display), OLED (Organic Light-Emitting Diode, Organic Light Emitting Diode) Etc. materials prepare.

CCD camera assembly 406 is for acquiring image or video.Optionally, CCD camera assembly 406 include front camera and Rear camera.In general, the front panel in terminal is arranged in front camera, rear camera is arranged at the back side of terminal.One In a little embodiments, rear camera at least two is main camera, depth of field camera, wide-angle camera, focal length camera shooting respectively Any one in head, to realize that main camera and the fusion of depth of field camera realize background blurring function, main camera and wide-angle Camera fusion realizes that pan-shot and VR (Virtual Reality, virtual reality) shooting functions or other fusions are clapped Camera shooting function.In some embodiments, CCD camera assembly 406 can also include flash lamp.Flash lamp can be monochromatic warm flash lamp, It can also be double-colored temperature flash lamp.Double-colored temperature flash lamp refers to the combination of warm light flash lamp and cold light flash lamp, be can be used for not With the light compensation under colour temperature.

Voicefrequency circuit 407 may include microphone and loud speaker.Microphone is used to acquire the sound wave of user and environment, and will Sound wave, which is converted to electric signal and is input to processor 401, to be handled, or is input to radio circuit 404 to realize voice communication. For stereo acquisition or the purpose of noise reduction, microphone can be multiple, be separately positioned on the different parts of terminal 400.Mike Wind can also be array microphone or omnidirectional's acquisition type microphone.Loud speaker is then used to that processor 401 or radio circuit will to be come from 404 electric signal is converted to sound wave.Loud speaker can be traditional wafer speaker, can also be piezoelectric ceramic loudspeaker.When When loud speaker is piezoelectric ceramic loudspeaker, the audible sound wave of the mankind can be not only converted electrical signals to, it can also be by telecommunications Number the sound wave that the mankind do not hear is converted to carry out the purposes such as ranging.In some embodiments, voicefrequency circuit 407 can also include Earphone jack.

Positioning component 408 is used for the current geographic position of positioning terminal 400, to realize navigation or LBS (Location Based Service, location based service).Positioning component 408 can be the GPS (Global based on the U.S. Positioning System, global positioning system), China dipper system or Russia Galileo system positioning group Part.

Power supply 409 is used to be powered for the various components in terminal 400.Power supply 409 can be alternating current, direct current, Disposable battery or rechargeable battery.When power supply 409 includes rechargeable battery, which can be wired charging electricity Pond or wireless charging battery.Wired charging battery is the battery to be charged by Wireline, and wireless charging battery is by wireless The battery of coil charges.The rechargeable battery can be also used for supporting fast charge technology.

In some embodiments, terminal 400 further include there are one or multiple sensors 410.The one or more sensors 410 include but not limited to：Acceleration transducer 411, gyro sensor 412, pressure sensor 413, fingerprint sensor 414, Optical sensor 415 and proximity sensor 416.

The acceleration that acceleration transducer 411 can detect in three reference axis of the coordinate system established with terminal 400 is big It is small.For example, acceleration transducer 411 can be used for detecting component of the acceleration of gravity in three reference axis.Processor 401 can With the acceleration of gravity signal acquired according to acceleration transducer 411, control touch display screen 405 is regarded with transverse views or longitudinal direction Figure carries out the display of user interface.Acceleration transducer 411 can be also used for game or the acquisition of the exercise data of user.

Gyro sensor 412 can be with the body direction of detection terminal 400 and rotational angle, and gyro sensor 412 can To cooperate with acquisition user to act the 3D of terminal 400 with acceleration transducer 411.Processor 401 is according to gyro sensor 412 Following function may be implemented in the data of acquisition：When action induction (for example changing UI according to the tilt operation of user), shooting Image stabilization, game control and inertial navigation.

The lower layer of side frame and/or touch display screen 405 in terminal 400 can be arranged in pressure sensor 413.Work as pressure The gripping signal that user can be detected in the side frame of terminal 400 to terminal 400 is arranged in sensor 413, by processor 401 Right-hand man's identification or prompt operation are carried out according to the gripping signal that pressure sensor 413 acquires.When the setting of pressure sensor 413 exists When the lower layer of touch display screen 405, the pressure operation of touch display screen 405 is realized to UI circle according to user by processor 401 Operability control on face is controlled.Operability control includes button control, scroll bar control, icon control, menu At least one of control.

Fingerprint sensor 414 is used to acquire the fingerprint of user, collected according to fingerprint sensor 414 by processor 401 The identity of fingerprint recognition user, alternatively, by fingerprint sensor 414 according to the identity of collected fingerprint recognition user.It is identifying When the identity for going out user is trusted identity, the user is authorized to execute relevant sensitive operation, the sensitive operation packet by processor 401 Include solution lock screen, check encryption information, download software, payment and change setting etc..Terminal can be set in fingerprint sensor 414 400 front, the back side or side.When being provided with physical button or manufacturer Logo in terminal 400, fingerprint sensor 414 can be with It is integrated with physical button or manufacturer Logo.

Optical sensor 415 is for acquiring ambient light intensity.In one embodiment, processor 401 can be according to optics The ambient light intensity that sensor 415 acquires controls the display brightness of touch display screen 405.Specifically, when ambient light intensity is higher When, the display brightness of touch display screen 405 is turned up；When ambient light intensity is relatively low, the display for turning down touch display screen 405 is bright Degree.In another embodiment, the ambient light intensity that processor 401 can also be acquired according to optical sensor 415, dynamic adjust The acquisition parameters of CCD camera assembly 406.

Proximity sensor 416, also referred to as range sensor are generally arranged at the front panel of terminal 400.Proximity sensor 416 The distance between front for acquiring user and terminal 400.In one embodiment, when proximity sensor 416 detects use When family and the distance between the front of terminal 400 taper into, touch display screen 405 is controlled from bright screen state by processor 401 It is switched to breath screen state；When proximity sensor 416 detects user and the distance between the front of terminal 400 becomes larger, Touch display screen 405 is controlled by processor 401 and is switched to bright screen state from breath screen state.

It will be understood by those skilled in the art that the restriction of the not structure paired terminal 400 of structure shown in Figure 13, can wrap It includes than illustrating more or fewer components, either combine certain components or is arranged using different components.

To sum up, the embodiment of the present invention provides not only a kind of audio classification device as shown in fig. 13 that, for realizing Fig. 1 or Audio frequency classification method described in Fig. 2 embodiments additionally provides a kind of computer readable storage medium, the computer-readable storage It is stored with instruction on medium, the audio classification side described in above-mentioned Fig. 1 or Fig. 2 embodiments is realized when described instruction is executed by processor Method.

One of ordinary skill in the art will appreciate that realizing that all or part of step of above-described embodiment can pass through hardware It completes, relevant hardware can also be instructed to complete by program, the program can be stored in a kind of computer-readable In storage medium, storage medium mentioned above can be read-only memory, disk or CD etc..

The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all the present invention spirit and Within principle, any modification, equivalent replacement, improvement and so on should all be included in the protection scope of the present invention.

Claims

1. a kind of audio frequency classification method, which is characterized in that the method includes：

Acquire audio signal；

According to the temporal aspect, the fully-connected network for including by the default grader determines the class of the target audio The probability for the pre-set categories that each pre-set categories identify in not identified by multiple pre-set categories；

The identified pre-set categories of the pre-set categories mark of maximum probability are determined as described during the multiple pre-set categories are identified The classification of target audio.

2. the method as described in claim 1, which is characterized in that described to be extracted by presetting the convolutional network that grader includes The audio frequency characteristics of the target audio, including：

By the convolutional network by the feature extraction of each audio fragment in the multiple audio fragment be a feature；

3. the method as described in claim 1, which is characterized in that the thresholding cycle for including by the default grader Network extracts the temporal aspect of the audio frequency characteristics, including：

Each element in first characteristic of division is substituted into the first preset function, is obtained each in first characteristic of division The weight of element, the element in first temporal aspect are corresponded with the element in first characteristic of division；

For the either element A in first temporal aspect, by elements A in first characteristic of division corresponding element Weight is multiplied with elements A, obtains corresponding first element of elements A；

Each element in first temporal aspect is replaced with into corresponding first element, obtains the second temporal aspect as institute State the temporal aspect of audio frequency characteristics.

4. the method as described in claim 1, which is characterized in that it is described according to the temporal aspect, pass through the default classification The fully-connected network that device includes determines that the classification of the target audio is each pre-set categories mark in multiple pre-set categories mark Know the probability of identified pre-set categories, including：

The element in second characteristic of division is substituted into the second preset function by the fully-connected network, obtains the target The probability of the classification of the audio pre-set categories that each pre-set categories identify in being identified by the multiple pre-set categories.

5. the method as described in claim 1, which is characterized in that the default grader further includes batch standardization network and pond At least one of network.

6. the method as described in claim 1, which is characterized in that the frequency information according to the audio signal, it will be described Audio signal is converted to the target audio, including：

Determine that the mel-frequency cepstrum coefficient MFCC of the audio signal generates the mesh according to the MFCC of the audio signal Mark with phonetic symbols frequency；Alternatively,

7. the method as described in claim 1, which is characterized in that described to be extracted by presetting the convolutional network that grader includes Before the audio frequency characteristics of the target audio, further include：

Obtain multiple trained audio collections, all trained audios that each of the multiple trained audio collection training audio collection includes Corresponding same pre-set categories mark；

8. a kind of audio classification device, which is characterized in that described device includes：

Acquisition module, for acquiring audio signal；

The duration of the audio signal is adjusted to pre- by adjustment module for the audio signal to be intercepted or supplemented If duration；

First extraction module, the audio spy for extracting the target audio by presetting the convolutional network that grader includes Sign；

Second extraction module, the thresholding recirculating network for including by the default grader extract the audio frequency characteristics Temporal aspect；

First determining module, for according to the temporal aspect, the fully-connected network for including by the default grader to be true The classification of the fixed target audio is general by the pre-set categories that each pre-set categories identify in multiple pre-set categories mark Rate；

Second determining module, the pre-set categories for maximum probability in identifying the multiple pre-set categories are identified pre- If classification is determined as the classification of the target audio.

9. device as claimed in claim 8, which is characterized in that first extraction module includes：

First extracting sub-module, for passing through the convolutional network by the spy of each audio fragment in the multiple audio fragment Sign is extracted as a feature；

Form submodule, the audio frequency characteristics for the feature of extraction to be made up of to the target audio the convolutional network.

10. device as claimed in claim 8, which is characterized in that second extraction module includes：

Second extracting sub-module, the first temporal aspect for extracting the audio frequency characteristics by the thresholding recirculating network；

First determination sub-module, for determining that corresponding first classification of first temporal aspect is special by the fully-connected network Sign；

First substitutes into submodule, for each element in first characteristic of division to be substituted into the first preset function, obtains institute The weight of each element in the first characteristic of division is stated, in the element in first temporal aspect and first characteristic of division Element corresponds；

Multiplication submodule is used for for the either element A in first temporal aspect, and elements A is special in first classification The weight of corresponding element is multiplied with elements A in sign, obtains corresponding first element of elements A；

Submodule is replaced, for each element in first temporal aspect to be replaced with corresponding first element, obtains the Temporal aspect of two temporal aspects as the audio frequency characteristics.

11. device as claimed in claim 8, which is characterized in that first determining module includes：

Second substitutes into submodule, pre- for the element in second characteristic of division to be substituted into second by the fully-connected network If function, obtains each pre-set categories in the multiple pre-set categories mark of classification of the target audio and identified The probability of pre-set categories.

12. device as claimed in claim 8, which is characterized in that the default grader further includes batch standardization network and pond Change at least one of network.

13. device as claimed in claim 8, which is characterized in that the conversion module includes：

Third determination sub-module, the mel-frequency cepstrum coefficient MFCC for determining the audio signal believe according to the audio Number MFCC, generate the target audio；

4th determination sub-module, the frequency spectrum for determining the audio signal, according to the frequency spectrum of the audio signal, described in generation Target audio.

14. device as claimed in claim 8, which is characterized in that described device further includes：

Acquisition module, for obtaining multiple trained audio collections, each of the multiple trained audio collection training audio collection includes All trained audios correspond to same pre-set categories mark；

Training module is trained for treating trained disaggregated model using the multiple trained audio collection, obtains described pre- If grader.

15. a kind of audio classification device, which is characterized in that described device includes：

Processor；

Memory for storing processor-executable instruction；

Wherein, the processor is configured as the step of perform claim requires any one method described in 1-7.

16. a kind of computer readable storage medium, instruction is stored on the computer readable storage medium, which is characterized in that The step of any one method described in claim 1-7 is realized when described instruction is executed by processor.