CN105788592A

CN105788592A - Audio classification method and apparatus thereof

Info

Publication number: CN105788592A
Application number: CN201610279778.XA
Authority: CN
Inventors: 张利
Original assignee: Leshi Zhixin Electronic Technology Tianjin Co Ltd; LeTV Holding Beijing Co Ltd
Current assignee: Leshi Zhixin Electronic Technology Tianjin Co Ltd; LeTV Holding Beijing Co Ltd
Priority date: 2016-04-28
Filing date: 2016-04-28
Publication date: 2016-07-20

Abstract

Embodiments of the invention provide an audio classification method and an apparatus thereof. The method comprises the following steps of according to collected training data, based on a depth nerve network, training and acquiring an audio classification model; extracting an audio characteristic from the audio data; and inputting the audio characteristic into the audio classification model, outputting and acquiring a classification result of the audio data, wherein the classification result includes a recording audio, a voice song searching audio and a humming audio. In the prior art, a classification correct rate between the humming audio and the voice song searching audio is low. In the embodiments of the invention, the above problem can be solved, audio classification accuracy is increased and then song searching accuracy can be increased too.

Description

A kind of audio frequency classification method and device

Technical field

The present embodiments relate to Audiotechnica field, particularly relate to a kind of audio frequency classification method and device.

Background technology

In recent years, developing rapidly along with intelligent television technology, it is possible to realize increasing function by intelligent television, for instance, it is possible to the function of search song is realized by intelligent television.

In a particular application, intelligent television can support the search song function of following three kinds of modes: the first, song audio frequency searched in the voice receiving user, one section that song audio frequency can be said searched for the user received in such as this voice: " search song " blue-and-white porcelain " ", then intelligent television scans in search the search engine that song audio frequency is corresponding with voice；The second, receives one section of recorded audio of user's input, for instance one section of background music of recording, then intelligent television scans in the search engine corresponding with recorded audio；The third, receive one section of humming audio frequency of user's input, for instance one section of music liked of user oneself humming, then intelligent television scans in the search engine corresponding with humming audio frequency.

Can be seen that, intelligent television is before search song, classify firstly the need of to the voice data received, to determine that this voice data belongs to voice and searches song audio frequency, recorded audio or humming audio frequency etc., then could scan in the search engine corresponding with voice data type, and Search Results is returned to user.But, to search song audio frequency due to voice and be generally similar with humming audio frequency, existing audio data classification method, the accuracy rate that voice is searched song audio frequency and humming audio classification is relatively low.

Summary of the invention

The embodiment of the present invention provides a kind of audio frequency classification method and device, the problem that accuracy rate in order to solve the classification of prior art sound intermediate frequency is relatively low.

The embodiment of the present invention provides a kind of audio frequency classification method, including:

According to the training data collected, obtain audio classification model based on deep neural network training；

Voice data is extracted audio frequency characteristics；

Described audio frequency characteristics inputs described audio classification model, and output obtains the classification results of described voice data；Described classification results includes: song audio frequency and humming audio frequency searched in recorded audio, voice.

The embodiment of the present invention also provides for a kind of audio classification device, including:

Training module, for according to the training data collected, obtaining audio classification model based on deep neural network training；

First extraction module, for extracting audio frequency characteristics to voice data；

Output module, for described audio frequency characteristics is inputted described audio classification model, output obtains the classification results of described voice data；Described classification results includes: song audio frequency and humming audio frequency searched in recorded audio, voice.

The embodiment of the present invention provides a kind of audio frequency classification method and device, the embodiment of the present invention can obtain the audio classification model based on deep neural network according to training data, and voice data is extracted audio frequency characteristics, and by described audio classification model and described audio frequency characteristics, described voice data is classified.Owing to deep neural network can the elaborative faculty of simulating human preferably, therefore, by can be good at distinguishing voice data based on the audio classification model of deep neural network, the humming audio frequency shorter particularly with effective duration is higher with the differentiation accuracy rate that song audio frequency searched in voice, therefore, the embodiment of the present invention can solve, in prior art, humming audio frequency and voice are searched the problem that the classification accuracy rate sung between audio frequency is relatively low, improve the accuracy rate of audio classification, and then the accuracy rate of search song can be improved.

Accompanying drawing explanation

In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, the accompanying drawing used required in embodiment or description of the prior art will be briefly described below, apparently, accompanying drawing in the following describes is some embodiments of the present invention, for those of ordinary skill in the art, under the premise not paying creative work, it is also possible to obtain other accompanying drawing according to these accompanying drawings.

Fig. 1 illustrates the flow chart of steps of a kind of audio frequency classification method embodiment one of the present invention；

Fig. 2 illustrates the flow chart of steps of a kind of audio frequency classification method embodiment two of the present invention；

Fig. 3 illustrates the flow chart of steps of a kind of audio frequency classification method embodiment three of the present invention；

Fig. 4 illustrates the schematic diagram of a kind of multilamellar deep neural network model of the present invention；

Fig. 5 illustrates the flow chart of steps of a kind of audio frequency classification method embodiment four of the present invention；

Fig. 6 illustrates the flow chart of steps of a kind of audio frequency classification method embodiment five of the present invention；

Fig. 7 illustrates the structured flowchart of a kind of audio classification device embodiment of the present invention.

Detailed description of the invention

For making the purpose of the embodiment of the present invention, technical scheme and advantage clearly, below in conjunction with the accompanying drawing in the embodiment of the present invention, technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is a part of embodiment of the present invention, rather than whole embodiments.Based on the embodiment in the present invention, the every other embodiment that those of ordinary skill in the art obtain under not making creative work premise, broadly fall into the scope of protection of the invention.

Embodiment of the method one

With reference to Fig. 1, it is shown that the flow chart of steps of a kind of audio frequency classification method embodiment one of the present invention, specifically may include that

Step 101, according to collect training data, based on deep neural network training obtain audio classification model；

Step 102, to voice data extract audio frequency characteristics；

Step 103, described audio frequency characteristics inputting described audio classification model, output obtains the classification results of described voice data；Described classification results includes: song audio frequency and humming audio frequency searched in recorded audio, voice.

The embodiment of the present invention can be used for by intelligent terminal, voice data being classified.Specifically, first the audio classification model based on deep neural network can be obtained intelligent terminal according to the training data training collected, then to needing the voice data classified to extract audio frequency characteristics, finally by disaggregated model preset for the audio frequency characteristics input extracted, it is analyzed audio frequency characteristics processing by described disaggregated model, namely exportable obtains the classification results that described voice data is corresponding.Such as, the classification results of output specifically may include that song audio frequency, humming audio frequency etc. searched in recorded audio, voice.Wherein, described intelligent terminal specifically can include the various forms of terminal units such as smart mobile phone, panel computer, intelligent television, and the embodiment of the present invention is not any limitation as the concrete form of intelligent terminal.For ease of describing, the embodiment of the present invention all illustrates for intelligent television, and other application scenarios is cross-referenced.

The training data training that described preset disaggregated model is according to the dissimilar audio frequency collected obtains.In embodiments of the present invention, use the disaggregated model based on DNN (DeepNeuralNetworks, deep neural network) as preset audio classification model.

It should be noted that neutral net is a kind of neutral net simulating human brain, it is possible to understand that, the cognitive process of human brain is the process that degree of depth multilamellar is complicated, often go deep into one layer with regard to many one layer abstract.And based on the degree of depth learning algorithm of DNN, the training at least with the neutral net of 7 layers is possibly realized, owing to DNN can simulate the process of people's brain neuron multilamellar degree of depth transmission preferably, thus there is obviously breakthrough performance when solving some challenges.

In a kind of Application Example of the present invention, described method can also comprise the steps:

In the search engine corresponding with described classification results, search obtains the related resource of described voice data.

After obtaining the classification results that described voice data is corresponding, described voice data can be input to the search engine corresponding with described classification results, such as, when the classification results determining that described voice data is corresponding is recorded audio, it is possible to search for described voice data in music recording identification engine；Determine the classification results that described voice data is corresponding be voice search song audio frequency time, it is possible to voice search song engine in search for described voice data；When determining that the classification results that described voice data is corresponding is humming audio frequency, it is possible to search in song engine at humming and search for described voice data etc., and return desired Search Results to user.So, in actual applications, it is possible to realize the song search function of multiple audio types by arranging a record button intelligent terminal.Such as, after clicking this record button, can typing for the voice data searched for, one section of recorded audio of song audio frequency or user's recording or one section of humming audio frequency etc. of user's humming searched in the one section of voice said such as user, the voice data of typing is extracted audio frequency characteristics and inputs preset model, the classification results of correspondence can be inputted, and then described voice data can be searched in corresponding search engine according to classification results.Pass through the embodiment of the present invention, it is possible to the audio classification system of the unified entrance of reality, carry out voice for user by intelligent terminal and search song and provide convenience, improve Consumer's Experience largely.

In a kind of application example of the present invention, when song audio frequency " search song " blue-and-white porcelain " " searched in the voice of intelligent terminal for reception to user, the audio frequency characteristics corresponding by this voice being searched song audio extraction, including mel cepstrum coefficients feature and fundamental frequency feature, then preset by extracting the audio frequency characteristics input obtained disaggregated model, this disaggregated model can export the classification results obtaining correspondence: song audio frequency searched in voice, and then can search the song resource of lookup " blue-and-white porcelain " in song engine at voice and Search Results is returned to user.

To sum up, the embodiment of the present invention can obtain the audio classification model based on DNN according to training data, and voice data is extracted audio frequency characteristics, and by described audio classification model and described audio frequency characteristics, described voice data is classified, and then search obtains the related resource of described voice data and Search Results is returned to user in the search engine corresponding with classification results；Due to DNN can the elaborative faculty of simulating human preferably, therefore, by can be good at distinguishing voice data based on the audio classification model of DNN, the humming audio frequency less particularly with valid frame is higher with the differentiation accuracy rate that song audio frequency searched in voice；So, the embodiment of the present invention is possible not only to solve, in prior art, humming audio frequency and voice are searched the problem that the classification accuracy rate sung between audio frequency is relatively low, improves the accuracy rate of audio classification, and then can improve the accuracy rate of search song；But also the Search Results corresponding with voice data can be returned to user, improve the experience of user.

Embodiment two

The present embodiment, on the basis of above-described embodiment one, illustrates to extract the detailed process of audio frequency characteristics corresponding to described voice data；In embodiment one, audio frequency characteristics is mainly extracted in speech recognition conventional MFCC feature and can effectively distinguish the less humming audio frequency of valid frame and the fundamental frequency feature of song audio frequency searched in voice, the present embodiment audio frequency characteristics to extracting carries out further dynamic expansion by calculating single order and second differnce computing, make audio frequency characteristics more prominent, be finally combined into the audio frequency characteristics of 42 dimensions.

With reference to Fig. 2, it is shown that the flow chart of steps of a kind of embodiment of the method that voice data is extracted audio frequency characteristics of the present invention, specifically may include that

Step 201, extract mel cepstrum coefficients feature corresponding to described voice data/training data and fundamental frequency feature；

In order to solve that voice is searched in prior art the problem that song audio frequency is relatively low with the accuracy rate of humming audio classification, the embodiment of the present invention utilizes the mel cepstrum coefficients feature in voice data and fundamental frequency feature that voice data is classified.In actual applications, MFCC (Mel-scaleFrequencyCepstralCoefficients, mel cepstrum coefficients) it is (beautiful at Mel, the unit of tone) cepstrum parameter that extracts of scale frequency territory, Mel scale describes the nonlinear characteristic of human ear frequency, reflecting the auditory properties of human ear, therefore described mel cepstrum coefficients feature can well identify the audio frequency of voice class.And fundamental frequency reflects the pitch information of human auditory system aspect, concrete, in the phonation of voiced sound, air-flow makes vocal cords produce the vibration of relaxation oscillation formula by glottis, produce one of air pulse paracycle, this air-flow excitation sound channel just produces voiced sound, and it carries the most of energy in voice, and wherein the frequency of vibration of vocal cords is known as fundamental frequency.The embodiment of the present invention utilizes the fundamental frequency fluctuation of humming audio frequency to speak mild this feature of the fundamental frequency fluctuation of audio frequency than people, it is possible to distinguishes humming audio frequency preferably and searches song audio frequency with voice.Therefore, according to described in include mel cepstrum coefficients feature and the audio frequency characteristics of fundamental frequency feature, it is possible to well distinguishing voice data, searching song audio frequency particularly with the less humming audio frequency of valid frame and voice also has and distinguishes accuracy preferably.

In embodiments of the present invention, the 13 original MFCC features of dimension that in described voice data/training data, every frame voice data is corresponding and the 1 dimension original fundamental frequency feature that often frame voice data is corresponding can specifically be extracted.

Step 202, described mel cepstrum coefficients feature is carried out first-order difference and second differnce calculate, obtain multidimensional mel cepstrum coefficients feature；

It should be noted that, in order to highlight the MFCC feature of described voice data further, make described voice data more significantly be different from other voice datas, the 13 original MFCC features of dimension that described voice data is corresponding can be carried out first-order difference and second differnce calculates, obtain the multidimensional mel cepstrum coefficients feature that described voice data is corresponding, such as the MFCC feature of 39 dimensions.

In the practical application of the present invention, the 13 original MFCC features of dimension that described voice data is corresponding being carried out first-order difference and second differnce calculates, the detailed process of the MFCC feature obtaining 39 dimensions may include that

M ' [n]=m [n]-m [n-1] (1)

First, according to the 13 original MFCC features of dimension that described voice data is corresponding, by formula (1), it is possible to calculate the single order MFCC features of 13 dimensions corresponding to described voice data.Illustrate for the n-th frame audio frequency that described voice data is corresponding, wherein, m [n] represents the 13 original MFCC features of dimension that in described voice data, n-th frame audio frequency is corresponding, and m ' [n] represents the single order MFCC features of 13 dimensions that n-th frame voice data is corresponding in the described voice data calculated.

M " [n]=m ' [n]-m ' [n-1] (2)

It follows that according to the single order MFCC features of 13 dimensions corresponding to the described voice data that calculates, by formula (2), it is possible to calculate the second order MFCC features of 13 dimensions corresponding to described voice data.Specifically, illustrate for the n-th frame audio frequency that described voice data is corresponding；Wherein, m ' [n] represents in described voice data the single order MFCC features of 13 dimensions that n-th frame voice data is corresponding, m " [n] represent the second order MFCC features of 13 dimensions that n-th frame voice data is corresponding in the described voice data calculated.

M [n]=m [n]+m ' [n]+m " [n] (3)

Finally, according to the second order MFCC features of the single order MFCC features of 13 dimensions corresponding to the described voice data that calculates and 13 dimensions, by formula (3), it is possible to calculate the MFCC features of 39 dimensions corresponding to described voice data.Specifically, illustrate for the n-th frame voice data that described voice data is corresponding；Wherein, M [n] identifies the MFCC feature of 39 dimensions corresponding to described voice data, m [n] represents the 13 original MFCC features of dimension that in described voice data, n-th frame voice data is corresponding, m ' [n] represents in described voice data the single order MFCC features of 13 dimensions that n-th frame voice data is corresponding, m " [n] represent the second order MFCC features of 13 dimensions that n-th frame voice data is corresponding in the described voice data calculated.

Step 203, described fundamental frequency feature is carried out first-order difference and second differnce calculate, obtain multidimensional fundamental frequency feature；

In embodiments of the present invention, it is possible to the 1 dimension original fundamental frequency feature that described voice data is corresponding being carried out first-order difference and second differnce calculates, obtain multidimensional fundamental frequency feature, such as the fundamental frequency feature of 3 dimensions, detailed process may include that

X ' [n]=x [n]-x [n-1] (4)

First, the embodiment of the present invention adopts auto-relativity function method to extract the 1 dimension original fundamental frequency feature that described voice data is corresponding, according to the 1 dimension original fundamental frequency feature that described voice data is corresponding, by formula (4), it is possible to calculate the single order fundamental frequency feature of 1 dimension corresponding to described voice data.Specifically, illustrate for the n-th frame voice data that described voice data is corresponding；Wherein, x [n] represents the 1 dimension original fundamental frequency feature that in described voice data, n-th frame voice data is corresponding, and x ' [n] represents the single order fundamental frequency feature of 1 dimension that n-th frame voice data is corresponding in the described voice data calculated.

X " [n]=x ' [n]-x ' [n-1] (5)

It follows that according to the single order fundamental frequency feature of 1 dimension corresponding to the described voice data that calculates, by formula (5), it is possible to calculate the second order fundamental frequency feature of 1 dimension corresponding to described voice data.Specifically, illustrate for the n-th frame voice data that described voice data is corresponding；Wherein, x ' [n] represents in described voice data the single order fundamental frequency feature of 1 dimension that n-th frame voice data is corresponding, x " [n] represent the second order fundamental frequency feature of 1 dimension that n-th frame voice data is corresponding in the described voice data calculated.

X [n]=x [n]+x ' [n]+x " [n] (6)

Finally, according to the second order fundamental frequency feature of the single order fundamental frequency feature of 1 dimension corresponding to the described voice data that calculates and 1 dimension, by formula (6), it is possible to calculate the fundamental frequency features of 3 dimensions corresponding to described voice data.Specifically, illustrate for the n-th frame voice data that described voice data is corresponding；Wherein, X [n] identifies the fundamental frequency feature of 3 dimensions corresponding to described voice data, x [n] represents the 1 dimension original fundamental frequency feature that in described voice data, n-th frame voice data is corresponding, x ' [n] represents in described voice data the single order fundamental frequency information of 1 dimension that n-th frame voice data is corresponding, x " [n] represent the second order fundamental frequency information of 1 dimension that n-th frame voice data is corresponding in the described voice data calculated.

Step 204, according to described multidimensional mel cepstrum coefficients feature and multidimensional fundamental frequency feature, it is determined that the multidimensional audio frequency characteristics that described voice data/training data is corresponding.

Specifically, multidimensional mel cepstrum coefficients feature corresponding for the described voice data/training data calculated and multidimensional fundamental frequency feature are added up, it is possible to obtain the multidimensional audio frequency characteristics that described voice data/training data is corresponding.Such as, the fundamental frequency feature of 39 corresponding for described voice data dimension MFCC features with 3 dimensions is added up, it is possible to obtain the audio frequency characteristics of 42 dimensions corresponding to described voice data.

F [n]=M [n]+X [n] (7)

As shown in formula (7), illustrate for the n-th frame voice data that described voice data is corresponding；Wherein, F [n] identifies the audio frequency characteristics of 42 dimensions corresponding to described voice data, M [n] represents the MFCC audio frequency characteristics of 39 dimensions that n-th frame voice data is corresponding in the described voice data calculated, and X [n] represents the fundamental frequency feature of 3 dimensions that n-th frame voice data is corresponding in described voice data.

Finding under study for action, human ear has different hearing sensitivity for the sound wave of different frequency, and when high-frequency sound and low-frequency sound sounding simultaneously, high-frequency sound is easily hidden by low-frequency sound, and people are likely to be easier to hear high-frequency sound.Based on the auditory properties of above-mentioned human ear, MFCC feature can embody the human ear sensitivity to different frequency sound, therefore, according to MFCC feature, it is possible to well identify the audio frequency of voice class.Simultaneously, owing to the fundamental frequency feature of audio frequency reflects the pitch information of human auditory system aspect, the fundamental frequency fluctuation that the fundamental frequency that generally audio frequency of humming is corresponding fluctuates more corresponding than the audio frequency that people speaks is mild, therefore, can distinguish humming audio frequency preferably according to the fundamental frequency feature that audio frequency is corresponding and song audio frequency searched in voice.

In embodiments of the present invention, original MFCC feature and 1 dimension original fundamental frequency feature is tieed up by corresponding for described voice data/training data 13, generate 42 dimension audio frequency characteristics exemplarily to illustrate, it should be understood that the dimension of MFCC feature corresponding to described voice data and fundamental frequency feature is not limited by the embodiment of the present invention.

Visible, the embodiment of the present invention can be tieed up original fundamental frequency feature by the 13 original MFCC features of dimension and 1 that the voice data/training data extracted is corresponding, calculate the audio frequency characteristics obtaining 42 dimensions corresponding to described voice data/training data with second differnce through first-order difference, and then according to described 42 audio frequency characteristics tieed up by disaggregated model classifying or the audio frequency characteristics of described 42 dimensions is input to based in the audio classification model of DNN described voice data.Can tie up on the bases that original MFCC feature and 1 ties up original fundamental frequency feature described 13 owing to first-order difference and second differnce calculate, the audio frequency characteristics of described voice data/training data is further expanded, make described voice data/training data more significantly be different from other voice datas, therefore, the embodiment of the present invention can according to the audio frequency characteristics of 42 dimensions after the extension of described voice data, by disaggregated model, described voice data is efficiently classified, the classification results higher to obtain accuracy rate, and, described audio model is trained by the audio frequency characteristics of 42 dimensions after can extending according to training data, to obtain more ripe audio classification model.

Embodiment of the method three

The present embodiment, on the basis of above-described embodiment one, describes the training process of audio classification model.With reference to Fig. 3, it is shown that the flow chart of steps of the training method embodiment of a kind of audio classification model of the present invention, specifically may include that

Step 301, collection training data；Described training data specifically may include that song audio frequency and humming audio frequency searched in recorded audio, voice；

In embodiments of the present invention, the training data of collection specifically can include recorded audio, song audio frequency searched in voice, humming audio frequency, to meet common audio classification demand.So that the disaggregated model trained can avoid noise and the quiet interference brought further better, the embodiment of the present invention can also collect noised audio and mute audio as training data, yet further, humming audio frequency can also be split as singing audio frequency and groan song audio frequency, classification results is refined more, such that it is able to obtain more accurate classification results, and then improve the accuracy rate of search song.

Wherein, the source of described training data can be specifically song recording, mp3 file, or the primary voice data got from network；Such as, the voice data that described voice searches song audio frequency corresponding can obtain from the recording data TV speech assistant, the voice data that described singing audio frequency is corresponding with groaning song audio frequency can obtain from the audio frequency of singing opera arias of the recording data that user sings and recording music removal background sound, the voice data that noised audio is corresponding can obtain from the general noise that speech recognition is recorded, and the voice data that mute audio is corresponding can obtain from the quiet data that speech recognition end-point detection is collected.It is appreciated that the embodiment of the present invention is not any limitation as the concrete mode collecting training data.In a kind of application example of the present invention, described recorded audio, voice are searched song audio frequency, singing audio frequency, are groaned song audio frequency, noise and mute audio ratio in training data are specifically as follows 24:9:5:3:10:5.

Step 302, to described training data extract audio frequency characteristics；

Specifically, described training data is extracted mel cepstrum coefficients feature and fundamental frequency feature, and these features have been carried out dynamic expansion, calculate single order and second differnce computing, be finally combined into the training characteristics of 42 dimensions.

Step 303, audio frequency characteristics according to described extraction, obtain audio classification model based on deep neural network training.

In the practical application of the embodiment of the present invention, the described audio frequency characteristics according to described extraction, the step obtaining disaggregated model based on deep neural network training specifically may include that

Sub-step S11, by the input of described audio frequency characteristics based on the audio classification model of deep neural network, and be extended described audio frequency characteristics processing；

In actual applications, the number n of extension frame is first set, again described audio frequency characteristics is inputted described audio classification model, now, it is extended the frame processed in described audio frequency characteristics processing, specifically, it is possible to the audio frequency characteristics that audio frequency characteristics corresponding for the n frame before present frame is corresponding with n frame after the current frame is all joined in the audio frequency characteristics of present frame, and using this audio frequency characteristics as audio frequency characteristics corresponding after present frame extension process.Such as, the dimension of the audio frequency characteristics of present frame is 42 dimensions, the dimension of the dimension respectively 42*n of the audio frequency characteristics that audio frequency characteristics that n frame before present frame is corresponding is corresponding with n frame after the current frame, so, the dimension of the audio frequency characteristics that present frame after extension process is corresponding is 42+42*n+42*n=42* (1+2*n).

Under normal circumstances, the span of n is 5～10, in embodiments of the present invention, is 10 when arranging n, and the classification accuracy of audio classification model is the highest.

Sub-step S12, the relevant parameter of described deep neural network is set；Wherein, described relevant parameter includes hidden layer number and the number of hidden nodes；

In embodiments of the present invention, by adjusting the relative parameters setting arranging DNN, wherein, relevant parameter includes hidden layer number m and the number of hidden nodes M, to set up original DNN disaggregated model.

It should be noted that, owing to DNN have employed the hierarchy similar to neutral net, system includes the multitiered network of input layer, hidden layer and output layer composition, described multitiered network connection between adjacent layers, and without connecting between same layer and cross-layer node, wherein, the schematic diagram of described multilamellar DNN model is as shown in Figure 4；In the diagram, the hidden layer number m=2 of described multilamellar DNN model, for representing the number of plies of hidden layer, the number of hidden nodes M=3, for representing the interstitial content in each hidden layer.

Under normal circumstances, hidden layer number m span is 3～5, in embodiments of the present invention, considers the training speed of the audio classification model based on DNN and the classification accuracy of the audio classification model based on DNN, hidden layer number m can be set to 3, and the number of hidden nodes M is set to 512.

After completing the setting of DNN relevant parameter, it is possible to according to training data, the newly-built audio classification model based on DNN is iterated training, to obtain the audio classification model based on DNN of maturation.

Sub-step S13, the described audio classification model based on DNN is trained.

In concrete training process, each audio types can a corresponding classification submodule, wherein, described audio classification model specifically can include submodel of classifying following six: recorded audio classification submodel, voice are searched song audio classification submodel, singing audio classification submodel, groaned and sing audio classification submodel, noised audio classification submodel and mute audio classification submodel.

Specifically, it is possible to by the iterative computation to the audio classification model based on DNN, to realize training the effect of described audio classification model；Wherein, in embodiments of the present invention, the object function of described iterative computation is mean square error function.

It should be noted that the process of the target function value being calculated as input by audio frequency characteristics corresponding for voice data and obtaining correspondence is called that an iteration calculates.In actual iterative process, can using audio frequency characteristics corresponding for voice data as input value, calculate the output valve obtaining correspondence through described audio classification model, more described output valve is substituted into the target function value that object function is corresponding to obtain current iterative computation.

In specific implementation process, the standard that can complete as described audio classification model training according to the number of times of iterative computation, the standard that can also complete as described audio classification model training less than predetermined threshold value according to the fluctuating margin of described object function, in embodiments of the present invention, when the fluctuating margin of object function is less than predetermined threshold value, the iterative computation of described audio classification model stops, and completes the training to described audio frequency.Testing it addition, the embodiment of the present invention also have chosen described audio classification model in the result of calculation of the 6th time to the 10th time iteration, result proves, the DNN disaggregated model that the 7th iterative computation goes out is best to the classifying quality of described audio types.

To sum up, the embodiment of the present invention first extracts the audio frequency characteristics of correspondence from a large amount of training datas, and obtained the audio classification model based on DNN of maturation with training by iterative computation, wherein, owing to described disaggregated model training data comprising noised audio and mute audio, therefore noised audio and mute audio can be accurately distinguished, avoid noise and the quiet classification accuracy affecting other kinds of voice data brought, simultaneously, by the powerful disposal ability of described audio classification model after training, the classification accuracy that described audio classification model is classified in a noisy environment can be improved, experience better is brought for user.

Embodiment of the method four

The present embodiment is on the basis of above-described embodiment one, and described audio classification model can include at least one classification submodel, and described classification submodel has corresponding relation with described classification results, and each classification submodel can a corresponding classification results.Therefore, by described classification submodel, it is possible to distinguish a greater variety of audio types, such that it is able to improve the accuracy rate of audio classification.With reference to Fig. 5, it is shown that the flow chart of steps of a kind of audio frequency classification method embodiment four of the present invention, specifically may include that

Step 501, by described audio frequency characteristics input described audio classification model；

In one preferred embodiment of the invention, described preset audio classification model specifically can include six classification submodels, and each classification submodel can corresponding a kind of audio types, for instance recorded audio classification submodel, voice are searched song audio classification submodel, singing audio classification submodel, groaned song audio classification submodel, noised audio submodel and mute audio classification submodel；Wherein, noised audio and mute audio are joined in described audio classification model as two submodels of classifying, owing to the difference of noised audio and mute audio and other audio types is bigger, therefore, can according to the feature of noised audio and mute audio to distinguish noised audio and mute audio, noised audio and mute audio is avoided to affect other types audio classification accuracy rate, it is ensured that the accuracy of audio types.

Step 502, calculate according to described classification submodel and obtain the probit of described audio frequency characteristics；

Specifically, it is possible to using the described audio frequency characteristics input parameter as submodel of classifying each in described audio classification model, and then generate the likelihood probability value of six classification submodels corresponding to described voice data.

It should be noted that described audio classification model may be used for the voice data of multiple range of application is classified.Such as, in embodiments of the present invention, it is classify to performing the voice data that voice searches the phonetic order of song corresponding；Wherein, the voice data that the phonetic order that song action searched in described execution voice is corresponding is mainly used in the tune of record song title or song, now, described disaggregated model needs to include recorded audio classification submodel, song audio classification submodel, singing audio classification submodel searched in voice, groan song audio classification submodel, noised audio classifies submodel and mute audio classification submodel；And when needing the voice data that the phonetic order performing phonetic search movie action is corresponding is classified, described voice data is not only applicable to the music of record movie name or film, and can apply to the lines of record film, now, described disaggregated model is likely to need to include recorded audio classification submodel, phonetic search movie audio classification submodel, voice lines audio classification submodel, noised audio classification submodel and mute audio classification submodel etc..Therefore, the classification submodel type that described audio classification model includes can be determined according to the range of application that described voice data to be sorted is corresponding with number, and owing to the range of application of voice data to be sorted is different, classification submodel type and data that described audio classification model includes are likely to also different.It should be understood that type and the number of described classification submodel are not added with limiting by the embodiment of the present invention.

Step 503, classification results corresponding for maximum for probit classification submodel is exported as the classification results of described voice data.

In a kind of preferred embodiment of the embodiment of the present invention, described the step that classification results corresponding for maximum for probit classification submodel carries out exporting as the classification results of described voice data specifically be may include that

Sub-step S21, when effective duration of described voice data is more than default effective duration thresholding, classification results corresponding for classification submodel maximum for probit is exported as the classification results of described voice data；

In actual applications, owing to noised audio and mute audio and other audio types differ greatly, thus may determine that whether described voice data is this two class audio frequencies type.

In embodiments of the present invention, it is possible to be set to 2 seconds by described effective duration thresholding, finding in statistics, the voice of 60% searched effective duration corresponding to song audio frequency more than 2 seconds, and the effective duration humming audio frequency corresponding is then relatively long, is typically in more than 4 seconds.

Sub-step S22, when effective duration of described voice data is less than or equal to default effective duration thresholding, search in song audio frequency, humming audio frequency and recorded audio at voice and determine the classification results that described voice data is corresponding.

It should be noted that owing to when effective duration of voice data is shorter, disaggregated model is relatively low to the classification accuracy of voice data, therefore, does not repartition and groan song type and singing type.

In actual applications, for the audio frequency that effective duration is shorter, if the probit only calculated according to described audio classification model is to determine the classification results that described audio frequency is corresponding, accuracy rate is relatively low.Therefore, in order to improve the classification accuracy of the shorter audio types of effective duration, before determining the classification results that described voice data is corresponding, need first to calculate the posterior probability values of humming audio frequency corresponding to described voice data, that is, it is possible to using humming audio frequency posterior probability values corresponding as described voice data to the probit of singing audio classification submodel corresponding for described voice data and the probit sum groaning song audio classification submodel.

In one preferred embodiment of the invention, if humming posterior probability values corresponding to audio frequency more than default humming audio frequency thresholding, it is determined that described voice data is humming audio frequency；If not, it is determined that described voice data is recorded audio or song audio frequency searched in voice, specifically, it is possible to the probit of selection sort submodel bigger is as classification results corresponding to described voice data；Wherein, described humming audio frequency thresholding can between 0.2 to 0.35, and the embodiment of the present invention specifically can value 0.35.

To sum up, in embodiments of the present invention, by voice data being inputted audio classification model, according to the probit that classification submodule calculates, voice data is judged；Wherein, noised audio and mute audio are joined in audio classification model as classification submodel, when voice data is noised audio or mute audio, can by noised audio classify submodel and mute audio classification submodel distinguish and classify, avoid noised audio and mute audio to affect other types audio classification accuracy rate, improve the classification accuracy of voice data；Meanwhile, for the voice data that effective duration is shorter, on the basis according to the classification calculated probit of submodel, in addition it is also necessary to classify according to the humming audio frequency posterior probability values that voice data is corresponding, realize the efficient classification to voice data further.

Embodiment of the method five

The present embodiment, on the basis of above-described embodiment one, illustrates from train classification models to the detailed process using disaggregated model that voice data is classified.With reference to Fig. 6, it is shown that the flow chart of steps of a kind of audio frequency classification method embodiment five of the present invention, specifically may include that

Step 601, collection training data；Described training data specifically may include that recorded audio, voice are searched song audio frequency, singing audio frequency, groaned song audio frequency, noised audio and mute audio；

Step 602, to described training data extract audio frequency characteristics；Wherein, described audio frequency characteristics includes: mel cepstrum coefficients feature and fundamental frequency feature；

Step 603, audio frequency characteristics according to described extraction, obtain disaggregated model based on deep neural network training；

It should be noted that, described disaggregated model specifically can include six submodels, and each submodel can corresponding a kind of audio types, for instance recorded audio submodel, voice are searched song audio frequency submodel, singing audio frequency submodel, groaned song audio frequency submodel, noised audio submodel and mute audio submodel.

Now, complete the training to described disaggregated model, when voice data to be sorted occurs, it is possible to use voice data is carried out classification process by described disaggregated model.

Step 604, to voice data extract audio frequency characteristics；Wherein, described audio frequency characteristics includes: mel cepstrum coefficients feature and fundamental frequency feature；

Specifically, when described voice data to be sorted being carried out classification and processing, it is necessary first to extract the audio frequency characteristics that described voice data is corresponding.

Step 605, by described audio frequency characteristics input described audio classification model；

Step 606, calculate according to described classification submodel and obtain the probit of described audio frequency characteristics；

In actual applications, can also with reference to the use habit of user and classification results before, for user in voice command commonly used, and the audio types that occurrence number is more in the historical data of audio classification result, can suitably increase the likelihood probability value that described audio types is corresponding, to improve the described audio types probability as classification results corresponding to described voice data to be sorted；Simultaneously, for in user voice command from untapped, and do not occur in the historical data of audio classification result or audio types that occurrence number is little, can suitably reduce the likelihood probability value that described audio types is corresponding, to reduce the described audio types probability as classification results corresponding to described voice data to be sorted.

Step 607, classification results corresponding for maximum for probit classification submodel is exported as the classification results of described voice data.

To sum up, the embodiment of the present invention is before carrying out classification process to voice data to be sorted, first pass through the described audio classification model after the audio classification model based on DNN is trained and is trained, next, according to the audio frequency characteristics that described voice data to be sorted is corresponding, being undertaken classifying by described audio classification model obtains the classification results that described voice data is corresponding；At the same time it can also be the historical data of the use habit of foundation user and classification results, the likelihood probability value that described disaggregated model is generated is modified, to improve the accuracy rate of classification results.

Device embodiment

With reference to Fig. 7, it is shown that the structured flowchart of a kind of audio classification device embodiment of the present invention, specifically may include that

Training module 710, for according to the training data collected, obtaining audio classification model based on deep neural network training；And

First extraction module 720, for extracting audio frequency characteristics to voice data；And

Output module 730, for described audio frequency characteristics is inputted described audio classification model, output obtains the classification results of described voice data；Described classification results includes: song audio frequency and humming audio frequency searched in recorded audio, voice.

In a kind of alternative embodiment of the present invention, described audio classification model includes at least one classification submodel, and described classification submodel has corresponding relation with described classification results；Described output module 730 specifically may include that

Input submodule, for inputting described audio classification model by described audio frequency characteristics；

Calculating sub module, for calculating the probit obtaining described audio frequency characteristics according to described classification submodel；

Output sub-module, for exporting classification results corresponding for classification submodel maximum for probit as the classification results of described voice data.

In another alternative embodiment of the present invention,

Described output sub-module, specifically for when effective duration of described voice data is more than default effective duration thresholding, exporting classification results corresponding for classification submodel maximum for probit as the classification results of described voice data；

When effective duration corresponding to described voice data is less than or equal to default effective duration thresholding, perform following process:

If humming posterior probability values corresponding to audio frequency is more than default humming audio frequency thresholding, it is determined that described voice data is humming audio frequency, if not, it is determined that described voice data is recorded audio or song audio frequency searched in voice.In the another kind of alternative embodiment of the present invention, described training module 710 specifically may include that

Collect submodule, be used for collecting training data；Wherein, described training data includes: song audio frequency and humming audio frequency searched in recorded audio, voice；

First extracts submodule, for described training data is extracted audio frequency characteristics；

Training submodule, for according to the audio frequency characteristics extracted, obtaining audio classification model based on deep neural network training.

In another alternative embodiment of the present invention, described training data also includes: noised audio and mute audio.

In another alternative embodiment of the present invention, described audio frequency characteristics includes: mel cepstrum coefficients feature and fundamental frequency feature；

Described first extraction module specifically for: extract mel cepstrum coefficients feature corresponding to described voice data/training data and fundamental frequency feature；Described mel cepstrum coefficients feature is carried out first-order difference and second differnce calculates, obtain multidimensional mel cepstrum coefficients feature；Described fundamental frequency feature is carried out first-order difference and second differnce calculates, obtain multidimensional fundamental frequency feature；According to described multidimensional mel cepstrum coefficients feature and multidimensional fundamental frequency feature, it is determined that the audio frequency characteristics that described voice data/training data is corresponding.

For device embodiment, due to itself and embodiment of the method basic simlarity, so what describe is fairly simple, relevant part illustrates referring to the part of embodiment of the method.

Device embodiment described above is merely schematic, the wherein said unit illustrated as separating component can be or may not be physically separate, the parts shown as unit can be or may not be physical location, namely may be located at a place, or can also be distributed on multiple NE.Some or all of module therein can be selected according to the actual needs to realize the purpose of the present embodiment scheme.Those of ordinary skill in the art, when not paying performing creative labour, are namely appreciated that and implement.

Through the above description of the embodiments, those skilled in the art is it can be understood that can add the mode of required general hardware platform by software to each embodiment and realize, naturally it is also possible to pass through hardware.Based on such understanding, the part that prior art is contributed by technique scheme substantially in other words can embody with the form of software product, this computer software product can store in a computer-readable storage medium, such as ROM/RAM, magnetic disc, CD etc., including some instructions with so that a computer equipment (can be personal computer, server, or the network equipment etc.) perform the method described in some part of each embodiment or embodiment.

Last it is noted that above example is only in order to illustrate technical scheme, it is not intended to limit；Although the present invention being described in detail with reference to previous embodiment, it will be understood by those within the art that: the technical scheme described in foregoing embodiments still can be modified by it, or wherein portion of techniques feature is carried out equivalent replacement；And these amendments or replacement, do not make the essence of appropriate technical solution depart from the spirit and scope of various embodiments of the present invention technical scheme.

Claims

1. an audio frequency classification method, it is characterised in that described method includes:

Voice data is extracted audio frequency characteristics；

2. method according to claim 1, it is characterised in that described audio classification model includes at least one classification submodel, and described classification submodel has corresponding relation with described classification results；It is described that by described audio frequency characteristics input described audio classification model, output obtains the step of the classification results of described voice data, including:

Described audio frequency characteristics is inputted described audio classification model；

The probit obtaining described audio frequency characteristics is calculated according to described classification submodel；

Classification results corresponding for classification submodel maximum for probit is exported as the classification results of described voice data.

3. method according to claim 2, it is characterised in that described the step that classification results corresponding for maximum for probit classification submodel carries out exporting as the classification results of described voice data is included:

When effective duration of described voice data is more than default effective duration thresholding, classification results corresponding for classification submodel maximum for probit is exported as the classification results of described voice data；

When effective duration corresponding to described voice data is less than or equal to default effective duration thresholding, perform following steps:

If humming posterior probability values corresponding to audio frequency is more than default humming audio frequency thresholding, it is determined that described voice data is humming audio frequency, if not, it is determined that described voice data is recorded audio or song audio frequency searched in voice.

4. method according to claim 1, it is characterised in that the described training data according to collection, obtains the step of audio classification model based on deep neural network training, including:

Collect training data；Wherein, described training data includes: song audio frequency and humming audio frequency searched in recorded audio, voice；

Described training data is extracted audio frequency characteristics；

According to the audio frequency characteristics extracted, obtain audio classification model based on deep neural network training.

5. method according to claim 4, it is characterised in that described training data also includes: noised audio and mute audio.

6. according to the arbitrary described method of claim 1 to 5, it is characterised in that described audio frequency characteristics includes: mel cepstrum coefficients feature and fundamental frequency feature；Extract described audio frequency characteristics as follows:

Extract mel cepstrum coefficients feature corresponding to described voice data/training data and fundamental frequency feature；

Described mel cepstrum coefficients feature is carried out first-order difference and second differnce calculates, obtain multidimensional mel cepstrum coefficients feature；

Described fundamental frequency feature is carried out first-order difference and second differnce calculates, obtain multidimensional fundamental frequency feature；

According to described multidimensional mel cepstrum coefficients feature and multidimensional fundamental frequency feature, it is determined that the audio frequency characteristics that described voice data/training data is corresponding.

7. an audio classification device, it is characterised in that including:

8. device according to claim 7, it is characterised in that described audio classification model includes at least one classification submodel, and described classification submodel has corresponding relation with described classification results；Described output module, including:

9. device according to claim 8, it is characterised in that

10. device according to claim 7, it is characterised in that described training module, including:

11. device according to claim 10, it is characterised in that described training data also includes: noised audio and mute audio.

12. the device according to claim 7 to 11, it is characterised in that described audio frequency characteristics includes: mel cepstrum coefficients feature and fundamental frequency feature；