CN107481718B

CN107481718B - Audio recognition method, device, storage medium and electronic equipment

Info

Publication number: CN107481718B
Application number: CN201710854125.4A
Authority: CN
Inventors: 梁昆
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2017-09-20
Filing date: 2017-09-20
Publication date: 2019-07-05
Anticipated expiration: 2037-09-20
Also published as: CN110310623A; CN110310623B; CN107481718A

Abstract

The embodiment of the present application discloses a kind of audio recognition method, device, storage medium and electronic equipment.The method includes obtaining the first voice data；The screening model that first voice data input constructs in advance is screened, the sound bite for filtering out setting phonetic feature of the screening model output is obtained；Identify that the sound bite obtains corresponding text.The calculation amount in speech recognition process can be effectively reduced by adopting the above technical scheme, improve recognition speed.

Description

Audio recognition method, device, storage medium and electronic equipment

Technical field

The invention relates to speech recognition technology more particularly to a kind of audio recognition method, device, storage medium and Electronic equipment.

Background technique

With the fast development for the science and technology for being applied to electronic equipment, electronic equipment has had powerful processing energy Power, and it is increasingly becoming people's life, entertainment and the essential important tool of work.

By taking smart phone as an example, in order to drive vehicle, cabin luggage or other be inconvenient to pass through touch screen operation intelligence Under the scene of energy mobile phone, user can also easily operate smart phone, and current smart phone is mostly configured with voice assistant function Energy.The voice data that user inputs can be converted into text by voice assistant.However, current speech recognition schemes into When row speech recognition, there are computationally intensive, the slow defects of recognition speed.

Summary of the invention

The embodiment of the present application provides a kind of audio recognition method, device, storage medium and electronic equipment, it is possible to reduce voice Calculation amount in identification process improves recognition speed.

In a first aspect, the embodiment of the present application provides a kind of audio recognition method, comprising:

Obtain the first voice data；

The screening model that first voice data input constructs in advance is screened, the screening model output is obtained The sound bite for filtering out setting phonetic feature, wherein the screening model is by adding the language of the phonetic feature of no physical meaning The training of sound data sample obtains；

Identify that the sound bite obtains corresponding text.

Second aspect, the embodiment of the present application also provides a kind of speech recognition equipment, which includes:

Voice obtains module, for obtaining the first voice data；

Voice screening module, the screening model for constructing first voice data input in advance are screened, are obtained Take the sound bite for filtering out setting phonetic feature of the screening model output, wherein the screening model is by adding no reality The voice data sample training of the phonetic feature of meaning obtains；

Speech recognition module, the sound bite obtains corresponding text for identification.

The third aspect, the embodiment of the present application also provides a kind of computer readable storage mediums, are stored thereon with computer Program realizes the audio recognition method as described in the embodiment of the present application when the program is executed by processor.

Fourth aspect, the embodiment of the present application also provides a kind of electronic equipment, including for acquiring the first voice data Voice collector, memory, processor and storage on a memory and the computer program that can run on a processor, the place Reason device realizes the audio recognition method as described in the embodiment of the present application when executing the computer program.

The application provides a kind of speech recognition schemes, by obtaining the first voice data；The input of first voice data is pre- The screening model first constructed is screened, and the sound bite for filtering out setting phonetic feature of the screening model output is obtained；Know Other sound bite obtains corresponding text.Above-mentioned technical proposal inputs the first acquired voice data before speech recognition Screening model.Since the training sample of screening model is the voice data sample added with the phonetic feature without physical meaning, First voice data input screening model is calculated, the sound without physical meaning that the first voice data includes can be filtered out Element obtains the sound bite not comprising the phoneme without physical meaning.To by the data volume of the sound bite of screening model output Less than the data volume of the first voice data.Sound bite after reducing again to data volume identifies, language can be effectively reduced Calculation amount in sound identification process, improves recognition speed.

Detailed description of the invention

Fig. 1 is a kind of flow chart of audio recognition method provided by the embodiments of the present application；

Fig. 2 is the basic structure schematic diagram of single neuron provided by the embodiments of the present application；

Fig. 3 is the flow chart of another audio recognition method provided by the embodiments of the present application；

Fig. 4 is the flow chart of another audio recognition method provided by the embodiments of the present application；

Fig. 5 is a kind of structural block diagram of speech recognition equipment provided by the embodiments of the present application；

Fig. 6 is the structural block diagram of a kind of electronic equipment provided by the embodiments of the present application.

Specific embodiment

The application is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining the application, rather than the restriction to the application.It also should be noted that in order to just Part relevant to the application is illustrated only in description, attached drawing rather than entire infrastructure.

It should be mentioned that some exemplary embodiments are described as before exemplary embodiment is discussed in greater detail The processing or method described as flow chart.Although each step is described as the processing of sequence by flow chart, many of these Step can be implemented concurrently, concomitantly or simultaneously.In addition, the sequence of each step can be rearranged.When its operation The processing can be terminated when completion, it is also possible to have the additional step being not included in attached drawing.The processing can be with Corresponding to method, function, regulation, subroutine, subprogram etc..

In the related technology, the mode of speech recognition generally includes end-point detection, feature extraction and matching operation.Wherein, it is At the time of accurately finding voice beginning and end, double-threshold comparison algorithm is generallyd use.Simultaneously using short-time zero-crossing rate and Short-time average energy detects voice data respectively, it is comprehensive using aforesaid way determine voice signal endpoint (start time and Finish time).The essence of the feature extraction of voice data is that voice data is converted from analog into digital signal, with reflection The series of features parameter of voice data feature represents voice data.Due to Meier frequency spectrum cepstrum coefficient (Mel Frequency Cepstral Coefficents, referred to as MFCC) it is to be proposed according to the auditory model of human ear, because it is close to the sense of hearing of people Feature can be good at improving recognition performance.Therefore, feature extraction process is illustrated by taking the extracting mode of MFCC parameter as an example. The extracting mode of MFCC parameter including the following steps: use preset window function, moved according to fixed frame length and frame to sound Frequency signal carries out framing, for example, frame length can be 25ms, frame shifting can be 10ms；By Fast Fourier Transform (FFT) (fast Fourier transform, abbreviation FFT) time-domain signal is become to the power spectrum of signal；Again using one group of Meier filter to upper It states and obtains Meier frequency spectrum after frequency spectrum is handled；Cepstral analysis is carried out on Meier frequency spectrum (including takes logarithm and discrete cosine Transformation), obtain MFCC parameter.Using the MFCC parameter of each voiced frame as the speech characteristic vector sequence of the voiced frame.It will be each The speech characteristic vector sequence of a voiced frame inputs hidden Markov model, and obtains hidden Markov model output With an at least matched state of frame voiced frame (voiced frame and the matched probability of state compared with i.e., by the corresponding state of maximum probability As with the matched state of voiced frame).Sequence obtains three state groups into phoneme, and the pronunciation of word is determined according to the phoneme, To realize speech recognition.However, the scheme of above-mentioned speech recognition cannot be distinguished from the phoneme with physical meaning and without actually containing Justice phoneme (such as user state habit in " this ", " that ", " what can I say " and " that is " etc.), so as to cause language Calculation amount is bigger in sound identification process, and speech recognition speed is slow.

Fig. 1 is a kind of flow chart of audio recognition method provided by the embodiments of the present application, and this method can be by speech recognition Device executes, wherein the device can be implemented by software and/or hardware, can generally integrate in the electronic device.As shown in Figure 1, This method comprises:

Step 110 obtains the first voice data.

Wherein, the first voice data includes the voice signal of user's input.For example, user is using the language in short message application The voice signal inputted when sound input function.For another example, it is inputted when speech voice input function of the user in being applied using memorandum Voice signal.For another example, the voice signal that user is inputted when using speech voice input function in mail applications.For another example, Yong Hu Using instant messaging application speech voice input function when the voice signal etc. that inputs.

It is integrated with voice collector on electronic equipment, passes through available first voice data of voice collector.Wherein, language Sound collector includes the wireless headsets such as transmitter and bluetooth headset, infrared earphone.Illustratively, by taking smart phone as an example, when When user opens the speech voice input function of short message application, short message input mode can replace being manually entered using voice input, Realization process can be, and user is indicated by inputting voice to smart phone, and the voice is indicated corresponding language by smart phone Sound signal switchs to text and is shown in short message application interface.Corresponding voice signal is located in advance to be indicated to the voice of user's input Reason, available first voice data.Wherein, above-mentioned pretreatment includes filtering and analog-to-digital conversion etc..It should be noted that due to User is often automatic when speaking to bring colloquial expression into, and may cause includes " this ", " that in the first voice data It is a ", " what can I say " and " that is " etc. the vocabulary without practical significance.

Step 120 screens the screening model that first voice data input constructs in advance, obtains the screening The sound bite for filtering out setting phonetic feature of model output.

Wherein, the screening model is obtained by the voice data sample training for adding the phonetic feature of no physical meaning.Show Example property, by taking screening model is neural network model as an example, the training process of screening model includes:

Model initialization, including the number of hidden layer and the number of nodes of input layer, hidden layer and each layer of output layer is arranged, Connection weight between each layer, and initialization hidden layer and the threshold value of output layer etc., tentatively obtain the frame of neural network model Frame.

Speech recognition calculates the output of the output parameter and output layer of hidden layer according to the formula that neural network model includes Parameter calculates nerve net according to the external bias value of connection weight and own node between upper one layer of calculated result, two layers The output of network model.

Error calculation is adjusted the parameter in neural network model using supervised mode of learning.Obtain user's History is sent using the voice data and corresponding text of the input of voice input mode in short message, since user confirms sending Short message is the adjusted rear data for not having the vocabulary without physical meaning and meet user's statement habit, can be as mark Quasi- voice data sample.Correspondingly, the corresponding desired output of voice data sample is the corresponding text of above-mentioned voice data Voice (or pronunciation).Trained sample is obtained by way of adding into the voice data sample without the phonetic feature of physical meaning This.Wherein, the mode for obtaining the phonetic feature without physical meaning can be the statement of the sample populations by statistics setting quantity Habit, analysis obtain the vocabulary of the higher no practical significance of probability of occurrence as phonetic feature.It can also be and voluntarily selected by user Select its commonly the vocabulary without practical significance or the programming count user commonly the vocabulary without practical significance as voice spy Sign etc..

The reality output and desired output of neural network model are calculated, obtained between reality output and desired output Error signal.Then, according to the error signal to the connection weight and external bias of neuron each in neural network model Value is updated.Fig. 2 shows the basic structure schematic diagram of single neuron provided by the embodiments of the present application, ω in Fig. 2_i1For nerve Connection weight in first i and upper one layer of layer where it between neuron, it is understood that for input x₁Weight；θ_iFor the mind External bias through member.According to neural network forecast error, Feedback error modifies the connection weight of each neuron in neural network Weight and external bias value.Judge whether algorithm iteration terminates, if so, completing the building of screening model.

First voice data is inputted to the screening model built, for the pronunciation without physical meaning in the first voice data Corresponding path, connection weight is smaller, inputs parameter between the hidden layer of neural network model or hidden layer is transmitted with output layer During, due to obtaining diminishing input parameter, after repeatedly calculating, the first voice data multiplied by the connection weight The phonetic feature (such as phoneme) of middle no physical meaning is filtered out.The output result of screening model is the language for filtering out no physical meaning The sound bite of sound feature.

Step 130, the identification sound bite obtain corresponding text.

It calculates sound bite to compare with preset reference template progress distance, by voiced frame each in sound bite and refers to mould Pronunciation of the shortest pronunciation of distance as the voiced frame, the combination of the pronunciation of each voiced frame are the language of the sound bite in plate Sound.After knowing the voice of the sound bite, preset dictionary can be inquired, determines the corresponding text of the voice.

The technical solution of the present embodiment, by the way that before speech recognition, the first acquired voice data is had input screening Model.Since the training sample of screening model is the voice data sample added with the phonetic feature without physical meaning, by first Voice data input screening model is calculated, and can be filtered out the phoneme without physical meaning that the first voice data includes, be obtained The sound bite of phoneme not comprising no physical meaning.To, by screening model output sound bite data volume less than the The data volume of one voice data.Sound bite after reducing again to data volume identifies, speech recognition can be effectively reduced Calculation amount in the process, improves recognition speed.

Fig. 3 is the flow chart of another audio recognition method provided by the embodiments of the present application.As shown in figure 3, this method packet It includes:

Step 301 obtains the first voice data.

Step 302 judges whether the corresponding user of first voice data is registration user, if so, thening follow the steps 303, it is no to then follow the steps 306.

When detecting the first voice data, the camera of controlling electronic devices is opened, and shoots at least framed user's image. By carrying out image procossing to user images, image recognition determines whether the user of the first voice data of input is registration user. Wherein it is possible to determine whether the user of the first voice data of input is registration user by way of images match.Illustratively, In user's registration, user images are obtained, as matching template.When detecting the first voice data, user images are obtained, and User images are matched with matching template, it is thus possible to determine whether the corresponding user of the first voice data is that registration is used Family.

Step 303, the history voice data for obtaining at least one registration user determine each according to the history voice data The word speed and dwell interval of a registration user.

When the corresponding user of the first voice data is registration user, the history voice data of registration user is obtained.Its In, history voice data includes history communicating data, history voice control data and history speech message of user etc..Pass through Analysis of history voice data can determine the average word speed and average dwell interval of each registration user.Wherein, average word speed and Average dwell interval is that weighted calculation obtains.It can also further determine that each registration user language under different scenes respectively Speed and dwell interval.

Step 304 inquires preset framing strategy set according to the word speed and habit of pausing, determining to use with the registration The corresponding framing strategy in family.

Wherein, framing strategy includes the value that the selection of window function, the value of frame length and frame move, and the framing strategy with The speech habits of different user are associated.Framing strategy set is the set of framing strategy, wherein storage word speed section and pause The corresponding relationship that interval section and window function, frame length and frame move.

The word speed and dwell interval determined according to above-mentioned steps is inquired the word speed section stored in framing strategy set and is stopped Pause interval section, positions word speed and the corresponding section of dwell interval, and the corresponding window function in the section, frame length and frame are moved as note The framing strategy of the current speech data of volume user's input.

Step 305, according to the corresponding framing strategy of registration user, framing is carried out to first voice data, obtain to Then few two second speech datas execute step 307.

Since stationarity is only presented in voice data in a relatively short period of time, it is therefore desirable to which voice data is divided into one one A short time interval, i.e. voiced frame.

Illustratively, using the framing strategy window function that includes determined in above-mentioned steps, include according to framing strategy Frame moves the first voice data of processing and obtains at least two second speech datas.Wherein, the window of window function is long is equal to the framing strategy Frame length.After obtaining at least two second speech datas, goes to and execute step 307.The division and registration of first voice data The word speed of user is related to dwell interval, and therefore, the frame length of the second speech data obtained after framing is with the word speed for registering user With dwell interval and change, frame length not immobilizes, it is possible to reduce by with practical significance voice with do not have practical meaning The voice of justice is divided in same voiced frame, is conducive to the efficiency for improving speech recognition.

Step 306, the framing strategy according to default carry out framing to first voice data, obtain at least two the Two voice data.

When the corresponding user of the first voice data is not registration user, using the window function of default, according to the frame of default It moves the first voice data of processing and obtains at least two second speech datas.Wherein, a length of default frame length of the window of window function.After framing The frame length of obtained second speech data is fixed and invariable, by the voice with practical significance and without the language of practical significance It is more that sound is divided into the case where voiced frame.

Step 307 extracts the corresponding first speech characteristic vector sequence of the second speech data.

Wherein, the first speech characteristic vector sequence includes MFCC feature.MFCC feature is extracted from second speech data Mode includes: to be filtered by a series of Meier filters to the spectrogram of second speech data, obtains Meier frequency spectrum； Cepstral analysis is carried out to the Meier frequency spectrum, mel-frequency cepstrum coefficient is obtained, using the mel-frequency cepstrum coefficient as defeated Enter the behavioral characteristics vector of screening model, i.e. the first speech characteristic vector sequence.

Step 308 after the first speech characteristic vector sequence is normalized, inputs the circulation constructed in advance Neural network model is screened.

Optionally, before the Recognition with Recurrent Neural Network model for constructing the first speech characteristic vector sequence inputting in advance, also First speech characteristic vector sequence can be normalized, it is to be understood that the step of normalized is not The step of having to carry out.Wherein, normalized be all first speech characteristic vector sequences are each mapped to [0,1] or [- 1,1] number between, the unit difference and range differences that can eliminate input data reduce voice and know away from the influence to speech recognition Other error.

After the first speech characteristic vector sequence is normalized, input the neural network model that constructs in advance into Row screening, wherein the neural network model is Recognition with Recurrent Neural Network model.

Step 309, the output result for obtaining the Recognition with Recurrent Neural Network model, wherein the output result is to filter out nothing Second speech characteristic vector sequence of the phoneme of physical meaning.

Wherein, phoneme is the minimum unit in voice, and according to the articulation analysis in syllable, a movement constitutes one Phoneme, phoneme include vowel and consonant.

Due to Recognition with Recurrent Neural Network model be by addition the phoneme without physical meaning training sample carry out study and Training is built-up, and output is the sound bite for filtering out the phoneme of no physical meaning, therefore, by the first speech characteristic vector After sequence inputting Recognition with Recurrent Neural Network model, the sound bite of output is the second phonetic feature for filtering out the phoneme of no physical meaning Vector sequence.

Step 310 judges whether the second speech characteristic vector sequence is equal with the length of preset reference template, if so, Then follow the steps 313, it is no to then follow the steps 311.

It is compared by the length for obtaining the second speech characteristic vector sequence with the length of preset reference template.If Length is not identical, thens follow the steps 311.If length is identical, 313 are thened follow the steps.

Step 311 calculates the second speech characteristic vector sequence and reference template using dynamic time warping algorithm Frame matching distance.

Wherein, dynamic time warping algorithm (dynamic time warping, abbreviation DTW) is a kind of two times of measurement The method of similarity between sequence is mainly used in field of speech recognition to identify that two sections of voices indicate whether the same list Word.

Illustratively, if the second speech characteristic vector sequence is different from the length of preset reference template, can pass through DTW algorithm calculates the frame matching distance matrix of the second speech characteristic vector sequence and reference template, in frame matching distance matrix An optimal path is found out, which is the corresponding path of minimal matching span.

Step 312 determines the corresponding pronunciation of minimum frame matching distance, then, executes step 314.

The voice and the second speech characteristic vector being determined as in the reference template of the corresponding endpoint of minimum frame matching distance Sequence, using the voice in the reference template as the pronunciation of the second speech characteristic vector sequence.

Step 313 directly matches the second speech characteristic vector sequence and reference template, determines that the sound bite is corresponding Pronunciation.

If the second speech characteristic vector sequence is identical as the length of preset reference template, directly the second voice of matching is special Vector sequence and reference template are levied, determines the corresponding pronunciation of sound bite.

Step 314 matches corresponding text according to the pronunciation, as speech recognition result.

The technical solution of the present embodiment, by determining framing according to the word speed of user and dwell interval before speech recognition Strategy carries out sub-frame processing to the first voice data using personalized framing strategy, realizes personalized framing, effectively subtract The number that there is major general the phonetic feature of physical meaning voiced frame in a frame is divided with the phonetic feature for not having physical meaning Amount.It, can be into one by the corresponding first speech characteristic vector sequence inputting screening model of second speech data after sub-frame processing Step improves audio identification efficiency.

Fig. 4 is the flow chart of another audio recognition method provided by the embodiments of the present application.As shown in figure 4, this method packet It includes:

Step 401 judges whether to meet model modification condition, if so, thening follow the steps 402, otherwise, executes step 408.

Wherein, model modification condition can be system time and reach preset time, can also be the satisfaction default update cycle. For example, setting model modification condition is late 12 points of progresss screening model updates on every Fridays, then when detecting system time for Friday At evening 12, determines and currently meet model modification condition.For another example, setting model modification condition is to update once for every 7 days, then detects When meeting the update cycle to the time apart from last time model modification, determines and currently meet model modification condition.

The short message inputted by voice mode and/or stored pass through voice mode that step 402, acquisition have been sent The memorandum of input.

Obtain the sent message and stored memorandum for using voice input mode.Since user confirms sending Short message may be considered the adjusted rear data for not having the vocabulary without physical meaning and meet user's statement habit, can be by it Voice data sample as standard.For the memorandum saved, it is also assumed that it is after being adjusted without no reality The vocabulary of border meaning and the data for meeting user's statement habit, can also be as the voice data sample of standard.

It pre-saves special by the voice of the corresponding voice data of body matter of the sent message of voice input mode Vector sequence is levied, and the corresponding voice data for saving user and giving an oral account input, user is given an oral account into the voice data of input as history Voice data.For example, the voice data that user gives an oral account input is " about this for sending short message by voice input mode A problem, what can I say, certain bad solution ", and the short message being actually sent out after processing is " about this problem, really Bad solution ".Corresponding storage user gives an oral account the speech characteristic vector sequence of the voice data of input, and what is be actually sent out short disappear Cease corresponding voice data.

The speech characteristic vector of the corresponding voice data of body matter of step 403, the acquisition short message and/or memorandum Sequence.

Obtain the speech characteristic vector sequence of the voice data in the body matter of the short message sent.Optionally, may be used also To be the speech characteristic vector sequence of voice data in the body matter for obtain stored memorandum.

Step 404, the history voice data for obtaining the short message and/or memorandum.

The content that the corresponding user of the short message sent gives an oral account input is obtained, as history voice data.Optionally, may be used also To be the content for obtaining the corresponding user of stored memorandum and giving an oral account input, as history voice data.

Step 405 determines the personalized phoneme without physical meaning and the phoneme according to the history voice data Appearance position.

Analysis of history voice data, it can be deduced that the speech habits of a certain user, i.e., the factor without physical meaning and its go out Existing position.For example, the user likes the vocabulary for saying " what can I say " this no physical meaning among sentence when voice inputs.

Step 406 adds the phoneme according to the appearance position as training into the speech characteristic vector sequence Sample, and using the speech characteristic vector sequence as desired output, using supervised mode of learning to the screening model into Row training.

Training sample is normalized, the unit difference and range differences that can eliminate input data are away from knowing voice Other influence, meanwhile, also help effective threshold value that input data is mapped to activation primitive, reduce network training error and Net training time.

Step 407, the parameter that the screening model is adjusted according to training result, the parameter includes connection weight and outside Bias.

Neural network forecast error can be determined by analyzing and training sample and desired output.According to error in neural network model By rear (output layer) mode that (input layer) transmits forward, the connection weight and external bias value of each neuron is respectively modified.

Step 408 obtains the first voice data.

If obtain the first voice data, above-mentioned model modification process not yet terminates, then the first voice data of nonrecognition, mentions Show that user is currently carrying out the update operation of screening model.

Step 409 screens the screening model that first voice data input constructs in advance, obtains the screening The sound bite for filtering out setting phonetic feature of model output.

If obtain the first voice data, model modification operation is not being executed, then the first voice data is being inputted into screening mould Type screens first voice data by model, obtains the sound bite for filtering out the phonetic feature of no physical meaning.

Step 410, the identification sound bite obtain corresponding text.

Step 411 judges whether the text is command information, if so, thening follow the steps 412, otherwise, executes step 413。

The incidence relation of white list storage spelling words intellectual and command information is first passed through in advance.Identifying that sound bite is corresponding When text, which is inquired according to the spelling words intellectual of the text.If inquiring corresponding spelling words intellectual in the white list, It determines that the corresponding text of sound bite represents command information, executes step 412.If corresponding text is not inquired in the white list Word combination, then prompt the user to choose whether as command information.If user selects the corresponding text of the sound bite to represent order letter Breath, then be added to the white list for the spelling words intellectual that the user is determined as command information, and execute step 412.If user selects The corresponding text of the sound bite does not represent command information, thens follow the steps 413.

Step 412 executes the corresponding operation of the command information.

Step 413 shows the text in the user interface.

The technical solution of the present embodiment, by passing through language using what is sent in the update condition for meeting screening model Short message that sound mode inputs and/or the stored memorandum inputted by voice mode are as training sample, to screening model It is trained, the output of screening model can be made to adapt to the statement habit of user's variation, effectively reduce false recognition rate and missing inspection Rate.

Fig. 5 is a kind of structural block diagram of speech recognition equipment provided by the embodiments of the present application.The device can have software and/or Hardware realization is typically integrated in electronic equipment.As shown in figure 5, the apparatus may include:

Voice obtains module 510, for obtaining the first voice data.

Voice screening module 520, the screening model for constructing first voice data input in advance screen, Obtain the sound bite for filtering out setting phonetic feature of the screening model output, wherein the screening model is by adding no reality The voice data sample training of the phonetic feature of border meaning obtains.

Speech recognition module 530, the sound bite obtains corresponding text for identification.

The embodiment of the present application provides a kind of speech recognition equipment, before speech recognition, by the first acquired voice data Have input screening model.Since the training sample of screening model is the voice data sample added with the phonetic feature without physical meaning This, the first voice data input screening model is calculated, can filter out that the first voice data includes without physical meaning Phoneme obtains the sound bite not comprising the phoneme without physical meaning.To by the data of the sound bite of screening model output Measure the data volume less than the first voice data.Sound bite after reducing again to data volume identifies, can be effectively reduced Calculation amount in speech recognition process, improves recognition speed.

Optionally, further includes:

User's judgment module, for when detecting the first voice data, judging the corresponding use of first voice data Whether family is registration user；

And, further includes:

Framing module, for before the screening model that constructs in advance of first voice data input is screened, Corresponding framing strategy is determined according to judging result, and framing is carried out to first voice data according to the framing strategy, is obtained To at least two second speech datas；

Wherein, the framing strategy includes the value that the selection of window function, the value of frame length and frame move, and the framing plan It is slightly associated with the speech habits of different user.

Optionally, framing module is specifically used for:

The history voice data for obtaining at least one registration user determines that each registration is used according to the history voice data The word speed and dwell interval at family；

Preset framing strategy set is inquired according to the word speed and habit of pausing, determination is corresponding with the registration user Framing strategy.

Optionally, voice screening module 520 is specifically used for:

Extract the corresponding first speech characteristic vector sequence of the second speech data；

After the first speech characteristic vector sequence is normalized, the Recognition with Recurrent Neural Network constructed in advance is inputted Model is screened；

Obtain the output result of the Recognition with Recurrent Neural Network model, wherein the output result is to filter out no physical meaning Phoneme the second speech characteristic vector sequence.

Optionally, speech recognition module 530 is specifically used for:

Judge whether the second speech characteristic vector sequence is equal with the length of preset reference template；

When unequal, the second speech characteristic vector sequence and reference template are calculated using dynamic time warping algorithm Frame matching distance；

The corresponding pronunciation of minimum frame matching distance is determined, using the matched text of pronunciation as speech recognition result.

Optionally, further includes:

Word processing module, for judging that the text is after identifying that the sound bite obtains corresponding text No is command information；

If so, executing the corresponding operation of the command information；

If it is not, then showing the text in the user interface.

Optionally, further includes:

Model modification module obtains inputting by voice mode of having sent for when meeting model modification condition Short message and/or the stored memorandum inputted by voice mode；

Obtain the speech characteristic vector sequence of the corresponding voice data of body matter of the short message and/or memorandum；

Obtain the history voice data of the short message and/or memorandum；

The appearance position of the personalized phoneme and the phoneme without physical meaning is determined according to the history voice data；

The phoneme is added as training sample into the speech characteristic vector sequence according to the appearance position, and with The speech characteristic vector sequence is trained the screening model as desired output, using supervised mode of learning；

The parameter of the screening model is adjusted according to training result, the parameter includes connection weight and external bias value.

The embodiment of the present application also provides a kind of storage medium comprising computer executable instructions, and the computer is executable Instruction is used to execute a kind of audio recognition method when being executed by computer processor, this method comprises:

Obtain the first voice data；

Identify that the sound bite obtains corresponding text.

Storage medium --- any various types of memory devices or storage equipment.Term " storage medium " is intended to wrap It includes: install medium, such as CD-ROM, floppy disk or magnetic tape equipment；Computer system memory or random access memory, such as DRAM, DDR RAM, SRAM, EDO RAM, Lan Basi (Rambus) RAM etc.；Nonvolatile memory, such as flash memory, magnetic medium (such as hard disk or optical storage)；Register or the memory component of other similar types etc..Storage medium can further include other Memory of type or combinations thereof.In addition, storage medium can be located at program in the first computer system being wherein performed, Or can be located in different second computer systems, second computer system is connected to the by network (such as internet) One computer system.Second computer system can provide program instruction to the first computer for executing." storage is situated between term Matter " may include may reside in different location (such as by network connection different computer systems in) two or More storage mediums.Storage medium can store the program instruction that can be performed by one or more processors and (such as implement For computer program).

Certainly, a kind of storage medium comprising computer executable instructions, computer provided by the embodiment of the present application Voice provided by the application any embodiment can also be performed in the speech recognition operation that executable instruction is not limited to the described above Relevant operation in recognition methods.

The embodiment of the present application provides a kind of electronic equipment, and language provided by the embodiments of the present application can be integrated in the electronic equipment Sound identification device.Wherein, electronic equipment includes smart phone, tablet computer, handheld device, laptop and smartwatch Deng.Fig. 6 is the structural block diagram of a kind of electronic equipment provided by the embodiments of the present application.As shown in fig. 6, the electronic equipment can wrap Include: memory 601, central processing unit (Central Processing Unit, CPU) 602 (also known as processor, hereinafter referred to as CPU), voice collector 606 and touch screen 611.The touch screen 611, is input to for user's operation to be converted into electric signal The processor, and show visual output signal；The voice collector 606, for acquiring the first voice data；The storage Device 601, for storing computer program；The CPU602 reads and executes the computer program stored in the memory 601. The CPU602 is performed the steps of when executing the computer program obtains the first voice data；By first voice The screening model that data input constructs in advance is screened, and the language for filtering out setting phonetic feature of the screening model output is obtained Tablet section, wherein the screening model is obtained by the voice data sample training for adding the phonetic feature of no physical meaning；Identification The sound bite obtains corresponding text.

The electronic equipment further include: Peripheral Interface 603, RF (Radio Frequency, radio frequency) circuit 605, power supply pipe Manage chip 608, input/output (I/O) subsystem 609, other input/control devicess 610 and outside port 604, these portions Part is communicated by one or more communication bus or signal wire 607.

It should be understood that illustrating the example that electronic equipment 600 is only electronic equipment, and electronic equipment 600 It can have than shown in the drawings more or less component, can combine two or more components, or can be with It is configured with different components.Various parts shown in the drawings can include one or more signal processings and/or dedicated It is realized in the combination of hardware, software or hardware and software including integrated circuit.

Just the electronic equipment provided in this embodiment for being integrated with speech recognition equipment is described in detail below, the electronics Equipment takes the mobile phone as an example.

Memory 601, the memory 601 can be accessed by CPU602, Peripheral Interface 603 etc., and the memory 601 can It can also include nonvolatile memory to include high-speed random access memory, such as one or more disk memory, Flush memory device or other volatile solid-state parts.

The peripheral hardware that outputs and inputs of equipment can be connected to CPU602 and deposited by Peripheral Interface 603, the Peripheral Interface 603 Reservoir 601.

I/O subsystem 609, the I/O subsystem 609 can be by the input/output peripherals in equipment, such as touch screen 611 With other input/control devicess 610, it is connected to Peripheral Interface 603.I/O subsystem 609 may include 6091 He of display controller For controlling one or more input controllers 6092 of other input/control devicess 610.Wherein, one or more input controls Device 6092 processed receives electric signal from other input/control devicess 610 or sends electric signal to other input/control devicess 610, Other input/control devicess 610 may include physical button (push button, rocker buttons etc.), dial, slide switch, behaviour Vertical pole clicks idler wheel.It is worth noting that input controller 6092 can with it is following any one connect: keyboard, infrared port, The indicating equipment of USB interface and such as mouse.

Display controller 6091 in I/O subsystem 609 receives electric signal from touch screen 611 or sends out to touch screen 611 Electric signals.Touch screen 611 detects the contact on touch screen, and the contact that display controller 6091 will test is converted to and is shown The interaction of user interface object on touch screen 611, i.e. realization human-computer interaction, the user interface being shown on touch screen 611 Object can be the icon of running game, the icon for being networked to corresponding network etc..It is worth noting that equipment can also include light Mouse, light mouse are the touch sensitive surfaces for not showing the touch sensitive surface visually exported, or formed by touch screen mould group Extend.

RF circuit 605 is mainly used for establishing the communication of mobile phone Yu wireless network (i.e. network side), realizes mobile phone and wireless network The data receiver of network and transmission.Such as transmitting-receiving short message, Email etc..Specifically, RF circuit 605 receives and sends RF letter Number, RF signal is also referred to as electromagnetic signal, and RF circuit 605 converts electrical signals to electromagnetic signal or electromagnetic signal is converted to telecommunications Number, and communicated by the electromagnetic signal with communication network and other equipment.RF circuit 605 may include for executing The known circuit of these functions comprising but it is not limited to antenna system, RF transceiver, one or more amplifiers, tuner, one A or multiple oscillators, digital signal processor, CODEC (COder-DECoder, coder) chipset, user identifier mould Block (Subscriber Identity Module, SIM) etc..

Voice collector 606, including the wireless headsets such as transmitter and bluetooth headset, infrared earphone, are mainly used for receiving The audio data is converted to electric signal by audio data.

Power management chip 608, the hardware for being connected by CPU602, I/O subsystem and Peripheral Interface are powered And power management.

Electronic equipment provided by the embodiments of the present application, it is by before speech recognition, the first acquired voice data is defeated Screening model is entered.Since the training sample of screening model is the voice data sample added with the phonetic feature without physical meaning This, the first voice data input screening model is calculated, can filter out that the first voice data includes without physical meaning Phoneme obtains the sound bite not comprising the phoneme without physical meaning.To by the data of the sound bite of screening model output Measure the data volume less than the first voice data.Sound bite after reducing again to data volume identifies, can be effectively reduced Calculation amount in speech recognition process, improves recognition speed.

The application, which can be performed, in speech recognition equipment, storage medium and the electronic equipment provided in above-described embodiment arbitrarily implements Audio recognition method provided by example has and executes the corresponding functional module of this method and beneficial effect.Not in above-described embodiment In detailed description technical detail, reference can be made to audio recognition method provided by the application any embodiment.

Note that above are only the preferred embodiment and institute's application technology principle of the application.It will be appreciated by those skilled in the art that The application is not limited to specific embodiment described here, be able to carry out for a person skilled in the art it is various it is apparent variation, The protection scope readjusted and substituted without departing from the application.Therefore, although being carried out by above embodiments to the application It is described in further detail, but the application is not limited only to above embodiments, in the case where not departing from the application design, also It may include more other equivalent embodiments, and scope of the present application is determined by the scope of the appended claims.

Claims

1. a kind of audio recognition method characterized by comprising

Obtain the first voice data；

When detecting the first voice data, judge whether the corresponding user of first voice data is registration user；

Corresponding framing strategy is determined according to judging result, and first voice data is divided according to the framing strategy Frame obtains at least two second speech datas, wherein the framing strategy includes the selection of window function, the value of frame length and frame The value of shifting, and the framing strategy is associated with the speech habits of different user；

The corresponding first speech characteristic vector sequence of the second speech data is extracted, by the first speech characteristic vector sequence It inputs the screening model constructed in advance to be screened, obtains the voice sheet for filtering out setting phonetic feature of the screening model output Section, wherein the screening model is obtained by the voice data sample training for adding the phonetic feature of no physical meaning；

Identify that the sound bite obtains corresponding text.

2. the method according to claim 1, wherein determining corresponding framing strategy according to judging result, comprising:

The history voice data for obtaining at least one registration user determines each registration user's according to the history voice data Word speed and dwell interval；

Preset framing strategy set is inquired according to the word speed and dwell interval, determines framing corresponding with the registration user Strategy.

3. the method according to claim 1, wherein it is special to extract corresponding first voice of the second speech data Vector sequence is levied, the screening model that the first speech characteristic vector sequence inputting constructs in advance is screened, comprising:

After the first speech characteristic vector sequence is normalized, the Recognition with Recurrent Neural Network model constructed in advance is inputted It is screened；

Obtain the output result of the Recognition with Recurrent Neural Network model, wherein the output result is the sound for filtering out no physical meaning Second speech characteristic vector sequence of element.

4. according to the method described in claim 3, it is characterised by comprising: identify that the sound bite obtains corresponding text, Include:

When unequal, the frame of the second speech characteristic vector sequence and reference template is calculated using dynamic time warping algorithm Matching distance；

5. the method according to claim 1, wherein identify the sound bite obtain corresponding text it Afterwards, further includes:

Judge whether the text is command information；

If so, executing the corresponding operation of the command information；

If it is not, then showing the text in the user interface.

6. according to claim 1 to any method in 5, which is characterized in that further include:

When meeting model modification condition, obtains the short message inputted by voice mode sent and/or stored pass through The memorandum of voice mode input；

Obtain the history voice data of the short message and/or memorandum；

The phoneme is added as training sample into the speech characteristic vector sequence according to the appearance position, and with described Speech characteristic vector sequence is trained the screening model as desired output, using supervised mode of learning；

7. a kind of speech recognition equipment characterized by comprising

Voice obtains module, for obtaining the first voice data；

User's judgment module, for when detecting the first voice data, judging that the corresponding user of first voice data is No is registration user；

Framing module, for determining corresponding framing strategy according to judging result, according to the framing strategy to first language Sound data carry out framing, obtain at least two second speech datas, wherein the framing strategy includes the selection of window function, frame The value that long value and frame move, and the framing strategy is associated with the speech habits of different user；

Voice screening module, for extracting the corresponding first speech characteristic vector sequence of the second speech data, by described The screening model that one speech characteristic vector sequence inputting constructs in advance is screened, and is obtained the filtering out for screening model output and is set Determine the sound bite of phonetic feature, wherein the screening model is by adding the voice data sample of the phonetic feature of no physical meaning This training obtains；

8. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is held by processor The audio recognition method as described in any in claim 1 to 6 is realized when row.

9. a kind of electronic equipment, including for acquiring the first voice data voice collector, memory, processor and be stored in On memory and the computer program that can run on a processor, which is characterized in that the processor executes the computer journey The audio recognition method as described in any in claim 1 to 6 is realized when sequence.