CN107481718B - Audio recognition method, device, storage medium and electronic equipment - Google Patents
Audio recognition method, device, storage medium and electronic equipment Download PDFInfo
- Publication number
- CN107481718B CN107481718B CN201710854125.4A CN201710854125A CN107481718B CN 107481718 B CN107481718 B CN 107481718B CN 201710854125 A CN201710854125 A CN 201710854125A CN 107481718 B CN107481718 B CN 107481718B
- Authority
- CN
- China
- Prior art keywords
- voice data
- speech
- voice
- vector sequence
- characteristic vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 47
- 238000012216 screening Methods 0.000 claims abstract description 72
- 238000001914 filtration Methods 0.000 claims abstract description 17
- 238000009432 framing Methods 0.000 claims description 43
- 238000012549 training Methods 0.000 claims description 27
- 238000003062 neural network model Methods 0.000 claims description 19
- 230000006870 function Effects 0.000 claims description 14
- 238000012986 modification Methods 0.000 claims description 12
- 230000004048 modification Effects 0.000 claims description 12
- 238000004590 computer program Methods 0.000 claims description 9
- 230000000306 recurrent effect Effects 0.000 claims description 9
- 241001269238 Data Species 0.000 claims description 7
- 238000004422 calculation algorithm Methods 0.000 claims description 7
- 239000000284 extract Substances 0.000 claims description 3
- 230000008569 process Effects 0.000 abstract description 10
- 238000004364 calculation method Methods 0.000 abstract description 9
- 238000012545 processing Methods 0.000 description 14
- 230000002093 peripheral effect Effects 0.000 description 8
- 238000001228 spectrum Methods 0.000 description 7
- 238000004458 analytical method Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 6
- 210000002569 neuron Anatomy 0.000 description 6
- 238000012905 input function Methods 0.000 description 5
- 238000013528 artificial neural network Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 238000004891 communication Methods 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 230000003993 interaction Effects 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000003542 behavioural effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000003467 diminishing effect Effects 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 210000005036 nerve Anatomy 0.000 description 1
- 210000004218 nerve net Anatomy 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0631—Creating reference templates; Clustering
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Machine Translation (AREA)
- User Interface Of Digital Computer (AREA)
- Telephonic Communication Services (AREA)
Abstract
The embodiment of the present application discloses a kind of audio recognition method, device, storage medium and electronic equipment.The method includes obtaining the first voice data;The screening model that first voice data input constructs in advance is screened, the sound bite for filtering out setting phonetic feature of the screening model output is obtained;Identify that the sound bite obtains corresponding text.The calculation amount in speech recognition process can be effectively reduced by adopting the above technical scheme, improve recognition speed.
Description
Technical field
The invention relates to speech recognition technology more particularly to a kind of audio recognition method, device, storage medium and
Electronic equipment.
Background technique
With the fast development for the science and technology for being applied to electronic equipment, electronic equipment has had powerful processing energy
Power, and it is increasingly becoming people's life, entertainment and the essential important tool of work.
By taking smart phone as an example, in order to drive vehicle, cabin luggage or other be inconvenient to pass through touch screen operation intelligence
Under the scene of energy mobile phone, user can also easily operate smart phone, and current smart phone is mostly configured with voice assistant function
Energy.The voice data that user inputs can be converted into text by voice assistant.However, current speech recognition schemes into
When row speech recognition, there are computationally intensive, the slow defects of recognition speed.
Summary of the invention
The embodiment of the present application provides a kind of audio recognition method, device, storage medium and electronic equipment, it is possible to reduce voice
Calculation amount in identification process improves recognition speed.
In a first aspect, the embodiment of the present application provides a kind of audio recognition method, comprising:
Obtain the first voice data;
The screening model that first voice data input constructs in advance is screened, the screening model output is obtained
The sound bite for filtering out setting phonetic feature, wherein the screening model is by adding the language of the phonetic feature of no physical meaning
The training of sound data sample obtains;
Identify that the sound bite obtains corresponding text.
Second aspect, the embodiment of the present application also provides a kind of speech recognition equipment, which includes:
Voice obtains module, for obtaining the first voice data;
Voice screening module, the screening model for constructing first voice data input in advance are screened, are obtained
Take the sound bite for filtering out setting phonetic feature of the screening model output, wherein the screening model is by adding no reality
The voice data sample training of the phonetic feature of meaning obtains;
Speech recognition module, the sound bite obtains corresponding text for identification.
The third aspect, the embodiment of the present application also provides a kind of computer readable storage mediums, are stored thereon with computer
Program realizes the audio recognition method as described in the embodiment of the present application when the program is executed by processor.
Fourth aspect, the embodiment of the present application also provides a kind of electronic equipment, including for acquiring the first voice data
Voice collector, memory, processor and storage on a memory and the computer program that can run on a processor, the place
Reason device realizes the audio recognition method as described in the embodiment of the present application when executing the computer program.
The application provides a kind of speech recognition schemes, by obtaining the first voice data;The input of first voice data is pre-
The screening model first constructed is screened, and the sound bite for filtering out setting phonetic feature of the screening model output is obtained;Know
Other sound bite obtains corresponding text.Above-mentioned technical proposal inputs the first acquired voice data before speech recognition
Screening model.Since the training sample of screening model is the voice data sample added with the phonetic feature without physical meaning,
First voice data input screening model is calculated, the sound without physical meaning that the first voice data includes can be filtered out
Element obtains the sound bite not comprising the phoneme without physical meaning.To by the data volume of the sound bite of screening model output
Less than the data volume of the first voice data.Sound bite after reducing again to data volume identifies, language can be effectively reduced
Calculation amount in sound identification process, improves recognition speed.
Detailed description of the invention
Fig. 1 is a kind of flow chart of audio recognition method provided by the embodiments of the present application;
Fig. 2 is the basic structure schematic diagram of single neuron provided by the embodiments of the present application;
Fig. 3 is the flow chart of another audio recognition method provided by the embodiments of the present application;
Fig. 4 is the flow chart of another audio recognition method provided by the embodiments of the present application;
Fig. 5 is a kind of structural block diagram of speech recognition equipment provided by the embodiments of the present application;
Fig. 6 is the structural block diagram of a kind of electronic equipment provided by the embodiments of the present application.
Specific embodiment
The application is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched
The specific embodiment stated is used only for explaining the application, rather than the restriction to the application.It also should be noted that in order to just
Part relevant to the application is illustrated only in description, attached drawing rather than entire infrastructure.
It should be mentioned that some exemplary embodiments are described as before exemplary embodiment is discussed in greater detail
The processing or method described as flow chart.Although each step is described as the processing of sequence by flow chart, many of these
Step can be implemented concurrently, concomitantly or simultaneously.In addition, the sequence of each step can be rearranged.When its operation
The processing can be terminated when completion, it is also possible to have the additional step being not included in attached drawing.The processing can be with
Corresponding to method, function, regulation, subroutine, subprogram etc..
In the related technology, the mode of speech recognition generally includes end-point detection, feature extraction and matching operation.Wherein, it is
At the time of accurately finding voice beginning and end, double-threshold comparison algorithm is generallyd use.Simultaneously using short-time zero-crossing rate and
Short-time average energy detects voice data respectively, it is comprehensive using aforesaid way determine voice signal endpoint (start time and
Finish time).The essence of the feature extraction of voice data is that voice data is converted from analog into digital signal, with reflection
The series of features parameter of voice data feature represents voice data.Due to Meier frequency spectrum cepstrum coefficient (Mel Frequency
Cepstral Coefficents, referred to as MFCC) it is to be proposed according to the auditory model of human ear, because it is close to the sense of hearing of people
Feature can be good at improving recognition performance.Therefore, feature extraction process is illustrated by taking the extracting mode of MFCC parameter as an example.
The extracting mode of MFCC parameter including the following steps: use preset window function, moved according to fixed frame length and frame to sound
Frequency signal carries out framing, for example, frame length can be 25ms, frame shifting can be 10ms;By Fast Fourier Transform (FFT) (fast
Fourier transform, abbreviation FFT) time-domain signal is become to the power spectrum of signal;Again using one group of Meier filter to upper
It states and obtains Meier frequency spectrum after frequency spectrum is handled;Cepstral analysis is carried out on Meier frequency spectrum (including takes logarithm and discrete cosine
Transformation), obtain MFCC parameter.Using the MFCC parameter of each voiced frame as the speech characteristic vector sequence of the voiced frame.It will be each
The speech characteristic vector sequence of a voiced frame inputs hidden Markov model, and obtains hidden Markov model output
With an at least matched state of frame voiced frame (voiced frame and the matched probability of state compared with i.e., by the corresponding state of maximum probability
As with the matched state of voiced frame).Sequence obtains three state groups into phoneme, and the pronunciation of word is determined according to the phoneme,
To realize speech recognition.However, the scheme of above-mentioned speech recognition cannot be distinguished from the phoneme with physical meaning and without actually containing
Justice phoneme (such as user state habit in " this ", " that ", " what can I say " and " that is " etc.), so as to cause language
Calculation amount is bigger in sound identification process, and speech recognition speed is slow.
Fig. 1 is a kind of flow chart of audio recognition method provided by the embodiments of the present application, and this method can be by speech recognition
Device executes, wherein the device can be implemented by software and/or hardware, can generally integrate in the electronic device.As shown in Figure 1,
This method comprises:
Step 110 obtains the first voice data.
Wherein, the first voice data includes the voice signal of user's input.For example, user is using the language in short message application
The voice signal inputted when sound input function.For another example, it is inputted when speech voice input function of the user in being applied using memorandum
Voice signal.For another example, the voice signal that user is inputted when using speech voice input function in mail applications.For another example, Yong Hu
Using instant messaging application speech voice input function when the voice signal etc. that inputs.
It is integrated with voice collector on electronic equipment, passes through available first voice data of voice collector.Wherein, language
Sound collector includes the wireless headsets such as transmitter and bluetooth headset, infrared earphone.Illustratively, by taking smart phone as an example, when
When user opens the speech voice input function of short message application, short message input mode can replace being manually entered using voice input,
Realization process can be, and user is indicated by inputting voice to smart phone, and the voice is indicated corresponding language by smart phone
Sound signal switchs to text and is shown in short message application interface.Corresponding voice signal is located in advance to be indicated to the voice of user's input
Reason, available first voice data.Wherein, above-mentioned pretreatment includes filtering and analog-to-digital conversion etc..It should be noted that due to
User is often automatic when speaking to bring colloquial expression into, and may cause includes " this ", " that in the first voice data
It is a ", " what can I say " and " that is " etc. the vocabulary without practical significance.
Step 120 screens the screening model that first voice data input constructs in advance, obtains the screening
The sound bite for filtering out setting phonetic feature of model output.
Wherein, the screening model is obtained by the voice data sample training for adding the phonetic feature of no physical meaning.Show
Example property, by taking screening model is neural network model as an example, the training process of screening model includes:
Model initialization, including the number of hidden layer and the number of nodes of input layer, hidden layer and each layer of output layer is arranged,
Connection weight between each layer, and initialization hidden layer and the threshold value of output layer etc., tentatively obtain the frame of neural network model
Frame.
Speech recognition calculates the output of the output parameter and output layer of hidden layer according to the formula that neural network model includes
Parameter calculates nerve net according to the external bias value of connection weight and own node between upper one layer of calculated result, two layers
The output of network model.
Error calculation is adjusted the parameter in neural network model using supervised mode of learning.Obtain user's
History is sent using the voice data and corresponding text of the input of voice input mode in short message, since user confirms sending
Short message is the adjusted rear data for not having the vocabulary without physical meaning and meet user's statement habit, can be as mark
Quasi- voice data sample.Correspondingly, the corresponding desired output of voice data sample is the corresponding text of above-mentioned voice data
Voice (or pronunciation).Trained sample is obtained by way of adding into the voice data sample without the phonetic feature of physical meaning
This.Wherein, the mode for obtaining the phonetic feature without physical meaning can be the statement of the sample populations by statistics setting quantity
Habit, analysis obtain the vocabulary of the higher no practical significance of probability of occurrence as phonetic feature.It can also be and voluntarily selected by user
Select its commonly the vocabulary without practical significance or the programming count user commonly the vocabulary without practical significance as voice spy
Sign etc..
The reality output and desired output of neural network model are calculated, obtained between reality output and desired output
Error signal.Then, according to the error signal to the connection weight and external bias of neuron each in neural network model
Value is updated.Fig. 2 shows the basic structure schematic diagram of single neuron provided by the embodiments of the present application, ω in Fig. 2i1For nerve
Connection weight in first i and upper one layer of layer where it between neuron, it is understood that for input x1Weight;θiFor the mind
External bias through member.According to neural network forecast error, Feedback error modifies the connection weight of each neuron in neural network
Weight and external bias value.Judge whether algorithm iteration terminates, if so, completing the building of screening model.
First voice data is inputted to the screening model built, for the pronunciation without physical meaning in the first voice data
Corresponding path, connection weight is smaller, inputs parameter between the hidden layer of neural network model or hidden layer is transmitted with output layer
During, due to obtaining diminishing input parameter, after repeatedly calculating, the first voice data multiplied by the connection weight
The phonetic feature (such as phoneme) of middle no physical meaning is filtered out.The output result of screening model is the language for filtering out no physical meaning
The sound bite of sound feature.
Step 130, the identification sound bite obtain corresponding text.
It calculates sound bite to compare with preset reference template progress distance, by voiced frame each in sound bite and refers to mould
Pronunciation of the shortest pronunciation of distance as the voiced frame, the combination of the pronunciation of each voiced frame are the language of the sound bite in plate
Sound.After knowing the voice of the sound bite, preset dictionary can be inquired, determines the corresponding text of the voice.
The technical solution of the present embodiment, by the way that before speech recognition, the first acquired voice data is had input screening
Model.Since the training sample of screening model is the voice data sample added with the phonetic feature without physical meaning, by first
Voice data input screening model is calculated, and can be filtered out the phoneme without physical meaning that the first voice data includes, be obtained
The sound bite of phoneme not comprising no physical meaning.To, by screening model output sound bite data volume less than the
The data volume of one voice data.Sound bite after reducing again to data volume identifies, speech recognition can be effectively reduced
Calculation amount in the process, improves recognition speed.
Fig. 3 is the flow chart of another audio recognition method provided by the embodiments of the present application.As shown in figure 3, this method packet
It includes:
Step 301 obtains the first voice data.
Step 302 judges whether the corresponding user of first voice data is registration user, if so, thening follow the steps
303, it is no to then follow the steps 306.
When detecting the first voice data, the camera of controlling electronic devices is opened, and shoots at least framed user's image.
By carrying out image procossing to user images, image recognition determines whether the user of the first voice data of input is registration user.
Wherein it is possible to determine whether the user of the first voice data of input is registration user by way of images match.Illustratively,
In user's registration, user images are obtained, as matching template.When detecting the first voice data, user images are obtained, and
User images are matched with matching template, it is thus possible to determine whether the corresponding user of the first voice data is that registration is used
Family.
Step 303, the history voice data for obtaining at least one registration user determine each according to the history voice data
The word speed and dwell interval of a registration user.
When the corresponding user of the first voice data is registration user, the history voice data of registration user is obtained.Its
In, history voice data includes history communicating data, history voice control data and history speech message of user etc..Pass through
Analysis of history voice data can determine the average word speed and average dwell interval of each registration user.Wherein, average word speed and
Average dwell interval is that weighted calculation obtains.It can also further determine that each registration user language under different scenes respectively
Speed and dwell interval.
Step 304 inquires preset framing strategy set according to the word speed and habit of pausing, determining to use with the registration
The corresponding framing strategy in family.
Wherein, framing strategy includes the value that the selection of window function, the value of frame length and frame move, and the framing strategy with
The speech habits of different user are associated.Framing strategy set is the set of framing strategy, wherein storage word speed section and pause
The corresponding relationship that interval section and window function, frame length and frame move.
The word speed and dwell interval determined according to above-mentioned steps is inquired the word speed section stored in framing strategy set and is stopped
Pause interval section, positions word speed and the corresponding section of dwell interval, and the corresponding window function in the section, frame length and frame are moved as note
The framing strategy of the current speech data of volume user's input.
Step 305, according to the corresponding framing strategy of registration user, framing is carried out to first voice data, obtain to
Then few two second speech datas execute step 307.
Since stationarity is only presented in voice data in a relatively short period of time, it is therefore desirable to which voice data is divided into one one
A short time interval, i.e. voiced frame.
Illustratively, using the framing strategy window function that includes determined in above-mentioned steps, include according to framing strategy
Frame moves the first voice data of processing and obtains at least two second speech datas.Wherein, the window of window function is long is equal to the framing strategy
Frame length.After obtaining at least two second speech datas, goes to and execute step 307.The division and registration of first voice data
The word speed of user is related to dwell interval, and therefore, the frame length of the second speech data obtained after framing is with the word speed for registering user
With dwell interval and change, frame length not immobilizes, it is possible to reduce by with practical significance voice with do not have practical meaning
The voice of justice is divided in same voiced frame, is conducive to the efficiency for improving speech recognition.
Step 306, the framing strategy according to default carry out framing to first voice data, obtain at least two the
Two voice data.
When the corresponding user of the first voice data is not registration user, using the window function of default, according to the frame of default
It moves the first voice data of processing and obtains at least two second speech datas.Wherein, a length of default frame length of the window of window function.After framing
The frame length of obtained second speech data is fixed and invariable, by the voice with practical significance and without the language of practical significance
It is more that sound is divided into the case where voiced frame.
Step 307 extracts the corresponding first speech characteristic vector sequence of the second speech data.
Wherein, the first speech characteristic vector sequence includes MFCC feature.MFCC feature is extracted from second speech data
Mode includes: to be filtered by a series of Meier filters to the spectrogram of second speech data, obtains Meier frequency spectrum;
Cepstral analysis is carried out to the Meier frequency spectrum, mel-frequency cepstrum coefficient is obtained, using the mel-frequency cepstrum coefficient as defeated
Enter the behavioral characteristics vector of screening model, i.e. the first speech characteristic vector sequence.
Step 308 after the first speech characteristic vector sequence is normalized, inputs the circulation constructed in advance
Neural network model is screened.
Optionally, before the Recognition with Recurrent Neural Network model for constructing the first speech characteristic vector sequence inputting in advance, also
First speech characteristic vector sequence can be normalized, it is to be understood that the step of normalized is not
The step of having to carry out.Wherein, normalized be all first speech characteristic vector sequences are each mapped to [0,1] or [-
1,1] number between, the unit difference and range differences that can eliminate input data reduce voice and know away from the influence to speech recognition
Other error.
After the first speech characteristic vector sequence is normalized, input the neural network model that constructs in advance into
Row screening, wherein the neural network model is Recognition with Recurrent Neural Network model.
Step 309, the output result for obtaining the Recognition with Recurrent Neural Network model, wherein the output result is to filter out nothing
Second speech characteristic vector sequence of the phoneme of physical meaning.
Wherein, phoneme is the minimum unit in voice, and according to the articulation analysis in syllable, a movement constitutes one
Phoneme, phoneme include vowel and consonant.
Due to Recognition with Recurrent Neural Network model be by addition the phoneme without physical meaning training sample carry out study and
Training is built-up, and output is the sound bite for filtering out the phoneme of no physical meaning, therefore, by the first speech characteristic vector
After sequence inputting Recognition with Recurrent Neural Network model, the sound bite of output is the second phonetic feature for filtering out the phoneme of no physical meaning
Vector sequence.
Step 310 judges whether the second speech characteristic vector sequence is equal with the length of preset reference template, if so,
Then follow the steps 313, it is no to then follow the steps 311.
It is compared by the length for obtaining the second speech characteristic vector sequence with the length of preset reference template.If
Length is not identical, thens follow the steps 311.If length is identical, 313 are thened follow the steps.
Step 311 calculates the second speech characteristic vector sequence and reference template using dynamic time warping algorithm
Frame matching distance.
Wherein, dynamic time warping algorithm (dynamic time warping, abbreviation DTW) is a kind of two times of measurement
The method of similarity between sequence is mainly used in field of speech recognition to identify that two sections of voices indicate whether the same list
Word.
Illustratively, if the second speech characteristic vector sequence is different from the length of preset reference template, can pass through
DTW algorithm calculates the frame matching distance matrix of the second speech characteristic vector sequence and reference template, in frame matching distance matrix
An optimal path is found out, which is the corresponding path of minimal matching span.
Step 312 determines the corresponding pronunciation of minimum frame matching distance, then, executes step 314.
The voice and the second speech characteristic vector being determined as in the reference template of the corresponding endpoint of minimum frame matching distance
Sequence, using the voice in the reference template as the pronunciation of the second speech characteristic vector sequence.
Step 313 directly matches the second speech characteristic vector sequence and reference template, determines that the sound bite is corresponding
Pronunciation.
If the second speech characteristic vector sequence is identical as the length of preset reference template, directly the second voice of matching is special
Vector sequence and reference template are levied, determines the corresponding pronunciation of sound bite.
Step 314 matches corresponding text according to the pronunciation, as speech recognition result.
The technical solution of the present embodiment, by determining framing according to the word speed of user and dwell interval before speech recognition
Strategy carries out sub-frame processing to the first voice data using personalized framing strategy, realizes personalized framing, effectively subtract
The number that there is major general the phonetic feature of physical meaning voiced frame in a frame is divided with the phonetic feature for not having physical meaning
Amount.It, can be into one by the corresponding first speech characteristic vector sequence inputting screening model of second speech data after sub-frame processing
Step improves audio identification efficiency.
Fig. 4 is the flow chart of another audio recognition method provided by the embodiments of the present application.As shown in figure 4, this method packet
It includes:
Step 401 judges whether to meet model modification condition, if so, thening follow the steps 402, otherwise, executes step 408.
Wherein, model modification condition can be system time and reach preset time, can also be the satisfaction default update cycle.
For example, setting model modification condition is late 12 points of progresss screening model updates on every Fridays, then when detecting system time for Friday
At evening 12, determines and currently meet model modification condition.For another example, setting model modification condition is to update once for every 7 days, then detects
When meeting the update cycle to the time apart from last time model modification, determines and currently meet model modification condition.
The short message inputted by voice mode and/or stored pass through voice mode that step 402, acquisition have been sent
The memorandum of input.
Obtain the sent message and stored memorandum for using voice input mode.Since user confirms sending
Short message may be considered the adjusted rear data for not having the vocabulary without physical meaning and meet user's statement habit, can be by it
Voice data sample as standard.For the memorandum saved, it is also assumed that it is after being adjusted without no reality
The vocabulary of border meaning and the data for meeting user's statement habit, can also be as the voice data sample of standard.
It pre-saves special by the voice of the corresponding voice data of body matter of the sent message of voice input mode
Vector sequence is levied, and the corresponding voice data for saving user and giving an oral account input, user is given an oral account into the voice data of input as history
Voice data.For example, the voice data that user gives an oral account input is " about this for sending short message by voice input mode
A problem, what can I say, certain bad solution ", and the short message being actually sent out after processing is " about this problem, really
Bad solution ".Corresponding storage user gives an oral account the speech characteristic vector sequence of the voice data of input, and what is be actually sent out short disappear
Cease corresponding voice data.
The speech characteristic vector of the corresponding voice data of body matter of step 403, the acquisition short message and/or memorandum
Sequence.
Obtain the speech characteristic vector sequence of the voice data in the body matter of the short message sent.Optionally, may be used also
To be the speech characteristic vector sequence of voice data in the body matter for obtain stored memorandum.
Step 404, the history voice data for obtaining the short message and/or memorandum.
The content that the corresponding user of the short message sent gives an oral account input is obtained, as history voice data.Optionally, may be used also
To be the content for obtaining the corresponding user of stored memorandum and giving an oral account input, as history voice data.
Step 405 determines the personalized phoneme without physical meaning and the phoneme according to the history voice data
Appearance position.
Analysis of history voice data, it can be deduced that the speech habits of a certain user, i.e., the factor without physical meaning and its go out
Existing position.For example, the user likes the vocabulary for saying " what can I say " this no physical meaning among sentence when voice inputs.
Step 406 adds the phoneme according to the appearance position as training into the speech characteristic vector sequence
Sample, and using the speech characteristic vector sequence as desired output, using supervised mode of learning to the screening model into
Row training.
Training sample is normalized, the unit difference and range differences that can eliminate input data are away from knowing voice
Other influence, meanwhile, also help effective threshold value that input data is mapped to activation primitive, reduce network training error and
Net training time.
Step 407, the parameter that the screening model is adjusted according to training result, the parameter includes connection weight and outside
Bias.
Neural network forecast error can be determined by analyzing and training sample and desired output.According to error in neural network model
By rear (output layer) mode that (input layer) transmits forward, the connection weight and external bias value of each neuron is respectively modified.
Step 408 obtains the first voice data.
If obtain the first voice data, above-mentioned model modification process not yet terminates, then the first voice data of nonrecognition, mentions
Show that user is currently carrying out the update operation of screening model.
Step 409 screens the screening model that first voice data input constructs in advance, obtains the screening
The sound bite for filtering out setting phonetic feature of model output.
If obtain the first voice data, model modification operation is not being executed, then the first voice data is being inputted into screening mould
Type screens first voice data by model, obtains the sound bite for filtering out the phonetic feature of no physical meaning.
Step 410, the identification sound bite obtain corresponding text.
Step 411 judges whether the text is command information, if so, thening follow the steps 412, otherwise, executes step
413。
The incidence relation of white list storage spelling words intellectual and command information is first passed through in advance.Identifying that sound bite is corresponding
When text, which is inquired according to the spelling words intellectual of the text.If inquiring corresponding spelling words intellectual in the white list,
It determines that the corresponding text of sound bite represents command information, executes step 412.If corresponding text is not inquired in the white list
Word combination, then prompt the user to choose whether as command information.If user selects the corresponding text of the sound bite to represent order letter
Breath, then be added to the white list for the spelling words intellectual that the user is determined as command information, and execute step 412.If user selects
The corresponding text of the sound bite does not represent command information, thens follow the steps 413.
Step 412 executes the corresponding operation of the command information.
Step 413 shows the text in the user interface.
The technical solution of the present embodiment, by passing through language using what is sent in the update condition for meeting screening model
Short message that sound mode inputs and/or the stored memorandum inputted by voice mode are as training sample, to screening model
It is trained, the output of screening model can be made to adapt to the statement habit of user's variation, effectively reduce false recognition rate and missing inspection
Rate.
Fig. 5 is a kind of structural block diagram of speech recognition equipment provided by the embodiments of the present application.The device can have software and/or
Hardware realization is typically integrated in electronic equipment.As shown in figure 5, the apparatus may include:
Voice obtains module 510, for obtaining the first voice data.
Voice screening module 520, the screening model for constructing first voice data input in advance screen,
Obtain the sound bite for filtering out setting phonetic feature of the screening model output, wherein the screening model is by adding no reality
The voice data sample training of the phonetic feature of border meaning obtains.
Speech recognition module 530, the sound bite obtains corresponding text for identification.
The embodiment of the present application provides a kind of speech recognition equipment, before speech recognition, by the first acquired voice data
Have input screening model.Since the training sample of screening model is the voice data sample added with the phonetic feature without physical meaning
This, the first voice data input screening model is calculated, can filter out that the first voice data includes without physical meaning
Phoneme obtains the sound bite not comprising the phoneme without physical meaning.To by the data of the sound bite of screening model output
Measure the data volume less than the first voice data.Sound bite after reducing again to data volume identifies, can be effectively reduced
Calculation amount in speech recognition process, improves recognition speed.
Optionally, further includes:
User's judgment module, for when detecting the first voice data, judging the corresponding use of first voice data
Whether family is registration user;
And, further includes:
Framing module, for before the screening model that constructs in advance of first voice data input is screened,
Corresponding framing strategy is determined according to judging result, and framing is carried out to first voice data according to the framing strategy, is obtained
To at least two second speech datas;
Wherein, the framing strategy includes the value that the selection of window function, the value of frame length and frame move, and the framing plan
It is slightly associated with the speech habits of different user.
Optionally, framing module is specifically used for:
The history voice data for obtaining at least one registration user determines that each registration is used according to the history voice data
The word speed and dwell interval at family;
Preset framing strategy set is inquired according to the word speed and habit of pausing, determination is corresponding with the registration user
Framing strategy.
Optionally, voice screening module 520 is specifically used for:
Extract the corresponding first speech characteristic vector sequence of the second speech data;
After the first speech characteristic vector sequence is normalized, the Recognition with Recurrent Neural Network constructed in advance is inputted
Model is screened;
Obtain the output result of the Recognition with Recurrent Neural Network model, wherein the output result is to filter out no physical meaning
Phoneme the second speech characteristic vector sequence.
Optionally, speech recognition module 530 is specifically used for:
Judge whether the second speech characteristic vector sequence is equal with the length of preset reference template;
When unequal, the second speech characteristic vector sequence and reference template are calculated using dynamic time warping algorithm
Frame matching distance;
The corresponding pronunciation of minimum frame matching distance is determined, using the matched text of pronunciation as speech recognition result.
Optionally, further includes:
Word processing module, for judging that the text is after identifying that the sound bite obtains corresponding text
No is command information;
If so, executing the corresponding operation of the command information;
If it is not, then showing the text in the user interface.
Optionally, further includes:
Model modification module obtains inputting by voice mode of having sent for when meeting model modification condition
Short message and/or the stored memorandum inputted by voice mode;
Obtain the speech characteristic vector sequence of the corresponding voice data of body matter of the short message and/or memorandum;
Obtain the history voice data of the short message and/or memorandum;
The appearance position of the personalized phoneme and the phoneme without physical meaning is determined according to the history voice data;
The phoneme is added as training sample into the speech characteristic vector sequence according to the appearance position, and with
The speech characteristic vector sequence is trained the screening model as desired output, using supervised mode of learning;
The parameter of the screening model is adjusted according to training result, the parameter includes connection weight and external bias value.
The embodiment of the present application also provides a kind of storage medium comprising computer executable instructions, and the computer is executable
Instruction is used to execute a kind of audio recognition method when being executed by computer processor, this method comprises:
Obtain the first voice data;
The screening model that first voice data input constructs in advance is screened, the screening model output is obtained
The sound bite for filtering out setting phonetic feature, wherein the screening model is by adding the language of the phonetic feature of no physical meaning
The training of sound data sample obtains;
Identify that the sound bite obtains corresponding text.
Storage medium --- any various types of memory devices or storage equipment.Term " storage medium " is intended to wrap
It includes: install medium, such as CD-ROM, floppy disk or magnetic tape equipment;Computer system memory or random access memory, such as
DRAM, DDR RAM, SRAM, EDO RAM, Lan Basi (Rambus) RAM etc.;Nonvolatile memory, such as flash memory, magnetic medium
(such as hard disk or optical storage);Register or the memory component of other similar types etc..Storage medium can further include other
Memory of type or combinations thereof.In addition, storage medium can be located at program in the first computer system being wherein performed,
Or can be located in different second computer systems, second computer system is connected to the by network (such as internet)
One computer system.Second computer system can provide program instruction to the first computer for executing." storage is situated between term
Matter " may include may reside in different location (such as by network connection different computer systems in) two or
More storage mediums.Storage medium can store the program instruction that can be performed by one or more processors and (such as implement
For computer program).
Certainly, a kind of storage medium comprising computer executable instructions, computer provided by the embodiment of the present application
Voice provided by the application any embodiment can also be performed in the speech recognition operation that executable instruction is not limited to the described above
Relevant operation in recognition methods.
The embodiment of the present application provides a kind of electronic equipment, and language provided by the embodiments of the present application can be integrated in the electronic equipment
Sound identification device.Wherein, electronic equipment includes smart phone, tablet computer, handheld device, laptop and smartwatch
Deng.Fig. 6 is the structural block diagram of a kind of electronic equipment provided by the embodiments of the present application.As shown in fig. 6, the electronic equipment can wrap
Include: memory 601, central processing unit (Central Processing Unit, CPU) 602 (also known as processor, hereinafter referred to as
CPU), voice collector 606 and touch screen 611.The touch screen 611, is input to for user's operation to be converted into electric signal
The processor, and show visual output signal;The voice collector 606, for acquiring the first voice data;The storage
Device 601, for storing computer program;The CPU602 reads and executes the computer program stored in the memory 601.
The CPU602 is performed the steps of when executing the computer program obtains the first voice data;By first voice
The screening model that data input constructs in advance is screened, and the language for filtering out setting phonetic feature of the screening model output is obtained
Tablet section, wherein the screening model is obtained by the voice data sample training for adding the phonetic feature of no physical meaning;Identification
The sound bite obtains corresponding text.
The electronic equipment further include: Peripheral Interface 603, RF (Radio Frequency, radio frequency) circuit 605, power supply pipe
Manage chip 608, input/output (I/O) subsystem 609, other input/control devicess 610 and outside port 604, these portions
Part is communicated by one or more communication bus or signal wire 607.
It should be understood that illustrating the example that electronic equipment 600 is only electronic equipment, and electronic equipment 600
It can have than shown in the drawings more or less component, can combine two or more components, or can be with
It is configured with different components.Various parts shown in the drawings can include one or more signal processings and/or dedicated
It is realized in the combination of hardware, software or hardware and software including integrated circuit.
Just the electronic equipment provided in this embodiment for being integrated with speech recognition equipment is described in detail below, the electronics
Equipment takes the mobile phone as an example.
Memory 601, the memory 601 can be accessed by CPU602, Peripheral Interface 603 etc., and the memory 601 can
It can also include nonvolatile memory to include high-speed random access memory, such as one or more disk memory,
Flush memory device or other volatile solid-state parts.
The peripheral hardware that outputs and inputs of equipment can be connected to CPU602 and deposited by Peripheral Interface 603, the Peripheral Interface 603
Reservoir 601.
I/O subsystem 609, the I/O subsystem 609 can be by the input/output peripherals in equipment, such as touch screen 611
With other input/control devicess 610, it is connected to Peripheral Interface 603.I/O subsystem 609 may include 6091 He of display controller
For controlling one or more input controllers 6092 of other input/control devicess 610.Wherein, one or more input controls
Device 6092 processed receives electric signal from other input/control devicess 610 or sends electric signal to other input/control devicess 610,
Other input/control devicess 610 may include physical button (push button, rocker buttons etc.), dial, slide switch, behaviour
Vertical pole clicks idler wheel.It is worth noting that input controller 6092 can with it is following any one connect: keyboard, infrared port,
The indicating equipment of USB interface and such as mouse.
Display controller 6091 in I/O subsystem 609 receives electric signal from touch screen 611 or sends out to touch screen 611
Electric signals.Touch screen 611 detects the contact on touch screen, and the contact that display controller 6091 will test is converted to and is shown
The interaction of user interface object on touch screen 611, i.e. realization human-computer interaction, the user interface being shown on touch screen 611
Object can be the icon of running game, the icon for being networked to corresponding network etc..It is worth noting that equipment can also include light
Mouse, light mouse are the touch sensitive surfaces for not showing the touch sensitive surface visually exported, or formed by touch screen mould group
Extend.
RF circuit 605 is mainly used for establishing the communication of mobile phone Yu wireless network (i.e. network side), realizes mobile phone and wireless network
The data receiver of network and transmission.Such as transmitting-receiving short message, Email etc..Specifically, RF circuit 605 receives and sends RF letter
Number, RF signal is also referred to as electromagnetic signal, and RF circuit 605 converts electrical signals to electromagnetic signal or electromagnetic signal is converted to telecommunications
Number, and communicated by the electromagnetic signal with communication network and other equipment.RF circuit 605 may include for executing
The known circuit of these functions comprising but it is not limited to antenna system, RF transceiver, one or more amplifiers, tuner, one
A or multiple oscillators, digital signal processor, CODEC (COder-DECoder, coder) chipset, user identifier mould
Block (Subscriber Identity Module, SIM) etc..
Voice collector 606, including the wireless headsets such as transmitter and bluetooth headset, infrared earphone, are mainly used for receiving
The audio data is converted to electric signal by audio data.
Power management chip 608, the hardware for being connected by CPU602, I/O subsystem and Peripheral Interface are powered
And power management.
Electronic equipment provided by the embodiments of the present application, it is by before speech recognition, the first acquired voice data is defeated
Screening model is entered.Since the training sample of screening model is the voice data sample added with the phonetic feature without physical meaning
This, the first voice data input screening model is calculated, can filter out that the first voice data includes without physical meaning
Phoneme obtains the sound bite not comprising the phoneme without physical meaning.To by the data of the sound bite of screening model output
Measure the data volume less than the first voice data.Sound bite after reducing again to data volume identifies, can be effectively reduced
Calculation amount in speech recognition process, improves recognition speed.
The application, which can be performed, in speech recognition equipment, storage medium and the electronic equipment provided in above-described embodiment arbitrarily implements
Audio recognition method provided by example has and executes the corresponding functional module of this method and beneficial effect.Not in above-described embodiment
In detailed description technical detail, reference can be made to audio recognition method provided by the application any embodiment.
Note that above are only the preferred embodiment and institute's application technology principle of the application.It will be appreciated by those skilled in the art that
The application is not limited to specific embodiment described here, be able to carry out for a person skilled in the art it is various it is apparent variation,
The protection scope readjusted and substituted without departing from the application.Therefore, although being carried out by above embodiments to the application
It is described in further detail, but the application is not limited only to above embodiments, in the case where not departing from the application design, also
It may include more other equivalent embodiments, and scope of the present application is determined by the scope of the appended claims.
Claims (9)
1. a kind of audio recognition method characterized by comprising
Obtain the first voice data;
When detecting the first voice data, judge whether the corresponding user of first voice data is registration user;
Corresponding framing strategy is determined according to judging result, and first voice data is divided according to the framing strategy
Frame obtains at least two second speech datas, wherein the framing strategy includes the selection of window function, the value of frame length and frame
The value of shifting, and the framing strategy is associated with the speech habits of different user;
The corresponding first speech characteristic vector sequence of the second speech data is extracted, by the first speech characteristic vector sequence
It inputs the screening model constructed in advance to be screened, obtains the voice sheet for filtering out setting phonetic feature of the screening model output
Section, wherein the screening model is obtained by the voice data sample training for adding the phonetic feature of no physical meaning;
Identify that the sound bite obtains corresponding text.
2. the method according to claim 1, wherein determining corresponding framing strategy according to judging result, comprising:
The history voice data for obtaining at least one registration user determines each registration user's according to the history voice data
Word speed and dwell interval;
Preset framing strategy set is inquired according to the word speed and dwell interval, determines framing corresponding with the registration user
Strategy.
3. the method according to claim 1, wherein it is special to extract corresponding first voice of the second speech data
Vector sequence is levied, the screening model that the first speech characteristic vector sequence inputting constructs in advance is screened, comprising:
Extract the corresponding first speech characteristic vector sequence of the second speech data;
After the first speech characteristic vector sequence is normalized, the Recognition with Recurrent Neural Network model constructed in advance is inputted
It is screened;
Obtain the output result of the Recognition with Recurrent Neural Network model, wherein the output result is the sound for filtering out no physical meaning
Second speech characteristic vector sequence of element.
4. according to the method described in claim 3, it is characterised by comprising: identify that the sound bite obtains corresponding text,
Include:
Judge whether the second speech characteristic vector sequence is equal with the length of preset reference template;
When unequal, the frame of the second speech characteristic vector sequence and reference template is calculated using dynamic time warping algorithm
Matching distance;
The corresponding pronunciation of minimum frame matching distance is determined, using the matched text of pronunciation as speech recognition result.
5. the method according to claim 1, wherein identify the sound bite obtain corresponding text it
Afterwards, further includes:
Judge whether the text is command information;
If so, executing the corresponding operation of the command information;
If it is not, then showing the text in the user interface.
6. according to claim 1 to any method in 5, which is characterized in that further include:
When meeting model modification condition, obtains the short message inputted by voice mode sent and/or stored pass through
The memorandum of voice mode input;
Obtain the speech characteristic vector sequence of the corresponding voice data of body matter of the short message and/or memorandum;
Obtain the history voice data of the short message and/or memorandum;
The appearance position of the personalized phoneme and the phoneme without physical meaning is determined according to the history voice data;
The phoneme is added as training sample into the speech characteristic vector sequence according to the appearance position, and with described
Speech characteristic vector sequence is trained the screening model as desired output, using supervised mode of learning;
The parameter of the screening model is adjusted according to training result, the parameter includes connection weight and external bias value.
7. a kind of speech recognition equipment characterized by comprising
Voice obtains module, for obtaining the first voice data;
User's judgment module, for when detecting the first voice data, judging that the corresponding user of first voice data is
No is registration user;
Framing module, for determining corresponding framing strategy according to judging result, according to the framing strategy to first language
Sound data carry out framing, obtain at least two second speech datas, wherein the framing strategy includes the selection of window function, frame
The value that long value and frame move, and the framing strategy is associated with the speech habits of different user;
Voice screening module, for extracting the corresponding first speech characteristic vector sequence of the second speech data, by described
The screening model that one speech characteristic vector sequence inputting constructs in advance is screened, and is obtained the filtering out for screening model output and is set
Determine the sound bite of phonetic feature, wherein the screening model is by adding the voice data sample of the phonetic feature of no physical meaning
This training obtains;
Speech recognition module, the sound bite obtains corresponding text for identification.
8. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is held by processor
The audio recognition method as described in any in claim 1 to 6 is realized when row.
9. a kind of electronic equipment, including for acquiring the first voice data voice collector, memory, processor and be stored in
On memory and the computer program that can run on a processor, which is characterized in that the processor executes the computer journey
The audio recognition method as described in any in claim 1 to 6 is realized when sequence.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910473083.9A CN110310623B (en) | 2017-09-20 | 2017-09-20 | Sample generation method, model training method, device, medium, and electronic apparatus |
CN201710854125.4A CN107481718B (en) | 2017-09-20 | 2017-09-20 | Audio recognition method, device, storage medium and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710854125.4A CN107481718B (en) | 2017-09-20 | 2017-09-20 | Audio recognition method, device, storage medium and electronic equipment |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910473083.9A Division CN110310623B (en) | 2017-09-20 | 2017-09-20 | Sample generation method, model training method, device, medium, and electronic apparatus |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107481718A CN107481718A (en) | 2017-12-15 |
CN107481718B true CN107481718B (en) | 2019-07-05 |
Family
ID=60587053
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910473083.9A Active CN110310623B (en) | 2017-09-20 | 2017-09-20 | Sample generation method, model training method, device, medium, and electronic apparatus |
CN201710854125.4A Active CN107481718B (en) | 2017-09-20 | 2017-09-20 | Audio recognition method, device, storage medium and electronic equipment |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910473083.9A Active CN110310623B (en) | 2017-09-20 | 2017-09-20 | Sample generation method, model training method, device, medium, and electronic apparatus |
Country Status (1)
Country | Link |
---|---|
CN (2) | CN110310623B (en) |
Families Citing this family (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018176387A1 (en) * | 2017-03-31 | 2018-10-04 | 深圳市红昌机电设备有限公司 | Voice control method and system for winding-type coil winder |
CN108717851B (en) * | 2018-03-28 | 2021-04-06 | 深圳市三诺数字科技有限公司 | Voice recognition method and device |
CN108847221B (en) * | 2018-06-19 | 2021-06-15 | Oppo广东移动通信有限公司 | Voice recognition method, voice recognition device, storage medium and electronic equipment |
CN110752973B (en) * | 2018-07-24 | 2020-12-25 | Tcl科技集团股份有限公司 | Terminal equipment control method and device and terminal equipment |
CN109003619A (en) * | 2018-07-24 | 2018-12-14 | Oppo(重庆)智能科技有限公司 | Voice data generation method and relevant apparatus |
CN108922531B (en) * | 2018-07-26 | 2020-10-27 | 腾讯科技(北京)有限公司 | Slot position identification method and device, electronic equipment and storage medium |
CN110853631A (en) * | 2018-08-02 | 2020-02-28 | 珠海格力电器股份有限公司 | Voice recognition method and device for smart home |
CN109145124B (en) * | 2018-08-16 | 2022-02-25 | 格力电器(武汉)有限公司 | Information storage method and device, storage medium and electronic device |
CN109192211A (en) * | 2018-10-29 | 2019-01-11 | 珠海格力电器股份有限公司 | Method, device and equipment for recognizing voice signal |
CN109448707A (en) * | 2018-12-18 | 2019-03-08 | 北京嘉楠捷思信息技术有限公司 | Voice recognition method and device, equipment and medium |
CN109637524A (en) * | 2019-01-18 | 2019-04-16 | 徐州工业职业技术学院 | A kind of artificial intelligence exchange method and artificial intelligence interactive device |
CN110265001B (en) * | 2019-05-06 | 2023-06-23 | 平安科技(深圳)有限公司 | Corpus screening method and device for speech recognition training and computer equipment |
CN110288988A (en) * | 2019-05-16 | 2019-09-27 | 平安科技(深圳)有限公司 | Target data screening technique, device and storage medium |
CN111862946B (en) * | 2019-05-17 | 2024-04-19 | 北京嘀嘀无限科技发展有限公司 | Order processing method and device, electronic equipment and storage medium |
CN110288976B (en) * | 2019-06-21 | 2021-09-07 | 北京声智科技有限公司 | Data screening method and device and intelligent sound box |
CN112329457A (en) * | 2019-07-17 | 2021-02-05 | 北京声智科技有限公司 | Input voice recognition method and related equipment |
WO2021134546A1 (en) * | 2019-12-31 | 2021-07-08 | 李庆远 | Input method for increasing speech recognition rate |
CN111292766B (en) * | 2020-02-07 | 2023-08-08 | 抖音视界有限公司 | Method, apparatus, electronic device and medium for generating voice samples |
CN113516994B (en) * | 2021-04-07 | 2022-04-26 | 北京大学深圳研究院 | Real-time voice recognition method, device, equipment and medium |
CN113422875B (en) * | 2021-06-22 | 2022-11-25 | 中国银行股份有限公司 | Voice seat response method, device, equipment and storage medium |
CN115457961B (en) * | 2022-11-10 | 2023-04-07 | 广州小鹏汽车科技有限公司 | Voice interaction method, vehicle, server, system and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101645064A (en) * | 2008-12-16 | 2010-02-10 | 中国科学院声学研究所 | Superficial natural spoken language understanding system and method thereof |
CN102543071A (en) * | 2011-12-16 | 2012-07-04 | 安徽科大讯飞信息科技股份有限公司 | Voice recognition system and method used for mobile equipment |
CN104143326A (en) * | 2013-12-03 | 2014-11-12 | 腾讯科技(深圳)有限公司 | Voice command recognition method and device |
CN105096941A (en) * | 2015-09-02 | 2015-11-25 | 百度在线网络技术(北京)有限公司 | Voice recognition method and device |
CN106875936A (en) * | 2017-04-18 | 2017-06-20 | 广州视源电子科技股份有限公司 | Voice recognition method and device |
CN107146605A (en) * | 2017-04-10 | 2017-09-08 | 北京猎户星空科技有限公司 | A kind of audio recognition method, device and electronic equipment |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101604520A (en) * | 2009-07-16 | 2009-12-16 | 北京森博克智能科技有限公司 | Spoken language voice recognition method based on statistical model and syntax rule |
CN103366740B (en) * | 2012-03-27 | 2016-12-14 | 联想(北京)有限公司 | Voice command identification method and device |
CN103514878A (en) * | 2012-06-27 | 2014-01-15 | 北京百度网讯科技有限公司 | Acoustic modeling method and device, and speech recognition method and device |
CN103544952A (en) * | 2012-07-12 | 2014-01-29 | 百度在线网络技术(北京)有限公司 | Voice self-adaption method, device and system |
CN103680495B (en) * | 2012-09-26 | 2017-05-03 | 中国移动通信集团公司 | Speech recognition model training method, speech recognition model training device and speech recognition terminal |
US9275638B2 (en) * | 2013-03-12 | 2016-03-01 | Google Technology Holdings LLC | Method and apparatus for training a voice recognition model database |
CN104134439B (en) * | 2014-07-31 | 2018-01-12 | 深圳市金立通信设备有限公司 | A kind of phrasal acquisition methods, apparatus and system |
CN104157286B (en) * | 2014-07-31 | 2017-12-29 | 深圳市金立通信设备有限公司 | A kind of phrasal acquisition methods and device |
CN106601238A (en) * | 2015-10-14 | 2017-04-26 | 阿里巴巴集团控股有限公司 | Application operation processing method and application operation processing device |
-
2017
- 2017-09-20 CN CN201910473083.9A patent/CN110310623B/en active Active
- 2017-09-20 CN CN201710854125.4A patent/CN107481718B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101645064A (en) * | 2008-12-16 | 2010-02-10 | 中国科学院声学研究所 | Superficial natural spoken language understanding system and method thereof |
CN102543071A (en) * | 2011-12-16 | 2012-07-04 | 安徽科大讯飞信息科技股份有限公司 | Voice recognition system and method used for mobile equipment |
CN104143326A (en) * | 2013-12-03 | 2014-11-12 | 腾讯科技(深圳)有限公司 | Voice command recognition method and device |
CN105096941A (en) * | 2015-09-02 | 2015-11-25 | 百度在线网络技术(北京)有限公司 | Voice recognition method and device |
CN107146605A (en) * | 2017-04-10 | 2017-09-08 | 北京猎户星空科技有限公司 | A kind of audio recognition method, device and electronic equipment |
CN106875936A (en) * | 2017-04-18 | 2017-06-20 | 广州视源电子科技股份有限公司 | Voice recognition method and device |
Also Published As
Publication number | Publication date |
---|---|
CN110310623A (en) | 2019-10-08 |
CN110310623B (en) | 2021-12-28 |
CN107481718A (en) | 2017-12-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107481718B (en) | Audio recognition method, device, storage medium and electronic equipment | |
CN110288978B (en) | Speech recognition model training method and device | |
WO2021093449A1 (en) | Wakeup word detection method and apparatus employing artificial intelligence, device, and medium | |
EP3877975B1 (en) | Electronic device and method for outputting a speech signal | |
CN112259106B (en) | Voiceprint recognition method and device, storage medium and computer equipment | |
EP3410435B1 (en) | Electronic apparatus for recognizing keyword included in your utterance to change to operating state and controlling method thereof | |
CN108182937B (en) | Keyword recognition method, device, equipment and storage medium | |
WO2021135577A9 (en) | Audio signal processing method and apparatus, electronic device, and storage medium | |
CN110534099A (en) | Voice wakes up processing method, device, storage medium and electronic equipment | |
US20220172737A1 (en) | Speech signal processing method and speech separation method | |
WO2016150001A1 (en) | Speech recognition method, device and computer storage medium | |
CN110853618A (en) | Language identification method, model training method, device and equipment | |
CN110853617B (en) | Model training method, language identification method, device and equipment | |
CN110570873B (en) | Voiceprint wake-up method and device, computer equipment and storage medium | |
WO2015171646A1 (en) | Method and system for speech input | |
CN110232933A (en) | Audio-frequency detection, device, storage medium and electronic equipment | |
CN113643693B (en) | Acoustic model conditioned on sound characteristics | |
CN113129867B (en) | Training method of voice recognition model, voice recognition method, device and equipment | |
CN109036395A (en) | Personalized speaker control method, system, intelligent sound box and storage medium | |
CN109872713A (en) | A kind of voice awakening method and device | |
CN113393828A (en) | Training method of voice synthesis model, and voice synthesis method and device | |
CN111326152A (en) | Voice control method and device | |
CN109065026B (en) | Recording control method and device | |
CN114120979A (en) | Optimization method, training method, device and medium of voice recognition model | |
CN111816180B (en) | Method, device, equipment, system and medium for controlling elevator based on voice |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: 523860 No. 18, Wu Sha Beach Road, Changan Town, Dongguan, Guangdong Applicant after: OPPO Guangdong Mobile Communications Co., Ltd. Address before: 523860 No. 18, Wu Sha Beach Road, Changan Town, Dongguan, Guangdong Applicant before: Guangdong OPPO Mobile Communications Co., Ltd. |
|
CB02 | Change of applicant information | ||
GR01 | Patent grant | ||
GR01 | Patent grant |