CN109903750A - A kind of audio recognition method and device - Google Patents
A kind of audio recognition method and device Download PDFInfo
- Publication number
- CN109903750A CN109903750A CN201910130555.0A CN201910130555A CN109903750A CN 109903750 A CN109903750 A CN 109903750A CN 201910130555 A CN201910130555 A CN 201910130555A CN 109903750 A CN109903750 A CN 109903750A
- Authority
- CN
- China
- Prior art keywords
- result
- voice
- target
- memory body
- unit
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Abstract
This application discloses a kind of audio recognition method and devices, this method comprises: after getting target voice to be identified, it will be obtained and the matched expression information of target voice from the memory body constructed in advance, wherein, a large amount of sample speaker is stored in memory body indicates that result and/or sample are spoken environment representation result, in turn, target voice can be identified according to the expression information obtained from memory body.It can be seen that, indicate that result and/or sample are spoken environment representation result due to storing a large amount of sample speaker in memory body, so, the speaker with target voice can be got from memory body and/or expression information that environment of speaking matches, to enrich the basis of characterization of target voice, so as to improve speech recognition effect and efficiency when carrying out online personalized speech identification to target voice.
Description
Technical field
This application involves technical field of voice recognition more particularly to a kind of audio recognition methods and device.
Background technique
With the continuous breakthrough of artificial intelligence technology and becoming increasingly popular for various intelligent terminals, human-computer interaction is in people
The frequency occurred in routine work, life is higher and higher.Voice as most convenient, efficiently one of interactive mode, identify solemn
So have become the important link of human-computer interaction.With voice being increasing using user, the difference for habit of pronouncing between user
Property become to be more and more obvious, in the case, traditional method for carrying out speech recognition using unified speech recognition modeling can not
Good recognition accuracy is all obtained to all users.
It therefore, how to be that each user individually constructs personalized speech recognition mould according to the pronunciation habit of each user
Type becomes the important research direction of current field of speech recognition.Existing personalized speech recognition methods is based on big mostly
The user's history voice data building of amount is directed to the personalized speech identification model of user, and this method is known as offline personalized;
For new user, it cannot achieve due to lacking the offline personalization of historical data;And for old user, user currently can
Talk about the otherness between user's history data, often will cause the case where recognition effect of personalized model does not rise anti-drop.
Another personalized method is to carry out personalized identification in real time using user's current sessions data, referred to as online
Personalization, but since utilizable data only have user's current sessions, user data is less, is difficult to construct the user in real time
Personalized identification model therefore how to guarantee the recognition effect of online personalization and efficiency be technology urgently to be resolved at present
Problem.
Summary of the invention
The main purpose of the embodiment of the present application is to provide a kind of audio recognition method and device, carry out it is online personalized
Speech recognition when, can be improved the effect and efficiency of speech recognition.
The embodiment of the present application provides a kind of audio recognition method, comprising:
Obtain target voice to be identified;
Obtained from the memory body constructed in advance with the matched expression information of the target voice, store in the memory body
A large amount of sample speaker indicates that result and/or sample are spoken environment representation result;
According to the expression information, the target voice is identified.
It is optionally, described to be obtained and the matched expression information of the target voice from the memory body constructed in advance, comprising:
The target voice is split, each unit voice is obtained;
According to the acoustic feature of the unit voice, the expression with the unit voice match is obtained from the memory body
Information.
Optionally, the acoustic feature according to the unit voice obtains and the unit language from the memory body
The matched expression information of sound, comprising:
Using the acoustic feature of the unit voice as the input of speech recognition modeling, make the knowledge of the speech recognition modeling
Each network layer of other network is sequentially output the initial representation result of the unit voice;
It is obtained and the matched expression information of the initial representation result from the memory body.
Optionally, each network layer of the identification network for making the speech recognition modeling, is sequentially output the unit
The initial representation result of voice, comprising:
So that each network layer of the identification network of the speech recognition modeling is successively used as current layer, utilizes control parameter tune
The initial representation of the whole current layer as a result, the target for obtaining the unit voice corresponding to the current layer indicate as a result,
The control parameter is used to that the target to be made to indicate that result approaches the practical expression result of the unit voice;
The target is indicated that result as next layer of input of the current layer, obtains the first of the next layer of output
Begin to indicate result.
Optionally, the control parameter is also used to inhibit the ambient noise of the unit voice.
Optionally, the control parameter is according to the initial table with current layer output obtained from the memory body
Show that the matched expression information of result is generated.
It is optionally, described to be obtained and the matched expression information of the initial representation result from the memory body, comprising:
The degree of correlation between result is indicated according to sample speaker each in the initial representation result and the memory body,
Generating target speaker indicates result;
And/or it is spoken between environment representation result according to each sample in the initial representation result and the memory body
The degree of correlation, generate target and speak environment representation result.
Optionally, described according to the expression information, the target voice is identified, comprising:
The target for obtaining each unit voice corresponding to the last layer in the identification network indicates result;
It is indicated according to the target of each phonetic unit of acquisition as a result, being identified to the target voice.
The embodiment of the present application also provides a kind of speech recognition equipments, comprising:
Target voice acquiring unit, for obtaining target voice to be identified;
Information acquisition unit is indicated, for obtaining and the matched expression of the target voice from the memory body constructed in advance
Information, a large amount of sample speaker is stored in the memory body indicates that result and/or sample are spoken environment representation result;
Target voice recognition unit, for being identified to the target voice according to the expression information.
Optionally, the expression information acquisition unit includes:
Unit phonetic acquisition subelement obtains each unit voice for splitting the target voice;
Indicate that acquisition of information subelement is obtained from the memory body for the acoustic feature according to the unit voice
With the expression information of the unit voice match.
Optionally, the expression acquisition of information subelement includes:
First initial results obtain subelement, for using the acoustic feature of the unit voice as speech recognition modeling
Input makes each network layer of the identification network of the speech recognition modeling, is sequentially output the initial representation of the unit voice
As a result;
First indicates acquisition of information subelement, matched with the initial representation result for obtaining from the memory body
Indicate information.
Optionally, the first initial results acquisition subelement includes:
First object result obtain subelement, for make the speech recognition modeling identification network each network layer according to
It is secondary to be used as current layer, the initial representation of the current layer is adjusted using control parameter as a result, obtaining corresponding to the current layer
The target of the unit voice indicates as a result, the control parameter is for making the target indicate that result approaches the unit voice
Practical expression result;
Second initial results obtain subelement, for indicating result as next layer of the current layer target
Input obtains the initial representation result of the next layer of output.
Optionally, the control parameter is also used to inhibit the ambient noise of the unit voice.
Optionally, the control parameter is according to the initial table with current layer output obtained from the memory body
Show that the matched expression information of result is generated.
Optionally, the first expression acquisition of information subelement is specifically used for:
The degree of correlation between result is indicated according to sample speaker each in the initial representation result and the memory body,
Generating target speaker indicates result;
And/or it is spoken between environment representation result according to each sample in the initial representation result and the memory body
The degree of correlation, generate target and speak environment representation result.
Optionally, the target voice recognition unit includes:
Second objective result obtains subelement, for obtaining each list corresponding to the last layer in the identification network
The target of position voice indicates result;
Target voice identifies subelement, and the target for each phonetic unit according to acquisition indicates as a result, to the mesh
Poster sound is identified.
The embodiment of the present application also provides a kind of speech recognition apparatus, comprising: processor, memory, system bus;
The processor and the memory are connected by the system bus;
The memory includes instruction, described instruction for storing one or more programs, one or more of programs
The processor is set to execute any one implementation in above-mentioned audio recognition method when being executed by the processor.
The embodiment of the present application also provides a kind of computer readable storage medium, deposited in the computer readable storage medium
Instruction is contained, when described instruction is run on the terminal device, so that the terminal device executes in above-mentioned audio recognition method
Any one implementation.
The embodiment of the present application also provides a kind of computer program product, the computer program product is on the terminal device
When operation, so that the terminal device executes any one implementation in above-mentioned audio recognition method.
A kind of audio recognition method and device provided by the embodiments of the present application, after getting target voice to be identified,
It will be obtained and the matched expression information of target voice from the memory body constructed in advance, wherein stored in memory body a large amount of
Sample speaker indicates that result and/or sample speak environment representation as a result, in turn, can be according to the expression obtained from memory body
Information identifies target voice.As it can be seen that due to stored in memory body a large amount of sample speaker indicate result and/or
Sample speaks environment representation as a result, thus it is possible to getting the speaker with target voice from memory body and/or environment of speaking
The expression information to match, to enrich the basis of characterization of target voice, so as to carry out online to target voice
Property speech recognition when, improve speech recognition effect and efficiency.
Detailed description of the invention
In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is the application
Some embodiments for those of ordinary skill in the art without creative efforts, can also basis
These attached drawings obtain other attached drawings.
Fig. 1 is a kind of flow diagram of audio recognition method provided by the embodiments of the present application;
Fig. 2 is that provided by the embodiments of the present application obtain from the memory body constructed in advance is believed with the matched expression of target voice
The flow diagram of breath;
Fig. 3 is that the acoustic feature provided by the embodiments of the present application according to unit voice obtains and unit voice from memory body
The matched flow diagram for indicating information;
Fig. 4 is the structural schematic diagram of speech recognition modeling provided by the embodiments of the present application;
Fig. 5 is that each network layer of the identification network provided by the embodiments of the present application for making speech recognition modeling is sequentially output list
The flow diagram of the initial representation result of position voice;
Fig. 6 is provided by the embodiments of the present application according to the flow diagram for indicating that information identifies target voice;
Fig. 7 is a kind of composition schematic diagram of speech recognition equipment provided by the embodiments of the present application.
Specific embodiment
Existing personalized speech recognition methods can be generally divided into two kinds, and one kind is offline personalized identification method, separately
One kind is online personalized identification method.Wherein, offline personalized identification method refers to first based on a large amount of user's history language
The building of sound data is directed to the personalized speech identification model of user, then recycles the model to carry out the voice of user personalized
Identification.But this offline personalized identification method, for new user, the history voice data due to lacking the new user comes
Building is directed to the personalized speech identification model of the new user, therefore can not realize language by this offline personalized identification method
Sound identification;Also, for old user, may also can between the voice and its history voice that old user currently issues
It has a certain difference, if still using the personalized speech identification model constructed based on its history voice data, to current speech
Speech recognition is carried out, recognition effect may be deteriorated.
Online personalized identification method, which is referred to, carries out individual character to it in real time using the voice data in user's current sessions
Change speech recognition.In identification process, firstly, receiving the voice data in user's current sessions, and the voice data is extracted
Acoustic feature;Then, extracting the corresponding speaker of every frame voice data indicates result;Then, then every frame voice data is calculated
Corresponding neural network output;In turn, available recognition result completes speech recognition.
Specifically, in extracting user's current sessions when the acoustic feature of voice data, it is necessary first to voice data
Sub-frame processing is carried out, corresponding voice frame sequence is obtained, then extracts the acoustic feature of each speech frame again, wherein the sound
The characteristic that feature refers to the acoustic information for characterizing corresponding speech frame is learned, for example, can be mel cepstrum coefficients
(Mel-scale Frequency Cepstral Coefficients, abbreviation MFCC) feature or perception linear prediction
(Perceptual Linear Predictive, abbreviation PLP) feature etc..
And for each speech frame in voice frame sequence, it is indicated to extract the corresponding speaker of the frame voice
As a result, being spliced into characteristic sequence firstly the need of by the acoustic feature of all historical frames before the frame in voice frame sequence, then
Using the Speaker Identification model constructed in advance, estimate to obtain the corresponding speaker's expression of the speech frame by maximum-likelihood criterion
Vector, and indicate vector as the expression result of corresponding speaker the speaker.Wherein, Speaker Identification model is usually adopted
It is global variable space (Total Variable space Model) model, specific building process are as follows: collect first
A large amount of voice data of multiple and different users;Then the acoustic feature of these voice data is extracted;Then, after recycling maximum
The training that canon of probability carries out global variable spatial model is tested, to construct Speaker Identification model.
Further, the acoustic feature of voice data and each is being got in user's current sessions by the above method
After the corresponding speaker of speech frame indicates result (i.e. corresponding speaker indicates vector), the two can be spliced, and will
Input of the spliced vector as speech recognition neural network, to obtain the output of the neural network, that is, obtain voice data
In each phoneme each state acoustics posterior probability values.And then it can use the output valve and decoding calculation of the neural network
Method (such as Viterbi (Viterbi) algorithm) is decoded the search of network, to obtain final recognition result, to complete voice
Identification.
But this voice data using in user's current sessions carries out the online of personalized speech identification to it in real time
Personalized identification method, the problem that personalized identification effect may be brought poor, for example, man-machine in phonitic entry method, voice
Interaction etc. is in application scenarios, and since the duration of every section of session of user's input is all very short, usually only several seconds, this was allowed for pair
The basis of characterization that the user carries out speech recognition is less, and the speaker generated online is caused to indicate the accuracy decline of result, into
And the accuracy of subsequent speech recognition result is caused to decline.
To solve drawbacks described above, this application provides a kind of audio recognition methods, are getting target voice to be identified
Afterwards, it will be obtained and the matched expression information of target voice from the memory body constructed in advance, wherein stored in memory body a large amount of
Sample speaker indicate that result and/or sample speak environment representation as a result, in turn, can be according to the table obtained from memory body
Show information, target voice is identified.As it can be seen that due to stored in memory body a large amount of sample speaker indicate result and/
Or sample speaks environment representation as a result, therefore, even if can also obtain from memory body in the case where target speech data is less
The expression information to match with target voice is got, it is more quasi- so as to according to these basis of characterization to enrich basis of characterization
The true expression result (for example the speaker of target voice indicates result etc.) for extracting target voice, and then these can be based on
The expression of the target voice extracted improves speech recognition as a result, to the online personalized speech recognition of target voice progress
Effect and efficiency.
To keep the purposes, technical schemes and advantages of the embodiment of the present application clearer, below in conjunction with the embodiment of the present application
In attached drawing, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described embodiment is
Some embodiments of the present application, instead of all the embodiments.Based on the embodiment in the application, those of ordinary skill in the art
Every other embodiment obtained without making creative work, shall fall in the protection scope of this application.
First embodiment
It is a kind of flow diagram of audio recognition method provided in this embodiment, this method includes following step referring to Fig. 1
It is rapid:
S101: target voice to be identified is obtained.
In the present embodiment, any voice for carrying out speech recognition using the present embodiment is defined as target voice.Also,
The present embodiment does not limit the languages type of target voice, for example, target voice can be Chinese speech or English voice etc.;Together
When, the present embodiment does not limit the length of target voice yet, for example, target voice can be a word or more words etc..
It is understood that target voice can be obtained by modes such as recording according to actual needs, for example, people day
Often telephone relation voice in life or session recording etc. can be used as target voice, can be with after getting target voice
The identification to the target voice is realized using the present embodiment.
S102: it is obtained and the matched expression information of target voice from the memory body constructed in advance, wherein deposited in memory body
Having stored up a large amount of sample speaker indicates that result and/or sample are spoken environment representation result.
In the present embodiment, by step S101, after getting target voice to be identified, in order to avoid because of target voice
Data are very few, influence effect and efficiency that speech recognition is carried out to it, can be obtained from the memory body constructed in advance first with
The expression information that target voice matches, and by the expression information and target speech data collectively as basis of characterization, to logical
Subsequent step S103 is crossed, realizes effective identification to target voice.Wherein, about " with the matched expression information of target voice ",
The expression information includes that at least one in memory body indicates that all or part of of each expression result in result indicates information,
In at least one expression result, sample speaker therein indicates that all or part of of result indicates information, can characterize mesh
The speaker characteristics of the affiliated speaker of poster sound, sample therein all or part of of environment representation result that speak indicate information, can
To characterize the environmental characteristic for environment of speaking where the affiliated speaker of target voice.
It should be noted that storing a large amount of different sample speaker in memory body indicates that result and/or sample are said
Talk about environment representation result.Wherein, sample speaker indicates that result refers to tone color, the gender, age, institute of characterization sample speaker
The data of the customized informations such as possession domain can be indicated using vector form or other forms;Sample is spoken environment representation
As a result refer to characterization sample speak environment customized information data, equally can using vector form or other forms into
Row indicates, for example, the peace and quiet such as vector data, or characterization mountain valley, library of the noisy environment of speaking such as characterization meeting room, market
It speaks the vector data of environment.
In practical application, sample speaker different in memory body can be obtained using one of following two embodiment
Indicate result:
In the first embodiment, it can use Speaker Identification model trained in advance, generate different speakers
It indicates vector, indicates result as the sample speaker in memory body.Specifically, firstly, acquiring the voice of multiple speakers
Data extract the phonetic feature of these training datas as training data;Then, special using training data and its voice
Sign, is trained the Speaker Identification model after parameter initialization, which can be Factor Analysis Model
(such as global variable spatial model) or model based on deep neural network;Then, after training obtains Speaker Identification model,
It is re-recognized, is extracted and preserved using voice data of the Speaker Identification model to each speaker in training data
The expression vector of corresponding each speaker indicates result as different sample speakers.
For example, if training obtain is global variable spatial model, spoken using the model to each in training data
After the voice data of people re-recognizes, the expression vector for each speaker being extracted and preserved is that each speaker is corresponding
Acoustic feature I-vector, and then can indicate as the sample speaker in memory body as a result, and speaking corresponding
People is as the sample speaker in memory body;If what training obtained is the model based on deep neural network, the model is utilized
After being re-recognized to the voice data of speaker each in training data, the expression of each speaker being extracted and preserved to
Amount is the output vector of the last one hidden layer in the model, and then can indicate to tie as the sample speaker in memory body
Fruit, and using corresponding speaker as the sample speaker in memory body.
In the second embodiment, it can use speaker adaptation speech recognition modeling trained in advance, generate not
Same speaker indicates vector, indicates result as the sample speaker in memory body.Specifically, firstly, acquiring multiple theorys
The voice data of people is talked about as training data, and extracts the phonetic feature of these training datas, and to every voice data institute
The speaker of category is marked;Then, using the training data and its phonetic feature of each speaker, to trained in advance logical
Adaptive training is carried out respectively with neural network speech recognition modeling, obtains the corresponding adaptive voice identification mould of each speaker
Type, wherein general neural network speech recognition modeling be by a large amount of voice data training obtain, specific training method with
Existing method is consistent, and details are not described herein;Then, the corresponding adaptive voice identification model of each speaker is obtained in training
Afterwards, it can use the speaker adaptation speech recognition modeling to re-recognize the voice data of corresponding speaker, obtain
The expression vector of corresponding speaker, then indicate as the sample speaker in memory body as a result, and speaking corresponding
People is as the sample speaker in memory body.
It should be noted that, by training, available each speaker is corresponding in above-mentioned second of embodiment
Adaptive voice identification model needs to speak using this if including a plurality of voice data of same speaker in training data
The corresponding adaptive voice identification model of people identifies this plurality of voice data respectively, then, then after identifying every time
The output vector of the last one hidden layer arrived carries out arithmetic average, that is, by the element of corresponding position each in all output vectors
Carry out arithmetic average;Then, then using obtained vector as the expression vector of corresponding speaker, and as corresponding speaker
Expression result.
For example: assuming that include 5 voice data of speaker A in training data, then it is corresponding first with speaker A
Adaptive voice identification model this 5 voice data are identified respectively, by the last one hidden layer obtained after identification
Output vector is expressed as [a1,a2,...an]、[b1,b2,...bn]、[c1,c2,...cn]、[d1,d2,...dn]、[e1,
e2,...en];Then, arithmetic average then by this 5 output vectors is carried out, obtained vector can indicate are as follows:
It then, can be using the vector as the expression vector of speaker A, and as the expression result of speaker A.
It should also be noted that, each speaker obtained by training is corresponding in above-mentioned second of embodiment
Adaptive voice identification model can be completely independent, that is, each speaker corresponding one individual, personalization adaptive
Speech recognition modeling.Certainly, in the corresponding adaptive voice identification model of these speakers some model parameters can be it is shared
, for example the input layer and output layer parameter of model are shared, but the middle layer of model is different, i.e., each speaker couple
Answer the middle layer of property one by one.
In addition, by the description of above two embodiment it is found that second of embodiment is compared to the first embodiment party
Formula, the speaker that gets indicates that the precision of result is higher, but the time spent simultaneously is also longer, in addition can reach several times with
On, therefore, in practical applications, can according to speaker indicate result precision requirement and acquisition time demand,
One of more appropriate embodiment of selection indicates result to obtain sample speaker.Further, it if desired gets
The higher speaker of precision indicates as a result, can also combine two kinds of embodiments, that is, will pass through above two embodiment
The obtained sample speaker corresponding to same sample speaker indicates that vector splices, and using spliced vector as this
The final sample speaker of sample speaker indicates vector, that is, indicates vector as the sample the final sample speaker
The expression result of this speaker.Such as: assuming that the sample of the same sample speaker obtained by above two embodiment is said
Talking about people indicates that the dimension of vector is respectively f1And f2, then the two can be spliced, obtaining dimension is f1+f2Vector, to
Vector is indicated as final sample speaker, that is, indicates vector as sample speaker the final sample speaker
Expression result.
It speaks environment representation furthermore, it is possible to obtain sample different in memory body using one of above two embodiment
As a result, the correspondence of " speaker " in every kind of embodiment can be replaced with into " environment of speaking " during specific implementation,
For details, reference can be made to the related introductions of above two embodiment, and details are not described herein.
It is understood that the sample in memory body can be said to reduce the subsequent calculation amount to entire memory body
Words people indicates that the storage quantity of result is limited within preset range, for example, sample speaker can be indicated to the storage of result
Quantity is limited within 1000.Also, if the sample speaker got indicates that the number of result is excessive, it can be by existing
Or the following clustering algorithm (such as K mean algorithm) occurred, it is clustered, and using the class center vector after cluster as generation
Table, instead of the expression of all speakers in the cluster as a result, storing into memory body, to meet memory body to sample speaker's table
Show the number requirement of result.
Similar, in order to reduce the subsequent calculation amount to entire memory body, can also speak the sample in memory body ring
Border indicates that the storage quantity of result is limited within preset range, for example, the environment representation result that sample can also be spoken is deposited
Storage quantity is limited within 1000.If also, the sample got speak environment representation result number it is excessive, can also lead to
The clustering algorithm (such as K mean algorithm) for crossing existing or future appearance, clusters it, and the class center vector after cluster is made
To represent, instead of the expression of environment of speaking all in the cluster as a result, storage is said into memory body with meeting memory body to sample
Talk about the number requirement of environment representation result.
In the present embodiment, a kind of to be optionally achieved in that, as shown in Fig. 2, in step S102 " from what is constructed in advance
The realization process of acquisition and the matched expression information of target voice in memory body " can specifically include step S201-S202:
S201: target voice is split, and obtains each unit voice.
In this implementation, for the acquisition from memory body and the matched expression information of target voice, know to abundant
Other foundation, it is necessary first to target voice be split, to obtain each unit voice that target voice includes, for example, each
Unit voice can be each speech frame of composition target voice, and each speech frame can be a phoneme, be also possible to one
A state in phoneme.
S202: it for per unit voice, according to the acoustic feature of the unit voice, is obtained and the unit from memory body
The expression information of voice match.
In this implementation, after obtaining each unit voice that target voice includes by step S201, it can distinguish
Feature extraction is carried out to each unit voice, to extract the acoustic feature of per unit voice, which be can be pair
Answer MFCC feature or the PLP feature etc. of unit voice.
Later, the sample that can be stored in the acoustic feature and memory body to each unit voice extracted is spoken
People indicates that result and/or sample environment representation result of speaking carry out data processing, and according to processing result, from memory body respectively
Get the expression information with each unit voice match, in turn, can by the expression information with per unit voice match into
Row integration, and using the expression information after integration as with the matched expression information of target voice.
Wherein, about " the expression information with the unit voice match ", which includes at least one in memory body
Indicate that all or part of of each expression result in result indicates information, at least one is indicated in result at this, sample therein
This speaker indicates that all or part of of result indicates information, can characterize the speaker characteristics of the affiliated speaker of the unit voice,
Sample therein all or part of of environment representation result that speak indicates information, can characterize the affiliated speaker institute of the unit voice
In the environmental characteristic of environment of speaking.
It should be noted that the specific implementation of this step S202 will be introduced in a second embodiment.
S103: according to the expression information, target voice is identified.
In the present embodiment, it by step S102, is got from the memory body constructed in advance matched with target voice
After indicating information, target voice may further be identified according to the expression information.Specifically, it can use the table
Show the acoustic feature of information and target voice, prediction obtains the corresponding acoustics posterior probability of per unit voice in target voice
Value, for example, the corresponding acoustics posterior probability values of the unit voice refer to that the unit voice belongs to when unit voice is phoneme
Posterior probability values when each phoneme type (each phoneme types of the affiliated languages of unit voice).Then after recycling these
The search that probability value is decoded network by decoding algorithm (such as Viterbi algorithm) is tested, to obtain the identification knot of target voice
Fruit.
It should be noted that the specific implementation of this step S103 will be introduced in a second embodiment.
To sum up, a kind of audio recognition method provided in this embodiment will be from pre- after getting target voice to be identified
It is obtained and the matched expression information of target voice in the memory body first constructed, wherein store a large amount of sample in memory body and say
Words people indicates that result and/or sample speak environment representation as a result, in turn, can according to the expression information obtained from memory body,
Target voice is identified.As it can be seen that indicating that result and/or sample are said due to storing a large amount of sample speaker in memory body
Words environment representation from memory body as a result, thus it is possible to get the speaker with target voice and/or environment of speaking matches
Expression information, to enrich the basis of characterization of target voice, so as to carry out online personalized language to target voice
When sound identifies, speech recognition effect and efficiency are improved.
Second embodiment
Next, the present embodiment will be to step S202 in first embodiment " according to the acoustic feature of unit voice, from memory
The specific implementation process of the expression information of acquisition and unit voice match in body " is introduced.
Referring to Fig. 3, it illustrates the acoustic feature provided in this embodiment according to unit voice obtained from memory body with
The flow diagram of the expression information of unit voice match, the process the following steps are included:
S301: using the acoustic feature of unit voice as the input of speech recognition modeling, speech recognition modeling is made to identify net
Each network layer of network is sequentially output the initial representation result of the unit voice.
It in the present embodiment, can be to each after obtaining each unit voice that target voice includes by step S201
Unit voice carries out acoustic feature extraction, to obtain the corresponding acoustic feature of per unit voice, it is then possible to according to these sound
It learns feature and the corresponding initial characteristics vector of each unit voice is generated, to as each unit language using vector generation method
The corresponding initial representation of sound can input as a result, when implementing using the acoustic feature of each unit voice as input data
Into the speech recognition modeling constructed in advance, each network layer of the identification network of the model is enabled to be sequentially output the unit
The initial characteristics vector of voice, as the corresponding initial representation result of each unit voice.It should be noted that in subsequent content
In, the present embodiment by be subject to target voice a certain unit voice come introduce how to unit voice carry out data processing, with
Obtain its corresponding initial representation as a result, and the processing mode of other unit voices is similar therewith, no longer repeat one by one.
Specifically, the speech recognition modeling that the present embodiment constructs in advance can be made of multitiered network, as shown in figure 4,
The model structure includes input layer, identification network, memory body, memory body coding module, control module and output layer.
Wherein, input layer is used to input the acoustic feature of unit voice, by taking unit voice is speech frame as an example, then input layer
The data of middle input are the acoustic features such as MFCC feature or the PLP feature of the speech frame.
The acoustic feature for the unit voice that identification network is used to input input layer converts, and will obtain after conversion
Feature vector is exported to output layer.As shown in figure 4, identification network can be made of deep neural network, it includes have multiple nets
Network layers, wherein each network layer is successively adjusted the feature vector of the unit voice of the output of a network layer thereon, so as to
The unit voice can be exported in the feature vector of each network layer, here, the unit voice that each network layer is exported
Feature vector is defined as the initial characteristics vector of corresponding network layer output, to as the corresponding initial representation of each unit voice
As a result, can be indicated with h, that is to say, that the present embodiment can be by each network layer in identification network to unit voice
Initial representation result h is successively updated.
By taking the unit voice is the t frame speech frame in target voice as an example, the speech frame is defeated in l layers for identifying network
Initial representation result out can be expressed asAndWherein, R indicates real number, DlIndicate the first of l layers of output
Begin to indicate the dimension of result, wherein l=1,2 ... N, N are total number of plies of network layer;Meanwhile the initial representation result can be based onControl parameter is generated by Controlling model, for adjusting the initial representation resultThat is, this of each network layer output is first
Begin to indicate resultCorresponding target adjusted indicates as a result, indicating that the generating mode of result will be subsequent about target
It is introduced in step S3011.Based on this, for the initial representation result of each network layer outputCorresponding network layer can be passed through
Network parameter generate, i.e.,Wherein, f is transforming function transformation function,Indicate t frame speech frame in l-1
Layer output target indicate as a result,Indicate that the target that t-1 frame speech frame is exported at l layers indicates result.
It should be noted that the present embodiment does not limit the structure of depth neural network in identification network, for example, the depth is refreshing
It can be unidirectional or two-way length memory models structure in short-term through network, or convolutional neural networks (Convolutional
Neural Networks, abbreviation CNN) structure, it can specifically be selected according to the actual situation using which kind of network structure, this
Apply embodiment to this without limiting.For example, the large vocabulary voice more for model training data is known in practical application
Other task, identify deep neural network in network usually can using 5 to 10 layers of two-way length Memory Neural Networks in short-term, and
For the restricted domain voice recognition tasks less for model training data, identify that deep neural network usually can be in network
Using 1 to 3 layer of unidirectional long Memory Neural Networks in short-term.
Further, in order to improve the computational efficiency of model, can choose include in identification network multiple network layers it
Between be inserted into down-sampled layer, for example, can be inserted into one layer of down-sampled layer between every two adjacent net network layers, that is, altogether insertion it is more
A down-sampled layer, alternatively, one layer of down-sampled layer can also be only inserted between any two adjacent net network layers, that is, be inserted into one altogether
The down-sampled layer of layer.
Next, to identification network each network layer how " the initial representation result for being sequentially output unit voice " carry out
It introduces.
One kind is optionally achieved in that, as shown in figure 5, " making the identification network of speech recognition modeling in step S301
Each network layer, be sequentially output the initial representation result of unit voice " realization process can specifically include step S3011-
S3012:
S3011: so that each network layer of the identification network of speech recognition modeling is successively used as current layer, utilize control parameter
The initial representation of current layer is adjusted as a result, the target for obtaining the unit voice corresponding to current layer indicates as a result, wherein, control is joined
Number is for making target indicate the practical expression result of result approach unity voice.
In this implementation, in order to enable each network layer of identification network of speech recognition modeling to be sequentially output list
The initial representation result of position voice, that is, realize the layer-by-layer update to the initial representation result of unit voice, it can be by speech recognition
Each network layer of the identification network of model, is successively used as current layer from input layer to output layer direction;Then, known using voice
The initial representation result h that the control parameter (can be indicated with g) of control module output in other model exports current layer is carried out
Adjustment, and the target for the unit voice that expression result adjusted is defined to correspond to current layer is indicated that result (can be usedIt indicates).
Wherein, about the control parameter of current layer, effect is the initial representation to the unit voice of current layer output
As a result h is adjusted, so that the target obtained after adjustment indicates resultThe practical expression knot of the unit voice can more be approached
Fruit.It should be noted that the control parameter of each network layer of identification network, is the initial representation based on the output of each network layer
As a result it generates, this makes the control parameter of each network layer may be identical or different.
In a kind of possible implementation of the present embodiment, the control parameter of current layer is that basis is obtained from memory body
What is taken is generated with the matched expression information of initial representation result of current layer output.
Specifically, as shown in figure 4, in the speech recognition modeling of the present embodiment building, memory body coding module difference
Be connected with each network layer, memory body and the control module in identification network, pass through memory body coding module as a result, it can be with
Expression information relevant to the initial representation result of current layer output is got from memory body then to be encoded according to memory body
The expression information of module output generates the control parameter of current layer.
Due to stored in memory body a large amount of sample speaker indicate result and/or sample speak environment representation as a result,
It is then a kind of to be optionally achieved in that, it can be obtained from memory body first with current layer output by memory body coding module
Begin to indicate the relevant expression information of result.When specific implementation, it can be said according to each sample in initial representation result and memory body
Talk about people indicate result between the degree of correlation, generate target speaker indicate as a result, and/or, according to initial representation result and memory
Each sample is spoken the degree of correlation between environment representation result in body, is generated target and is spoken environment representation as a result, in this way, can will
The target speaker of generation indicate result and/or target speak environment representation as a result, as obtained from memory body with it is current
The relevant expression information of initial representation result of layer output.
Next, being situated between to " the target speaker indicate result " how to generate and " target speak environment representation result "
It continues.
It can use memory body coding module, determine that each sample speaker in memory body indicates result and unit voice
Initial representation result between degree of correlation size, then, according to these degrees of correlation, by each sample speaker in memory body
Indicate result carry out linear combination, with generate can characterize the affiliated speaker of the unit voice phonetic feature indicate as a result,
And it is defined as target speaker expression result.For example, using unit voice as the t frame speech frame, current in target voice
For layer is l layers, the target speaker of the speech frame generated by memory body coding module indicates that result can be expressed as
Wherein, determine each sample speaker in memory body indicate result and unit voice initial representation result it
Between the degree of correlation size when, can be generated each sample speaker in characterization memory body indicate result and initial representation result it
Between degree of correlation size combination coefficient.It include coefficient corresponding with each sample speaker expression result in the combination coefficient,
The coefficient is bigger, shows that its corresponding sample speaker indicates that the degree of correlation between result and initial representation result is higher, conversely,
The coefficient is smaller, shows that its corresponding sample speaker indicates that the degree of correlation between result and initial representation result is lower.
In the present embodiment, each network layer that can use memory body coding module generates said combination coefficient, specifically
Ground, memory body coding module can be made of three layers or more of neural network, can specifically include input layer, full articulamentum and
Output layer.Wherein, as shown in figure 4, the input layer of memory body coding module is for inputting each sample speaker table in memory body
Show the initial representation that result and unit voice are exported in current layer as a result, alternatively, in order to promote encoding efficiency, it can be by the list
The arithmetic for the initial representation result that all history unit voices before position voice and the unit voice exports in current layer is put down
It is used as input data, the input layer of memory body coding module is input to, for example, using unit voice as the t in target voice
For frame speech frame, it is assumed that the speech frame is its initial characteristics in current layer output in the initial representation result that current layer exports
Vector, then can all history speech frames (t-1 frame, t-2 frame ...) by t frame speech frame and before current
The arithmetic average of the initial characteristics vector of layer output is input to the input layer of memory body coding module as input data;Defeated
Enter layer and be provided with one or more layers full articulamentum later, and the number of plies of full articulamentum can be less than the 3 layers and every layer node for including
Less than 512, after being encoded by the data that full articulamentum exports input layer, the output layer of memory body coding module can be with
Output based on full articulamentum is as a result, each sample speaker generated in characterization memory body indicates result and the unit voice
The combination coefficient of degree of correlation size between initial representation result, is defined as α, using unit voice as the t frame in target voice
For i-th of sample speaker indicates result in speech frame and memory body, if the current layer of identification network is l layers,
Output is corresponded to the coefficient of current layer by the output layer of memory body coding module
The coefficient that result is indicated corresponding to each sample speaker can be obtained through the above wayThese coefficients will
Combination coefficient is formed, using the combination coefficient, the target speaker that can calculate t frame speech frame according to the following equation indicates resultSpecific formula for calculation is as follows:
Wherein,Indicate the t frame speech frame in target voice in the initial representation result of l layers of network output of identification
The degree of correlation between result is indicated with i-th of sample speaker in memory body;M indicates that sample speaker indicates result in memory body
Total number;miIndicate that i-th of sample speaker indicates result in memory body;Indicate the target speaker of t frame speech frame
Indicate result.
Similar, the target by that in above-mentioned implementation, can also calculate unit voice is spoken environment representation result.
In specific calculating process, it is only necessary to which " the sample speaker indicate result " in memory body replace with to " sample is spoken environment table
Show result ", specific calculating process can be found in the related introduction of above-mentioned implementation, and details are not described herein.
As it can be seen that the initial representation knot with current layer output can be got from memory body by memory body coding module
The relevant expression information of fruit, may include three kinds of forms: the first indicates result to generate target speaker;Second is generation
Target is spoken environment representation result;The third indicates that result and target are spoken environment representation result to generate target speaker.
Further, the initial representation knot with current layer output is being got from memory body by memory body coding module
After the relevant expression information of fruit, control parameter can be generated using the expression information by control module.
Specifically, memory body coding module gets related to the initial representation result of current layer output from memory body
Expression information after, which can be sent to the control module in speech recognition modeling, as shown in figure 4, speech recognition
Control module in model is connected to memory body coding module and identification network, is to be connected to memory body coding more specifically
Each network layer in module and identification network.
In practical application, control module can (neural network structure be typically by three layers or more of neural network
The feedforward neural network of multilayer) it constitutes, including input layer, middle layer and output layer.Wherein, input layer is for inputting memory body
The expression information of coding module output, i.e. above-mentioned target speaker indicate that result and/or target are spoken environment representation result;It is intermediate
Layer is the full articulamentum of multilayer, and the number of plies of full articulamentum is identical as the identification network number of plies of network;Output layer is by N number of part group
At N is the total number of plies of network for identifying network and including, and each part in output layer in this N number of part respectively corresponds identification net
Each network layer of network, this allows the output layer by this N number of part, and output corresponds to each net of identification network respectively
The control parameter of network layers, it is therefore, defeated for this when the initial characteristics vector exported using network layer is as when initial representation result
Each section in this N number of part of layer out, the number of nodes and identify net corresponding with the part in network which is included
Network layers output initial characteristics vector dimension be it is identical, so as to guarantee the output layer each section output control
The dimension of parameter vector, the dimension with the initial characteristics vector of corresponding network layer (identification network) output is identical.
It should be noted that for the control parameter of the current layer for corresponding to identification network generated by control module,
When the initial representation for the unit voice for adjusting current layer output using it is as a result, obtain the unit voice corresponding to current layer
When target indicates result, which can not only make the target indicate that result approaches the practical expression knot of the unit voice
Fruit, moreover it is possible to inhibit the ambient noise of the unit voice, for example, inhibiting the periphery speaker other than the affiliated speaker of the unit voice
Voice and inhibit ambient noise etc..
Further, for ease of calculation, behaviour can also be normalized in the control parameter vector for corresponding to current layer
Make, makes the control of its value range between zero and one, specifically, can use sigmoid function and control parameter vector is returned
One changes operation, specific calculation formula are as follows:
Wherein, g indicates the control parameter vector after normalization;X indicates the control parameter vector before normalization.
In turn, the control parameter vector g after can use normalization, the initial table of the unit voice of adjustment current layer output
Show result h, the target for obtaining the unit voice corresponding to current layer indicates resultDuring specific adjustment, work as utilization
When the initial representation result h for the unit voice that the initial characteristics vector of current layer output is exported as current layer, it can will control
The corresponding position element of parameter vector g and initial characteristics vector h carries out multiplication operations, and specific adjustment formula is as follows:
Wherein,Indicate that the target for corresponding to the unit voice of current layer indicates resultJth tie up element;gjIndicate warp
The jth of control parameter vector g after normalization ties up element;hjIndicate the initial representation result h of the unit voice of current layer output
Jth tie up element.
S3012: the target of unit voice is indicated that result as next layer of input of current layer, obtains next layer of output
Initial representation result.
In this implementation, knot is indicated by the target that step S3011 gets the unit voice corresponding to current layer
FruitAfterwards, which can be indicated into resultAs next layer of input of current layer, next layer of the network parameter is utilized
(for example the transforming function transformation function f) of above-mentioned introduction indicates result to targetIt is converted, to obtain the initial representation of next layer of output
As a result h.
S302: the initial representation for the output of each network layer from memory body as a result, obtain and the initial representation result
Matched expression information.
The by the agency of in above-mentioned steps S301, in order to be obtained and the matched expression of initial representation result from memory body
Information can specifically indicate the degree of correlation between result according to sample speaker each in the initial representation result and memory body,
Generate target speaker and indicate as a result, and/or, spoken environment according to each sample in the initial representation result and the memory body
It indicates the degree of correlation between result, generates target and speak environment representation result.That is, the target speaker of generation is indicated result
And/or target speaks environment representation as a result, obtaining and the matched expression information of the initial representation result as from memory body.
As it can be seen that for each network layer output unit voice initial representation as a result, can be obtained from memory body
With the matched expression information of the initial representation result, in this way, corresponding one group of the per unit voice of target voice matches
Indicate information, the present embodiment can indicate that information carries out speech recognition to target voice based on these.
Specifically, it obtains belonging to unit language corresponding to each network layer for identifying network in S3011 through the above steps
After the target of sound indicates result, step S103 " according to the expression information, identifying to target voice " may further be realized,
Referring to Fig. 6, detailed process the following steps are included:
S601: the target for obtaining each unit voice corresponding to the last layer in identification network indicates result.
In the present embodiment, target voice is split by step S201, it, can will be each after obtaining each unit voice
The acoustic feature of a unit voice sequentially inputs speech recognition modeling as shown in Figure 4, arrives model by the way that the model is available
Identify that the target for corresponding to each unit voice of the last layer network layer in network indicates result
S602: it is indicated according to the target of each phonetic unit of acquisition as a result, being identified to target voice.
In the present embodiment, each unit language corresponding to the last layer in identification network is got by step S601
After the target of sound indicates result, it can be input to the output layer of speech recognition modeling, output layer can use regular method
(such as softmax warping function) it is carried out it is regular, to obtain the corresponding acoustics posterior probability values of each unit voice.In turn
It can use these posterior probability values, be decoded the search of network, by decoding algorithm (such as Viterbi algorithm) to obtain mesh
The recognition result of poster sound.
Next, the training process of speech recognition modeling will be specifically introduced in the present embodiment:
In order to train speech recognition modeling, it is necessary first to collect a large amount of voice data of multiple and different users as training number
According to;Then, the acoustic feature of these voice data is extracted;Then, using training data and its acoustic feature, can will intersect
Optimization aim of the entropy function as model is constantly updated model parameter by error backpropagation algorithm, wherein model
Parameter refers to that the weight connected between every layer network in identification network, control module and the memory body coding module of model turns
Change matrix and corresponding biasing.At no point in the update process, model parameter can be updated by way of successive ignition, when reaching
When preset convergence target (i.e. intersection entropy function reaches preset value), stops iteration, complete the update of model parameter, trained
The speech recognition modeling of completion.
To sum up, the present embodiment, can be according to per unit language in target voice using the speech recognition modeling constructed in advance
The acoustic feature of sound gets the expression information with per unit voice match from memory body, and then can obtain and target
The expression information that the speaker of voice and/or environment of speaking match, so as to using get these indicate information come
The basis of characterization of abundant target voice, and then language can be improved when carrying out online personalized speech identification to target voice
Sound recognition effect and efficiency.
3rd embodiment
A kind of speech recognition equipment will be introduced in the present embodiment, and related content refers to above method embodiment.
It is a kind of composition schematic diagram of speech recognition equipment provided in this embodiment referring to Fig. 7, which includes:
Target voice acquiring unit 701, for obtaining target voice to be identified;
Indicate information acquisition unit 702, it is matched with the target voice for being obtained from the memory body constructed in advance
Indicate information, a large amount of sample speaker is stored in the memory body indicates that result and/or sample are spoken environment representation result;
Target voice recognition unit 703, for being identified to the target voice according to the expression information.
In a kind of implementation of the present embodiment, the expression information acquisition unit 702 includes:
Unit phonetic acquisition subelement obtains each unit voice for splitting the target voice;
Indicate that acquisition of information subelement is obtained from the memory body for the acoustic feature according to the unit voice
With the expression information of the unit voice match.
In a kind of implementation of the present embodiment, the expression acquisition of information subelement includes:
First initial results obtain subelement, for using the acoustic feature of the unit voice as speech recognition modeling
Input makes each network layer of the identification network of the speech recognition modeling, is sequentially output the initial representation of the unit voice
As a result;
First indicates acquisition of information subelement, matched with the initial representation result for obtaining from the memory body
Indicate information.
In a kind of implementation of the present embodiment, first initial results obtain subelement and include:
First object result obtain subelement, for make the speech recognition modeling identification network each network layer according to
It is secondary to be used as current layer, the initial representation of the current layer is adjusted using control parameter as a result, obtaining corresponding to the current layer
The target of the unit voice indicates as a result, the control parameter is for making the target indicate that result approaches the unit voice
Practical expression result;
Second initial results obtain subelement, for indicating result as next layer of the current layer target
Input obtains the initial representation result of the next layer of output.
In a kind of implementation of the present embodiment, the control parameter is also used to that the periphery of the unit voice is inhibited to make an uproar
Sound.
In a kind of implementation of the present embodiment, the control parameter is according to obtaining from the memory body and institute
The matched expression information of initial representation result for stating current layer output is generated.
In a kind of implementation of the present embodiment, the first expression acquisition of information subelement is specifically used for:
The degree of correlation between result is indicated according to sample speaker each in the initial representation result and the memory body,
Generating target speaker indicates result;
And/or it is spoken between environment representation result according to each sample in the initial representation result and the memory body
The degree of correlation, generate target and speak environment representation result.
In a kind of implementation of the present embodiment, the target voice recognition unit 703 includes:
Second objective result obtains subelement, for obtaining each list corresponding to the last layer in the identification network
The target of position voice indicates result;
Target voice identifies subelement, and the target for each phonetic unit according to acquisition indicates as a result, to the mesh
Poster sound is identified.
Further, the embodiment of the present application also provides a kind of speech recognition apparatus, comprising: processor, memory, system
Bus;
The processor and the memory are connected by the system bus;
The memory includes instruction, described instruction for storing one or more programs, one or more of programs
The processor is set to execute any implementation method of above-mentioned audio recognition method when being executed by the processor.
Further, described computer-readable to deposit the embodiment of the present application also provides a kind of computer readable storage medium
Instruction is stored in storage media, when described instruction is run on the terminal device, so that the terminal device executes above-mentioned voice
Any implementation method of recognition methods.
Further, the embodiment of the present application also provides a kind of computer program product, the computer program product exists
When being run on terminal device, so that the terminal device executes any implementation method of above-mentioned audio recognition method.
As seen through the above description of the embodiments, those skilled in the art can be understood that above-mentioned implementation
All or part of the steps in example method can be realized by means of software and necessary general hardware platform.Based on such
Understand, substantially the part that contributes to existing technology can be in the form of software products in other words for the technical solution of the application
It embodies, which can store in storage medium, such as ROM/RAM, magnetic disk, CD, including several
Instruction is used so that a computer equipment (can be the network communications such as personal computer, server, or Media Gateway
Equipment, etc.) execute method described in certain parts of each embodiment of the application or embodiment.
It should be noted that each embodiment in this specification is described in a progressive manner, each embodiment emphasis is said
Bright is the difference from other embodiments, and the same or similar parts in each embodiment may refer to each other.For reality
For applying device disclosed in example, since it is corresponded to the methods disclosed in the examples, so being described relatively simple, related place
Referring to method part illustration.
It should also be noted that, herein, relational terms such as first and second and the like are used merely to one
Entity or operation are distinguished with another entity or operation, without necessarily requiring or implying between these entities or operation
There are any actual relationship or orders.Moreover, the terms "include", "comprise" or its any other variant are intended to contain
Lid non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only wanted including those
Element, but also including other elements that are not explicitly listed, or further include for this process, method, article or equipment
Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that
There is also other identical elements in process, method, article or equipment including the element.
The foregoing description of the disclosed embodiments makes professional and technical personnel in the field can be realized or use the application.
Various modifications to these embodiments will be readily apparent to those skilled in the art, as defined herein
General Principle can be realized in other embodiments without departing from the spirit or scope of the application.Therefore, the application
It is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase one
The widest scope of cause.
Claims (17)
1. a kind of audio recognition method characterized by comprising
Obtain target voice to be identified;
Obtained from the memory body constructed in advance with the matched expression information of the target voice, store in the memory body big
The sample speaker of amount indicates that result and/or sample are spoken environment representation result;
According to the expression information, the target voice is identified.
2. the method according to claim 1, wherein described obtain and the mesh from the memory body constructed in advance
The matched expression information of poster sound, comprising:
The target voice is split, each unit voice is obtained;
According to the acoustic feature of the unit voice, the expression with the unit voice match is obtained from the memory body and is believed
Breath.
3. according to the method described in claim 2, it is characterized in that, the acoustic feature according to the unit voice, from institute
State the expression information obtained in memory body with the unit voice match, comprising:
Using the acoustic feature of the unit voice as the input of speech recognition modeling, make the identification net of the speech recognition modeling
Each network layer of network is sequentially output the initial representation result of the unit voice;
It is obtained and the matched expression information of the initial representation result from the memory body.
4. according to the method described in claim 3, it is characterized in that, it is described make the speech recognition modeling identification network it is each
A network layer is sequentially output the initial representation result of the unit voice, comprising:
So that each network layer of the identification network of the speech recognition modeling is successively used as current layer, adjusts institute using control parameter
The initial representation of current layer is stated as a result, obtaining the target of the unit voice corresponding to the current layer indicates as a result, described
Control parameter is used to that the target to be made to indicate that result approaches the practical expression result of the unit voice;
The target is indicated that result as next layer of input of the current layer, obtains the initial table of the next layer of output
Show result.
5. according to the method described in claim 4, it is characterized in that, the control parameter is also used to inhibit the unit voice
Ambient noise.
6. according to the method described in claim 4, it is characterized in that, the control parameter is that basis is obtained from the memory body
With the current layer output the matched expression information of initial representation result it is generated.
7. according to the described in any item methods of claim 3 to 6, which is characterized in that acquisition and the institute from the memory body
State the matched expression information of initial representation result, comprising:
The degree of correlation between result is indicated according to sample speaker each in the initial representation result and the memory body, is generated
Target speaker indicates result;
And/or the phase spoken between environment representation result according to the initial representation result and each sample in the memory body
Guan Du generates target and speaks environment representation result.
8. according to the described in any item methods of claim 4 to 6, which is characterized in that it is described according to the expression information, to described
Target voice is identified, comprising:
The target for obtaining each unit voice corresponding to the last layer in the identification network indicates result;
It is indicated according to the target of each phonetic unit of acquisition as a result, being identified to the target voice.
9. a kind of speech recognition equipment characterized by comprising
Target voice acquiring unit, for obtaining target voice to be identified;
It indicates information acquisition unit, believes for being obtained from the memory body constructed in advance with the matched expression of the target voice
Breath, a large amount of sample speaker is stored in the memory body indicates that result and/or sample are spoken environment representation result;
Target voice recognition unit, for being identified to the target voice according to the expression information.
10. device according to claim 9, which is characterized in that the expression information acquisition unit includes:
Unit phonetic acquisition subelement obtains each unit voice for splitting the target voice;
Indicate acquisition of information subelement, for the acoustic feature according to the unit voice, acquisition and institute from the memory body
State the expression information of unit voice match.
11. device according to claim 10, which is characterized in that the expression acquisition of information subelement includes:
First initial results obtain subelement, for using the acoustic feature of the unit voice as the defeated of speech recognition modeling
Enter, makes each network layer of the identification network of the speech recognition modeling, be sequentially output the initial representation knot of the unit voice
Fruit;
First indicates acquisition of information subelement, for obtaining and the matched expression of initial representation result from the memory body
Information.
12. device according to claim 11, which is characterized in that first initial results obtain subelement and include:
First object result obtains subelement, for making each network layer of identification network of the speech recognition modeling successively
For current layer, the initial representation of the current layer is adjusted using control parameter as a result, obtaining corresponding to described in the current layer
The target of unit voice indicates makes the target indicate that result approaches the reality of the unit voice as a result, the control parameter is used for
Border indicates result;
Second initial results obtain subelement, for the target to be indicated that result is defeated as next layer of the current layer
Enter, obtains the initial representation result of the next layer of output.
13. device according to claim 12, which is characterized in that the control parameter is that basis is obtained from the memory body
What is taken is generated with the matched expression information of initial representation result of current layer output.
14. 1 to 13 described in any item devices according to claim 1, which is characterized in that described first indicates that acquisition of information is single
Member is specifically used for:
The degree of correlation between result is indicated according to sample speaker each in the initial representation result and the memory body, is generated
Target speaker indicates result;
And/or the phase spoken between environment representation result according to the initial representation result and each sample in the memory body
Guan Du generates target and speaks environment representation result.
15. a kind of speech recognition apparatus characterized by comprising processor, memory, system bus;
The processor and the memory are connected by the system bus;
The memory includes instruction for storing one or more programs, one or more of programs, and described instruction works as quilt
The processor makes the processor perform claim require 1-8 described in any item methods when executing.
16. a kind of computer readable storage medium, which is characterized in that instruction is stored in the computer readable storage medium,
When described instruction is run on the terminal device, so that the terminal device perform claim requires the described in any item methods of 1-8.
17. a kind of computer program product, which is characterized in that when the computer program product is run on the terminal device, make
It obtains the terminal device perform claim and requires the described in any item methods of 1-8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910130555.0A CN109903750B (en) | 2019-02-21 | 2019-02-21 | Voice recognition method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910130555.0A CN109903750B (en) | 2019-02-21 | 2019-02-21 | Voice recognition method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109903750A true CN109903750A (en) | 2019-06-18 |
CN109903750B CN109903750B (en) | 2022-01-04 |
Family
ID=66945180
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910130555.0A Active CN109903750B (en) | 2019-02-21 | 2019-02-21 | Voice recognition method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109903750B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111883181A (en) * | 2020-06-30 | 2020-11-03 | 海尔优家智能科技(北京)有限公司 | Audio detection method and device, storage medium and electronic device |
CN112270923A (en) * | 2020-10-22 | 2021-01-26 | 江苏峰鑫网络科技有限公司 | Semantic recognition system based on neural network |
CN112289297A (en) * | 2019-07-25 | 2021-01-29 | 阿里巴巴集团控股有限公司 | Speech synthesis method, device and system |
CN112530418A (en) * | 2019-08-28 | 2021-03-19 | 北京声智科技有限公司 | Voice wake-up method, device and related equipment |
CN112599118A (en) * | 2020-12-30 | 2021-04-02 | 科大讯飞股份有限公司 | Voice recognition method and device, electronic equipment and storage medium |
WO2021136054A1 (en) * | 2019-12-30 | 2021-07-08 | Oppo广东移动通信有限公司 | Voice wake-up method, apparatus and device, and storage medium |
WO2024053844A1 (en) * | 2022-09-05 | 2024-03-14 | 삼성전자주식회사 | Electronic device for updating target speaker by using voice signal included in audio signal, and target speaker updating method therefor |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103714812A (en) * | 2013-12-23 | 2014-04-09 | 百度在线网络技术(北京)有限公司 | Voice identification method and voice identification device |
CN106952648A (en) * | 2017-02-17 | 2017-07-14 | 北京光年无限科技有限公司 | A kind of output intent and robot for robot |
CN107146615A (en) * | 2017-05-16 | 2017-09-08 | 南京理工大学 | Audio recognition method and system based on the secondary identification of Matching Model |
US10079022B2 (en) * | 2016-01-05 | 2018-09-18 | Electronics And Telecommunications Research Institute | Voice recognition terminal, voice recognition server, and voice recognition method for performing personalized voice recognition |
CN109272995A (en) * | 2018-09-26 | 2019-01-25 | 出门问问信息科技有限公司 | Audio recognition method, device and electronic equipment |
-
2019
- 2019-02-21 CN CN201910130555.0A patent/CN109903750B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103714812A (en) * | 2013-12-23 | 2014-04-09 | 百度在线网络技术(北京)有限公司 | Voice identification method and voice identification device |
US10079022B2 (en) * | 2016-01-05 | 2018-09-18 | Electronics And Telecommunications Research Institute | Voice recognition terminal, voice recognition server, and voice recognition method for performing personalized voice recognition |
CN106952648A (en) * | 2017-02-17 | 2017-07-14 | 北京光年无限科技有限公司 | A kind of output intent and robot for robot |
CN107146615A (en) * | 2017-05-16 | 2017-09-08 | 南京理工大学 | Audio recognition method and system based on the secondary identification of Matching Model |
CN109272995A (en) * | 2018-09-26 | 2019-01-25 | 出门问问信息科技有限公司 | Audio recognition method, device and electronic equipment |
Non-Patent Citations (2)
Title |
---|
SHILIANG ZHANG ET AL.: "《Feedforward Sequential Memory Networks: A new structure to learn long-term dependency》", 《ARXIV:1512.08301》 * |
王海坤等: "《语音识别技术的研究进展与展望》", 《电信科学》 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112289297A (en) * | 2019-07-25 | 2021-01-29 | 阿里巴巴集团控股有限公司 | Speech synthesis method, device and system |
CN112530418A (en) * | 2019-08-28 | 2021-03-19 | 北京声智科技有限公司 | Voice wake-up method, device and related equipment |
WO2021136054A1 (en) * | 2019-12-30 | 2021-07-08 | Oppo广东移动通信有限公司 | Voice wake-up method, apparatus and device, and storage medium |
CN111883181A (en) * | 2020-06-30 | 2020-11-03 | 海尔优家智能科技(北京)有限公司 | Audio detection method and device, storage medium and electronic device |
CN112270923A (en) * | 2020-10-22 | 2021-01-26 | 江苏峰鑫网络科技有限公司 | Semantic recognition system based on neural network |
CN112599118A (en) * | 2020-12-30 | 2021-04-02 | 科大讯飞股份有限公司 | Voice recognition method and device, electronic equipment and storage medium |
CN112599118B (en) * | 2020-12-30 | 2024-02-13 | 中国科学技术大学 | Speech recognition method, device, electronic equipment and storage medium |
WO2024053844A1 (en) * | 2022-09-05 | 2024-03-14 | 삼성전자주식회사 | Electronic device for updating target speaker by using voice signal included in audio signal, and target speaker updating method therefor |
Also Published As
Publication number | Publication date |
---|---|
CN109903750B (en) | 2022-01-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109903750A (en) | A kind of audio recognition method and device | |
US20220148571A1 (en) | Speech Recognition Method and Apparatus, and Computer-Readable Storage Medium | |
US11538463B2 (en) | Customizable speech recognition system | |
Deng et al. | Recognizing emotions from whispered speech based on acoustic feature transfer learning | |
CN110164476B (en) | BLSTM voice emotion recognition method based on multi-output feature fusion | |
CN107195296B (en) | Voice recognition method, device, terminal and system | |
CN108615525B (en) | Voice recognition method and device | |
CN109523616B (en) | Facial animation generation method, device, equipment and readable storage medium | |
WO2020253509A1 (en) | Situation- and emotion-oriented chinese speech synthesis method, device, and storage medium | |
WO2018054361A1 (en) | Environment self-adaptive method of speech recognition, speech recognition device, and household appliance | |
Ravanelli et al. | A network of deep neural networks for distant speech recognition | |
Caranica et al. | Speech recognition results for voice-controlled assistive applications | |
JP2005003926A (en) | Information processor, method, and program | |
KR20210070213A (en) | Voice user interface | |
CN111081230A (en) | Speech recognition method and apparatus | |
Ault et al. | On speech recognition algorithms | |
CN109637527A (en) | The semantic analytic method and system of conversation sentence | |
CN115937369A (en) | Expression animation generation method and system, electronic equipment and storage medium | |
Song et al. | Dian: Duration informed auto-regressive network for voice cloning | |
Li et al. | Semi-supervised ensemble DNN acoustic model training | |
CN113571095A (en) | Speech emotion recognition method and system based on nested deep neural network | |
JP7469698B2 (en) | Audio signal conversion model learning device, audio signal conversion device, audio signal conversion model learning method and program | |
Paul et al. | Automated speech recognition of isolated words using neural networks | |
Ponting | Computational Models of Speech Pattern Processing | |
CN112885326A (en) | Method and device for creating personalized speech synthesis model, method and device for synthesizing and testing speech |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |