CN109903750A - A kind of audio recognition method and device - Google Patents

A kind of audio recognition method and device Download PDF

Info

Publication number
CN109903750A
CN109903750A CN201910130555.0A CN201910130555A CN109903750A CN 109903750 A CN109903750 A CN 109903750A CN 201910130555 A CN201910130555 A CN 201910130555A CN 109903750 A CN109903750 A CN 109903750A
Authority
CN
China
Prior art keywords
result
voice
target
memory body
unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910130555.0A
Other languages
Chinese (zh)
Other versions
CN109903750B (en
Inventor
潘嘉
魏思
王智国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN201910130555.0A priority Critical patent/CN109903750B/en
Publication of CN109903750A publication Critical patent/CN109903750A/en
Application granted granted Critical
Publication of CN109903750B publication Critical patent/CN109903750B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

This application discloses a kind of audio recognition method and devices, this method comprises: after getting target voice to be identified, it will be obtained and the matched expression information of target voice from the memory body constructed in advance, wherein, a large amount of sample speaker is stored in memory body indicates that result and/or sample are spoken environment representation result, in turn, target voice can be identified according to the expression information obtained from memory body.It can be seen that, indicate that result and/or sample are spoken environment representation result due to storing a large amount of sample speaker in memory body, so, the speaker with target voice can be got from memory body and/or expression information that environment of speaking matches, to enrich the basis of characterization of target voice, so as to improve speech recognition effect and efficiency when carrying out online personalized speech identification to target voice.

Description

A kind of audio recognition method and device
Technical field
This application involves technical field of voice recognition more particularly to a kind of audio recognition methods and device.
Background technique
With the continuous breakthrough of artificial intelligence technology and becoming increasingly popular for various intelligent terminals, human-computer interaction is in people The frequency occurred in routine work, life is higher and higher.Voice as most convenient, efficiently one of interactive mode, identify solemn So have become the important link of human-computer interaction.With voice being increasing using user, the difference for habit of pronouncing between user Property become to be more and more obvious, in the case, traditional method for carrying out speech recognition using unified speech recognition modeling can not Good recognition accuracy is all obtained to all users.
It therefore, how to be that each user individually constructs personalized speech recognition mould according to the pronunciation habit of each user Type becomes the important research direction of current field of speech recognition.Existing personalized speech recognition methods is based on big mostly The user's history voice data building of amount is directed to the personalized speech identification model of user, and this method is known as offline personalized; For new user, it cannot achieve due to lacking the offline personalization of historical data;And for old user, user currently can Talk about the otherness between user's history data, often will cause the case where recognition effect of personalized model does not rise anti-drop.
Another personalized method is to carry out personalized identification in real time using user's current sessions data, referred to as online Personalization, but since utilizable data only have user's current sessions, user data is less, is difficult to construct the user in real time Personalized identification model therefore how to guarantee the recognition effect of online personalization and efficiency be technology urgently to be resolved at present Problem.
Summary of the invention
The main purpose of the embodiment of the present application is to provide a kind of audio recognition method and device, carry out it is online personalized Speech recognition when, can be improved the effect and efficiency of speech recognition.
The embodiment of the present application provides a kind of audio recognition method, comprising:
Obtain target voice to be identified;
Obtained from the memory body constructed in advance with the matched expression information of the target voice, store in the memory body A large amount of sample speaker indicates that result and/or sample are spoken environment representation result;
According to the expression information, the target voice is identified.
It is optionally, described to be obtained and the matched expression information of the target voice from the memory body constructed in advance, comprising:
The target voice is split, each unit voice is obtained;
According to the acoustic feature of the unit voice, the expression with the unit voice match is obtained from the memory body Information.
Optionally, the acoustic feature according to the unit voice obtains and the unit language from the memory body The matched expression information of sound, comprising:
Using the acoustic feature of the unit voice as the input of speech recognition modeling, make the knowledge of the speech recognition modeling Each network layer of other network is sequentially output the initial representation result of the unit voice;
It is obtained and the matched expression information of the initial representation result from the memory body.
Optionally, each network layer of the identification network for making the speech recognition modeling, is sequentially output the unit The initial representation result of voice, comprising:
So that each network layer of the identification network of the speech recognition modeling is successively used as current layer, utilizes control parameter tune The initial representation of the whole current layer as a result, the target for obtaining the unit voice corresponding to the current layer indicate as a result, The control parameter is used to that the target to be made to indicate that result approaches the practical expression result of the unit voice;
The target is indicated that result as next layer of input of the current layer, obtains the first of the next layer of output Begin to indicate result.
Optionally, the control parameter is also used to inhibit the ambient noise of the unit voice.
Optionally, the control parameter is according to the initial table with current layer output obtained from the memory body Show that the matched expression information of result is generated.
It is optionally, described to be obtained and the matched expression information of the initial representation result from the memory body, comprising:
The degree of correlation between result is indicated according to sample speaker each in the initial representation result and the memory body, Generating target speaker indicates result;
And/or it is spoken between environment representation result according to each sample in the initial representation result and the memory body The degree of correlation, generate target and speak environment representation result.
Optionally, described according to the expression information, the target voice is identified, comprising:
The target for obtaining each unit voice corresponding to the last layer in the identification network indicates result;
It is indicated according to the target of each phonetic unit of acquisition as a result, being identified to the target voice.
The embodiment of the present application also provides a kind of speech recognition equipments, comprising:
Target voice acquiring unit, for obtaining target voice to be identified;
Information acquisition unit is indicated, for obtaining and the matched expression of the target voice from the memory body constructed in advance Information, a large amount of sample speaker is stored in the memory body indicates that result and/or sample are spoken environment representation result;
Target voice recognition unit, for being identified to the target voice according to the expression information.
Optionally, the expression information acquisition unit includes:
Unit phonetic acquisition subelement obtains each unit voice for splitting the target voice;
Indicate that acquisition of information subelement is obtained from the memory body for the acoustic feature according to the unit voice With the expression information of the unit voice match.
Optionally, the expression acquisition of information subelement includes:
First initial results obtain subelement, for using the acoustic feature of the unit voice as speech recognition modeling Input makes each network layer of the identification network of the speech recognition modeling, is sequentially output the initial representation of the unit voice As a result;
First indicates acquisition of information subelement, matched with the initial representation result for obtaining from the memory body Indicate information.
Optionally, the first initial results acquisition subelement includes:
First object result obtain subelement, for make the speech recognition modeling identification network each network layer according to It is secondary to be used as current layer, the initial representation of the current layer is adjusted using control parameter as a result, obtaining corresponding to the current layer The target of the unit voice indicates as a result, the control parameter is for making the target indicate that result approaches the unit voice Practical expression result;
Second initial results obtain subelement, for indicating result as next layer of the current layer target Input obtains the initial representation result of the next layer of output.
Optionally, the control parameter is also used to inhibit the ambient noise of the unit voice.
Optionally, the control parameter is according to the initial table with current layer output obtained from the memory body Show that the matched expression information of result is generated.
Optionally, the first expression acquisition of information subelement is specifically used for:
The degree of correlation between result is indicated according to sample speaker each in the initial representation result and the memory body, Generating target speaker indicates result;
And/or it is spoken between environment representation result according to each sample in the initial representation result and the memory body The degree of correlation, generate target and speak environment representation result.
Optionally, the target voice recognition unit includes:
Second objective result obtains subelement, for obtaining each list corresponding to the last layer in the identification network The target of position voice indicates result;
Target voice identifies subelement, and the target for each phonetic unit according to acquisition indicates as a result, to the mesh Poster sound is identified.
The embodiment of the present application also provides a kind of speech recognition apparatus, comprising: processor, memory, system bus;
The processor and the memory are connected by the system bus;
The memory includes instruction, described instruction for storing one or more programs, one or more of programs The processor is set to execute any one implementation in above-mentioned audio recognition method when being executed by the processor.
The embodiment of the present application also provides a kind of computer readable storage medium, deposited in the computer readable storage medium Instruction is contained, when described instruction is run on the terminal device, so that the terminal device executes in above-mentioned audio recognition method Any one implementation.
The embodiment of the present application also provides a kind of computer program product, the computer program product is on the terminal device When operation, so that the terminal device executes any one implementation in above-mentioned audio recognition method.
A kind of audio recognition method and device provided by the embodiments of the present application, after getting target voice to be identified, It will be obtained and the matched expression information of target voice from the memory body constructed in advance, wherein stored in memory body a large amount of Sample speaker indicates that result and/or sample speak environment representation as a result, in turn, can be according to the expression obtained from memory body Information identifies target voice.As it can be seen that due to stored in memory body a large amount of sample speaker indicate result and/or Sample speaks environment representation as a result, thus it is possible to getting the speaker with target voice from memory body and/or environment of speaking The expression information to match, to enrich the basis of characterization of target voice, so as to carry out online to target voice Property speech recognition when, improve speech recognition effect and efficiency.
Detailed description of the invention
In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is the application Some embodiments for those of ordinary skill in the art without creative efforts, can also basis These attached drawings obtain other attached drawings.
Fig. 1 is a kind of flow diagram of audio recognition method provided by the embodiments of the present application;
Fig. 2 is that provided by the embodiments of the present application obtain from the memory body constructed in advance is believed with the matched expression of target voice The flow diagram of breath;
Fig. 3 is that the acoustic feature provided by the embodiments of the present application according to unit voice obtains and unit voice from memory body The matched flow diagram for indicating information;
Fig. 4 is the structural schematic diagram of speech recognition modeling provided by the embodiments of the present application;
Fig. 5 is that each network layer of the identification network provided by the embodiments of the present application for making speech recognition modeling is sequentially output list The flow diagram of the initial representation result of position voice;
Fig. 6 is provided by the embodiments of the present application according to the flow diagram for indicating that information identifies target voice;
Fig. 7 is a kind of composition schematic diagram of speech recognition equipment provided by the embodiments of the present application.
Specific embodiment
Existing personalized speech recognition methods can be generally divided into two kinds, and one kind is offline personalized identification method, separately One kind is online personalized identification method.Wherein, offline personalized identification method refers to first based on a large amount of user's history language The building of sound data is directed to the personalized speech identification model of user, then recycles the model to carry out the voice of user personalized Identification.But this offline personalized identification method, for new user, the history voice data due to lacking the new user comes Building is directed to the personalized speech identification model of the new user, therefore can not realize language by this offline personalized identification method Sound identification;Also, for old user, may also can between the voice and its history voice that old user currently issues It has a certain difference, if still using the personalized speech identification model constructed based on its history voice data, to current speech Speech recognition is carried out, recognition effect may be deteriorated.
Online personalized identification method, which is referred to, carries out individual character to it in real time using the voice data in user's current sessions Change speech recognition.In identification process, firstly, receiving the voice data in user's current sessions, and the voice data is extracted Acoustic feature;Then, extracting the corresponding speaker of every frame voice data indicates result;Then, then every frame voice data is calculated Corresponding neural network output;In turn, available recognition result completes speech recognition.
Specifically, in extracting user's current sessions when the acoustic feature of voice data, it is necessary first to voice data Sub-frame processing is carried out, corresponding voice frame sequence is obtained, then extracts the acoustic feature of each speech frame again, wherein the sound The characteristic that feature refers to the acoustic information for characterizing corresponding speech frame is learned, for example, can be mel cepstrum coefficients (Mel-scale Frequency Cepstral Coefficients, abbreviation MFCC) feature or perception linear prediction (Perceptual Linear Predictive, abbreviation PLP) feature etc..
And for each speech frame in voice frame sequence, it is indicated to extract the corresponding speaker of the frame voice As a result, being spliced into characteristic sequence firstly the need of by the acoustic feature of all historical frames before the frame in voice frame sequence, then Using the Speaker Identification model constructed in advance, estimate to obtain the corresponding speaker's expression of the speech frame by maximum-likelihood criterion Vector, and indicate vector as the expression result of corresponding speaker the speaker.Wherein, Speaker Identification model is usually adopted It is global variable space (Total Variable space Model) model, specific building process are as follows: collect first A large amount of voice data of multiple and different users;Then the acoustic feature of these voice data is extracted;Then, after recycling maximum The training that canon of probability carries out global variable spatial model is tested, to construct Speaker Identification model.
Further, the acoustic feature of voice data and each is being got in user's current sessions by the above method After the corresponding speaker of speech frame indicates result (i.e. corresponding speaker indicates vector), the two can be spliced, and will Input of the spliced vector as speech recognition neural network, to obtain the output of the neural network, that is, obtain voice data In each phoneme each state acoustics posterior probability values.And then it can use the output valve and decoding calculation of the neural network Method (such as Viterbi (Viterbi) algorithm) is decoded the search of network, to obtain final recognition result, to complete voice Identification.
But this voice data using in user's current sessions carries out the online of personalized speech identification to it in real time Personalized identification method, the problem that personalized identification effect may be brought poor, for example, man-machine in phonitic entry method, voice Interaction etc. is in application scenarios, and since the duration of every section of session of user's input is all very short, usually only several seconds, this was allowed for pair The basis of characterization that the user carries out speech recognition is less, and the speaker generated online is caused to indicate the accuracy decline of result, into And the accuracy of subsequent speech recognition result is caused to decline.
To solve drawbacks described above, this application provides a kind of audio recognition methods, are getting target voice to be identified Afterwards, it will be obtained and the matched expression information of target voice from the memory body constructed in advance, wherein stored in memory body a large amount of Sample speaker indicate that result and/or sample speak environment representation as a result, in turn, can be according to the table obtained from memory body Show information, target voice is identified.As it can be seen that due to stored in memory body a large amount of sample speaker indicate result and/ Or sample speaks environment representation as a result, therefore, even if can also obtain from memory body in the case where target speech data is less The expression information to match with target voice is got, it is more quasi- so as to according to these basis of characterization to enrich basis of characterization The true expression result (for example the speaker of target voice indicates result etc.) for extracting target voice, and then these can be based on The expression of the target voice extracted improves speech recognition as a result, to the online personalized speech recognition of target voice progress Effect and efficiency.
To keep the purposes, technical schemes and advantages of the embodiment of the present application clearer, below in conjunction with the embodiment of the present application In attached drawing, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described embodiment is Some embodiments of the present application, instead of all the embodiments.Based on the embodiment in the application, those of ordinary skill in the art Every other embodiment obtained without making creative work, shall fall in the protection scope of this application.
First embodiment
It is a kind of flow diagram of audio recognition method provided in this embodiment, this method includes following step referring to Fig. 1 It is rapid:
S101: target voice to be identified is obtained.
In the present embodiment, any voice for carrying out speech recognition using the present embodiment is defined as target voice.Also, The present embodiment does not limit the languages type of target voice, for example, target voice can be Chinese speech or English voice etc.;Together When, the present embodiment does not limit the length of target voice yet, for example, target voice can be a word or more words etc..
It is understood that target voice can be obtained by modes such as recording according to actual needs, for example, people day Often telephone relation voice in life or session recording etc. can be used as target voice, can be with after getting target voice The identification to the target voice is realized using the present embodiment.
S102: it is obtained and the matched expression information of target voice from the memory body constructed in advance, wherein deposited in memory body Having stored up a large amount of sample speaker indicates that result and/or sample are spoken environment representation result.
In the present embodiment, by step S101, after getting target voice to be identified, in order to avoid because of target voice Data are very few, influence effect and efficiency that speech recognition is carried out to it, can be obtained from the memory body constructed in advance first with The expression information that target voice matches, and by the expression information and target speech data collectively as basis of characterization, to logical Subsequent step S103 is crossed, realizes effective identification to target voice.Wherein, about " with the matched expression information of target voice ", The expression information includes that at least one in memory body indicates that all or part of of each expression result in result indicates information, In at least one expression result, sample speaker therein indicates that all or part of of result indicates information, can characterize mesh The speaker characteristics of the affiliated speaker of poster sound, sample therein all or part of of environment representation result that speak indicate information, can To characterize the environmental characteristic for environment of speaking where the affiliated speaker of target voice.
It should be noted that storing a large amount of different sample speaker in memory body indicates that result and/or sample are said Talk about environment representation result.Wherein, sample speaker indicates that result refers to tone color, the gender, age, institute of characterization sample speaker The data of the customized informations such as possession domain can be indicated using vector form or other forms;Sample is spoken environment representation As a result refer to characterization sample speak environment customized information data, equally can using vector form or other forms into Row indicates, for example, the peace and quiet such as vector data, or characterization mountain valley, library of the noisy environment of speaking such as characterization meeting room, market It speaks the vector data of environment.
In practical application, sample speaker different in memory body can be obtained using one of following two embodiment Indicate result:
In the first embodiment, it can use Speaker Identification model trained in advance, generate different speakers It indicates vector, indicates result as the sample speaker in memory body.Specifically, firstly, acquiring the voice of multiple speakers Data extract the phonetic feature of these training datas as training data;Then, special using training data and its voice Sign, is trained the Speaker Identification model after parameter initialization, which can be Factor Analysis Model (such as global variable spatial model) or model based on deep neural network;Then, after training obtains Speaker Identification model, It is re-recognized, is extracted and preserved using voice data of the Speaker Identification model to each speaker in training data The expression vector of corresponding each speaker indicates result as different sample speakers.
For example, if training obtain is global variable spatial model, spoken using the model to each in training data After the voice data of people re-recognizes, the expression vector for each speaker being extracted and preserved is that each speaker is corresponding Acoustic feature I-vector, and then can indicate as the sample speaker in memory body as a result, and speaking corresponding People is as the sample speaker in memory body;If what training obtained is the model based on deep neural network, the model is utilized After being re-recognized to the voice data of speaker each in training data, the expression of each speaker being extracted and preserved to Amount is the output vector of the last one hidden layer in the model, and then can indicate to tie as the sample speaker in memory body Fruit, and using corresponding speaker as the sample speaker in memory body.
In the second embodiment, it can use speaker adaptation speech recognition modeling trained in advance, generate not Same speaker indicates vector, indicates result as the sample speaker in memory body.Specifically, firstly, acquiring multiple theorys The voice data of people is talked about as training data, and extracts the phonetic feature of these training datas, and to every voice data institute The speaker of category is marked;Then, using the training data and its phonetic feature of each speaker, to trained in advance logical Adaptive training is carried out respectively with neural network speech recognition modeling, obtains the corresponding adaptive voice identification mould of each speaker Type, wherein general neural network speech recognition modeling be by a large amount of voice data training obtain, specific training method with Existing method is consistent, and details are not described herein;Then, the corresponding adaptive voice identification model of each speaker is obtained in training Afterwards, it can use the speaker adaptation speech recognition modeling to re-recognize the voice data of corresponding speaker, obtain The expression vector of corresponding speaker, then indicate as the sample speaker in memory body as a result, and speaking corresponding People is as the sample speaker in memory body.
It should be noted that, by training, available each speaker is corresponding in above-mentioned second of embodiment Adaptive voice identification model needs to speak using this if including a plurality of voice data of same speaker in training data The corresponding adaptive voice identification model of people identifies this plurality of voice data respectively, then, then after identifying every time The output vector of the last one hidden layer arrived carries out arithmetic average, that is, by the element of corresponding position each in all output vectors Carry out arithmetic average;Then, then using obtained vector as the expression vector of corresponding speaker, and as corresponding speaker Expression result.
For example: assuming that include 5 voice data of speaker A in training data, then it is corresponding first with speaker A Adaptive voice identification model this 5 voice data are identified respectively, by the last one hidden layer obtained after identification Output vector is expressed as [a1,a2,...an]、[b1,b2,...bn]、[c1,c2,...cn]、[d1,d2,...dn]、[e1, e2,...en];Then, arithmetic average then by this 5 output vectors is carried out, obtained vector can indicate are as follows:
It then, can be using the vector as the expression vector of speaker A, and as the expression result of speaker A.
It should also be noted that, each speaker obtained by training is corresponding in above-mentioned second of embodiment Adaptive voice identification model can be completely independent, that is, each speaker corresponding one individual, personalization adaptive Speech recognition modeling.Certainly, in the corresponding adaptive voice identification model of these speakers some model parameters can be it is shared , for example the input layer and output layer parameter of model are shared, but the middle layer of model is different, i.e., each speaker couple Answer the middle layer of property one by one.
In addition, by the description of above two embodiment it is found that second of embodiment is compared to the first embodiment party Formula, the speaker that gets indicates that the precision of result is higher, but the time spent simultaneously is also longer, in addition can reach several times with On, therefore, in practical applications, can according to speaker indicate result precision requirement and acquisition time demand, One of more appropriate embodiment of selection indicates result to obtain sample speaker.Further, it if desired gets The higher speaker of precision indicates as a result, can also combine two kinds of embodiments, that is, will pass through above two embodiment The obtained sample speaker corresponding to same sample speaker indicates that vector splices, and using spliced vector as this The final sample speaker of sample speaker indicates vector, that is, indicates vector as the sample the final sample speaker The expression result of this speaker.Such as: assuming that the sample of the same sample speaker obtained by above two embodiment is said Talking about people indicates that the dimension of vector is respectively f1And f2, then the two can be spliced, obtaining dimension is f1+f2Vector, to Vector is indicated as final sample speaker, that is, indicates vector as sample speaker the final sample speaker Expression result.
It speaks environment representation furthermore, it is possible to obtain sample different in memory body using one of above two embodiment As a result, the correspondence of " speaker " in every kind of embodiment can be replaced with into " environment of speaking " during specific implementation, For details, reference can be made to the related introductions of above two embodiment, and details are not described herein.
It is understood that the sample in memory body can be said to reduce the subsequent calculation amount to entire memory body Words people indicates that the storage quantity of result is limited within preset range, for example, sample speaker can be indicated to the storage of result Quantity is limited within 1000.Also, if the sample speaker got indicates that the number of result is excessive, it can be by existing Or the following clustering algorithm (such as K mean algorithm) occurred, it is clustered, and using the class center vector after cluster as generation Table, instead of the expression of all speakers in the cluster as a result, storing into memory body, to meet memory body to sample speaker's table Show the number requirement of result.
Similar, in order to reduce the subsequent calculation amount to entire memory body, can also speak the sample in memory body ring Border indicates that the storage quantity of result is limited within preset range, for example, the environment representation result that sample can also be spoken is deposited Storage quantity is limited within 1000.If also, the sample got speak environment representation result number it is excessive, can also lead to The clustering algorithm (such as K mean algorithm) for crossing existing or future appearance, clusters it, and the class center vector after cluster is made To represent, instead of the expression of environment of speaking all in the cluster as a result, storage is said into memory body with meeting memory body to sample Talk about the number requirement of environment representation result.
In the present embodiment, a kind of to be optionally achieved in that, as shown in Fig. 2, in step S102 " from what is constructed in advance The realization process of acquisition and the matched expression information of target voice in memory body " can specifically include step S201-S202:
S201: target voice is split, and obtains each unit voice.
In this implementation, for the acquisition from memory body and the matched expression information of target voice, know to abundant Other foundation, it is necessary first to target voice be split, to obtain each unit voice that target voice includes, for example, each Unit voice can be each speech frame of composition target voice, and each speech frame can be a phoneme, be also possible to one A state in phoneme.
S202: it for per unit voice, according to the acoustic feature of the unit voice, is obtained and the unit from memory body The expression information of voice match.
In this implementation, after obtaining each unit voice that target voice includes by step S201, it can distinguish Feature extraction is carried out to each unit voice, to extract the acoustic feature of per unit voice, which be can be pair Answer MFCC feature or the PLP feature etc. of unit voice.
Later, the sample that can be stored in the acoustic feature and memory body to each unit voice extracted is spoken People indicates that result and/or sample environment representation result of speaking carry out data processing, and according to processing result, from memory body respectively Get the expression information with each unit voice match, in turn, can by the expression information with per unit voice match into Row integration, and using the expression information after integration as with the matched expression information of target voice.
Wherein, about " the expression information with the unit voice match ", which includes at least one in memory body Indicate that all or part of of each expression result in result indicates information, at least one is indicated in result at this, sample therein This speaker indicates that all or part of of result indicates information, can characterize the speaker characteristics of the affiliated speaker of the unit voice, Sample therein all or part of of environment representation result that speak indicates information, can characterize the affiliated speaker institute of the unit voice In the environmental characteristic of environment of speaking.
It should be noted that the specific implementation of this step S202 will be introduced in a second embodiment.
S103: according to the expression information, target voice is identified.
In the present embodiment, it by step S102, is got from the memory body constructed in advance matched with target voice After indicating information, target voice may further be identified according to the expression information.Specifically, it can use the table Show the acoustic feature of information and target voice, prediction obtains the corresponding acoustics posterior probability of per unit voice in target voice Value, for example, the corresponding acoustics posterior probability values of the unit voice refer to that the unit voice belongs to when unit voice is phoneme Posterior probability values when each phoneme type (each phoneme types of the affiliated languages of unit voice).Then after recycling these The search that probability value is decoded network by decoding algorithm (such as Viterbi algorithm) is tested, to obtain the identification knot of target voice Fruit.
It should be noted that the specific implementation of this step S103 will be introduced in a second embodiment.
To sum up, a kind of audio recognition method provided in this embodiment will be from pre- after getting target voice to be identified It is obtained and the matched expression information of target voice in the memory body first constructed, wherein store a large amount of sample in memory body and say Words people indicates that result and/or sample speak environment representation as a result, in turn, can according to the expression information obtained from memory body, Target voice is identified.As it can be seen that indicating that result and/or sample are said due to storing a large amount of sample speaker in memory body Words environment representation from memory body as a result, thus it is possible to get the speaker with target voice and/or environment of speaking matches Expression information, to enrich the basis of characterization of target voice, so as to carry out online personalized language to target voice When sound identifies, speech recognition effect and efficiency are improved.
Second embodiment
Next, the present embodiment will be to step S202 in first embodiment " according to the acoustic feature of unit voice, from memory The specific implementation process of the expression information of acquisition and unit voice match in body " is introduced.
Referring to Fig. 3, it illustrates the acoustic feature provided in this embodiment according to unit voice obtained from memory body with The flow diagram of the expression information of unit voice match, the process the following steps are included:
S301: using the acoustic feature of unit voice as the input of speech recognition modeling, speech recognition modeling is made to identify net Each network layer of network is sequentially output the initial representation result of the unit voice.
It in the present embodiment, can be to each after obtaining each unit voice that target voice includes by step S201 Unit voice carries out acoustic feature extraction, to obtain the corresponding acoustic feature of per unit voice, it is then possible to according to these sound It learns feature and the corresponding initial characteristics vector of each unit voice is generated, to as each unit language using vector generation method The corresponding initial representation of sound can input as a result, when implementing using the acoustic feature of each unit voice as input data Into the speech recognition modeling constructed in advance, each network layer of the identification network of the model is enabled to be sequentially output the unit The initial characteristics vector of voice, as the corresponding initial representation result of each unit voice.It should be noted that in subsequent content In, the present embodiment by be subject to target voice a certain unit voice come introduce how to unit voice carry out data processing, with Obtain its corresponding initial representation as a result, and the processing mode of other unit voices is similar therewith, no longer repeat one by one.
Specifically, the speech recognition modeling that the present embodiment constructs in advance can be made of multitiered network, as shown in figure 4, The model structure includes input layer, identification network, memory body, memory body coding module, control module and output layer.
Wherein, input layer is used to input the acoustic feature of unit voice, by taking unit voice is speech frame as an example, then input layer The data of middle input are the acoustic features such as MFCC feature or the PLP feature of the speech frame.
The acoustic feature for the unit voice that identification network is used to input input layer converts, and will obtain after conversion Feature vector is exported to output layer.As shown in figure 4, identification network can be made of deep neural network, it includes have multiple nets Network layers, wherein each network layer is successively adjusted the feature vector of the unit voice of the output of a network layer thereon, so as to The unit voice can be exported in the feature vector of each network layer, here, the unit voice that each network layer is exported Feature vector is defined as the initial characteristics vector of corresponding network layer output, to as the corresponding initial representation of each unit voice As a result, can be indicated with h, that is to say, that the present embodiment can be by each network layer in identification network to unit voice Initial representation result h is successively updated.
By taking the unit voice is the t frame speech frame in target voice as an example, the speech frame is defeated in l layers for identifying network Initial representation result out can be expressed asAndWherein, R indicates real number, DlIndicate the first of l layers of output Begin to indicate the dimension of result, wherein l=1,2 ... N, N are total number of plies of network layer;Meanwhile the initial representation result can be based onControl parameter is generated by Controlling model, for adjusting the initial representation resultThat is, this of each network layer output is first Begin to indicate resultCorresponding target adjusted indicates as a result, indicating that the generating mode of result will be subsequent about target It is introduced in step S3011.Based on this, for the initial representation result of each network layer outputCorresponding network layer can be passed through Network parameter generate, i.e.,Wherein, f is transforming function transformation function,Indicate t frame speech frame in l-1 Layer output target indicate as a result,Indicate that the target that t-1 frame speech frame is exported at l layers indicates result.
It should be noted that the present embodiment does not limit the structure of depth neural network in identification network, for example, the depth is refreshing It can be unidirectional or two-way length memory models structure in short-term through network, or convolutional neural networks (Convolutional Neural Networks, abbreviation CNN) structure, it can specifically be selected according to the actual situation using which kind of network structure, this Apply embodiment to this without limiting.For example, the large vocabulary voice more for model training data is known in practical application Other task, identify deep neural network in network usually can using 5 to 10 layers of two-way length Memory Neural Networks in short-term, and For the restricted domain voice recognition tasks less for model training data, identify that deep neural network usually can be in network Using 1 to 3 layer of unidirectional long Memory Neural Networks in short-term.
Further, in order to improve the computational efficiency of model, can choose include in identification network multiple network layers it Between be inserted into down-sampled layer, for example, can be inserted into one layer of down-sampled layer between every two adjacent net network layers, that is, altogether insertion it is more A down-sampled layer, alternatively, one layer of down-sampled layer can also be only inserted between any two adjacent net network layers, that is, be inserted into one altogether The down-sampled layer of layer.
Next, to identification network each network layer how " the initial representation result for being sequentially output unit voice " carry out It introduces.
One kind is optionally achieved in that, as shown in figure 5, " making the identification network of speech recognition modeling in step S301 Each network layer, be sequentially output the initial representation result of unit voice " realization process can specifically include step S3011- S3012:
S3011: so that each network layer of the identification network of speech recognition modeling is successively used as current layer, utilize control parameter The initial representation of current layer is adjusted as a result, the target for obtaining the unit voice corresponding to current layer indicates as a result, wherein, control is joined Number is for making target indicate the practical expression result of result approach unity voice.
In this implementation, in order to enable each network layer of identification network of speech recognition modeling to be sequentially output list The initial representation result of position voice, that is, realize the layer-by-layer update to the initial representation result of unit voice, it can be by speech recognition Each network layer of the identification network of model, is successively used as current layer from input layer to output layer direction;Then, known using voice The initial representation result h that the control parameter (can be indicated with g) of control module output in other model exports current layer is carried out Adjustment, and the target for the unit voice that expression result adjusted is defined to correspond to current layer is indicated that result (can be usedIt indicates).
Wherein, about the control parameter of current layer, effect is the initial representation to the unit voice of current layer output As a result h is adjusted, so that the target obtained after adjustment indicates resultThe practical expression knot of the unit voice can more be approached Fruit.It should be noted that the control parameter of each network layer of identification network, is the initial representation based on the output of each network layer As a result it generates, this makes the control parameter of each network layer may be identical or different.
In a kind of possible implementation of the present embodiment, the control parameter of current layer is that basis is obtained from memory body What is taken is generated with the matched expression information of initial representation result of current layer output.
Specifically, as shown in figure 4, in the speech recognition modeling of the present embodiment building, memory body coding module difference Be connected with each network layer, memory body and the control module in identification network, pass through memory body coding module as a result, it can be with Expression information relevant to the initial representation result of current layer output is got from memory body then to be encoded according to memory body The expression information of module output generates the control parameter of current layer.
Due to stored in memory body a large amount of sample speaker indicate result and/or sample speak environment representation as a result, It is then a kind of to be optionally achieved in that, it can be obtained from memory body first with current layer output by memory body coding module Begin to indicate the relevant expression information of result.When specific implementation, it can be said according to each sample in initial representation result and memory body Talk about people indicate result between the degree of correlation, generate target speaker indicate as a result, and/or, according to initial representation result and memory Each sample is spoken the degree of correlation between environment representation result in body, is generated target and is spoken environment representation as a result, in this way, can will The target speaker of generation indicate result and/or target speak environment representation as a result, as obtained from memory body with it is current The relevant expression information of initial representation result of layer output.
Next, being situated between to " the target speaker indicate result " how to generate and " target speak environment representation result " It continues.
It can use memory body coding module, determine that each sample speaker in memory body indicates result and unit voice Initial representation result between degree of correlation size, then, according to these degrees of correlation, by each sample speaker in memory body Indicate result carry out linear combination, with generate can characterize the affiliated speaker of the unit voice phonetic feature indicate as a result, And it is defined as target speaker expression result.For example, using unit voice as the t frame speech frame, current in target voice For layer is l layers, the target speaker of the speech frame generated by memory body coding module indicates that result can be expressed as
Wherein, determine each sample speaker in memory body indicate result and unit voice initial representation result it Between the degree of correlation size when, can be generated each sample speaker in characterization memory body indicate result and initial representation result it Between degree of correlation size combination coefficient.It include coefficient corresponding with each sample speaker expression result in the combination coefficient, The coefficient is bigger, shows that its corresponding sample speaker indicates that the degree of correlation between result and initial representation result is higher, conversely, The coefficient is smaller, shows that its corresponding sample speaker indicates that the degree of correlation between result and initial representation result is lower.
In the present embodiment, each network layer that can use memory body coding module generates said combination coefficient, specifically Ground, memory body coding module can be made of three layers or more of neural network, can specifically include input layer, full articulamentum and Output layer.Wherein, as shown in figure 4, the input layer of memory body coding module is for inputting each sample speaker table in memory body Show the initial representation that result and unit voice are exported in current layer as a result, alternatively, in order to promote encoding efficiency, it can be by the list The arithmetic for the initial representation result that all history unit voices before position voice and the unit voice exports in current layer is put down It is used as input data, the input layer of memory body coding module is input to, for example, using unit voice as the t in target voice For frame speech frame, it is assumed that the speech frame is its initial characteristics in current layer output in the initial representation result that current layer exports Vector, then can all history speech frames (t-1 frame, t-2 frame ...) by t frame speech frame and before current The arithmetic average of the initial characteristics vector of layer output is input to the input layer of memory body coding module as input data;Defeated Enter layer and be provided with one or more layers full articulamentum later, and the number of plies of full articulamentum can be less than the 3 layers and every layer node for including Less than 512, after being encoded by the data that full articulamentum exports input layer, the output layer of memory body coding module can be with Output based on full articulamentum is as a result, each sample speaker generated in characterization memory body indicates result and the unit voice The combination coefficient of degree of correlation size between initial representation result, is defined as α, using unit voice as the t frame in target voice For i-th of sample speaker indicates result in speech frame and memory body, if the current layer of identification network is l layers, Output is corresponded to the coefficient of current layer by the output layer of memory body coding module
The coefficient that result is indicated corresponding to each sample speaker can be obtained through the above wayThese coefficients will Combination coefficient is formed, using the combination coefficient, the target speaker that can calculate t frame speech frame according to the following equation indicates resultSpecific formula for calculation is as follows:
Wherein,Indicate the t frame speech frame in target voice in the initial representation result of l layers of network output of identification The degree of correlation between result is indicated with i-th of sample speaker in memory body;M indicates that sample speaker indicates result in memory body Total number;miIndicate that i-th of sample speaker indicates result in memory body;Indicate the target speaker of t frame speech frame Indicate result.
Similar, the target by that in above-mentioned implementation, can also calculate unit voice is spoken environment representation result. In specific calculating process, it is only necessary to which " the sample speaker indicate result " in memory body replace with to " sample is spoken environment table Show result ", specific calculating process can be found in the related introduction of above-mentioned implementation, and details are not described herein.
As it can be seen that the initial representation knot with current layer output can be got from memory body by memory body coding module The relevant expression information of fruit, may include three kinds of forms: the first indicates result to generate target speaker;Second is generation Target is spoken environment representation result;The third indicates that result and target are spoken environment representation result to generate target speaker.
Further, the initial representation knot with current layer output is being got from memory body by memory body coding module After the relevant expression information of fruit, control parameter can be generated using the expression information by control module.
Specifically, memory body coding module gets related to the initial representation result of current layer output from memory body Expression information after, which can be sent to the control module in speech recognition modeling, as shown in figure 4, speech recognition Control module in model is connected to memory body coding module and identification network, is to be connected to memory body coding more specifically Each network layer in module and identification network.
In practical application, control module can (neural network structure be typically by three layers or more of neural network The feedforward neural network of multilayer) it constitutes, including input layer, middle layer and output layer.Wherein, input layer is for inputting memory body The expression information of coding module output, i.e. above-mentioned target speaker indicate that result and/or target are spoken environment representation result;It is intermediate Layer is the full articulamentum of multilayer, and the number of plies of full articulamentum is identical as the identification network number of plies of network;Output layer is by N number of part group At N is the total number of plies of network for identifying network and including, and each part in output layer in this N number of part respectively corresponds identification net Each network layer of network, this allows the output layer by this N number of part, and output corresponds to each net of identification network respectively The control parameter of network layers, it is therefore, defeated for this when the initial characteristics vector exported using network layer is as when initial representation result Each section in this N number of part of layer out, the number of nodes and identify net corresponding with the part in network which is included Network layers output initial characteristics vector dimension be it is identical, so as to guarantee the output layer each section output control The dimension of parameter vector, the dimension with the initial characteristics vector of corresponding network layer (identification network) output is identical.
It should be noted that for the control parameter of the current layer for corresponding to identification network generated by control module, When the initial representation for the unit voice for adjusting current layer output using it is as a result, obtain the unit voice corresponding to current layer When target indicates result, which can not only make the target indicate that result approaches the practical expression knot of the unit voice Fruit, moreover it is possible to inhibit the ambient noise of the unit voice, for example, inhibiting the periphery speaker other than the affiliated speaker of the unit voice Voice and inhibit ambient noise etc..
Further, for ease of calculation, behaviour can also be normalized in the control parameter vector for corresponding to current layer Make, makes the control of its value range between zero and one, specifically, can use sigmoid function and control parameter vector is returned One changes operation, specific calculation formula are as follows:
Wherein, g indicates the control parameter vector after normalization;X indicates the control parameter vector before normalization.
In turn, the control parameter vector g after can use normalization, the initial table of the unit voice of adjustment current layer output Show result h, the target for obtaining the unit voice corresponding to current layer indicates resultDuring specific adjustment, work as utilization When the initial representation result h for the unit voice that the initial characteristics vector of current layer output is exported as current layer, it can will control The corresponding position element of parameter vector g and initial characteristics vector h carries out multiplication operations, and specific adjustment formula is as follows:
Wherein,Indicate that the target for corresponding to the unit voice of current layer indicates resultJth tie up element;gjIndicate warp The jth of control parameter vector g after normalization ties up element;hjIndicate the initial representation result h of the unit voice of current layer output Jth tie up element.
S3012: the target of unit voice is indicated that result as next layer of input of current layer, obtains next layer of output Initial representation result.
In this implementation, knot is indicated by the target that step S3011 gets the unit voice corresponding to current layer FruitAfterwards, which can be indicated into resultAs next layer of input of current layer, next layer of the network parameter is utilized (for example the transforming function transformation function f) of above-mentioned introduction indicates result to targetIt is converted, to obtain the initial representation of next layer of output As a result h.
S302: the initial representation for the output of each network layer from memory body as a result, obtain and the initial representation result Matched expression information.
The by the agency of in above-mentioned steps S301, in order to be obtained and the matched expression of initial representation result from memory body Information can specifically indicate the degree of correlation between result according to sample speaker each in the initial representation result and memory body, Generate target speaker and indicate as a result, and/or, spoken environment according to each sample in the initial representation result and the memory body It indicates the degree of correlation between result, generates target and speak environment representation result.That is, the target speaker of generation is indicated result And/or target speaks environment representation as a result, obtaining and the matched expression information of the initial representation result as from memory body.
As it can be seen that for each network layer output unit voice initial representation as a result, can be obtained from memory body With the matched expression information of the initial representation result, in this way, corresponding one group of the per unit voice of target voice matches Indicate information, the present embodiment can indicate that information carries out speech recognition to target voice based on these.
Specifically, it obtains belonging to unit language corresponding to each network layer for identifying network in S3011 through the above steps After the target of sound indicates result, step S103 " according to the expression information, identifying to target voice " may further be realized, Referring to Fig. 6, detailed process the following steps are included:
S601: the target for obtaining each unit voice corresponding to the last layer in identification network indicates result.
In the present embodiment, target voice is split by step S201, it, can will be each after obtaining each unit voice The acoustic feature of a unit voice sequentially inputs speech recognition modeling as shown in Figure 4, arrives model by the way that the model is available Identify that the target for corresponding to each unit voice of the last layer network layer in network indicates result
S602: it is indicated according to the target of each phonetic unit of acquisition as a result, being identified to target voice.
In the present embodiment, each unit language corresponding to the last layer in identification network is got by step S601 After the target of sound indicates result, it can be input to the output layer of speech recognition modeling, output layer can use regular method (such as softmax warping function) it is carried out it is regular, to obtain the corresponding acoustics posterior probability values of each unit voice.In turn It can use these posterior probability values, be decoded the search of network, by decoding algorithm (such as Viterbi algorithm) to obtain mesh The recognition result of poster sound.
Next, the training process of speech recognition modeling will be specifically introduced in the present embodiment:
In order to train speech recognition modeling, it is necessary first to collect a large amount of voice data of multiple and different users as training number According to;Then, the acoustic feature of these voice data is extracted;Then, using training data and its acoustic feature, can will intersect Optimization aim of the entropy function as model is constantly updated model parameter by error backpropagation algorithm, wherein model Parameter refers to that the weight connected between every layer network in identification network, control module and the memory body coding module of model turns Change matrix and corresponding biasing.At no point in the update process, model parameter can be updated by way of successive ignition, when reaching When preset convergence target (i.e. intersection entropy function reaches preset value), stops iteration, complete the update of model parameter, trained The speech recognition modeling of completion.
To sum up, the present embodiment, can be according to per unit language in target voice using the speech recognition modeling constructed in advance The acoustic feature of sound gets the expression information with per unit voice match from memory body, and then can obtain and target The expression information that the speaker of voice and/or environment of speaking match, so as to using get these indicate information come The basis of characterization of abundant target voice, and then language can be improved when carrying out online personalized speech identification to target voice Sound recognition effect and efficiency.
3rd embodiment
A kind of speech recognition equipment will be introduced in the present embodiment, and related content refers to above method embodiment.
It is a kind of composition schematic diagram of speech recognition equipment provided in this embodiment referring to Fig. 7, which includes:
Target voice acquiring unit 701, for obtaining target voice to be identified;
Indicate information acquisition unit 702, it is matched with the target voice for being obtained from the memory body constructed in advance Indicate information, a large amount of sample speaker is stored in the memory body indicates that result and/or sample are spoken environment representation result;
Target voice recognition unit 703, for being identified to the target voice according to the expression information.
In a kind of implementation of the present embodiment, the expression information acquisition unit 702 includes:
Unit phonetic acquisition subelement obtains each unit voice for splitting the target voice;
Indicate that acquisition of information subelement is obtained from the memory body for the acoustic feature according to the unit voice With the expression information of the unit voice match.
In a kind of implementation of the present embodiment, the expression acquisition of information subelement includes:
First initial results obtain subelement, for using the acoustic feature of the unit voice as speech recognition modeling Input makes each network layer of the identification network of the speech recognition modeling, is sequentially output the initial representation of the unit voice As a result;
First indicates acquisition of information subelement, matched with the initial representation result for obtaining from the memory body Indicate information.
In a kind of implementation of the present embodiment, first initial results obtain subelement and include:
First object result obtain subelement, for make the speech recognition modeling identification network each network layer according to It is secondary to be used as current layer, the initial representation of the current layer is adjusted using control parameter as a result, obtaining corresponding to the current layer The target of the unit voice indicates as a result, the control parameter is for making the target indicate that result approaches the unit voice Practical expression result;
Second initial results obtain subelement, for indicating result as next layer of the current layer target Input obtains the initial representation result of the next layer of output.
In a kind of implementation of the present embodiment, the control parameter is also used to that the periphery of the unit voice is inhibited to make an uproar Sound.
In a kind of implementation of the present embodiment, the control parameter is according to obtaining from the memory body and institute The matched expression information of initial representation result for stating current layer output is generated.
In a kind of implementation of the present embodiment, the first expression acquisition of information subelement is specifically used for:
The degree of correlation between result is indicated according to sample speaker each in the initial representation result and the memory body, Generating target speaker indicates result;
And/or it is spoken between environment representation result according to each sample in the initial representation result and the memory body The degree of correlation, generate target and speak environment representation result.
In a kind of implementation of the present embodiment, the target voice recognition unit 703 includes:
Second objective result obtains subelement, for obtaining each list corresponding to the last layer in the identification network The target of position voice indicates result;
Target voice identifies subelement, and the target for each phonetic unit according to acquisition indicates as a result, to the mesh Poster sound is identified.
Further, the embodiment of the present application also provides a kind of speech recognition apparatus, comprising: processor, memory, system Bus;
The processor and the memory are connected by the system bus;
The memory includes instruction, described instruction for storing one or more programs, one or more of programs The processor is set to execute any implementation method of above-mentioned audio recognition method when being executed by the processor.
Further, described computer-readable to deposit the embodiment of the present application also provides a kind of computer readable storage medium Instruction is stored in storage media, when described instruction is run on the terminal device, so that the terminal device executes above-mentioned voice Any implementation method of recognition methods.
Further, the embodiment of the present application also provides a kind of computer program product, the computer program product exists When being run on terminal device, so that the terminal device executes any implementation method of above-mentioned audio recognition method.
As seen through the above description of the embodiments, those skilled in the art can be understood that above-mentioned implementation All or part of the steps in example method can be realized by means of software and necessary general hardware platform.Based on such Understand, substantially the part that contributes to existing technology can be in the form of software products in other words for the technical solution of the application It embodies, which can store in storage medium, such as ROM/RAM, magnetic disk, CD, including several Instruction is used so that a computer equipment (can be the network communications such as personal computer, server, or Media Gateway Equipment, etc.) execute method described in certain parts of each embodiment of the application or embodiment.
It should be noted that each embodiment in this specification is described in a progressive manner, each embodiment emphasis is said Bright is the difference from other embodiments, and the same or similar parts in each embodiment may refer to each other.For reality For applying device disclosed in example, since it is corresponded to the methods disclosed in the examples, so being described relatively simple, related place Referring to method part illustration.
It should also be noted that, herein, relational terms such as first and second and the like are used merely to one Entity or operation are distinguished with another entity or operation, without necessarily requiring or implying between these entities or operation There are any actual relationship or orders.Moreover, the terms "include", "comprise" or its any other variant are intended to contain Lid non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only wanted including those Element, but also including other elements that are not explicitly listed, or further include for this process, method, article or equipment Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that There is also other identical elements in process, method, article or equipment including the element.
The foregoing description of the disclosed embodiments makes professional and technical personnel in the field can be realized or use the application. Various modifications to these embodiments will be readily apparent to those skilled in the art, as defined herein General Principle can be realized in other embodiments without departing from the spirit or scope of the application.Therefore, the application It is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase one The widest scope of cause.

Claims (17)

1. a kind of audio recognition method characterized by comprising
Obtain target voice to be identified;
Obtained from the memory body constructed in advance with the matched expression information of the target voice, store in the memory body big The sample speaker of amount indicates that result and/or sample are spoken environment representation result;
According to the expression information, the target voice is identified.
2. the method according to claim 1, wherein described obtain and the mesh from the memory body constructed in advance The matched expression information of poster sound, comprising:
The target voice is split, each unit voice is obtained;
According to the acoustic feature of the unit voice, the expression with the unit voice match is obtained from the memory body and is believed Breath.
3. according to the method described in claim 2, it is characterized in that, the acoustic feature according to the unit voice, from institute State the expression information obtained in memory body with the unit voice match, comprising:
Using the acoustic feature of the unit voice as the input of speech recognition modeling, make the identification net of the speech recognition modeling Each network layer of network is sequentially output the initial representation result of the unit voice;
It is obtained and the matched expression information of the initial representation result from the memory body.
4. according to the method described in claim 3, it is characterized in that, it is described make the speech recognition modeling identification network it is each A network layer is sequentially output the initial representation result of the unit voice, comprising:
So that each network layer of the identification network of the speech recognition modeling is successively used as current layer, adjusts institute using control parameter The initial representation of current layer is stated as a result, obtaining the target of the unit voice corresponding to the current layer indicates as a result, described Control parameter is used to that the target to be made to indicate that result approaches the practical expression result of the unit voice;
The target is indicated that result as next layer of input of the current layer, obtains the initial table of the next layer of output Show result.
5. according to the method described in claim 4, it is characterized in that, the control parameter is also used to inhibit the unit voice Ambient noise.
6. according to the method described in claim 4, it is characterized in that, the control parameter is that basis is obtained from the memory body With the current layer output the matched expression information of initial representation result it is generated.
7. according to the described in any item methods of claim 3 to 6, which is characterized in that acquisition and the institute from the memory body State the matched expression information of initial representation result, comprising:
The degree of correlation between result is indicated according to sample speaker each in the initial representation result and the memory body, is generated Target speaker indicates result;
And/or the phase spoken between environment representation result according to the initial representation result and each sample in the memory body Guan Du generates target and speaks environment representation result.
8. according to the described in any item methods of claim 4 to 6, which is characterized in that it is described according to the expression information, to described Target voice is identified, comprising:
The target for obtaining each unit voice corresponding to the last layer in the identification network indicates result;
It is indicated according to the target of each phonetic unit of acquisition as a result, being identified to the target voice.
9. a kind of speech recognition equipment characterized by comprising
Target voice acquiring unit, for obtaining target voice to be identified;
It indicates information acquisition unit, believes for being obtained from the memory body constructed in advance with the matched expression of the target voice Breath, a large amount of sample speaker is stored in the memory body indicates that result and/or sample are spoken environment representation result;
Target voice recognition unit, for being identified to the target voice according to the expression information.
10. device according to claim 9, which is characterized in that the expression information acquisition unit includes:
Unit phonetic acquisition subelement obtains each unit voice for splitting the target voice;
Indicate acquisition of information subelement, for the acoustic feature according to the unit voice, acquisition and institute from the memory body State the expression information of unit voice match.
11. device according to claim 10, which is characterized in that the expression acquisition of information subelement includes:
First initial results obtain subelement, for using the acoustic feature of the unit voice as the defeated of speech recognition modeling Enter, makes each network layer of the identification network of the speech recognition modeling, be sequentially output the initial representation knot of the unit voice Fruit;
First indicates acquisition of information subelement, for obtaining and the matched expression of initial representation result from the memory body Information.
12. device according to claim 11, which is characterized in that first initial results obtain subelement and include:
First object result obtains subelement, for making each network layer of identification network of the speech recognition modeling successively For current layer, the initial representation of the current layer is adjusted using control parameter as a result, obtaining corresponding to described in the current layer The target of unit voice indicates makes the target indicate that result approaches the reality of the unit voice as a result, the control parameter is used for Border indicates result;
Second initial results obtain subelement, for the target to be indicated that result is defeated as next layer of the current layer Enter, obtains the initial representation result of the next layer of output.
13. device according to claim 12, which is characterized in that the control parameter is that basis is obtained from the memory body What is taken is generated with the matched expression information of initial representation result of current layer output.
14. 1 to 13 described in any item devices according to claim 1, which is characterized in that described first indicates that acquisition of information is single Member is specifically used for:
The degree of correlation between result is indicated according to sample speaker each in the initial representation result and the memory body, is generated Target speaker indicates result;
And/or the phase spoken between environment representation result according to the initial representation result and each sample in the memory body Guan Du generates target and speaks environment representation result.
15. a kind of speech recognition apparatus characterized by comprising processor, memory, system bus;
The processor and the memory are connected by the system bus;
The memory includes instruction for storing one or more programs, one or more of programs, and described instruction works as quilt The processor makes the processor perform claim require 1-8 described in any item methods when executing.
16. a kind of computer readable storage medium, which is characterized in that instruction is stored in the computer readable storage medium, When described instruction is run on the terminal device, so that the terminal device perform claim requires the described in any item methods of 1-8.
17. a kind of computer program product, which is characterized in that when the computer program product is run on the terminal device, make It obtains the terminal device perform claim and requires the described in any item methods of 1-8.
CN201910130555.0A 2019-02-21 2019-02-21 Voice recognition method and device Active CN109903750B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910130555.0A CN109903750B (en) 2019-02-21 2019-02-21 Voice recognition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910130555.0A CN109903750B (en) 2019-02-21 2019-02-21 Voice recognition method and device

Publications (2)

Publication Number Publication Date
CN109903750A true CN109903750A (en) 2019-06-18
CN109903750B CN109903750B (en) 2022-01-04

Family

ID=66945180

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910130555.0A Active CN109903750B (en) 2019-02-21 2019-02-21 Voice recognition method and device

Country Status (1)

Country Link
CN (1) CN109903750B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111883181A (en) * 2020-06-30 2020-11-03 海尔优家智能科技(北京)有限公司 Audio detection method and device, storage medium and electronic device
CN112270923A (en) * 2020-10-22 2021-01-26 江苏峰鑫网络科技有限公司 Semantic recognition system based on neural network
CN112289297A (en) * 2019-07-25 2021-01-29 阿里巴巴集团控股有限公司 Speech synthesis method, device and system
CN112530418A (en) * 2019-08-28 2021-03-19 北京声智科技有限公司 Voice wake-up method, device and related equipment
CN112599118A (en) * 2020-12-30 2021-04-02 科大讯飞股份有限公司 Voice recognition method and device, electronic equipment and storage medium
WO2021136054A1 (en) * 2019-12-30 2021-07-08 Oppo广东移动通信有限公司 Voice wake-up method, apparatus and device, and storage medium
WO2024053844A1 (en) * 2022-09-05 2024-03-14 삼성전자주식회사 Electronic device for updating target speaker by using voice signal included in audio signal, and target speaker updating method therefor

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103714812A (en) * 2013-12-23 2014-04-09 百度在线网络技术(北京)有限公司 Voice identification method and voice identification device
CN106952648A (en) * 2017-02-17 2017-07-14 北京光年无限科技有限公司 A kind of output intent and robot for robot
CN107146615A (en) * 2017-05-16 2017-09-08 南京理工大学 Audio recognition method and system based on the secondary identification of Matching Model
US10079022B2 (en) * 2016-01-05 2018-09-18 Electronics And Telecommunications Research Institute Voice recognition terminal, voice recognition server, and voice recognition method for performing personalized voice recognition
CN109272995A (en) * 2018-09-26 2019-01-25 出门问问信息科技有限公司 Audio recognition method, device and electronic equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103714812A (en) * 2013-12-23 2014-04-09 百度在线网络技术(北京)有限公司 Voice identification method and voice identification device
US10079022B2 (en) * 2016-01-05 2018-09-18 Electronics And Telecommunications Research Institute Voice recognition terminal, voice recognition server, and voice recognition method for performing personalized voice recognition
CN106952648A (en) * 2017-02-17 2017-07-14 北京光年无限科技有限公司 A kind of output intent and robot for robot
CN107146615A (en) * 2017-05-16 2017-09-08 南京理工大学 Audio recognition method and system based on the secondary identification of Matching Model
CN109272995A (en) * 2018-09-26 2019-01-25 出门问问信息科技有限公司 Audio recognition method, device and electronic equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SHILIANG ZHANG ET AL.: "《Feedforward Sequential Memory Networks: A new structure to learn long-term dependency》", 《ARXIV:1512.08301》 *
王海坤等: "《语音识别技术的研究进展与展望》", 《电信科学》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112289297A (en) * 2019-07-25 2021-01-29 阿里巴巴集团控股有限公司 Speech synthesis method, device and system
CN112530418A (en) * 2019-08-28 2021-03-19 北京声智科技有限公司 Voice wake-up method, device and related equipment
WO2021136054A1 (en) * 2019-12-30 2021-07-08 Oppo广东移动通信有限公司 Voice wake-up method, apparatus and device, and storage medium
CN111883181A (en) * 2020-06-30 2020-11-03 海尔优家智能科技(北京)有限公司 Audio detection method and device, storage medium and electronic device
CN112270923A (en) * 2020-10-22 2021-01-26 江苏峰鑫网络科技有限公司 Semantic recognition system based on neural network
CN112599118A (en) * 2020-12-30 2021-04-02 科大讯飞股份有限公司 Voice recognition method and device, electronic equipment and storage medium
CN112599118B (en) * 2020-12-30 2024-02-13 中国科学技术大学 Speech recognition method, device, electronic equipment and storage medium
WO2024053844A1 (en) * 2022-09-05 2024-03-14 삼성전자주식회사 Electronic device for updating target speaker by using voice signal included in audio signal, and target speaker updating method therefor

Also Published As

Publication number Publication date
CN109903750B (en) 2022-01-04

Similar Documents

Publication Publication Date Title
CN109903750A (en) A kind of audio recognition method and device
US20220148571A1 (en) Speech Recognition Method and Apparatus, and Computer-Readable Storage Medium
US11538463B2 (en) Customizable speech recognition system
Deng et al. Recognizing emotions from whispered speech based on acoustic feature transfer learning
CN110164476B (en) BLSTM voice emotion recognition method based on multi-output feature fusion
CN107195296B (en) Voice recognition method, device, terminal and system
CN108615525B (en) Voice recognition method and device
CN109523616B (en) Facial animation generation method, device, equipment and readable storage medium
WO2020253509A1 (en) Situation- and emotion-oriented chinese speech synthesis method, device, and storage medium
WO2018054361A1 (en) Environment self-adaptive method of speech recognition, speech recognition device, and household appliance
Ravanelli et al. A network of deep neural networks for distant speech recognition
Caranica et al. Speech recognition results for voice-controlled assistive applications
JP2005003926A (en) Information processor, method, and program
KR20210070213A (en) Voice user interface
CN111081230A (en) Speech recognition method and apparatus
Ault et al. On speech recognition algorithms
CN109637527A (en) The semantic analytic method and system of conversation sentence
CN115937369A (en) Expression animation generation method and system, electronic equipment and storage medium
Song et al. Dian: Duration informed auto-regressive network for voice cloning
Li et al. Semi-supervised ensemble DNN acoustic model training
CN113571095A (en) Speech emotion recognition method and system based on nested deep neural network
JP7469698B2 (en) Audio signal conversion model learning device, audio signal conversion device, audio signal conversion model learning method and program
Paul et al. Automated speech recognition of isolated words using neural networks
Ponting Computational Models of Speech Pattern Processing
CN112885326A (en) Method and device for creating personalized speech synthesis model, method and device for synthesizing and testing speech

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant