CN108711422A

CN108711422A - Audio recognition method, device, computer readable storage medium and computer equipment

Info

Publication number: CN108711422A
Application number: CN201810457129.3A
Authority: CN
Inventors: 刘毅
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2018-05-14
Filing date: 2018-05-14
Publication date: 2018-10-26
Anticipated expiration: 2038-05-14
Also published as: CN108711422B; WO2019218818A1

Abstract

This application involves a kind of audio recognition method, device, computer readable storage medium and computer equipment, method includes：Obtain the multiple word sequences being decoded to voice data, and corresponding first score of each word sequence；The word sequence of the forward preset quantity of the first score is extracted from word sequence as candidate word sequence；Field where the candidate word sequence of identification；Field where candidate word sequence will be in the neural network of candidate word sequence inputting to corresponding field；Candidate word sequence beat again point by neural network, obtains corresponding second score of each candidate word sequence；The final score of candidate word sequence is obtained according to corresponding first score of candidate word sequence and the second score；Using the highest candidate word sequence of final score as the voice recognition result of voice data, the time that this mode is trained for the neural network of specific area also significantly shortens, and further improves the efficiency of speech recognition.

Description

Audio recognition method, device, computer readable storage medium and computer equipment

Technical field

This application involves technical field of voice recognition, more particularly to a kind of audio recognition method, device, computer-readable Storage medium and computer equipment.

Background technology

The technology of with the rapid development of computer technology, field of speech recognition is also increasingly mature.

In traditional technology, the technical solution of speech recognition is generally divided into front end decoding and back-end processing two parts, front end master It is responsible for receiving the voice data of input and voice data is decoded, obtains multiple sentences that there is a possibility that, rear end is then Determine one of sentence as final recognition result in the sentence for the multiple possibilities that can be obtained from front end.Traditional technology In, the sentence inputting of multiple possibilities to neural network can be determined final recognition result with this by rear end, however this mode Under, the text using magnanimity is needed, and need to expend the longer period can just train the neural network that can finally come into operation, Therefore the scheme of this speech recognition is less efficient.

Invention content

Based on this, it is necessary to which, for the relatively low technical problem of above-mentioned audio identification efficiency, voice can be improved by providing one kind Audio recognition method, device, computer readable storage medium and the computer equipment of recognition efficiency.

A kind of audio recognition method, including：

Obtain the multiple word sequences being decoded to voice data, and corresponding first score of each word sequence；

The word sequence of the forward preset quantity of the first score is extracted from the word sequence as candidate word sequence；

Identify the field where the candidate word sequence；

Field where the candidate word sequence is by the candidate word sequence inputting to the neural network in corresponding field In；

The candidate word sequence beat again point by the neural network, obtains each candidate word sequence corresponding the Two scores；

The final of candidate word sequence is obtained according to corresponding first score of the candidate word sequence and the second score to obtain Point；

Using the highest candidate word sequence of the final score as the voice recognition result of the voice data.

A kind of speech recognition equipment, described device include：

Word sequence acquisition module, for obtaining the multiple word sequences being decoded to voice data and each word Corresponding first score of sequence；

Extraction module, the word sequence conduct for extracting the forward preset quantity of the first score from the word sequence Candidate word sequence；

Field identification module, for identification field where the candidate word sequence；

Beat again sub-module, for according to the field where the candidate word sequence by the candidate word sequence inputting to correspondence In the neural network in field；The candidate word sequence beat again point by the neural network, obtains each candidate's word order Arrange corresponding second score；

Voice recognition result determining module, for that must be got according to corresponding first score of the candidate word sequence and second To the final score of the candidate word sequence；Using the highest candidate word sequence of the final score as the language of the voice data Sound recognition result.

A kind of computer equipment, including memory, processor and storage can be run on a memory and on a processor Computer program, the processor realize following steps when executing the computer program：

Identify the field where the candidate word sequence；

A kind of computer readable storage medium, is stored thereon with computer program, and the computer program is held by processor Following steps are realized when row：

Identify the field where the candidate word sequence；

Above-mentioned audio recognition method, device, computer readable storage medium and computer equipment, by getting to voice The multiple word sequences and corresponding first score that data are decoded select multiple candidate word order from multiple word sequences After the row and field of determining candidate word sequence, can get god by candidate word sequence inputting to the corresponding neural network in field After network carries out beating again the second score got to each candidate word sequence, it can be determined according to the first score and the second score The final score of each candidate's word sequence, and then voice of the highest candidate word sequence of final score as voice data can be selected Recognition result.The method of this speech recognition first identifies candidate word sequence point before carry out beating again to candidate word sequence Field is corresponded to so as to use the corresponding neural network in each field beat again point to candidate word sequence using every field Neural network carry out beating again the second score got that not only score is more accurate to candidate word sequence, since every field is equal There is corresponding neural network, therefore the time being trained for the neural network of specific area also significantly shortens , further improve the efficiency of speech recognition.

Description of the drawings

Fig. 1 is the applied environment figure of audio recognition method in one embodiment；

Fig. 2 is the flow diagram of audio recognition method in one embodiment；

Fig. 3 is the flow diagram of step 206 in one embodiment；

Fig. 4 is the flow diagram of the training step of neural network in one embodiment；

Fig. 5 is the input data of neural network and the schematic diagram of output data in one embodiment；

Fig. 6 is the flow diagram in one embodiment after step 214；

Fig. 7 is the flow diagram of audio recognition method in another embodiment；

Fig. 8 is the schematic diagram of the corresponding neural network training process of every field in one embodiment；

Fig. 9 is the flow diagram that front end is decoded voice data in one embodiment；

Figure 10 is the schematic diagram of word sequence in one embodiment；

Figure 11 is the flow diagram that semantic classification model beat again point to candidate word sequence in one embodiment；

Figure 12 is the flow diagram that semantic classification model beat again point to candidate word sequence in another embodiment；

Figure 13 is the structure diagram of speech recognition equipment in one embodiment；

Figure 14 is the structure diagram of field identification module in another embodiment；

Figure 15 is the structure diagram of one embodiment Computer equipment.

Specific implementation mode

It is with reference to the accompanying drawings and embodiments, right in order to make the object, technical solution and advantage of the application be more clearly understood The application is further elaborated.It should be appreciated that specific embodiment described herein is only used to explain the application, and It is not used in restriction the application.

Fig. 1 is the applied environment figure of audio recognition method in one embodiment.Referring to Fig.1, the audio recognition method application In speech recognition system.The speech recognition system includes terminal 110 and server 120.Terminal 110 and server 120 pass through net Network connects.Terminal 110 can be specifically terminal console or mobile terminal, and mobile terminal specifically can be with mobile phone, tablet computer, notes At least one of this computer etc..Server 120 can use the server of the either multiple server compositions of independent server Cluster is realized.

As shown in Fig. 2, in one embodiment, providing a kind of audio recognition method.The present embodiment is mainly in this way It is illustrated applied to the server 120 in above-mentioned Fig. 1.With reference to Fig. 2, which specifically comprises the following steps：

Step 202, multiple word sequences for being decoded to voice data are obtained, and each word sequence corresponding the One score.

Server can be got voice data is decoded after obtained multiple word sequences, decoding operate can pass through end End carries out, and terminal is when getting voice data, using trained acoustic model, language model and pronunciation dictionary etc., The voice data got is decoded, to which multiple word sequences can be obtained.Word sequence refers to being decoded to voice data The multiple words and mulitpath obtained afterwards have corresponding word on the initial position to end position of each path The sentence that sequence connects, can be regarded as a recognition result, i.e. every word sequence can be regarded as an identification knot Fruit.And there are corresponding score, i.e. the first score in each path of each word sequence.First score can be regarded as according to each The score that the probability calculation that path occurs obtains.

Step 204, the word sequence of the forward preset quantity of the first score is extracted from word sequence as candidate word sequence.

Each word sequence has corresponding first score, can extract out the forward word sequence of the first score.First score is leaned on Before refer to being ranked up word sequence according to the sequence of the first score from high to low, you can with by the word order of the first highest scoring Row rank the first.The forward word sequence of the first score is extracted, that is, extracts the higher word sequence of the first scoring values, extracts word The quantity of sequence can be pre-set, therefore the word sequence of the forward preset quantity of extractable first score is used as candidate word sequence, in advance If quantity can be adjusted according to the considerations of technical staff.

Step 206, the field where candidate word sequence is identified.

Step 208, according to the field where candidate word sequence by candidate word sequence inputting to the neural network in corresponding field In.

After selecting the candidate word sequence of preset quantity from multiple word sequences, where may recognize that candidate word sequence Field.Field can be carried out customized setting by technical staff, for example field can be divided into navigation, music etc..Candidate word order Show multiple, the field identified can be one, i.e., multiple candidate word sequences belong to the same field, when identifying candidate It, can will be in multiple candidate word sequence inputtings to the neural network in corresponding field behind the corresponding field of word sequence.That is, for each neck There is corresponding advance neural network in domain, and the neural network in each field can be targetedly to the candidate word order in respective field Row carry out weight scoring operations.

Step 210, candidate word sequence beat again point by neural network, obtain each candidate word sequence corresponding Two scores.

Step 212, the final of candidate word sequence is obtained according to corresponding first score of candidate word sequence and the second score to obtain Point.

Step 214, using the highest candidate word sequence of final score as the voice recognition result of voice data.

It is input to corresponding neural network when the field for identifying candidate word sequence, and by whole candidate word sequences In, such as when identifying that the field of candidate word sequence is " navigation " field, then by whole candidate word sequence inputting to " navigation " In the corresponding neural network in field, the neural network in " navigation " field can beat again the candidate word sequence of each of input Point, obtain the second score of each candidate word sequence.

Beat again and point refer to, the corresponding neural network in field where candidate word sequence can to the candidate word sequence of each of input into One score of row calculates, and obtains the second score of each candidate word sequence, by the first score and second of each candidate word sequence Score is combined calculating by certain mode, for example the first score and the second score can be weighted respectively, The first score after weighted calculation is added with the score of the second score again.It can be available according to the first score and the second score The final score of each candidate's word sequence, and can be using the highest candidate word sequence of final score as the speech recognition of voice data As a result.Above-mentioned audio recognition method, by getting the multiple word sequences being decoded to voice data and corresponding One score after selecting multiple candidate word sequences in multiple word sequences and determines the field of candidate word sequence, can be by candidate word In sequence inputting to the corresponding neural network in field, gets neural network and each candidate word sequence beat again and get After second score, the final score of each candidate word sequence can be determined according to the first score and the second score, and then can select most Voice recognition result of the candidate word sequence of whole highest scoring as voice data.The method of this speech recognition, to candidate Point before word sequence beat again, the field of candidate word sequence is first identified, so as to use the corresponding nerve net in each field Network beat again point to candidate word sequence, beat again getting to candidate word sequence using the corresponding neural network of every field The second score not only score is more accurate, since every field has corresponding neural network, be directed to specific neck The time that the neural network in domain is trained also significantly shortens, and further improves the efficiency of speech recognition.

In one embodiment, as shown in figure 3, step 206, including：

It step 302, will be in each candidate word sequence inputting to semantic classification model.

Step 304, classified to each candidate word sequence by semantic classification model, obtain each candidate word sequence pair The tag along sort answered.

Step 306, the corresponding field of the maximum tag along sort of accounting in tag along sort is obtained to be used as where candidate word sequence Field.

After extracting the word sequence of preset quantity as candidate word sequence, each candidate word sequence can be input to instruction In the semantic classification model perfected.Trained semantic classification model refers to being trained in advance with a large amount of training sample Semantic classification model, the field where the candidate word sequence inputted for identification.By the candidate word order of the preset quantity extracted After row are all input in trained semantic classification model, trained semantic classification model can carry out each candidate word sequence Classification exports the tag along sort obtained after classifying to each candidate word sequence, can be according to the classification of each candidate word sequence Label determines the corresponding field of each candidate word sequence.

Each there may be differences for the corresponding tag along sort of candidate word sequence, that is, trained semantic classification model is to every There may be differences for the classification results of a candidate's word sequence, then represent trained semantic classification model and judge each candidate word Sequence fields are different.In this case, in order to determine the field of candidate word sequence, each candidate word order can be got The tag along sort of row, using the maximum tag along sort of accounting as whole corresponding tag along sorts of candidate word sequence, i.e., by accounting Field of the corresponding field of maximum tag along sort as whole candidate word sequences.

For example, the tag along sort of candidate word sequence A is 1, the tag along sort of candidate word sequence B is 1, candidate word sequence C Tag along sort is 2, and the tag along sort of candidate word sequence D is 1, wherein 1 can be expressed as navigation type, 2 are expressed as music class, then In candidate word sequence A, B, C, D, the accounting that tag along sort is 1 is maximum, thus can by tag along sort 1 as candidate word sequence A, B, the tag along sort of C, D, you can think that the field of candidate word sequence A, B, C, D be the corresponding field of tag along sort are navigation type.

Go out the field where candidate word sequence by semantic classification Model Identification, and when the field of each candidate word sequence is deposited In disagreement, using the corresponding field of the maximum tag along sort of tag along sort accounting as the field where candidate word sequence, it is ensured that The accuracys of candidate word sequence fields, also improves the accuracy for beating again point score of subsequent candidate word sequence, to Improve the accuracy of voice recognition result.

In one embodiment, above-mentioned audio recognition method further includes：Obtain the corresponding text of every field；It will be in text Each word be converted into term vector；Using term vector as input, the corresponding neural network in each field is trained.

Before practice neural network, first each neural network can be trained according to actual needs, training Good neural network can put into actual use and be identified to the field of the candidate word sequence of input.Since each field has respectively Self-corresponding neural network, therefore when the neural network to each field is trained, need acquisition every field corresponding Text.Text can be sentence, after the sentence for getting each field, can carry out word segmentation processing to each sentence, then can obtain To the corresponding multiple words of each sentence.The word for including in each sentence is converted into term vector, as neural network Input data is by this method trained the corresponding neural network in each field, can be obtained each field after training Corresponding trained neural network.

The neural network of every field is trained in advance, it is ensured that in neural network in actual use to waiting It selects word sequence beat again the accuracy of timesharing, to improve the accuracy of speech recognition, can also improve the effect of speech recognition Rate.

In one embodiment, using term vector as input, the corresponding neural network in each field is trained, is wrapped It includes：According to the sequence of word in text, using the corresponding term vector of each word in text as input, by the word of each input For the corresponding term vector of next word of language as output, the parameter to adjust neural network is trained neural network.

After the text for obtaining each field, the word for including in text can be converted into term vector, and be input to every In the corresponding neural network in a field, neural network is trained.It, can be according to text after term vector is input to neural network The sequence of word in this, using the term vector of the word of each input as inputting, and by next word of the word of each input The corresponding term vector of language is used as output,.It is noted that the corresponding input data of the word each exported is not merely previous The word of a moment input, but the word at previous moment and the input of all moment before.Nerve net is adjusted in this way The parameter of network is trained neural network, to which the corresponding trained neural network in each field can be obtained, it is ensured that instruction The neural network got carries out candidate word sequence to beat again timesharing, and what is obtained beats again the reliability of point score.

In one embodiment, as shown in figure 4, above-mentioned audio recognition method further includes the training step of neural network, packet It includes：

Step 402, the corresponding text of every field is obtained.

Step 404, each word in text is converted into term vector.

Step 406, according to the sequence of word in text, using the corresponding term vector of each word in text as input, Using the corresponding term vector of next word of the word of each input as output, to adjust the parameter of neural network to nerve net Network is trained.

In order to ensure the accuracy of the neural network of input actual use, neural network can be trained in advance, obtained Practice is carried out again after to trained neural network.There is corresponding neural network in each field, that is, needs to every The corresponding neural network in a field is trained respectively.When the neural network for each field is trained, need to obtain To the corresponding text in each field, and the word in the text in each field can be converted into term vector, term vector can be with It is the vector data of a N-dimensional.The corresponding text in each field can be regarded as the corresponding sentence in each field, and participle can be used Tool carries out word division to each sentence, i.e., each sentence can be corresponding with multiple words, the word that can will include in each sentence Language is input to after being converted into term vector in the corresponding neural network of every field.

Using the corresponding term vector of the word for including in each sentence as the input of neural network, and by each sentence, Output of the next word of the word of input as the term vector of a upper word.Output data is under each input word The corresponding term vector of one word, but for each output data, input data is not merely a upper input Word term vector, but it is upper one input word term vector and the whole words inputted before term vector.If The input data and output data of neural network are determined, and constantly the parameter of neural network have been adjusted, to neural network It is trained.For example, sentence A is：I am late for school yesterday, then participle tool can be used to be split as sentence：I/yesterday It/go to school/be late//.As shown in figure 5, according to the sequence of word script in sentence A, the word that will include in sentence A successively Corresponding term vector is input in neural network, and x1, x2, x3, x4, x5, x6 are respectively the corresponding term vector of each word in figure, As input data.Acquiescently, a blank input is increased before the term vector of input, i.e., corresponds to the word of sentence A Term vector be input to neural network after, " I " corresponding term vector is not first and has input, and become second it is defeated Enter data, i.e. first input data is defaulted as sky.

Accordingly, using the corresponding term vector of next word of the word of each input as corresponding output data, i.e., First is defaulted as the term vector that the empty corresponding output data of input data is word " I ", and the word " I " inputted is corresponding Output data is the term vector of word " yesterday ".As shown in Figure 5, x1, x2, x3, x4, x5, x6 respectively corresponding " blank word ", " I ", " yesterday ", " going to school ", " late ", " ", and the term vector that y1, y2, y3, y4, y5, y6 are respectively each input corresponds to Output data, corresponding word be " I ", " yesterday ", " going to school ", " late ", " ", i.e. word " going to school " is corresponding Output data is its corresponding term vector of next word " late ".For output data is " yesterday ", input data The only blank word of word " I " and acquiescence, but for output data is " ", input data is to input before this Whole words, i.e. " I ", " yesterday ", " going to school ", " late " and the blank word given tacit consent to all is the input of word " " Data.That is, the vector each exported, all to the input at previous moment and before, the input at all moment is related. The word each exported is related to the word of the input word at previous moment and input of all moment before.

The training process of neural network is exactly the process for not stopping to adjust the parameter of neural network, when the parameter quilt of neural network When determining, then it is believed that this neural network is trained and finishes.Each time input text in the corresponding word of each word to When amount is trained neural network, the parameter of neural network can be adjusted, until technical staff determines some ginseng Number is optimal parameter, then neural metwork training finishes.When determining whether the parameter of setting is optimal parameter, authentication can be used Formula, i.e., by the parameter setting of neural network it is good after, neural network is verified using the sample data largely verified.It will test Card data are input in neural network, and detection neural network, to the predictablity rate of verify data, works as nerve net under the parameter The prediction preparation rate of network reach technical staff setting standard value when, then it is believed that the parameter be optimal parameter, otherwise need after It is continuous that the parameter of neural network is adjusted, until neural network passes through verification.

Each neural network is trained by the corresponding text of a large amount of every field, it is ensured that trained nerve The reliability of network also improves efficiency of the neural network to data processing, and then can improve the efficiency and standard of speech recognition Exactness.

In one embodiment, candidate word sequence is obtained according to corresponding first score of candidate word sequence and the second score Final score, including：To being weighted summation according to corresponding first score of candidate word sequence and the second score, candidate word is obtained The final score of sequence.

Terminal can be decoded voice data operation, multiple word sequences can be obtained, each after getting voice data Word sequence has corresponding first score, and the word sequence of preset quantity is selected from multiple word sequences as candidate word order Row, i.e., each candidate word sequence have corresponding first score.It is corresponded to by each candidate word sequence inputting to place field Trained neural network in after, trained neural network, which to each candidate word sequence can beat again, gets second Point, then each candidate word sequence has corresponding second score.In the first score for obtaining each candidate word sequence and the After two scores, the final score of each candidate word sequence can be obtained.

When calculating final score according to the first score and the second score, can be weighted, the first score and second The weight of score may be the same or different, and specifically may depend on the setting of technical staff.For example, technical staff thinks The accuracy higher of one score, or the influence bigger to voice recognition result can then increase the weight of the first score.When After being weighted calculating to the first score and the second score, the score after the first score and the weighting of the second score can be carried out Summation, you can obtain the final score of each candidate word sequence.

The first score and the second score are calculated using weighting, it can be at any time by adjusting the first score and the second score Weight the influence of the first score and the second score for voice recognition result is adjusted with this, that is, can adjust at any time voice side and The accounting of semantic side can be further assured that the reliability of final score, and the voice recognition result to select is more accurate.

In one embodiment, neural network is Recognition with Recurrent Neural Network.

Recognition with Recurrent Neural Network, alternatively referred to as RNNLM (Recurrent Neural Network Based Language Model), alternatively referred to as recirculating network language model.Recirculating network language model, can be with other than considering the word that currently inputs Consider the multiple words inputted before, and next word can occur according to the long text of the word composition inputted before to calculate Probability, therefore recirculating network language model has " better memory effect ".For example " I " " mood " is likely to occur " no below It is wrong ", it is also possible to occur " bad ", the appearance of these words, occur dependent on before " I " and " mood ", this is exactly " note Recall effect ".

Each word of input can be mapped to a compact vector row space by recirculating network language model, and map Vector row space is connected by circulation using relatively small parameter set merging use and establishes corresponding model, and the upper and lower of long range is generated Text relies on.Common a kind of language model, i.e. NGRAM models in large vocabulary continuous speech recognition can rely on preceding N-1 of input Word, and recirculating network language model can capture the historical information of all words inputted before.Therefore compared to traditional language Speech model only depends on the characteristic of the N-1 word that front inputs, recirculating network language model for input candidate word sequence into It is more accurate that row beats again the score got.

In one embodiment, using the highest candidate word sequence of final score as the voice recognition result of voice data Later, further include：Entity extraction is carried out to voice recognition result, obtains entity word；Entity word is retrieved；Work as retrieval As a result when inconsistent with entity word, entity word is repaired.

After obtaining the first score and the second score of each candidate word sequence, each candidate word sequence can be calculated Final score, and select voice recognition result of the highest candidate word of final score as voice data.Determining voice data After corresponding voice recognition result, entity reparation can be carried out to voice recognition result.First entity is carried out to voice recognition result to carry It taking, entity on behalf plays the word of larger effect, such as " I wants to go to Window on the World " in voice recognition result this sentence, So in this sentence, " Window on the World " is then the emphasis of this sentence, and " I wants to go to " is one kind of speaker mean Expression, and it is " Window on the World " finally to need practicable place.

When carrying out entity extraction to voice recognition result, multiple entity words can be obtained, a reality can also be had to Pronouns, general term for nouns, numerals and measure words language can retrieve each entity word after extracting entity word, when the result retrieved and speech recognition knot When the entity word extracted in fruit is inconsistent, then the entity word extracted in voice recognition result is substituted for retrieval knot Fruit by this method repairs entity word.

In speech recognition, identification wrong word is generally primarily focused on entity special in field, therefore can be asked for this Topic, the entity extraction in voice recognition result is come out, and is repaired accordingly to entity word, is known to reduce voice There is the probability of wrong word in other result, improves the accuracy of finally obtained voice recognition result.

In one embodiment, entity word is retrieved, including：Field where candidate word sequence determines language The field of sound recognition result；Each entity word is retrieved in the corresponding database in the field of voice recognition result.

Multiple word sequences can be obtained after being decoded to voice data, multiple word sequence conducts are selected from word sequence Candidate word sequence, and voice recognition result of one of candidate word sequence as voice data is selected, therefore can be according to candidate Field where word sequence determines the corresponding field of voice recognition result.It, can be after the field of voice recognition result is determined Each entity word is retrieved in the corresponding database in field.

Entity word is retrieved according to the field where voice recognition result, not only reduces the data volume of retrieval, The accuracy of retrieval result is also improved, therefore when being repaired to entity word, repairing result also can be more accurate, to Improve the efficiency and accuracy corrected to voice recognition result.

In one embodiment, further comprising the steps of as shown in fig. 6, after step 214：

Step 602, entity extraction is carried out to voice recognition result, obtains entity word.

Step 604, the field of voice recognition result is determined according to the field where candidate word sequence.

Step 606, each entity word is retrieved in the corresponding database in the field of voice recognition result.

Step 608, when retrieval result and entity word are inconsistent, entity word is repaired.

Know as the voice of voice data when selecting the highest candidate word sequence of final score from multiple candidate word sequences After other result, the entity of voice recognition result can be extracted, entity on behalf is sent out in voice recognition result this sentence The word of larger effect, such as " I wants to go to Window on the World " are waved, then in this sentence, " Window on the World " is then this sentence Emphasis, and " I wants to go to " is a kind of expression of speaker mean, and it is " Window on the World " finally to need practicable place. Before the voice recognition result for determining voice data, the field of each candidate word sequence that may be voice recognition result is carried out Determination, that is, have identified the field where candidate word sequence.Candidate word sequence can be multiple, whole candidate word sequence pair The field answered is generally one, i.e., multiple candidate word sequence pairs answer the same field, and voice recognition result be final score most High candidate word sequence can then determine the field of voice recognition result according to the field where candidate word sequence.

It, can be in the corresponding database in the field of voice recognition result after the corresponding field of voice recognition result is determined Each entity word is retrieved, i.e., there is corresponding database in each field.For example voice recognition result is that " I thinks Go to emperor mansion ", entity extraction is carried out to this sentence, " emperor mansion " this entity word, and voice recognition result can be obtained Field be " navigation " field then when being retrieved to entity word " emperor mansion " be number corresponding in " navigation " field According to being retrieved in library, that is, corresponding place name is retrieved, when retrieval result is " DiWang Building ", then illustrates retrieval result and entity Word is inconsistent, then can entity word " emperor mansion " retrieval result " DiWang Building " be replaced with, then voice recognition result In entity word carried out repairing operation, the voice recognition result " I wants to go to DiWang Building " after being repaired passes through Entity reparation operation can carry out the correction such as wrong word to voice recognition result and operate, and obtain more accurate voice recognition result.

As shown in fig. 7, in one embodiment, providing a kind of audio recognition method.The present embodiment is mainly in this way It is illustrated applied to the server 120 in above-mentioned Fig. 1.With reference to Fig. 7, which specifically comprises the following steps：

Step 702, neural network is trained, obtains trained neural network.

Before practice neural network, need to be trained neural network previously according to actual items.Each neck There is corresponding neural network in domain, thus using the corresponding text in each field to the corresponding neural network in each field into Row individually training, the neural network trained are corresponded with field, as shown in figure 8, each neural network is corresponding with respectively From training text.And in practice, semantic classification model may recognize that the field where candidate word sequence, when passing through language After adopted disaggregated model determines the field of candidate word sequence, then candidate word sequence is sequentially input to the neural network in corresponding field In beat again point, therefore the field of neural network is corresponding with the field of semantic classification model, therefore in training, such as It is shown in Fig. 8, semantic classification model can be trained jointly with neural network so that the field of neural network and semantic classification The classification field of model corresponds to, and neural network is mainly used for that voice is identified, and semantic classification model is mainly used for language The semanteme of sound data is analyzed, it can thus be assumed that neural network belongs to identification side, and semantic classification model belongs to semantic side.

When being trained to neural network, the corresponding text of every field can be got, by each of each text , can be using term vector as input data after word is converted into term vector, and according to the sequence of the word in sentence, word will be inputted Next word output data of the term vector as current input word, i.e., the term vector each exported with it is previous when The term vector of the term vector and input of all moment before of carving input is related.In the corresponding term vector of input word to nerve net When network is trained, the parameter of neural network can constantly be adjusted, until being adjusted to a suitable parameter, then be represented Neural metwork training finishes, and has obtained trained neural network.After being trained respectively to the neural network in each field, If need to be updated the corresponding neural network in some field during practice, it is only necessary to update the needs The corresponding neural network in newer field, such as, it is only necessary to when more frontier n, then it can only need to update as shown in Figure 8 Grey parts, i.e., re -training is carried out to the corresponding neural networks of field n, and corresponds to field in semantic side re -training simultaneously Semantic classification model.

Neural network can be Recognition with Recurrent Neural Network, alternatively referred to as recirculating network language model, recirculating network language model Advantage be long memory effect, i.e., the term vector of the word each exported is by the term vector of the word of all inputs before Joint effect, the corresponding output data of term vector of the word currently inputted is the corresponding term vector of a upper word.Output Data are the corresponding term vector of next word of each input word, but for each output data, input number According to be not merely it is upper one input word term vector, but it is upper one input word term vector and input before Whole words term vector.The input word at the word each exported and previous moment and all moment are defeated before The word entered is related.And so on the corresponding output data of term vector that each inputs of setting, to recirculating network language model into Row training puts into actual use to obtain trained neural network.

Step 704, multiple word sequences for being decoded to voice data are obtained, and each word sequence corresponding the One score.

Tone decoding processing can be carried out by front end, front end can be terminal, can be decoded to obtain to voice data more A word sequence, and corresponding first score of each word sequence.Specifically, as shown in figure 9, front end is in the voice for getting input After data, feature extraction is carried out to voice data, with trained acoustic model, language model and pronunciation dictionary to feature Voice data after extraction scans for decoding, and multiple word sequences can be obtained.Acoustic model can be assembled for training by voice training and be got It arrives, language model is trained to obtain by text training set, and trained acoustic model can be fed into reality with language model In.

Word sequence can regard that multiple words and multiple paths, alternatively referred to as lattice, lattice are substantially as One directed acyclic (directed acyclic graph) figure, the end time point of one word of each node on behalf on figure, Each edge represents the acoustic score and language model scores that a possible word and the word occur.To the result of speech recognition When being indicated, on each node storage current location speech recognition as a result, including information such as acoustics probability, language probability, As shown in Figure 10, from Far Left Qi Shiweizhi <s>Start, final &lt is gone to along different arcs;/s>It can obtain different times Sequence, and the probabilistic combination stored on arc illustrate that input voice obtains the probability (score) of certain passage.For example scheme Shown in 10, " Beijing welcomes you ", " background, which is changed, reflects you " can be regarded as a paths of recognition result, i.e., " Beijing is joyous Meet you ", " background, which is changed, reflects you " be a word sequence.And there are one probability for each path correspondence in figure, can be counted according to probability Calculation obtains the score of each path, i.e. the first score, therefore each word sequence has corresponding first score.

Step 706, the word sequence of the forward preset quantity of the first score is extracted from word sequence as candidate word sequence.

After obtaining multiple word sequences, the word sequence that the forward preset quantity of the first score can be extracted from word sequence is made For candidate word sequence.I.e. by the calculated optimal paths of lattice, i.e. the highest path of probability is not necessarily arranged with actual word order It is matched, so can extract out the word sequence of the forward preset quantity of the first score as candidate word sequence, alternatively referred to as n- Best, preset quantity can according to actual needs be set by technical staff.

It step 708, will be in each candidate word sequence inputting to semantic classification model.

Step 710, classified to each candidate word sequence by semantic classification model, obtain each candidate word sequence pair The tag along sort answered.

Step 712, the corresponding field of the maximum tag along sort of accounting in tag along sort is obtained to be used as where candidate word sequence Field.

Step 714, according to the field where candidate word sequence by the trained of candidate word sequence inputting to corresponding field In neural network.

Step 716, candidate word sequence beat again point by trained neural network, obtain each candidate word sequence Corresponding second score.

Step 718, to being weighted summation according to corresponding first score of candidate word sequence and the second score, candidate is obtained The final score of word sequence.

Step 720, using the highest candidate word sequence of final score as the voice recognition result of voice data.

After extracting the word sequence of preset quantity as candidate word sequence, each candidate word sequence can be input to instruction In the semantic classification model perfected, trained semantic classification model can classify to each candidate word sequence, export to every The tag along sort that a candidate's word sequence obtains after being classified can determine each wait according to the tag along sort of each candidate word sequence Select the corresponding field of word sequence.But trained semantic classification model may deposit the domain classification result of each candidate word sequence In difference, in order to determine the field of candidate word sequence, the tag along sort of each candidate word sequence can be got, accounting is maximum Tag along sort is made the corresponding field of the maximum tag along sort of accounting as whole corresponding tag along sorts of candidate word sequence For the field of whole candidate word sequences, therefore the corresponding field of final multiple candidate word sequences will be same, generally not It will appear the corresponding multiple candidate word sequences of the same voice data and belong to multiple fields.

After extracting multiple candidate word sequences, that is, after the extraction step for having carried out n-best, can to candidate word sequence into Row field identifies, semantic classification Model Identification can be used to go out the field where candidate word sequence.As shown in figure 11, when having carried out n- After the extraction of best, semantic classification model can be used to carry out field identification to candidate word sequence, and extremely by candidate word sequence inputting In the neural network in corresponding field, there is corresponding neural network in each field, that is, divides the neural network in field, nerve net Network can be RNNLM.Compared to NGRAM language models, the ability of the preservation long-term memory of RNNLM is more preferable.NGRAM language models Common a kind of language model in large vocabulary continuous speech recognition, the model based on it is such a it is assumed that n-th word appearance It is only related to the word of front N-1, and it is all uncorrelated to other any words, and the probability of whole sentence is exactly multiplying for each word probability of occurrence Product, these probability can be obtained by directly counting the number of N number of word while appearance from language material.

But RNNLM is then different.For example " I " " mood " is likely to occur " good " below, it is also possible to there is " bad ", this The appearance of a little words, occur dependent on before " I " and " mood ", i.e., " memory effect ".Traditional language model, such as NGRAM Language model, calculate a word occur probability when, only depend on the word of front N-1, N is generally at most set as 5, in addition before Word will be ignored.But there are irrationalities for this way, because of the word occurred before, all can have shadow to current word It rings, for example, " Liu Dehua/be/mono-/outstanding// performer/and/singer/have/very much/classics/works/he/wherein/ / mono- ", if it is considered that context before, then the probability for " song " next occur is bound to be higher than " brother ", and if only 3 before consideration or 4 words, neglect above, can not then determine herein " song " or " brother ".But the note of RNNLM It is good to recall effect, is exactly relative to conventional language model, RNNLM can consider the very long text inputted before, according to very long before Text, next there is the probability of word to calculate, therefore RNNLM has better memory effect.

After candidate word to be input to the neural network in corresponding field, candidate word sequence can be beaten again by neural network The second score for getting each candidate word sequence, finally obtains the voice recognition result of voice data.It beats again point alternatively referred to as Rescoring, the language model due to generating lattice can not be accomplished accurate enough, it is therefore desirable to use the language mould of bigger Type readjusts the score of n-best, this process is referred to as rescoring, that is, beats again point.

Example as shown in figure 12, multiple candidate's word sequences and corresponding first score are respectively：Play the double of Zhou Jielun Save rod 94.7456, the double-cut stick 95.3976 for playing Zhou Jielun, the double-cut stick 94.7951 for playing all conclusions ..., then these are waited It selects word sequence to be all input to semantic classification model to classify, semantic classification Model Identification goes out these candidate word sequences and belongs to music Field then can will then carry out weight scoring treatment in candidate word sequence inputting to the corresponding neural network of music field, and obtains Second score of each candidate's word sequence, specially：It plays the nunchakus 93.3381 of Zhou Jielun, play the double-cut stick of Zhou Jielun 95.0925, play all conclusions double-cut stick 94.1557 ....

After obtaining the first score and the second score of each candidate word, the first score and the second score can be weighted and asked With obtain the final score of each candidate word sequence.Specifically, the score * first of the final score of each candidate word sequence=first Weight the+the second score the second weights of *.First weight and the second weight can it is equal can not also be equal, the first weight and second Weight has respectively represented the first score and the second score, and respectively shared proportion can be recognized when the first weight is more than the second weight The accounting score of the first score is tuned up for technical staff, it is believed that the proportion of the first score is higher than the second score, i.e. technology people Member thinks influence bigger of first score to voice recognition result.First weight and the second weight can be by largely testing What data determined, for example with a large amount of candidate word sequence, ceaselessly adjust the value of the first weight and the second weight, until by root When the highest candidate word sequence of final score obtained according to the first weight and the second weight calculation is as voice recognition result, this is most The candidate word sequence of whole highest scoring is most accurate voice recognition result really, then can determine the first weight and the second weight Actual value, and this calculating parameter of the first weight and the second weight as final score is used in practice later.

For example, when the first weight is 0.4, when the second weight is 0.6, then candidate word sequence can be calculated and " play Zhou Jie Final score=95.3976*0.4+95.0925*0.6=95.21454 of the double-cut stick of human relations ", by whole candidate word sequences After final score all calculates, it can be seen that the score highest of candidate word sequence " double-cut stick for playing Zhou Jielun ", therefore can incite somebody to action " double-cut stick for playing Zhou Jielun " is used as voice recognition result.After final score is calculated, a calculating can also be increased Step, such as using the logarithm of final score as choosing score, i.e., above-mentioned candidate word sequence " double-cut stick for playing Zhou Jielun " Final score score is 95.21454, then takes 95.21454 logarithm, such as log₁₀95.21454 each candidate word order is calculated After the logarithm of the final score of row, using the logarithm of final score as the selection score of each candidate word sequence, then select Take the highest candidate word sequence of score as voice recognition result.This calculation can be diversified, can also take other Calculation, can specifically be set by technical staff.

Step 722, entity extraction is carried out to voice recognition result, obtains entity word.

Step 724, the field of voice recognition result is determined according to the field where candidate word sequence.

Step 726, each entity word is retrieved in the corresponding database in the field of voice recognition result.

Step 728, when retrieval result and entity word are inconsistent, entity word is repaired.

In speech recognition, identify that the particular entity that mistake generally all concentrates in field can obtained for this problem After the voice recognition result of voice data, entity reparation operation is carried out to voice recognition result.First voice recognition result is carried out Entity extraction, entity on behalf plays the word of larger effect in voice recognition result this sentence, for example " I wants to go to emperor Mansion ", then in this sentence, " emperor mansion " is then the emphasis of this sentence, and " I wants to go to " is speaker mean A kind of expression, and it is " emperor mansion " finally to need practicable place.

When carrying out entity extraction to voice recognition result, multiple entity words can be obtained, a reality can also be had to Pronouns, general term for nouns, numerals and measure words language can retrieve each entity word, can be obtained after being decoded to voice data after extracting entity word To multiple word sequences, multiple word sequences are selected from word sequence as candidate word sequence, and select one of candidate word order The voice recognition result as voice data is arranged, therefore can determine voice recognition result pair according to the field where candidate word sequence The field answered.It, can be in the corresponding database in field to each after the field of voice recognition result of voice data is determined Entity word is retrieved, and when the entity word extracted in the result and voice recognition result retrieved is inconsistent, then will The entity word extracted in voice recognition result is substituted for retrieval result, is repaired by this method to entity word.Such as It can be known according to voice after extracting " emperor mansion " this entity word when voice recognition result is " I wants to go to emperor mansion " Navigation field belonging to other result retrieves " emperor mansion ", i.e., is retrieved to geographical location information, can be found according to retrieval Actual result should be " DiWang Building ", therefore can be corrected as whole word " I wants to go to DiWang Building ", entity repair function Specific identification mistake can be fast and accurately directed to be repaired.

Audio recognition method in the present embodiment beats again extension set system, beating again behind point field as a result of point field Divide operation that can significantly promote recognition accuracy, the experiment proved that, in different field, speech recognition accuracy has respectively 4% to 16% opposite promotion, whole recognition accuracy improve 6%.In addition, since there is corresponding god in each field Through network, therefore it can significantly reduce the cost being trained to neural network.Traditional speech recognition schemes need profit A language model is trained with the text of magnanimity, for the effect for ensureing to beat again point, language model scale is very huge, model Cycle of training is very long, also relatively difficult to the assessment of model overall effect, and the 8.2G text training RNNLM times are 100 hours left sides It is right；After differentiation field, the model training time in single field can shorten in 24 hours, more for specific area model The new time substantially shortens.And the language model in traditional technology can not possibly comprehensively cover all data, once find language There are problems in model, it is necessary to and the entire language model of meeting is updated, and huge expense can all be generated every time by updating, and this In embodiment, then only need, to needing the corresponding neural network of processing or newer field to carry out re -training, to reduce Training cost also shortens the time consumed required for neural network update.

Fig. 2 to Figure 12 is respectively the flow diagram of audio recognition method in each embodiment.Although should be understood that Each step in the flow chart of each figure is shown successively according to the instruction of arrow, but these steps are not necessarily according to arrow The sequence of head instruction executes successively.Unless expressly stating otherwise herein, there is no stringent sequences to limit for the execution of these steps System, these steps can execute in other order.Moreover, at least part step in each figure may include multiple sub-steps Rapid either multiple these sub-steps of stage or stage are not necessarily to execute completion in synchronization, but can be in difference At the time of execute, the execution in these sub-steps or stage sequence is also not necessarily to be carried out successively, but can be with other steps Either the sub-step of other steps or at least part in stage execute in turn or alternately.

In one embodiment, as shown in figure 13, a kind of speech recognition equipment is provided, including：

Word sequence acquisition module 1302, for obtaining the multiple word sequences being decoded to voice data, and it is every Corresponding first score of a word sequence.

Extraction module 1304, for extracting the word sequence of the forward preset quantity of the first score from word sequence as candidate Word sequence.

Field identification module 1306, for identification field where candidate word sequence.

Beat again sub-module 1308, for according to the field where candidate word sequence by candidate word sequence inputting to corresponding field Neural network in；Candidate word sequence beat again point by neural network, obtains each candidate word sequence corresponding second Score.

Voice recognition result determining module 1310, for that must be got according to corresponding first score of candidate word sequence and second To the final score of candidate word sequence；Using the highest candidate word sequence of final score as the voice recognition result of voice data.

In one embodiment, as shown in figure 14, above-mentioned field identification module 1306 includes：

Input module 1306A, being used for will be in each candidate word sequence inputting to trained semantic classification model.

Sort module 1306B classifies to each candidate word sequence for passing through semantic classification model, obtains each time Select the corresponding tag along sort of word sequence；It obtains the corresponding field of the maximum tag along sort of accounting in tag along sort and is used as candidate word order Field where arranging.

In one embodiment, above-mentioned apparatus further includes training module (not shown), for obtaining every field pair The text answered；Each word in text is converted into term vector；Using term vector as input, nerve corresponding to each field Network is trained.

In one embodiment, above-mentioned training module is additionally operable to the sequence according to word in text, by each of text The corresponding term vector of word is as input, using the corresponding term vector of next word of the word of each input as exporting, with The parameter of adjustment neural network is trained neural network.

In one embodiment, above-mentioned sub-module 1308 of beating again is additionally operable to according to corresponding first score of candidate word sequence It is weighted summation with the second score, obtains the final score of candidate word sequence.

In one embodiment, neural network is Recognition with Recurrent Neural Network.

In one embodiment, above-mentioned apparatus further includes entity repair module (not shown), for speech recognition As a result entity extraction is carried out, multiple entity words are obtained；Entity word is retrieved；When retrieval result and entity word differ When cause, entity word is repaired.

In one embodiment, above-mentioned entity repair module is additionally operable to determine voice according to the field where candidate word sequence The field of recognition result；Each entity word is retrieved in the corresponding database in the field of voice recognition result.

Figure 15 shows the internal structure chart of one embodiment Computer equipment.The computer equipment can be specifically figure Server 120 in 1.As shown in figure 15, it includes being connected by system bus which, which includes the computer equipment, Processor, memory and network interface.Wherein, memory includes non-volatile memory medium and built-in storage.The computer is set Standby non-volatile memory medium is stored with operating system, can also be stored with computer program, and the computer program is by processor When execution, processor may make to realize audio recognition method.Also computer program can be stored in the built-in storage, the computer When program is executed by processor, processor may make to execute audio recognition method.

It will be understood by those skilled in the art that the structure that Figure 15 goes out, only with the relevant part-structure of application scheme Block diagram, do not constitute the restriction for the computer equipment being applied thereon to application scheme, specific computer equipment can To include either combining certain components than more or fewer components as shown in the figure or being arranged with different components.

In one embodiment, speech recognition equipment provided by the present application can be implemented as a kind of shape of computer program Formula, computer program can be run on the computer equipment of such as Figure 15.It can be stored in the memory of computer equipment and form the language Each program module of sound identification device, for example, word sequence acquisition module, extraction module, field identification module shown in Figure 13, Beat again sub-module and voice recognition result determining module.The computer program that each program module is constituted makes processor execute sheet Step in the audio recognition method of each embodiment of the application described in specification.

For example, computer equipment shown in figure 15 can be obtained by the word sequence in speech recognition equipment as shown in fig. 13 that Modulus block executes the multiple word sequences for obtaining and being decoded to voice data, and each word sequence corresponding first obtains Point.Computer equipment can execute the word sequence work that the forward preset quantity of the first score is extracted from word sequence by extraction module For candidate word sequence.Computer equipment can execute the field where the candidate word sequence of identification by field identification module.Computer Equipment can be executed according to the field where candidate word sequence by beating again sub-module by candidate word sequence inputting to corresponding field In neural network；By neural network candidate word sequence beat again point, obtains each candidate word sequence corresponding second Point.Computer equipment can be executed by voice recognition result determining module according to corresponding first score of candidate word sequence and second Score obtains the final score of candidate word sequence；Using the highest candidate word sequence of final score as the speech recognition of voice data As a result.

In one embodiment, a kind of computer equipment, including memory and processor are provided, is stored in memory Computer program, the processor realize that the positioning provided in any one embodiment of the application movement is set when executing computer program The step of standby method.

In one embodiment, a kind of computer readable storage medium is provided, computer program is stored thereon with, is calculated Machine program realizes the step of method of the positioning mobile device provided in any one embodiment of the application when being executed by processor.

One of ordinary skill in the art will appreciate that realizing all or part of flow in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the program can be stored in a non-volatile computer and can be read In storage medium, the program is when being executed, it may include such as the flow of the embodiment of above-mentioned each method.Wherein, provided herein Each embodiment used in any reference to memory, storage, database or other media, may each comprise non-volatile And/or volatile memory.Nonvolatile memory may include that read-only memory (ROM), programming ROM (PROM), electricity can be compiled Journey ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include random access memory (RAM) or external cache.By way of illustration and not limitation, RAM is available in many forms, such as static state RAM (SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double data rate sdram (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronization link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) directly RAM (RDRAM), straight Connect memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..

Each technical characteristic of above example can be combined arbitrarily, to keep description succinct, not to above-described embodiment In each technical characteristic it is all possible combination be all described, as long as however, the combination of these technical characteristics be not present lance Shield is all considered to be the range of this specification record.

The several embodiments of the application above described embodiment only expresses, the description thereof is more specific and detailed, but simultaneously Cannot the limitation to the application the scope of the claims therefore be interpreted as.It should be pointed out that for those of ordinary skill in the art For, under the premise of not departing from the application design, various modifications and improvements can be made, these belong to the guarantor of the application Protect range.Therefore, the protection domain of the application patent should be determined by the appended claims.

Claims

1. a kind of audio recognition method, including：

Identify the field where the candidate word sequence；

Field where the candidate word sequence will be in the neural network of the candidate word sequence inputting to corresponding field；

The candidate word sequence beat again point by the neural network, obtaining each candidate word sequence corresponding second must Point；

The final score of the candidate word sequence is obtained according to corresponding first score of the candidate word sequence and the second score；

2. according to the method described in claim 1, it is characterized in that, the field identified where the candidate word sequence, packet It includes：

It will be in each candidate word sequence inputting to semantic classification model；

Classified to each candidate word sequence by the semantic classification model, it is corresponding to obtain each candidate word sequence Tag along sort；

The corresponding field of the maximum tag along sort of accounting in the tag along sort is obtained as the neck where the candidate word sequence Domain.

3. according to the method described in claim 1, it is characterized in that, the method further includes：

Obtain the corresponding text of every field；

Each word in the text is converted into term vector；

Using the term vector as input, the corresponding neural network in each field is trained.

4. according to the method described in claim 3, it is characterized in that, it is described using the term vector as input, to each field Corresponding neural network is trained, including：

It will be each defeated using the corresponding term vector of each word in text as input according to the sequence of word in the text The corresponding term vector of next word of the word entered is as output, to adjust the parameter of the neural network to the nerve net Network is trained.

5. according to the method described in claim 1, it is characterized in that, described according to corresponding first score of the candidate word sequence The final score of the candidate word sequence is obtained with the second score, including：

To being weighted summation according to corresponding first score of the candidate word sequence and the second score, the candidate word order is obtained The final score of row.

6. according to the method described in claim 1, it is characterized in that, the neural network is Recognition with Recurrent Neural Network.

7. according to the method described in claim 1, it is characterized in that, described by the highest candidate word sequence of the final score After voice recognition result as the voice data, further include：

Entity extraction is carried out to institute's speech recognition result, obtains entity word；

The entity word is retrieved；

When retrieval result and the entity word are inconsistent, the entity word is repaired.

8. the method according to the description of claim 7 is characterized in that described retrieve the entity word, including：

Field where the candidate word sequence determines the field of institute's speech recognition result；

Each entity word is retrieved in the corresponding database in field of institute's speech recognition result.

9. a kind of speech recognition equipment, which is characterized in that described device includes：

Word sequence acquisition module, for obtaining the multiple word sequences being decoded to voice data and each word sequence Corresponding first score；

Extraction module, for extracting the word sequence of the forward preset quantity of the first score from the word sequence as candidate Word sequence；

Beat again sub-module, for according to the field where the candidate word sequence by the candidate word sequence inputting to corresponding field Neural network in；The candidate word sequence beat again point by the neural network, obtains each candidate word sequence pair The second score answered；

Voice recognition result determining module, for obtaining institute according to corresponding first score of the candidate word sequence and the second score State the final score of candidate word sequence；The highest candidate word sequence of the final score is known as the voice of the voice data Other result.

10. device according to claim 9, which is characterized in that the field identification module includes：

Input module, being used for will be in each candidate word sequence inputting to trained semantic classification model；

Sort module obtains each time for classifying to each candidate word sequence by the semantic classification model Select the corresponding tag along sort of word sequence；It obtains described in the corresponding field conduct of the maximum tag along sort of accounting in the tag along sort Field where candidate word sequence.

11. device according to claim 9, which is characterized in that described device further includes training module, each for obtaining The corresponding text in field；Each word in the text is converted into term vector；Using the term vector as input, to each The corresponding neural network in field is trained.

12. according to the devices described in claim 11, which is characterized in that the training module is additionally operable to according to word in the text The sequence of language, using the corresponding term vector of each word in text as input, by next word of the word of each input For corresponding term vector as output, the parameter to adjust the neural network is trained the neural network.

13. device according to claim 9, which is characterized in that described device further includes entity repair module, for institute Speech recognition result carries out entity extraction, obtains multiple entity words；The entity word is retrieved；Work as retrieval result When inconsistent with the entity word, the entity word is repaired.

14. a kind of computer readable storage medium is stored with computer program, when the computer program is executed by processor, So that the processor is executed such as the step of any one of claim 1 to 8 the method.

15. a kind of computer equipment, including memory and processor, the memory is stored with computer program, the calculating When machine program is executed by the processor so that the processor executes the step such as any one of claim 1 to 8 the method Suddenly.