WO2019218818A1

WO2019218818A1 - Speech recognition method and apparatus, and computer readable storage medium and computer device

Info

Publication number: WO2019218818A1
Application number: PCT/CN2019/082300
Authority: WO
Inventors: 刘毅
Original assignee: 腾讯科技（深圳）有限公司
Priority date: 2018-05-14
Filing date: 2019-04-11
Publication date: 2019-11-21
Also published as: CN108711422B; CN108711422A

Abstract

The present application relates to a speech recognition method and apparatus, and a computer readable storage medium, and a computer device. The method comprises: obtaining a plurality of word sequences obtained by decoding speech data, and a first score corresponding to each word sequence; extracting, from the plurality of word sequences, a preset quantity of word sequences of which the first scores rank highly as candidate word sequences; for the extracted candidate word sequences, recognizing the field where the candidate word sequences are located; inputting the candidate word sequences into a neural network of the corresponding field according to the field where the candidate word sequences are located; re-scoring the candidate word sequences by means of the neural network to obtain second scores; obtaining the final scores according to the first scores and second scores corresponding to the candidate word sequences; using, in the extracted candidate word sequences, the candidate word sequence with the highest final score as the speech recognition result of the speech data. In this way, the time duration for training a specific field of the neural network is also greatly shortened, and efficiency of the speech recognition is further improved.

Description

Speech recognition method, device, computer readable storage medium, and computer device

The present application claims priority to Chinese Patent Application No. 201101457129.3, entitled "Speech Recognition Method, Apparatus, Computer Readable Storage Media, and Computer Equipment", which is filed on May 14, 2018, the entire contents of which are incorporated by reference. In this application.

Technical field

The present application relates to the field of speech recognition technologies, and in particular, to a speech recognition method, apparatus, computer readable storage medium, and computer device.

Background technique

With the rapid development of computer technology, the technology in the field of speech recognition has become increasingly mature.

In the traditional technology, the technical solution of speech recognition is generally divided into two parts: front-end decoding and back-end processing. The front-end is mainly responsible for receiving input speech data and decoding the speech data to obtain multiple sentences with possibility, and the back end can be One of the sentences of the plurality of possibilities obtained by the front end determines one of the sentences as the final speech recognition result. In the traditional technology, the back end can input multiple possible sentences into the neural network to determine the final speech recognition result. However, in this way, a large amount of text needs to be utilized, and it takes a long time to train. The neural network that can eventually be put into use, so this speech recognition scheme is less efficient.

Summary of the invention

Based on this, it is necessary to provide a speech recognition method, apparatus, computer readable storage medium and computer device capable of improving speech recognition efficiency for the technical problem of low speech recognition efficiency.

A speech recognition method is applied to a computer device, and the method includes:

Obtaining a plurality of word sequences obtained by decoding the voice data, and a first score corresponding to each word sequence;

Extracting, from the plurality of word sequences, a preset number of word sequences of the first score as a candidate word sequence;

Identifying the field in which the extracted candidate word sequence is located;

Inputting each candidate word sequence into a neural network of a corresponding domain according to the field;

Retrieving each candidate word sequence by the neural network to obtain a second score corresponding to each candidate word sequence;

Obtaining a final score of each candidate word sequence according to the first score and the second score corresponding to each candidate word sequence;

The candidate word sequence with the highest final score is used as the speech recognition result of the speech data.

A speech recognition device, the device comprising:

a word sequence obtaining module, configured to acquire a plurality of word sequences obtained by decoding the voice data, and a first score corresponding to each word sequence;

An extraction module, configured to extract, from the plurality of word sequences, a preset number of word sequences of the first score as a candidate word sequence;

a domain identification module, configured to identify an area in which the extracted candidate word sequence is located;

a re-scoring module, configured to input each candidate word sequence into a neural network of a corresponding domain according to the domain; re-scoring each candidate word sequence by the neural network to obtain each candidate word sequence Corresponding second score;

a speech recognition result determining module, configured to obtain a final score of each candidate word sequence according to the first score and the second score corresponding to each candidate word sequence; using the candidate word sequence with the highest final score as the Speech recognition result of voice data.

A computer device comprising a memory and a processor, the memory storing a computer program, the computer program being executed by the processor, causing the processor to perform the following operations:

A computer readable storage medium storing a computer program, when executed by a processor, causes the processor to perform the following operations:

The above-mentioned speech recognition method, apparatus, computer readable storage medium and computer device select a plurality of candidate word sequences from a plurality of word sequences by acquiring a plurality of word sequences obtained by decoding the speech data and a corresponding first score After determining the domain of the candidate word sequence, the candidate word sequence may be input into the neural network corresponding to the domain, and the second score obtained by the neural network for re-rating each candidate word sequence may be obtained according to the first score and the first score. The two scores determine the final score of each candidate word sequence, and then the candidate word sequence with the highest final score can be selected as the speech recognition result of the speech data. The method for speech recognition first identifies the domain of the candidate word sequence before re-rating the candidate word sequence, so that the candidate word sequence can be re-scored using the neural network corresponding to the domain to which the candidate word sequence belongs, using the domain The corresponding neural network re-scores the candidate word sequence to obtain a second score that is not only more accurate. Since each field has its own corresponding neural network, each field uses its own text to train each corresponding neural network. No need to use all the text for training, so the training time for the specific domain of the neural network is also greatly shortened, further improving the efficiency of speech recognition.

DRAWINGS

1 is an application environment diagram of a voice recognition method in an embodiment;

2 is a schematic flow chart of a voice recognition method in an embodiment;

3 is a schematic flow chart of step 206 in an embodiment;

4 is a schematic flow chart of a training step of a neural network in an embodiment;

5 is a schematic diagram of input data and output data of a neural network in an embodiment;

6 is a schematic flow chart after step 214 in an embodiment;

7 is a schematic flow chart of a voice recognition method in another embodiment;

8 is a schematic diagram of a neural network training process corresponding to each field in an embodiment;

9 is a schematic flow chart of decoding a voice data by a front end in an embodiment;

Figure 10 is a schematic illustration of a sequence of words in one embodiment;

11 is a schematic flow chart of re-scoring a candidate word sequence by a semantic classification model in an embodiment;

12 is a schematic flow chart of re-scoring a candidate word sequence by a semantic classification model in another embodiment;

Figure 13 is a block diagram showing the structure of a voice recognition apparatus in an embodiment;

14 is a structural block diagram of a domain identification module in another embodiment;

Figure 15 is a block diagram showing the structure of a computer device in an embodiment.

Detailed ways

In order to make the objects, technical solutions, and advantages of the present application more comprehensible, the present application will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the application and are not intended to be limiting.

FIG. 1 is an application environment diagram of a voice recognition method in an embodiment. Referring to Figure 1, the speech recognition method is applied to a speech recognition system. The speech recognition system includes a terminal 110 and a server 120. The terminal 110 and the server 120 are connected through a network. The terminal 110 may specifically be a desktop terminal or a mobile terminal, and the mobile terminal may specifically include at least one of a mobile phone, a tablet computer, a notebook computer, and the like. The server 120 can be implemented by a stand-alone server or a server cluster composed of a plurality of servers.

As shown in FIG. 2, in one embodiment, a speech recognition method is provided. This embodiment is exemplified by applying the method to the server 120 in FIG. 1 described above. Referring to FIG. 2, the voice recognition method specifically includes the following steps:

Step 202: Acquire a plurality of word sequences obtained by decoding the voice data, and a first score corresponding to each word sequence.

The server can obtain a plurality of word sequences obtained by decoding the voice data, and the decoding operation can be performed by the terminal. When acquiring the voice data, the terminal can use the trained acoustic model, the language model, and the pronunciation dictionary to obtain the obtained The speech data is decoded so that a plurality of word sequences can be obtained. Of course, the decoding operation can also be performed by the server.

The word sequence refers to a plurality of words obtained by decoding the voice data and a path corresponding to the plurality of words, and sentences obtained by orderly connecting the corresponding positions of each path to the corresponding words at the end position can be seen. The result is a recognition result, that is, each word sequence can be regarded as a recognition result. And each path of each word sequence has a corresponding score, that is, the first score. The first score can be thought of as a score calculated from the probability of occurrence of each path.

Step 204: Extract a predetermined number of word sequences of the first score from the plurality of word sequences as a candidate word sequence.

Each word sequence has a corresponding first score, and the sequence of words with the first score can be extracted. The first score is preceded by sorting the sequence of words according to the order of the first score from high to low, that is, the sequence of words with the highest score of the first score can be ranked first. The sequence of words with the first score is extracted, that is, the sequence of words with the first score is extracted, and the number of extracted word sequences can be preset, so that the preset number of word sequences of the first score can be extracted as candidates. The sequence of words, the preset number can be adjusted according to the consideration of the technician.

In another possible implementation, the sequence of words is sorted in order from lowest to highest, and the sequence of words with the highest highest score is ranked at the last position. The predetermined number of word sequences with the first score is extracted, that is, starting from the last position, and the preset number of word sequences are extracted according to the order of the word sequences.

Step 206: Identify the domain in which the extracted candidate word sequence is located.

Step 208: Input each candidate word sequence into the neural network of the corresponding domain according to the identified domain.

When a predetermined number of candidate word sequences are selected from a plurality of word sequences, the field in which the predetermined number of candidate word sequences are located can be identified. The field can be customized by the technician, such as dividing the field into navigation, music, and so on. There are multiple candidate word sequences, and the identified domain may be one, that is, multiple candidate word sequences belong to the same domain. After identifying the domain corresponding to the candidate word sequence, multiple candidate word sequences may be input to the corresponding domain respectively. In the neural network. That is, for each domain, there is a corresponding neural network, and the neural network of each domain can re-scoring the candidate word sequences of the respective domains in a targeted manner.

Step 210: Re-score each candidate word sequence by a neural network to obtain a second score corresponding to each candidate word sequence.

Step 212: Obtain a final score of each candidate word sequence according to the first score and the second score corresponding to each candidate word sequence.

Step 214, the candidate word sequence with the highest final score is used as the speech recognition result of the speech data.

When the field of the candidate word sequence is identified and all candidate word sequences are input into the corresponding neural network, for example, when the field identifying the candidate word sequence is the "navigation" field, all the candidate word sequences are input to " In the neural network corresponding to the "field", the neural network in the "navigation" field can re-score each input candidate word sequence to obtain a second score for each candidate word sequence.

The re-scoring means that the neural network corresponding to the domain of the candidate word sequence performs a score calculation on each candidate word sequence input, and obtains a second score of each candidate word sequence, and the first score of each candidate word sequence is The second score is combined and calculated in a certain manner. For example, the first score and the second score may be separately weighted, and the weighted calculated first score and the second score may be added. The final score of each candidate word sequence can be obtained according to the first score and the second score, and the candidate word sequence with the highest final score can be used as the speech recognition result of the speech data.

The speech recognition method, after acquiring a plurality of word sequences obtained by decoding the speech data and the corresponding first score, selecting a plurality of candidate word sequences from the plurality of word sequences, and determining the sequence of the plurality of candidate word sequences a field, each candidate word sequence may be input into a corresponding neural network of the domain, and after obtaining a second score obtained by the neural network re-rating each candidate word sequence, each of the first score and the second score may be determined. The final score of the candidate word sequence, and then the candidate word sequence with the highest final score can be selected as the speech recognition result of the speech data.

The method for speech recognition first identifies the field in which the candidate word sequence is located before re-scoring the candidate word sequence, so that the candidate word sequence can be re-scored using the corresponding neural network in the field, and the corresponding nerve in the field is used. The second score obtained by the network to score the candidate word sequence is not only more accurate, but each domain has its own corresponding neural network, and each field uses its own text to train the corresponding neural network without using all The text is trained, so the training time for the specific domain of the neural network is also greatly shortened, further improving the efficiency of speech recognition.

In one embodiment, as shown in FIG. 3, step 206 includes:

Step 302, input each candidate word sequence into a semantic classification model.

Step 304: Classify each candidate word sequence by a semantic classification model, and obtain a classification label corresponding to each candidate word sequence.

Step 306: Acquire an area corresponding to the category label with the largest proportion among the classification labels as the field in which the extracted candidate word sequence is located.

After extracting a predetermined number of word sequences as candidate word sequences, each candidate word sequence can be input into the trained semantic classification model. The trained semantic classification model refers to a semantic classification model trained in a large number of training samples in advance, and is used to identify the field in which the input candidate word sequence is located. After inputting the extracted preset number of candidate word sequences into the trained semantic classification model, the trained semantic classification model can classify each candidate word sequence, and output the classification of each candidate word sequence to obtain The classification label can determine the domain corresponding to each candidate word sequence according to the classification label of each candidate word sequence.

There may be differences in the classification labels corresponding to each candidate word sequence. That is, the trained semantic classification model may have different classification results for each candidate word sequence, and represents a trained semantic classification model to determine the sequence of each candidate word sequence. The fields are different. In this case, in order to determine the domain of the candidate word sequence, the classification label of each candidate word sequence can be obtained, and the classification label with the largest proportion is used as the classification label corresponding to all the candidate word sequences, that is, the classification label with the largest proportion. The corresponding field is the field in which all candidate word sequences are located.

For example, the classification label of the candidate word sequence A is 1, the classification label of the candidate word sequence B is 1, the classification label of the candidate word sequence C is 2, and the classification label of the candidate word sequence D is 1, wherein 1 can be represented as a navigation class. 2 is represented as music, then in the candidate word sequences A, B, C, D, the classification label is the largest proportion of 1, so the classification label 1 can be used as the classification label of the candidate word sequences A, B, C, D. The domain in which the candidate word sequences A, B, C, and D are considered to belong to the category label is a navigation class.

The domain of the extracted candidate word sequences is identified by the semantic classification model, and when there is a divergence in the domain of each candidate word sequence, the domain corresponding to the classification tag with the largest proportion of the classification tags is used as the field in which the candidate word sequence is located. The accuracy of the domain of the candidate word sequence is ensured, and the accuracy of the re-scoring score of the subsequent candidate word sequence is also improved, thereby improving the accuracy of the speech recognition result.

In one embodiment, the voice recognition method further includes: acquiring text corresponding to each domain; converting each word in the text corresponding to each domain into a word vector; respectively inputting a word vector corresponding to each domain as an input to each field The corresponding neural network is trained.

Before the actual use of the neural network, each neural network can be trained according to actual needs, and the trained neural network can be put into actual use to identify the domain of the input candidate word sequence. Since each domain has its own corresponding neural network, it is necessary to obtain corresponding texts of various fields when training neural networks in each field. The text can be a sentence. After each sentence of each field is obtained, for each field, each sentence corresponding to the field can be processed by word segmentation, and multiple words corresponding to each sentence can be obtained. Convert the words contained in each sentence into word vectors, as the input data of the corresponding neural network in the field, and train the corresponding neural network in each field in this way. After training, each field can be obtained. Train a good neural network.

Pre-training neural networks in various fields ensures the accuracy of re-scoring candidate word sequences in the actual use of neural networks, thereby improving the accuracy of speech recognition and improving the efficiency of speech recognition.

In one embodiment, the word vector is taken as an input, and the neural network corresponding to each domain is trained, including: for each domain, the words corresponding to each word in the text according to the order of the words in the corresponding text of the domain As an input, the vector takes the word vector corresponding to the next word of each input word as an output, and adjusts the parameters of the corresponding neural network in the field to train the neural network.

After obtaining the text of each field, the words contained in the text can be converted into word vectors, and the word vectors corresponding to the texts of each field are input into the corresponding neural network of each field, and the corresponding neural network is performed. training. After inputting the word vector corresponding to the text into the neural network, the word vector of each input word may be input as the order of the words in the text, and the word vector corresponding to the next word of each input word may be output. To adjust the parameters of the corresponding neural network. It should be noted that the input data corresponding to each output word is not only the word input at the previous moment, but the words input at the previous moment and all previous moments. In this way, the parameters of the neural network are adjusted to train the neural network, so that the trained neural network corresponding to each field can be obtained, and the trained score can be obtained when the trained neural network re-scores the candidate word sequence. Reliability.

In an embodiment, as shown in FIG. 4, the voice recognition method further includes a training step of the neural network, including:

Step 402: Acquire text corresponding to each field.

Step 404, converting each word in the text corresponding to each field into a word vector.

Step 406: For each domain, according to the order of the words in the text corresponding to the domain, input the word vector corresponding to each word in the text as an input, and output the word vector corresponding to the next word of each input word as an output. The neural network is trained by adjusting the parameters of the corresponding neural network in the field.

In order to ensure the accuracy of the neural network put into practical use, the neural network can be trained in advance, and the trained neural network can be obtained and then applied. Each domain has its own corresponding neural network, that is, it needs to train the corresponding neural network in each field separately. When training for each domain of the neural network, it is necessary to obtain the corresponding text of each field, and can convert the words in the text of each field into word vectors, and the word vector can be an N-dimensional vector data. N is a positive integer. The text corresponding to each field can be regarded as the corresponding sentence of each field. You can use the word segmentation tool to divide the words into each sentence, that is, each sentence will have multiple words, and the words contained in each sentence can be converted. The word vector is input to the corresponding neural network of each field.

The word vector corresponding to the word contained in each sentence is used as the input of the neural network, and the next word of the input word in each sentence is used as the output of the word vector of the previous word. The output data is the word vector corresponding to the next word of each input word, but for each output data, the input data is not only the word vector of the last input word, but the word vector of the last input word. And the word vector of all the words entered before. The input data and output data of the neural network are set, and the parameters of the neural network are continuously adjusted to train the neural network.

For example, sentence A is: I was late for school yesterday, then I can use the word segmentation tool to split the sentence into: I / Yesterday / school / late / /. As shown in FIG. 5, according to the original order of the words in the sentence A, the word vectors corresponding to the words contained in the sentence A are sequentially input into the neural network, where x1, x2, x3, x4, x5, and x6 are respectively words. The corresponding word vector is the input data. By default, a blank input is added before the input word vector, that is, after the word vector corresponding to the word of sentence A is input to the neural network, the word vector corresponding to "I" is not the first input, but becomes the first The two input data, the first input data, is empty by default.

Correspondingly, the word vector corresponding to the next word of each input word is used as the corresponding output data, that is, the output data corresponding to the first default empty input data is the word vector of the word “I”, and the input word “ The corresponding output data of my "is the word vector of the word "yesterday". As shown in FIG. 5, x1, x2, x3, x4, x5, and x6 correspond to "blank words", "me", "yesterday", "going to school", "late arrival", "out", and y1, y2, respectively. Y3, y4, y5, and y6 are the output data corresponding to the input word vectors, respectively, and the corresponding words are “I”, “Yesterday”, “Going to School”, “Late”, “Yes”, that is, the word “School” corresponds to The output data is the word vector corresponding to its next word "late". For the output data is "yesterday", the input data only has the words "I" and the default blank words, but for the output data is "a", the input data is all the words entered before this, that is, "I" The words "yesterday", "going to school", "late" and the default blank words are the input data of the word "了". That is to say, each output word vector is related to the input of the previous moment and the input of all previous moments. That is, each output word is related to the input word at the previous moment and the words entered at all previous times.

The training process of the neural network is a process of constantly adjusting the parameters of the neural network. When the parameters of the neural network are determined, the neural network can be considered to have been trained. When the neural network is trained in the word vector corresponding to each word in each input text, the parameters of the neural network are adjusted until the technician determines that a certain parameter is the required parameter, and the neural network is trained. . When it is determined whether the set parameter is a parameter that satisfies the requirement, a verification mode may be adopted, that is, after the parameters of the neural network are set, the neural network is verified by using a large amount of verified sample data. The verification data is input into the neural network, and the prediction accuracy of the verification data by the neural network under the parameter is detected. When the prediction preparation rate of the neural network reaches the standard value set by the technician, the parameter can be considered as satisfying the requirement. Parameters, otherwise you need to continue to adjust the parameters of the neural network until the neural network passes the verification.

Training the neural network corresponding to each field through a large number of texts corresponding to each field ensures the reliability of the trained neural network, and improves the efficiency of the neural network for data processing, thereby improving the efficiency and accuracy of speech recognition. degree.

In one embodiment, the final score of the candidate word sequence is obtained according to the first score and the second score corresponding to the candidate word sequence, including: performing weighted summation on the first score and the second score corresponding to the candidate word sequence to obtain a candidate The final score of the word sequence.

After acquiring the voice data, the terminal may perform decoding operation on the voice data, and may obtain a plurality of word sequences, each word sequence having a corresponding first score, and selecting a preset number of word sequences from the plurality of word sequences. As a candidate word sequence, that is, each candidate word sequence has a corresponding first score. After each candidate word sequence is input into the trained neural network corresponding to the domain, the trained neural network will re-score each candidate word sequence to obtain a second score, and each candidate word sequence has its own Corresponding second score. After obtaining the first score and the second score of each candidate word sequence, a final score for each candidate word sequence can be obtained.

When the final score is calculated according to the first score and the second score, the weighting calculation may be performed, and the weights of the first score and the second score may be the same or different, and may depend on the setting of the technician. For example, if the technician thinks that the accuracy of the first score is higher, or the impact on the speech recognition result is greater, the weight of the first score may be increased. After the weighting calculation is performed on both the first score and the second score, the first score and the second score weighted score may be summed to obtain a final score of each candidate word sequence.

Using the weighting to calculate the first score and the second score, the influence of the first score and the second score on the speech recognition result may be adjusted at any time by adjusting the weights of the first score and the second score, that is, the voice side may be adjusted at any time The proportion of the semantic side can further ensure the reliability of the final score, so that the selected speech recognition result is more accurate.

In one embodiment, the neural network is a circulating neural network.

The cyclic neural network, also known as RNNLM (Recurrent Neural Network Based Language Model), can also be called a cyclic network language model. In addition to considering the currently input words, the cyclic network language model can consider multiple words that were previously input, and can calculate the probability of the next word based on the long text composed of the previously entered words, so the cyclic network language model has "better" Memory effect." For example, "my" or "mood" may appear "good" or "bad". The appearance of these words depends on the appearance of "my" and "mood" before, and this is the "memory effect."

The circular network language model maps each word entered into a compact continuous vector space, while the mapped continuous vector space uses a relatively small set of parameters and uses loop joins to create the corresponding model, resulting in long-distance context dependencies. A language model commonly used in large-vocal continuous speech recognition, namely the NGRAM model (a linguistic statistical model), relies on the first N-1 words of the input, N is a positive integer, and the cyclic network language model can capture the previously input. Historical information for all words. Therefore, compared with the traditional language model, which only depends on the characteristics of the N-1 words input earlier, the cyclic network language model is more accurate in scoring the input candidate word sequences.

In an embodiment, after the candidate word sequence with the highest final score is used as the speech recognition result of the speech data, the method further includes: performing entity extraction on the speech recognition result to obtain an entity word; retrieving the entity word; and performing the retrieval result and the entity When the words are inconsistent, the entity words are repaired.

After obtaining the first score and the second score of each candidate word sequence, the final score of each candidate word sequence can be calculated, and the candidate word with the highest final score is selected as the speech recognition result of the speech data. After determining the speech recognition result corresponding to the speech data, the speech recognition result may be physically repaired. First, the speech recognition result is physically extracted. The entity represents a word that plays a big role in the sentence of the speech recognition result, such as "I want to go to the window of the world", then in this sentence, "the window of the world" is this The focus of the sentence, and "I want to go" is only an expression of the speaker's intention, and the final place to be implemented is the "window of the world."

When entity extraction is performed on the speech recognition result, multiple entity words can be obtained, or only one entity word can be obtained. After the entity words are extracted, each entity word can be retrieved, and the retrieved result and the speech recognition result are When the extracted entity words are inconsistent, the entity words extracted from the speech recognition result are replaced with the retrieval result, and the entity words are repaired in this way.

In speech recognition, the recognition of typos is generally concentrated on special entities in the domain. Therefore, the entity in the speech recognition result can be extracted for this problem, and the entity words are correspondingly repaired, thereby reducing the occurrence of speech recognition results. The probability of a typo increases the accuracy of the resulting speech recognition result.

In one embodiment, retrieving the entity term comprises: determining a domain of the speech recognition result based on the domain in which the candidate word sequence is located; and retrieving each entity term in a database corresponding to the domain of the speech recognition result.

When the speech data is decoded, a plurality of word sequences can be obtained, a plurality of word sequences are extracted from the word sequence as a candidate word sequence, and one of the candidate word sequences is selected as a speech recognition result of the speech data, and thus can be extracted according to the The field in which the candidate word sequence is located determines the field to which the speech recognition result corresponds. After the field of speech recognition results is determined, each entity term can be retrieved in a database corresponding to the domain.

Retrieving the entity words according to the field in which the speech recognition result is located not only reduces the amount of data retrieved, but also improves the accuracy of the search results. Therefore, when the entity words are repaired, the repair results can be more accurate, thereby improving the accuracy. The efficiency and accuracy of correcting speech recognition results.

In an embodiment, as shown in FIG. 6, after step 214, the following steps are further included:

Step 602: Perform physical extraction on the speech recognition result to obtain an entity word.

Step 604: Determine a field of the speech recognition result according to the domain in which the extracted candidate word sequence is located.

Step 606: Search each entity word in a database corresponding to the field of the speech recognition result.

Step 608: When the search result is inconsistent with the entity word, the entity word is repaired.

When the candidate word sequence with the highest final score is selected from the plurality of candidate word sequences as the speech recognition result of the speech data, the entity of the speech recognition result can be extracted, and the entity represents a large role in the sentence of the speech recognition result. Words such as "I want to go to the window of the world", then in this sentence, "window of the world" is the focus of this sentence, and "I want to go" is just an expression of the speaker's intention, and ultimately needs to be implemented The location is "Window of the World." Before determining the speech recognition result of the speech data, the field of each candidate word sequence that may be the speech recognition result is determined, that is, the field in which the candidate word sequence is located is identified. The sequence of candidate words may be multiple, and the field corresponding to all candidate word sequences is generally one, that is, multiple candidate word sequences correspond to the same field, and the speech recognition result is the candidate word sequence with the highest final score, and the candidate can be extracted according to the candidate. The field in which the word sequence is located determines the field of speech recognition results.

After the field of the speech recognition result is determined, each entity word can be retrieved in a database corresponding to the field of the speech recognition result, that is, each field has its own corresponding database. For example, the speech recognition result is “I want to go to the Emperor Building”, and the entity extraction of this sentence can get the entity word “Imperial Building”, and the field of speech recognition result is “navigation” field, then the entity word “Imperial Building” When searching, it is searched in the database corresponding to the "navigation" field, that is, the corresponding place name is retrieved. When the search result is "Diwang Building", the search result is inconsistent with the entity word, then the entity word "emperor" can be The building is replaced by the search result "Diwang Building", and the entity words in the speech recognition result are repaired, and the corrected speech recognition result "I want to go to the Diwang Building", that is, the voice can be corrected by the physical repair operation The recognition result is corrected by a typos and the like, and a more accurate speech recognition result is obtained.

As shown in Figure 7, in one embodiment, a speech recognition method is provided. This embodiment is exemplified by applying the method to the server 120 in FIG. 1 described above. Referring to FIG. 7, the voice recognition method specifically includes the following steps:

In step 702, the neural network is trained to obtain a trained neural network.

Before actually using the neural network, it is necessary to train the neural network according to the actual project in advance. Each domain has its own corresponding neural network. Therefore, the corresponding neural network of each domain is trained separately by using the corresponding text of each domain. The trained neural network corresponds to the domain one by one, as shown in Figure 8. Each neural network has its own training text. In practical application, the semantic classification model can identify the domain in which the extracted candidate word sequence is located. After the domain of the extracted candidate word sequence is determined by the semantic classification model, each candidate word sequence is sequentially input to the corresponding domain. In the neural network, the scoring is performed, so the domain of the neural network corresponds to the domain of the semantic classification model. Therefore, during training, as shown in Fig. 8, the semantic classification model can be trained together with the neural network to make the nerve The domain of the network corresponds to the classification domain of the semantic classification model. The neural network is used to identify the speech, while the semantic classification model is used to analyze the semantics of the speech data. Therefore, the neural network belongs to the recognition side, and the semantic classification model belongs to the semantics. side.

When training the neural network, the corresponding texts of each field can be obtained. After converting each word in each text into a word vector, the word vector can be used as input data, and according to the order of the words in the sentence, The word vector of the next word of the input word is used as the output data of the current input word, that is, each output word vector is related to the word vector input at the previous moment and the word vector input at all previous times. When the neural network is trained by the word vector corresponding to the input word, the parameters of the neural network can be continuously adjusted until the appropriate parameters are adjusted, and the neural network is trained, and the trained neural network is obtained. After training the neural network in each field separately, if it is necessary to update the neural network corresponding to a certain domain in the actual application process, only the corresponding neural network of the domain needs to be updated, for example, only the domain needs to be updated. In the case of n, it is only necessary to update the gray portion as shown in FIG. 8, that is, to retrain the neural network corresponding to the field n, and at the same time retrain the semantic classification model of the corresponding domain on the semantic side.

The neural network can be a cyclic neural network, also known as a cyclic network language model. The advantage of the cyclic network language model is the long memory effect, that is, the word vector of each output word is influenced by the word vectors of all previously input words. The output data corresponding to the word vector of the currently input word is the word vector corresponding to the previous word. The output data is the word vector corresponding to the next word of each input word, but for each output data, the input data is not only the word vector of the last input word, but the word vector of the last input word. And the word vector of all the words entered before. That is, each output word is related to the input word at the previous moment and the words entered at all previous times. By analogy, the output data corresponding to each input word vector is set, and the cyclic network language model is trained to obtain the trained neural network for practical use.

Step 704: Acquire a plurality of word sequences obtained by decoding the voice data, and a first score corresponding to each word sequence.

The speech decoding process can be performed by the front end, and the front end can be a terminal, and the speech data can be decoded to obtain a plurality of word sequences, and a first score corresponding to each word sequence. Specifically, as shown in FIG. 9 , after acquiring the input voice data, the front end extracts the feature data, and uses the trained acoustic model, the language model, and the pronunciation dictionary to perform search and decoding on the extracted voice data. Get multiple word sequences. The acoustic model can be trained through the speech training set, the language model is trained by the text training set, and the trained acoustic model and language model can be put into practical use.

A word sequence can be thought of as multiple words and multiple paths. It can also be called lattice. Latice is essentially a directed acyclic graph. Each node on the graph represents the end time of a word. Each edge represents a possible word, as well as the acoustic score and language model score of the word. When the result of the speech recognition is expressed, each node stores the result of the speech recognition at the current position, including the acoustic probability, the language probability, and the like, as shown in FIG. 10, starting from the leftmost starting position <s>, along the different The arc goes to the final </s> to get a different sequence of words, and the probability of being stored on the arc is combined to represent the probability (fraction) of the input speech to get a certain piece of text. For example, as shown in Figure 10, "Beijing welcomes you" and "background changes you" can be regarded as a path to identify the results, that is, "Beijing welcomes you" and "background changes you" are a sequence of words. . Each path in the graph corresponds to a probability. According to the probability, the score of each path can be calculated, that is, the first score, so each word sequence has its own corresponding first score.

Step 706: Extract a predetermined number of word sequences of the first score from the plurality of word sequences as a candidate word sequence.

When a plurality of word sequences are obtained, a predetermined number of word sequences of the first score may be extracted from the word sequence as a candidate word sequence. That is, the optimal path calculated by lattice, that is, the path with the highest probability does not necessarily match the actual word sequence, so the predetermined number of word sequences with the first score before can be extracted as the candidate word sequence, which can also be called N-best (n optimal, a comprehensive fast algorithm for speech feature information), the preset number can be set by the technician according to actual needs.

Step 708, input each candidate word sequence into a semantic classification model.

Step 710: classify each candidate word sequence by a semantic classification model, and obtain a classification label corresponding to each candidate word sequence.

Step 712: Acquire an area corresponding to the classification label with the largest proportion among the classification labels as the field in which the extracted candidate word sequence is located.

Step 714: Input the candidate word sequence into the trained neural network of the corresponding domain according to the domain in which the extracted candidate word sequence is located.

Step 716: Re-score each candidate word sequence by the trained neural network to obtain a second score corresponding to each candidate word sequence.

Step 718, weighting and summing the first score and the second score corresponding to each candidate word sequence to obtain a final score of each candidate word sequence.

In step 720, the candidate word sequence with the highest final score is used as the speech recognition result of the speech data.

After extracting a preset number of word sequences as candidate word sequences, each candidate word sequence can be input into the trained semantic classification model, and the trained semantic classification model can classify each candidate word sequence and output The classification label obtained by classifying each candidate word sequence may determine the domain corresponding to each candidate word sequence according to the classification label of each candidate word sequence. However, the trained semantic classification model may have different domain classification results for each candidate word sequence. In order to determine the domain in which the extracted candidate word sequence is located, the classification label of each candidate word sequence may be obtained, and the classification with the largest proportion may be obtained. The label is a classification label corresponding to all candidate word sequences, that is, the domain corresponding to the largest classification label is the domain of all candidate word sequences. Therefore, the fields corresponding to the plurality of candidate word sequences will eventually be the same, generally not A plurality of candidate word sequences corresponding to the same voice data belong to a plurality of fields.

After extracting a plurality of candidate word sequences, after performing the n-best extraction step, the candidate word sequence can be identified by the domain, and the semantic classification model can be used to identify the domain in which the candidate word sequence is located. As shown in FIG. 11, after the extraction of n-best is performed, the candidate word sequence can be identified by the semantic classification model, and the candidate word sequence is input into the neural network of the corresponding domain. Each domain has its own corresponding neural network, that is, a neural network that is divided into domains. The neural network can be RNNLM. Compared to the NGRAM language model, RNNLM has better ability to preserve long-term memory. The NGRAM language model is a commonly used language model in large vocabulary continuous speech recognition. The model is based on the assumption that the appearance of the Nth word is only related to the previous N-1 words, and is not related to any other words. The probability of a whole sentence is the product of the probability of occurrence of each word. These probabilities can be obtained by counting the number of simultaneous occurrences of N words from the corpus.

But RNNLM is different. For example, "my" and "mood" may appear "good" or "bad". The appearance of these words depends on the appearance of "my" and "mood", that is, "memory effect". Traditional language models, such as the NGRAM language model, rely on the first N-1 words when calculating the probability of a word appearing. N is usually set to a maximum of 5, and the previous words are ignored. However, this practice is irrational, because words that have appeared before will have an impact on the current words, for example, "Andy Lau / yes / one / excellent / / actors / and / singers / have / many / classic / works / he / / / / one", if you consider the previous context, then the probability of "song" will be higher than "go", and if you only consider the previous 3 or 4 words, ignore In the previous article, it is impossible to determine whether it is "song" or "brother". However, the memory effect of RNNLM is good. Compared with the traditional language model, RNNLM considers the long text input before, and calculates the probability of the next word based on the long text before, so RNNLM has better memory effect.

After the candidate word sequence is input to the neural network of the corresponding domain, the candidate word sequence can be re-scored by the neural network to obtain the second score of the candidate word sequence, and finally the speech recognition result of the speech data is obtained. Re-scoring can also be called rescoring. Since the language model that generates the lattice cannot be accurate enough, it is necessary to re-adjust the n-best score using a larger language model. This process is called rescoring, which is a re-scoring.

As shown in the example of FIG. 12, the plurality of candidate word sequences and the corresponding first scores are: playing Jay Chou's nunchaku 94.7456, playing Jay Chou's nunchaku 95.3976, and playing the week conclusion of the nunchaku 94.7795, ... These candidate word sequences are input into the semantic classification model for classification. The semantic classification model recognizes that the candidate word sequences belong to the music field, and then each candidate word sequence can be input into the corresponding neural network of the music field for re-scoring processing, and The second score of each candidate word sequence is obtained, specifically: playing Jay Chou's nunchaku 93.3381, playing Jay Chou's nunchaku 95.0925, playing the week conclusion of the nunchaku 94.1557, ....

After obtaining the first score and the second score of each candidate word, the first score and the second score may be weighted and summed to obtain a final score for each candidate word sequence. Specifically, the final score of each candidate word sequence = first score * first weight + second score * second weight. The first weight and the second weight may be equal or unequal, and the first weight and the second weight respectively represent the proportion of each of the first score and the second score, and when the first weight is greater than the second weight, the technology may be considered The person increases the score of the first score, and thinks that the proportion of the first score is higher than the second score, that is, the technician thinks that the first score has a greater influence on the speech recognition result. The first weight and the second weight may be determined by a large amount of test data, such as using a large number of candidate word sequences, and continuously adjusting the values of the first weight and the second weight until the first weight and the second weight are to be calculated. When the obtained candidate word sequence with the highest final score is used as the speech recognition result, the sequence of the candidate word with the highest final score is indeed the most accurate speech recognition result, and the actual values of the first weight and the second weight can be determined, and This first weight and the second weight are used in the actual application as the calculation parameters of the final score.

For example, when the first weight is 0.4 and the second weight is 0.6, the final score of the candidate word sequence "Play Jay Chou's nunchaku" = 95.3976 * 0.4 + 95.0925 * 0.6 = 95.21454, all candidate words can be calculated. After the final scores of the sequence are calculated, it can be seen that the candidate word sequence "playing Jay Chou's nunchaku" has the highest score, so "play Jay Chou's nunchaku" can be used as the speech recognition result. After calculating the final score, a calculation step can also be added, such as taking the logarithm of the final score as the selection score, that is, the final score of the candidate sequence "playing Jay Chou's nunchaku" is 95.21454, and then taking the pair of 95.21454. For example, log ₁₀ 95.21454, after calculating the logarithm of the final score of each candidate word sequence, the logarithm of the final score is taken as the selection score of each candidate word sequence, and then the candidate word sequence with the highest selection score is selected as the speech. Identify the results. This calculation method can be diversified, and other calculation methods can be adopted, which can be set by a technician.

Step 722: Perform physical extraction on the speech recognition result to obtain an entity word.

Step 724: Determine a field of the speech recognition result according to the domain in which the extracted candidate word sequence is located.

Step 726, searching for each entity word in a database corresponding to the field of the speech recognition result.

Step 728: When the search result is inconsistent with the entity word, the entity word is repaired.

In speech recognition, recognition errors are generally concentrated in special entities in the domain. For this problem, the speech recognition result can be physically repaired after the speech recognition result of the speech data is obtained. First, the speech recognition result is physically extracted. The entity represents a word that plays a big role in the sentence of the speech recognition result, such as "I want to go to the emperor building", then in this sentence, "the emperor building" is the sentence Focus, and "I want to go" is just an expression of the speaker's intention, and the final place to be implemented is the "Emperor Building."

When entity extraction is performed on the speech recognition result, multiple entity words can be obtained, or only one entity word can be obtained. After the entity words are extracted, each entity word can be retrieved, and when the voice data is decoded, more can be obtained. a sequence of words, selecting a plurality of word sequences as a candidate word sequence from the sequence of words, and selecting one of the candidate word sequences as the speech recognition result of the speech data, so that the speech recognition result corresponding to the field in which the candidate word sequence is located may be determined. field. After the field of the speech recognition result of the speech data is determined, each entity word may be searched in a corresponding database in the field, and when the retrieved result is inconsistent with the entity word extracted in the speech recognition result, the speech is The entity words extracted in the recognition result are replaced with the search results, and the entity words are repaired in this way. For example, when the speech recognition result is "I want to go to the Emperor Building", after extracting the entity word of "Emperor Building", the "Emperor Building" can be searched according to the navigation field to which the speech recognition result belongs, that is, the geographical location information is retrieved, according to The search reveals that the actual result should be “Diwang Building”, so the whole sentence can be corrected to “I want to go to the King Building”, and the physical repair function can quickly and accurately repair the specific identification errors.

In the speech recognition method in this embodiment, since the sub-segment re-scoring mechanism is adopted, the re-scoring operation after the sub-field can greatly improve the recognition accuracy. It is proved by experiments that the speech recognition accuracy rates are different in different fields. With a relative increase of 4% to 16%, the overall recognition accuracy increased by 6%. In addition, since each domain has its own corresponding neural network, the cost of training the neural network can be greatly reduced. The traditional speech recognition scheme needs to use a large amount of text to train a language model. In order to ensure the effect of re-scoring, the language model is very large, the model training period is very long, and the evaluation of the overall effect of the model is difficult. 8.2G text training RNNLM The time is about 100 hours; after the field is divided, the training time of the model in a single field can be shortened to 24 hours, and the update time for the domain-specific model is greatly shortened. Moreover, the language model in the traditional technology cannot cover all the data comprehensively. Once there is a problem in the language model, the entire language model needs to be updated, and each update will generate huge overhead, but in this embodiment, It is only necessary to retrain the neural network corresponding to the domain that needs to be processed or updated, which reduces the training cost and shortens the time required for the neural network update.

2 to 12 are schematic flowcharts of a voice recognition method in various embodiments. It should be understood that although the various steps in the flowcharts of the various figures are sequentially displayed as indicated by the arrows, these steps are not necessarily performed in the order indicated by the arrows. Except as explicitly stated herein, the execution of these steps is not strictly limited, and the steps may be performed in other orders. Moreover, at least some of the steps in the various figures may comprise a plurality of sub-steps or stages, which are not necessarily performed at the same time, but may be performed at different times, the execution of these sub-steps or stages The order is also not necessarily sequential, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of the other steps.

In one embodiment, as shown in FIG. 13, a voice recognition apparatus is provided, including:

The word sequence obtaining module 1302 is configured to acquire a plurality of word sequences obtained by decoding the voice data, and a first score corresponding to each word sequence.

The extracting module 1304 is configured to extract, from the plurality of word sequences, a preset number of word sequences of the first score as a candidate word sequence.

The domain identification module 1306 is configured to identify the domain in which the extracted candidate word sequence is located.

The re-scoring module 1308 is configured to input each candidate word sequence into the neural network of the corresponding domain according to the field; each candidate word sequence is re-scoreed by the neural network to obtain a second score corresponding to each candidate word sequence.

The speech recognition result determining module 1310 is configured to obtain a final score of each candidate word sequence according to the first score and the second score corresponding to each candidate word sequence; and use the candidate word sequence with the highest final score as the speech recognition result of the speech data.

In one embodiment, as shown in FIG. 14, the domain identification module 1306 includes:

The input module 1306A is configured to input each candidate word sequence into the trained semantic classification model.

The classification module 1306B is configured to classify each candidate word sequence by a semantic classification model to obtain a classification label corresponding to each candidate word sequence; and obtain an area corresponding to the classification label with the largest proportion among the classification labels as the extracted candidate word sequence. field of.

In an embodiment, the apparatus further includes a training module (not shown) for acquiring text corresponding to each domain; converting each word in the text corresponding to each domain into a word vector; respectively corresponding to each domain The word vector is used as an input to train the corresponding neural networks in each field.

In one embodiment, the training module is further configured to input, for each field, a word vector corresponding to each word in the text according to the order of the words in the text corresponding to the domain, and input the next word of each input word. The word vector corresponding to the word is used as an output to train the neural network by adjusting the parameters of the corresponding neural network in the field.

In one embodiment, the re-scoring module 1308 is further configured to perform weighted summation on the first score and the second score corresponding to the candidate word sequence to obtain a final score of the candidate word sequence.

In one embodiment, the neural network is a circulating neural network.

In an embodiment, the apparatus further includes a physical repair module (not shown) for performing entity extraction on the speech recognition result to obtain a plurality of entity words; retrieving the entity words; and when the search result is inconsistent with the entity words When the entity words are repaired.

In an embodiment, the entity repairing module is further configured to determine a domain of the voice recognition result according to the domain in which the extracted candidate word sequence is located; and retrieve each entity word in a database corresponding to the domain of the voice recognition result.

Figure 15 is a diagram showing the internal structure of a computer device in one embodiment. The computer device may specifically be the server 120 of FIG. As shown in FIG. 15, the computer device includes the computer device including a processor, a memory, and a network interface connected by a system bus. Wherein, the memory comprises a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system, and may also store a computer program that, when executed by the processor, causes the processor to implement a speech recognition method. The internal memory can also store a computer program that, when executed by the processor, causes the processor to perform the speech recognition method described above.

It will be understood by those skilled in the art that the structure shown in FIG. 15 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation of the computer device to which the solution of the present application is applied. The specific computer device may include a ratio. More or fewer components are shown in the figures, or some components are combined, or have different component arrangements.

In one embodiment, the speech recognition apparatus provided herein can be implemented in the form of a computer program that can be run on a computer device as in FIG. The program modules constituting the voice recognition device may be stored in a memory of the computer device, such as the word sequence acquisition module, the extraction module, the domain identification module, the re-scoring module, and the speech recognition result determination module shown in FIG. The computer program of each program module causes the processor to perform the steps in the speech recognition method of the various embodiments of the present application described in this specification.

For example, the computer device shown in FIG. 15 can perform acquisition of a plurality of word sequences obtained by decoding the voice data by the word sequence acquisition module in the voice recognition device as shown in FIG. 13, and a first score corresponding to each word sequence. . The computer device may perform, by the extraction module, extracting, from the plurality of word sequences, a preset number of word sequences of the first score as a candidate word sequence. The computer device can perform an area identifying the extracted candidate word sequence by the domain identification module. The computer device may perform, by the re-scoring module, input each candidate word sequence into the neural network of the corresponding domain according to the domain in which the extracted candidate word sequence is located; re-score each candidate word sequence through the neural network to obtain each candidate word The second score corresponding to the sequence. The computer device may obtain a final score of each candidate word sequence according to the first score and the second score corresponding to each candidate word sequence by the voice recognition result determining module; the candidate word sequence with the highest final score as the voice recognition result of the voice data .

In one embodiment, a computer apparatus is provided comprising a memory and a processor having stored therein a computer program that, when executed, implements the steps of the speech recognition method provided in any one of the embodiments of the present application.

In one embodiment, a computer readable storage medium is provided having stored thereon a computer program that, when executed by a processor, implements the steps of the speech recognition method provided in any one of the embodiments of the present application.

One of ordinary skill in the art can understand that all or part of the process of implementing the above embodiments can be completed by a computer program to instruct related hardware, and the program can be stored in a non-volatile computer readable storage medium. Wherein, the program, when executed, may include the flow of an embodiment of the methods as described above. Any reference to a memory, storage, database or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. Non-volatile memory can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of formats, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronization chain. Synchlink DRAM (SLDRAM), Memory Bus (Rambus) Direct RAM (RDRAM), Direct Memory Bus Dynamic RAM (DRDRAM), and Memory Bus Dynamic RAM (RDRAM).

The technical features of the above embodiments may be arbitrarily combined. For the sake of brevity of description, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, It is considered to be the range described in this specification.

The above-mentioned embodiments are merely illustrative of several embodiments of the present application, and the description thereof is more specific and detailed, but is not to be construed as limiting the scope of the claims. It should be noted that a number of variations and modifications may be made by those skilled in the art without departing from the spirit and scope of the present application. Therefore, the scope of the invention should be determined by the appended claims.

Claims

A speech recognition method, characterized in that it is applied to a computer device, and the method includes:

Obtaining a plurality of word sequences obtained by decoding the voice data, and a first score corresponding to each word sequence;

Extracting, from the plurality of word sequences, a preset number of word sequences of the first score as a candidate word sequence;

Identifying the field in which the extracted candidate word sequence is located;

Inputting each candidate word sequence into a neural network of a corresponding domain according to the field;

Retrieving each candidate word sequence by the neural network to obtain a second score corresponding to each candidate word sequence;

Obtaining a final score of each candidate word sequence according to the first score and the second score corresponding to each candidate word sequence;

The candidate word sequence with the highest final score is used as the speech recognition result of the speech data.
The method according to claim 1, wherein the identifying the domain in which the extracted candidate word sequence is located comprises:

Inputting each candidate word sequence into a semantic classification model;

And classifying each candidate word sequence by using the semantic classification model to obtain a classification label corresponding to each candidate word sequence;

Obtaining, by the domain corresponding to the classification tag having the largest proportion among the classification tags, the domain in which the extracted candidate word sequence is located.
The method of claim 1 further comprising:

Get the text corresponding to each field;

Converting each word in the text corresponding to each field into a word vector;

The word vectors corresponding to the respective fields are respectively input as inputs, and the neural networks corresponding to the respective fields are trained.
The method according to claim 3, wherein the respective word vectors corresponding to the respective fields are respectively input as input, and the neural network corresponding to each domain is trained, including:

For each field, according to the order of words in the text corresponding to the domain, the word vector corresponding to each word in the text is taken as an input, and the word vector corresponding to the next word of each input word is output as The neural network is trained by adjusting parameters of the neural network corresponding to the domain.
The method according to claim 1, wherein the obtaining a final score of each candidate word sequence according to the first score and the second score corresponding to each candidate word sequence comprises:

And weighting the first score and the second score corresponding to the candidate word sequence to obtain a final score of the candidate word sequence.
The method of claim 1 wherein said neural network is a cyclic neural network.
The method according to claim 1, wherein after the candidate word sequence having the highest final score is used as the voice recognition result of the voice data, the method further includes:

Entity extraction of the speech recognition result to obtain an entity word;

Retrieving the entity words;

When the retrieval result is inconsistent with the entity word, the entity word is repaired.
The method of claim 7 wherein said retrieving said entity term comprises:

Determining a field of the speech recognition result according to an area in which the extracted candidate word sequence is located;

The entity words are retrieved in a database corresponding to the field of the speech recognition result.
A speech recognition apparatus, characterized in that the apparatus comprises:

a word sequence obtaining module, configured to acquire a plurality of word sequences obtained by decoding the voice data, and a first score corresponding to each word sequence;

An extraction module, configured to extract, from the plurality of word sequences, a preset number of word sequences of the first score as a candidate word sequence;

a domain identification module, configured to identify an area in which the extracted candidate word sequence is located;

a re-scoring module, configured to input each candidate word sequence into a neural network of a corresponding domain according to the domain; re-scoring each candidate word sequence by the neural network to obtain each candidate word sequence Corresponding second score;

a speech recognition result determining module, configured to obtain a final score of each candidate word sequence according to the first score and the second score corresponding to each candidate word sequence; using the candidate word sequence with the highest final score as the Speech recognition result of voice data.
The device according to claim 9, wherein the domain identification module comprises:

An input module, configured to input each of the candidate word sequences into the trained semantic classification model;

a classification module, configured to classify each of the candidate word sequences by using the semantic classification model to obtain a classification label corresponding to each candidate word sequence; and obtain a domain corresponding to the classification label with the largest proportion among the classification labels As the field in which the extracted candidate word sequence is located.
The device according to claim 9, wherein the device further comprises: a training module, configured to acquire text corresponding to each domain; and convert each word in the text corresponding to each domain into a word vector; The word vectors corresponding to the respective fields are input as inputs, and the neural networks corresponding to the respective fields are trained.
The apparatus according to claim 11, wherein the training module is further configured to, for each domain, use a word vector corresponding to each word in the text according to an order of words in the text corresponding to the domain. Inputting, the word vector corresponding to the next word of each input word is output, and the neural network is trained by adjusting parameters of the neural network corresponding to the domain.
The device according to claim 9, wherein the device further comprises a entity repairing module, configured to perform entity extraction on the speech recognition result, to obtain a plurality of entity words; to retrieve the entity words; When the result is inconsistent with the entity word, the entity word is repaired.
A computer readable storage medium storing a computer program, the computer program being executed by a processor, causing the processor to perform the steps of the method of any one of claims 1 to 8.
A computer device comprising a memory and a processor, the memory storing a computer program, the computer program being executed by the processor, causing the processor to perform the following operations:

Obtaining a plurality of word sequences obtained by decoding the voice data, and a first score corresponding to each word sequence;

Extracting, from the plurality of word sequences, a preset number of word sequences of the first score as a candidate word sequence;

Identifying the field in which the extracted candidate word sequence is located;

Inputting each candidate word sequence into a neural network of a corresponding domain according to the field;

Retrieving the candidate word sequence by the neural network to obtain a second score corresponding to each candidate word sequence;

Obtaining a final score of each candidate word sequence according to the first score and the second score corresponding to each candidate word sequence;

The candidate word sequence having the highest final score among the extracted candidate word sequences is used as the speech recognition result of the speech data.
The computer apparatus according to claim 15, wherein said computer program is executed by said processor such that said processor performs the following operations:

Inputting each candidate word sequence into a semantic classification model;

And classifying each candidate word sequence by using the semantic classification model to obtain a classification label corresponding to each candidate word sequence;

Obtaining, by the domain corresponding to the classification tag having the largest proportion among the classification tags, the domain in which the extracted candidate word sequence is located.
The computer apparatus according to claim 15, wherein said computer program is executed by said processor such that said processor performs the following operations:

Get the text corresponding to each field;

Converting each word in the text corresponding to each field into a word vector;

The word vectors corresponding to the respective fields are respectively input as inputs, and the neural networks corresponding to the respective fields are trained.
The computer apparatus according to claim 17, wherein said computer program is executed by said processor such that said processor performs the following operations:

For each field, according to the order of words in the text corresponding to the domain, the word vector corresponding to each word in the text is taken as an input, and the word vector corresponding to the next word of each input word is output as The neural network is trained by adjusting parameters of the neural network corresponding to the domain.
The computer apparatus according to claim 15, wherein said computer program is executed by said processor such that said processor performs the following operations:

And weighting the first score and the second score corresponding to the candidate word sequence to obtain a final score of the candidate word sequence.
The computer apparatus according to claim 15, wherein said neural network is a cyclic neural network.
The computer apparatus according to claim 15, wherein said computer program is executed by said processor such that said processor performs the following operations:

Entity extraction of the speech recognition result to obtain an entity word;

Retrieving the entity words;

When the retrieval result is inconsistent with the entity word, the entity word is repaired.
The computer apparatus according to claim 21, wherein said computer program, when executed by said processor, causes said processor to perform the following operations:

Determining a field of the speech recognition result according to an area in which the extracted candidate word sequence is located;

The entity words are retrieved in a database corresponding to the field of the speech recognition result.