WO2019218818A1 - Speech recognition method and apparatus, and computer readable storage medium and computer device - Google Patents

Speech recognition method and apparatus, and computer readable storage medium and computer device Download PDF

Info

Publication number
WO2019218818A1
WO2019218818A1 PCT/CN2019/082300 CN2019082300W WO2019218818A1 WO 2019218818 A1 WO2019218818 A1 WO 2019218818A1 CN 2019082300 W CN2019082300 W CN 2019082300W WO 2019218818 A1 WO2019218818 A1 WO 2019218818A1
Authority
WO
WIPO (PCT)
Prior art keywords
candidate word
score
word sequence
word
neural network
Prior art date
Application number
PCT/CN2019/082300
Other languages
French (fr)
Chinese (zh)
Inventor
刘毅
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2019218818A1 publication Critical patent/WO2019218818A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Definitions

  • the present application relates to the field of speech recognition technologies, and in particular, to a speech recognition method, apparatus, computer readable storage medium, and computer device.
  • the technical solution of speech recognition is generally divided into two parts: front-end decoding and back-end processing.
  • the front-end is mainly responsible for receiving input speech data and decoding the speech data to obtain multiple sentences with possibility, and the back end can be One of the sentences of the plurality of possibilities obtained by the front end determines one of the sentences as the final speech recognition result.
  • the back end can input multiple possible sentences into the neural network to determine the final speech recognition result.
  • a large amount of text needs to be utilized, and it takes a long time to train.
  • the neural network that can eventually be put into use, so this speech recognition scheme is less efficient.
  • a speech recognition method is applied to a computer device, and the method includes:
  • the candidate word sequence with the highest final score is used as the speech recognition result of the speech data.
  • a speech recognition device comprising:
  • a word sequence obtaining module configured to acquire a plurality of word sequences obtained by decoding the voice data, and a first score corresponding to each word sequence
  • An extraction module configured to extract, from the plurality of word sequences, a preset number of word sequences of the first score as a candidate word sequence
  • a domain identification module configured to identify an area in which the extracted candidate word sequence is located
  • a re-scoring module configured to input each candidate word sequence into a neural network of a corresponding domain according to the domain; re-scoring each candidate word sequence by the neural network to obtain each candidate word sequence Corresponding second score;
  • a speech recognition result determining module configured to obtain a final score of each candidate word sequence according to the first score and the second score corresponding to each candidate word sequence; using the candidate word sequence with the highest final score as the Speech recognition result of voice data.
  • a computer device comprising a memory and a processor, the memory storing a computer program, the computer program being executed by the processor, causing the processor to perform the following operations:
  • the candidate word sequence with the highest final score is used as the speech recognition result of the speech data.
  • a computer readable storage medium storing a computer program, when executed by a processor, causes the processor to perform the following operations:
  • the candidate word sequence with the highest final score is used as the speech recognition result of the speech data.
  • the above-mentioned speech recognition method, apparatus, computer readable storage medium and computer device select a plurality of candidate word sequences from a plurality of word sequences by acquiring a plurality of word sequences obtained by decoding the speech data and a corresponding first score
  • the candidate word sequence may be input into the neural network corresponding to the domain, and the second score obtained by the neural network for re-rating each candidate word sequence may be obtained according to the first score and the first score.
  • the two scores determine the final score of each candidate word sequence, and then the candidate word sequence with the highest final score can be selected as the speech recognition result of the speech data.
  • the method for speech recognition first identifies the domain of the candidate word sequence before re-rating the candidate word sequence, so that the candidate word sequence can be re-scored using the neural network corresponding to the domain to which the candidate word sequence belongs, using the domain
  • the corresponding neural network re-scores the candidate word sequence to obtain a second score that is not only more accurate. Since each field has its own corresponding neural network, each field uses its own text to train each corresponding neural network. No need to use all the text for training, so the training time for the specific domain of the neural network is also greatly shortened, further improving the efficiency of speech recognition.
  • 1 is an application environment diagram of a voice recognition method in an embodiment
  • FIG. 2 is a schematic flow chart of a voice recognition method in an embodiment
  • step 206 is a schematic flow chart of step 206 in an embodiment
  • FIG. 4 is a schematic flow chart of a training step of a neural network in an embodiment
  • FIG. 5 is a schematic diagram of input data and output data of a neural network in an embodiment
  • FIG. 6 is a schematic flow chart after step 214 in an embodiment
  • FIG. 7 is a schematic flow chart of a voice recognition method in another embodiment
  • FIG. 8 is a schematic diagram of a neural network training process corresponding to each field in an embodiment
  • FIG. 9 is a schematic flow chart of decoding a voice data by a front end in an embodiment
  • Figure 10 is a schematic illustration of a sequence of words in one embodiment
  • 11 is a schematic flow chart of re-scoring a candidate word sequence by a semantic classification model in an embodiment
  • FIG. 12 is a schematic flow chart of re-scoring a candidate word sequence by a semantic classification model in another embodiment
  • Figure 13 is a block diagram showing the structure of a voice recognition apparatus in an embodiment
  • FIG. 14 is a structural block diagram of a domain identification module in another embodiment
  • Figure 15 is a block diagram showing the structure of a computer device in an embodiment.
  • FIG. 1 is an application environment diagram of a voice recognition method in an embodiment.
  • the speech recognition method is applied to a speech recognition system.
  • the speech recognition system includes a terminal 110 and a server 120.
  • the terminal 110 and the server 120 are connected through a network.
  • the terminal 110 may specifically be a desktop terminal or a mobile terminal, and the mobile terminal may specifically include at least one of a mobile phone, a tablet computer, a notebook computer, and the like.
  • the server 120 can be implemented by a stand-alone server or a server cluster composed of a plurality of servers.
  • a speech recognition method is provided. This embodiment is exemplified by applying the method to the server 120 in FIG. 1 described above. Referring to FIG. 2, the voice recognition method specifically includes the following steps:
  • Step 202 Acquire a plurality of word sequences obtained by decoding the voice data, and a first score corresponding to each word sequence.
  • the server can obtain a plurality of word sequences obtained by decoding the voice data, and the decoding operation can be performed by the terminal.
  • the terminal can use the trained acoustic model, the language model, and the pronunciation dictionary to obtain the obtained
  • the speech data is decoded so that a plurality of word sequences can be obtained.
  • the decoding operation can also be performed by the server.
  • the word sequence refers to a plurality of words obtained by decoding the voice data and a path corresponding to the plurality of words, and sentences obtained by orderly connecting the corresponding positions of each path to the corresponding words at the end position can be seen.
  • the result is a recognition result, that is, each word sequence can be regarded as a recognition result.
  • each path of each word sequence has a corresponding score, that is, the first score.
  • the first score can be thought of as a score calculated from the probability of occurrence of each path.
  • Step 204 Extract a predetermined number of word sequences of the first score from the plurality of word sequences as a candidate word sequence.
  • Each word sequence has a corresponding first score, and the sequence of words with the first score can be extracted.
  • the first score is preceded by sorting the sequence of words according to the order of the first score from high to low, that is, the sequence of words with the highest score of the first score can be ranked first.
  • the sequence of words with the first score is extracted, that is, the sequence of words with the first score is extracted, and the number of extracted word sequences can be preset, so that the preset number of word sequences of the first score can be extracted as candidates.
  • the sequence of words, the preset number can be adjusted according to the consideration of the technician.
  • the sequence of words is sorted in order from lowest to highest, and the sequence of words with the highest highest score is ranked at the last position.
  • the predetermined number of word sequences with the first score is extracted, that is, starting from the last position, and the preset number of word sequences are extracted according to the order of the word sequences.
  • Step 206 Identify the domain in which the extracted candidate word sequence is located.
  • Step 208 Input each candidate word sequence into the neural network of the corresponding domain according to the identified domain.
  • the field in which the predetermined number of candidate word sequences are located can be identified.
  • the field can be customized by the technician, such as dividing the field into navigation, music, and so on.
  • There are multiple candidate word sequences and the identified domain may be one, that is, multiple candidate word sequences belong to the same domain. After identifying the domain corresponding to the candidate word sequence, multiple candidate word sequences may be input to the corresponding domain respectively.
  • the neural network That is, for each domain, there is a corresponding neural network, and the neural network of each domain can re-scoring the candidate word sequences of the respective domains in a targeted manner.
  • Step 210 Re-score each candidate word sequence by a neural network to obtain a second score corresponding to each candidate word sequence.
  • Step 212 Obtain a final score of each candidate word sequence according to the first score and the second score corresponding to each candidate word sequence.
  • Step 214 the candidate word sequence with the highest final score is used as the speech recognition result of the speech data.
  • the neural network in the "navigation" field can re-score each input candidate word sequence to obtain a second score for each candidate word sequence.
  • the re-scoring means that the neural network corresponding to the domain of the candidate word sequence performs a score calculation on each candidate word sequence input, and obtains a second score of each candidate word sequence, and the first score of each candidate word sequence is The second score is combined and calculated in a certain manner. For example, the first score and the second score may be separately weighted, and the weighted calculated first score and the second score may be added. The final score of each candidate word sequence can be obtained according to the first score and the second score, and the candidate word sequence with the highest final score can be used as the speech recognition result of the speech data.
  • each candidate word sequence may be input into a corresponding neural network of the domain, and after obtaining a second score obtained by the neural network re-rating each candidate word sequence, each of the first score and the second score may be determined.
  • the final score of the candidate word sequence, and then the candidate word sequence with the highest final score can be selected as the speech recognition result of the speech data.
  • the method for speech recognition first identifies the field in which the candidate word sequence is located before re-scoring the candidate word sequence, so that the candidate word sequence can be re-scored using the corresponding neural network in the field, and the corresponding nerve in the field is used.
  • the second score obtained by the network to score the candidate word sequence is not only more accurate, but each domain has its own corresponding neural network, and each field uses its own text to train the corresponding neural network without using all The text is trained, so the training time for the specific domain of the neural network is also greatly shortened, further improving the efficiency of speech recognition.
  • step 206 includes:
  • Step 302 input each candidate word sequence into a semantic classification model.
  • Step 304 Classify each candidate word sequence by a semantic classification model, and obtain a classification label corresponding to each candidate word sequence.
  • Step 306 Acquire an area corresponding to the category label with the largest proportion among the classification labels as the field in which the extracted candidate word sequence is located.
  • each candidate word sequence can be input into the trained semantic classification model.
  • the trained semantic classification model refers to a semantic classification model trained in a large number of training samples in advance, and is used to identify the field in which the input candidate word sequence is located.
  • the trained semantic classification model can classify each candidate word sequence, and output the classification of each candidate word sequence to obtain The classification label can determine the domain corresponding to each candidate word sequence according to the classification label of each candidate word sequence.
  • the trained semantic classification model may have different classification results for each candidate word sequence, and represents a trained semantic classification model to determine the sequence of each candidate word sequence.
  • the fields are different.
  • the classification label of each candidate word sequence in order to determine the domain of the candidate word sequence, the classification label of each candidate word sequence can be obtained, and the classification label with the largest proportion is used as the classification label corresponding to all the candidate word sequences, that is, the classification label with the largest proportion.
  • the corresponding field is the field in which all candidate word sequences are located.
  • the classification label of the candidate word sequence A is 1, the classification label of the candidate word sequence B is 1, the classification label of the candidate word sequence C is 2, and the classification label of the candidate word sequence D is 1, wherein 1 can be represented as a navigation class. 2 is represented as music, then in the candidate word sequences A, B, C, D, the classification label is the largest proportion of 1, so the classification label 1 can be used as the classification label of the candidate word sequences A, B, C, D.
  • the domain in which the candidate word sequences A, B, C, and D are considered to belong to the category label is a navigation class.
  • the domain of the extracted candidate word sequences is identified by the semantic classification model, and when there is a divergence in the domain of each candidate word sequence, the domain corresponding to the classification tag with the largest proportion of the classification tags is used as the field in which the candidate word sequence is located.
  • the accuracy of the domain of the candidate word sequence is ensured, and the accuracy of the re-scoring score of the subsequent candidate word sequence is also improved, thereby improving the accuracy of the speech recognition result.
  • the voice recognition method further includes: acquiring text corresponding to each domain; converting each word in the text corresponding to each domain into a word vector; respectively inputting a word vector corresponding to each domain as an input to each field The corresponding neural network is trained.
  • each neural network Before the actual use of the neural network, each neural network can be trained according to actual needs, and the trained neural network can be put into actual use to identify the domain of the input candidate word sequence. Since each domain has its own corresponding neural network, it is necessary to obtain corresponding texts of various fields when training neural networks in each field.
  • the text can be a sentence. After each sentence of each field is obtained, for each field, each sentence corresponding to the field can be processed by word segmentation, and multiple words corresponding to each sentence can be obtained. Convert the words contained in each sentence into word vectors, as the input data of the corresponding neural network in the field, and train the corresponding neural network in each field in this way. After training, each field can be obtained. Train a good neural network.
  • Pre-training neural networks in various fields ensures the accuracy of re-scoring candidate word sequences in the actual use of neural networks, thereby improving the accuracy of speech recognition and improving the efficiency of speech recognition.
  • the word vector is taken as an input, and the neural network corresponding to each domain is trained, including: for each domain, the words corresponding to each word in the text according to the order of the words in the corresponding text of the domain As an input, the vector takes the word vector corresponding to the next word of each input word as an output, and adjusts the parameters of the corresponding neural network in the field to train the neural network.
  • the words contained in the text can be converted into word vectors, and the word vectors corresponding to the texts of each field are input into the corresponding neural network of each field, and the corresponding neural network is performed. training.
  • the word vector of each input word may be input as the order of the words in the text, and the word vector corresponding to the next word of each input word may be output.
  • the input data corresponding to each output word is not only the word input at the previous moment, but the words input at the previous moment and all previous moments.
  • the parameters of the neural network are adjusted to train the neural network, so that the trained neural network corresponding to each field can be obtained, and the trained score can be obtained when the trained neural network re-scores the candidate word sequence. Reliability.
  • the voice recognition method further includes a training step of the neural network, including:
  • Step 402 Acquire text corresponding to each field.
  • Step 404 converting each word in the text corresponding to each field into a word vector.
  • Step 406 For each domain, according to the order of the words in the text corresponding to the domain, input the word vector corresponding to each word in the text as an input, and output the word vector corresponding to the next word of each input word as an output.
  • the neural network is trained by adjusting the parameters of the corresponding neural network in the field.
  • the neural network can be trained in advance, and the trained neural network can be obtained and then applied.
  • Each domain has its own corresponding neural network, that is, it needs to train the corresponding neural network in each field separately.
  • the word vector can be an N-dimensional vector data. N is a positive integer.
  • the text corresponding to each field can be regarded as the corresponding sentence of each field. You can use the word segmentation tool to divide the words into each sentence, that is, each sentence will have multiple words, and the words contained in each sentence can be converted.
  • the word vector is input to the corresponding neural network of each field.
  • the word vector corresponding to the word contained in each sentence is used as the input of the neural network, and the next word of the input word in each sentence is used as the output of the word vector of the previous word.
  • the output data is the word vector corresponding to the next word of each input word, but for each output data, the input data is not only the word vector of the last input word, but the word vector of the last input word. And the word vector of all the words entered before.
  • the input data and output data of the neural network are set, and the parameters of the neural network are continuously adjusted to train the neural network.
  • sentence A is: I was late for school yesterday, then I can use the word segmentation tool to split the sentence into: I / Yesterday / school / late / /.
  • the word vectors corresponding to the words contained in the sentence A are sequentially input into the neural network, where x1, x2, x3, x4, x5, and x6 are respectively words.
  • the corresponding word vector is the input data.
  • a blank input is added before the input word vector, that is, after the word vector corresponding to the word of sentence A is input to the neural network, the word vector corresponding to "I" is not the first input, but becomes the first
  • the two input data, the first input data is empty by default.
  • the word vector corresponding to the next word of each input word is used as the corresponding output data, that is, the output data corresponding to the first default empty input data is the word vector of the word “I”, and the input word “
  • the corresponding output data of my is the word vector of the word "yesterday”.
  • x1, x2, x3, x4, x5, and x6 correspond to "blank words”, “me”, “yesterday”, “going to school”, “late arrival”, “out”, and y1, y2, respectively.
  • Y3, y4, y5, and y6 are the output data corresponding to the input word vectors, respectively, and the corresponding words are “I”, “Yesterday”, “Going to School”, “Late”, “Yes”, that is, the word “School” corresponds to
  • the output data is the word vector corresponding to its next word "late”.
  • the input data only has the words "I” and the default blank words, but for the output data is "a”, the input data is all the words entered before this, that is, "I”
  • the words “yesterday”, “going to school”, “late” and the default blank words are the input data of the word " ⁇ ". That is to say, each output word vector is related to the input of the previous moment and the input of all previous moments. That is, each output word is related to the input word at the previous moment and the words entered at all previous times.
  • the training process of the neural network is a process of constantly adjusting the parameters of the neural network.
  • the neural network can be considered to have been trained.
  • the parameters of the neural network are adjusted until the technician determines that a certain parameter is the required parameter, and the neural network is trained.
  • a verification mode may be adopted, that is, after the parameters of the neural network are set, the neural network is verified by using a large amount of verified sample data.
  • the verification data is input into the neural network, and the prediction accuracy of the verification data by the neural network under the parameter is detected.
  • the prediction preparation rate of the neural network reaches the standard value set by the technician, the parameter can be considered as satisfying the requirement. Parameters, otherwise you need to continue to adjust the parameters of the neural network until the neural network passes the verification.
  • Training the neural network corresponding to each field through a large number of texts corresponding to each field ensures the reliability of the trained neural network, and improves the efficiency of the neural network for data processing, thereby improving the efficiency and accuracy of speech recognition. degree.
  • the final score of the candidate word sequence is obtained according to the first score and the second score corresponding to the candidate word sequence, including: performing weighted summation on the first score and the second score corresponding to the candidate word sequence to obtain a candidate The final score of the word sequence.
  • the terminal may perform decoding operation on the voice data, and may obtain a plurality of word sequences, each word sequence having a corresponding first score, and selecting a preset number of word sequences from the plurality of word sequences.
  • a candidate word sequence that is, each candidate word sequence has a corresponding first score.
  • the trained neural network will re-score each candidate word sequence to obtain a second score, and each candidate word sequence has its own Corresponding second score. After obtaining the first score and the second score of each candidate word sequence, a final score for each candidate word sequence can be obtained.
  • the weighting calculation may be performed, and the weights of the first score and the second score may be the same or different, and may depend on the setting of the technician. For example, if the technician thinks that the accuracy of the first score is higher, or the impact on the speech recognition result is greater, the weight of the first score may be increased. After the weighting calculation is performed on both the first score and the second score, the first score and the second score weighted score may be summed to obtain a final score of each candidate word sequence.
  • the influence of the first score and the second score on the speech recognition result may be adjusted at any time by adjusting the weights of the first score and the second score, that is, the voice side may be adjusted at any time
  • the proportion of the semantic side can further ensure the reliability of the final score, so that the selected speech recognition result is more accurate.
  • the neural network is a circulating neural network.
  • the cyclic neural network also known as RNNLM (Recurrent Neural Network Based Language Model) can also be called a cyclic network language model.
  • RNNLM Recurrent Neural Network Based Language Model
  • the cyclic network language model can consider multiple words that were previously input, and can calculate the probability of the next word based on the long text composed of the previously entered words, so the cyclic network language model has "better” Memory effect.” For example, “my” or “mood” may appear "good” or "bad”. The appearance of these words depends on the appearance of "my” and “mood” before, and this is the "memory effect.”
  • the circular network language model maps each word entered into a compact continuous vector space, while the mapped continuous vector space uses a relatively small set of parameters and uses loop joins to create the corresponding model, resulting in long-distance context dependencies.
  • a language model commonly used in large-vocal continuous speech recognition namely the NGRAM model (a linguistic statistical model) relies on the first N-1 words of the input, N is a positive integer, and the cyclic network language model can capture the previously input. Historical information for all words. Therefore, compared with the traditional language model, which only depends on the characteristics of the N-1 words input earlier, the cyclic network language model is more accurate in scoring the input candidate word sequences.
  • the method further includes: performing entity extraction on the speech recognition result to obtain an entity word; retrieving the entity word; and performing the retrieval result and the entity When the words are inconsistent, the entity words are repaired.
  • the final score of each candidate word sequence can be calculated, and the candidate word with the highest final score is selected as the speech recognition result of the speech data.
  • the speech recognition result may be physically repaired. First, the speech recognition result is physically extracted.
  • the entity represents a word that plays a big role in the sentence of the speech recognition result, such as "I want to go to the window of the world", then in this sentence, "the window of the world” is this The focus of the sentence, and "I want to go” is only an expression of the speaker's intention, and the final place to be implemented is the "window of the world.”
  • entity extraction When entity extraction is performed on the speech recognition result, multiple entity words can be obtained, or only one entity word can be obtained. After the entity words are extracted, each entity word can be retrieved, and the retrieved result and the speech recognition result are When the extracted entity words are inconsistent, the entity words extracted from the speech recognition result are replaced with the retrieval result, and the entity words are repaired in this way.
  • the recognition of typos is generally concentrated on special entities in the domain. Therefore, the entity in the speech recognition result can be extracted for this problem, and the entity words are correspondingly repaired, thereby reducing the occurrence of speech recognition results.
  • the probability of a typo increases the accuracy of the resulting speech recognition result.
  • retrieving the entity term comprises: determining a domain of the speech recognition result based on the domain in which the candidate word sequence is located; and retrieving each entity term in a database corresponding to the domain of the speech recognition result.
  • a plurality of word sequences can be obtained, a plurality of word sequences are extracted from the word sequence as a candidate word sequence, and one of the candidate word sequences is selected as a speech recognition result of the speech data, and thus can be extracted according to the
  • the field in which the candidate word sequence is located determines the field to which the speech recognition result corresponds.
  • each entity term can be retrieved in a database corresponding to the domain.
  • step 214 the following steps are further included:
  • Step 602 Perform physical extraction on the speech recognition result to obtain an entity word.
  • Step 604 Determine a field of the speech recognition result according to the domain in which the extracted candidate word sequence is located.
  • Step 606 Search each entity word in a database corresponding to the field of the speech recognition result.
  • Step 608 When the search result is inconsistent with the entity word, the entity word is repaired.
  • the entity of the speech recognition result can be extracted, and the entity represents a large role in the sentence of the speech recognition result.
  • the location is "Window of the World.”
  • the sequence of candidate words may be multiple, and the field corresponding to all candidate word sequences is generally one, that is, multiple candidate word sequences correspond to the same field, and the speech recognition result is the candidate word sequence with the highest final score, and the candidate can be extracted according to the candidate.
  • the field in which the word sequence is located determines the field of speech recognition results.
  • each entity word can be retrieved in a database corresponding to the field of the speech recognition result, that is, each field has its own corresponding database.
  • the speech recognition result is “I want to go to the Emperor Building”, and the entity extraction of this sentence can get the entity word “Imperial Building”, and the field of speech recognition result is “navigation” field, then the entity word “Imperial Building” When searching, it is searched in the database corresponding to the "navigation" field, that is, the corresponding place name is retrieved.
  • the search result is "Diwang Building"
  • the search result is inconsistent with the entity word
  • the entity word "emperor” can be The building is replaced by the search result "Diwang Building”, and the entity words in the speech recognition result are repaired, and the corrected speech recognition result "I want to go to the Diwang Building", that is, the voice can be corrected by the physical repair operation
  • the recognition result is corrected by a typos and the like, and a more accurate speech recognition result is obtained.
  • a speech recognition method is provided. This embodiment is exemplified by applying the method to the server 120 in FIG. 1 described above. Referring to FIG. 7, the voice recognition method specifically includes the following steps:
  • step 702 the neural network is trained to obtain a trained neural network.
  • each domain has its own corresponding neural network. Therefore, the corresponding neural network of each domain is trained separately by using the corresponding text of each domain.
  • the trained neural network corresponds to the domain one by one, as shown in Figure 8.
  • Each neural network has its own training text.
  • the semantic classification model can identify the domain in which the extracted candidate word sequence is located. After the domain of the extracted candidate word sequence is determined by the semantic classification model, each candidate word sequence is sequentially input to the corresponding domain. In the neural network, the scoring is performed, so the domain of the neural network corresponds to the domain of the semantic classification model. Therefore, during training, as shown in Fig.
  • the semantic classification model can be trained together with the neural network to make the nerve
  • the domain of the network corresponds to the classification domain of the semantic classification model.
  • the neural network is used to identify the speech, while the semantic classification model is used to analyze the semantics of the speech data. Therefore, the neural network belongs to the recognition side, and the semantic classification model belongs to the semantics. side.
  • the corresponding texts of each field can be obtained.
  • the word vector After converting each word in each text into a word vector, the word vector can be used as input data, and according to the order of the words in the sentence, The word vector of the next word of the input word is used as the output data of the current input word, that is, each output word vector is related to the word vector input at the previous moment and the word vector input at all previous times.
  • the neural network is trained by the word vector corresponding to the input word, the parameters of the neural network can be continuously adjusted until the appropriate parameters are adjusted, and the neural network is trained, and the trained neural network is obtained.
  • the neural network can be a cyclic neural network, also known as a cyclic network language model.
  • the advantage of the cyclic network language model is the long memory effect, that is, the word vector of each output word is influenced by the word vectors of all previously input words.
  • the output data corresponding to the word vector of the currently input word is the word vector corresponding to the previous word.
  • the output data is the word vector corresponding to the next word of each input word, but for each output data, the input data is not only the word vector of the last input word, but the word vector of the last input word.
  • the word vector of all the words entered before That is, each output word is related to the input word at the previous moment and the words entered at all previous times.
  • the output data corresponding to each input word vector is set, and the cyclic network language model is trained to obtain the trained neural network for practical use.
  • Step 704 Acquire a plurality of word sequences obtained by decoding the voice data, and a first score corresponding to each word sequence.
  • the speech decoding process can be performed by the front end, and the front end can be a terminal, and the speech data can be decoded to obtain a plurality of word sequences, and a first score corresponding to each word sequence.
  • the front end extracts the feature data, and uses the trained acoustic model, the language model, and the pronunciation dictionary to perform search and decoding on the extracted voice data. Get multiple word sequences.
  • the acoustic model can be trained through the speech training set, the language model is trained by the text training set, and the trained acoustic model and language model can be put into practical use.
  • a word sequence can be thought of as multiple words and multiple paths. It can also be called lattice.
  • Latice is essentially a directed acyclic graph. Each node on the graph represents the end time of a word. Each edge represents a possible word, as well as the acoustic score and language model score of the word.
  • each node stores the result of the speech recognition at the current position, including the acoustic probability, the language probability, and the like, as shown in FIG.
  • Step 706 Extract a predetermined number of word sequences of the first score from the plurality of word sequences as a candidate word sequence.
  • a predetermined number of word sequences of the first score may be extracted from the word sequence as a candidate word sequence. That is, the optimal path calculated by lattice, that is, the path with the highest probability does not necessarily match the actual word sequence, so the predetermined number of word sequences with the first score before can be extracted as the candidate word sequence, which can also be called N-best (n optimal, a comprehensive fast algorithm for speech feature information), the preset number can be set by the technician according to actual needs.
  • Step 708 input each candidate word sequence into a semantic classification model.
  • Step 710 classify each candidate word sequence by a semantic classification model, and obtain a classification label corresponding to each candidate word sequence.
  • Step 712 Acquire an area corresponding to the classification label with the largest proportion among the classification labels as the field in which the extracted candidate word sequence is located.
  • Step 714 Input the candidate word sequence into the trained neural network of the corresponding domain according to the domain in which the extracted candidate word sequence is located.
  • Step 716 Re-score each candidate word sequence by the trained neural network to obtain a second score corresponding to each candidate word sequence.
  • Step 718 weighting and summing the first score and the second score corresponding to each candidate word sequence to obtain a final score of each candidate word sequence.
  • step 720 the candidate word sequence with the highest final score is used as the speech recognition result of the speech data.
  • each candidate word sequence After extracting a preset number of word sequences as candidate word sequences, each candidate word sequence can be input into the trained semantic classification model, and the trained semantic classification model can classify each candidate word sequence and output
  • the classification label obtained by classifying each candidate word sequence may determine the domain corresponding to each candidate word sequence according to the classification label of each candidate word sequence.
  • the trained semantic classification model may have different domain classification results for each candidate word sequence.
  • the classification label of each candidate word sequence may be obtained, and the classification with the largest proportion may be obtained.
  • the label is a classification label corresponding to all candidate word sequences, that is, the domain corresponding to the largest classification label is the domain of all candidate word sequences. Therefore, the fields corresponding to the plurality of candidate word sequences will eventually be the same, generally not A plurality of candidate word sequences corresponding to the same voice data belong to a plurality of fields.
  • the candidate word sequence After extracting a plurality of candidate word sequences, after performing the n-best extraction step, the candidate word sequence can be identified by the domain, and the semantic classification model can be used to identify the domain in which the candidate word sequence is located. As shown in FIG. 11, after the extraction of n-best is performed, the candidate word sequence can be identified by the semantic classification model, and the candidate word sequence is input into the neural network of the corresponding domain.
  • Each domain has its own corresponding neural network, that is, a neural network that is divided into domains.
  • the neural network can be RNNLM. Compared to the NGRAM language model, RNNLM has better ability to preserve long-term memory.
  • the NGRAM language model is a commonly used language model in large vocabulary continuous speech recognition.
  • the model is based on the assumption that the appearance of the Nth word is only related to the previous N-1 words, and is not related to any other words.
  • the probability of a whole sentence is the product of the probability of occurrence of each word. These probabilities can be obtained by counting the number of simultaneous occurrences of N words from the corpus.
  • RNNLM is different. For example, "my” and “mood” may appear “good” or “bad”. The appearance of these words depends on the appearance of "my” and “mood”, that is, “memory effect”.
  • Traditional language models such as the NGRAM language model, rely on the first N-1 words when calculating the probability of a word appearing. N is usually set to a maximum of 5, and the previous words are ignored.
  • the candidate word sequence After the candidate word sequence is input to the neural network of the corresponding domain, the candidate word sequence can be re-scored by the neural network to obtain the second score of the candidate word sequence, and finally the speech recognition result of the speech data is obtained.
  • Re-scoring can also be called rescoring. Since the language model that generates the lattice cannot be accurate enough, it is necessary to re-adjust the n-best score using a larger language model. This process is called rescoring, which is a re-scoring.
  • the plurality of candidate word sequences and the corresponding first scores are: playing Jay Chou's nunchaku 94.7456, playing Jay Chou's nunchaku 95.3976, and playing the week conclusion of the nunchaku 94.7795, ...
  • These candidate word sequences are input into the semantic classification model for classification.
  • the semantic classification model recognizes that the candidate word sequences belong to the music field, and then each candidate word sequence can be input into the corresponding neural network of the music field for re-scoring processing, and
  • the second score of each candidate word sequence is obtained, specifically: playing Jay Chou's nunchaku 93.3381, playing Jay Chou's nunchaku 95.0925, playing the week conclusion of the nunchaku 94.1557, ....
  • the first score and the second score may be weighted and summed to obtain a final score for each candidate word sequence.
  • the final score of each candidate word sequence first score * first weight + second score * second weight.
  • the first weight and the second weight may be equal or unequal, and the first weight and the second weight respectively represent the proportion of each of the first score and the second score, and when the first weight is greater than the second weight, the technology may be considered. The person increases the score of the first score, and thinks that the proportion of the first score is higher than the second score, that is, the technician thinks that the first score has a greater influence on the speech recognition result.
  • the first weight and the second weight may be determined by a large amount of test data, such as using a large number of candidate word sequences, and continuously adjusting the values of the first weight and the second weight until the first weight and the second weight are to be calculated.
  • the sequence of the candidate word with the highest final score is indeed the most accurate speech recognition result, and the actual values of the first weight and the second weight can be determined, and This first weight and the second weight are used in the actual application as the calculation parameters of the final score.
  • a calculation step can also be added, such as taking the logarithm of the final score as the selection score, that is, the final score of the candidate sequence "playing Jay Chou's nunchaku" is 95.21454, and then taking the pair of 95.21454.
  • log 10 95.21454 after calculating the logarithm of the final score of each candidate word sequence, the logarithm of the final score is taken as the selection score of each candidate word sequence, and then the candidate word sequence with the highest selection score is selected as the speech. Identify the results.
  • This calculation method can be diversified, and other calculation methods can be adopted, which can be set by a technician.
  • Step 722 Perform physical extraction on the speech recognition result to obtain an entity word.
  • Step 724 Determine a field of the speech recognition result according to the domain in which the extracted candidate word sequence is located.
  • Step 726 searching for each entity word in a database corresponding to the field of the speech recognition result.
  • Step 728 When the search result is inconsistent with the entity word, the entity word is repaired.
  • the speech recognition result can be physically repaired after the speech recognition result of the speech data is obtained.
  • the speech recognition result is physically extracted.
  • the entity represents a word that plays a big role in the sentence of the speech recognition result, such as "I want to go to the emperor building", then in this sentence, "the emperor building” is the sentence Focus, and "I want to go” is just an expression of the speaker's intention, and the final place to be implemented is the "Emperor Building.”
  • entity extraction When entity extraction is performed on the speech recognition result, multiple entity words can be obtained, or only one entity word can be obtained. After the entity words are extracted, each entity word can be retrieved, and when the voice data is decoded, more can be obtained. a sequence of words, selecting a plurality of word sequences as a candidate word sequence from the sequence of words, and selecting one of the candidate word sequences as the speech recognition result of the speech data, so that the speech recognition result corresponding to the field in which the candidate word sequence is located may be determined. field.
  • each entity word may be searched in a corresponding database in the field, and when the retrieved result is inconsistent with the entity word extracted in the speech recognition result, the speech is The entity words extracted in the recognition result are replaced with the search results, and the entity words are repaired in this way.
  • the speech recognition result is "I want to go to the Emperor Building”
  • the "Emperor Building” can be searched according to the navigation field to which the speech recognition result belongs, that is, the geographical location information is retrieved, according to The search reveals that the actual result should be “Diwang Building”, so the whole sentence can be corrected to “I want to go to the King Building”, and the physical repair function can quickly and accurately repair the specific identification errors.
  • the re-scoring operation after the sub-field can greatly improve the recognition accuracy. It is proved by experiments that the speech recognition accuracy rates are different in different fields. With a relative increase of 4% to 16%, the overall recognition accuracy increased by 6%. In addition, since each domain has its own corresponding neural network, the cost of training the neural network can be greatly reduced.
  • the traditional speech recognition scheme needs to use a large amount of text to train a language model. In order to ensure the effect of re-scoring, the language model is very large, the model training period is very long, and the evaluation of the overall effect of the model is difficult.
  • 8.2G text training RNNLM The time is about 100 hours; after the field is divided, the training time of the model in a single field can be shortened to 24 hours, and the update time for the domain-specific model is greatly shortened.
  • the language model in the traditional technology cannot cover all the data comprehensively. Once there is a problem in the language model, the entire language model needs to be updated, and each update will generate huge overhead, but in this embodiment, It is only necessary to retrain the neural network corresponding to the domain that needs to be processed or updated, which reduces the training cost and shortens the time required for the neural network update.
  • FIGS. 2 to 12 are schematic flowcharts of a voice recognition method in various embodiments. It should be understood that although the various steps in the flowcharts of the various figures are sequentially displayed as indicated by the arrows, these steps are not necessarily performed in the order indicated by the arrows. Except as explicitly stated herein, the execution of these steps is not strictly limited, and the steps may be performed in other orders. Moreover, at least some of the steps in the various figures may comprise a plurality of sub-steps or stages, which are not necessarily performed at the same time, but may be performed at different times, the execution of these sub-steps or stages The order is also not necessarily sequential, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of the other steps.
  • a voice recognition apparatus including:
  • the word sequence obtaining module 1302 is configured to acquire a plurality of word sequences obtained by decoding the voice data, and a first score corresponding to each word sequence.
  • the extracting module 1304 is configured to extract, from the plurality of word sequences, a preset number of word sequences of the first score as a candidate word sequence.
  • the domain identification module 1306 is configured to identify the domain in which the extracted candidate word sequence is located.
  • the re-scoring module 1308 is configured to input each candidate word sequence into the neural network of the corresponding domain according to the field; each candidate word sequence is re-scoreed by the neural network to obtain a second score corresponding to each candidate word sequence.
  • the speech recognition result determining module 1310 is configured to obtain a final score of each candidate word sequence according to the first score and the second score corresponding to each candidate word sequence; and use the candidate word sequence with the highest final score as the speech recognition result of the speech data.
  • the domain identification module 1306 includes:
  • the input module 1306A is configured to input each candidate word sequence into the trained semantic classification model.
  • the classification module 1306B is configured to classify each candidate word sequence by a semantic classification model to obtain a classification label corresponding to each candidate word sequence; and obtain an area corresponding to the classification label with the largest proportion among the classification labels as the extracted candidate word sequence. field of.
  • the apparatus further includes a training module (not shown) for acquiring text corresponding to each domain; converting each word in the text corresponding to each domain into a word vector; respectively corresponding to each domain The word vector is used as an input to train the corresponding neural networks in each field.
  • a training module (not shown) for acquiring text corresponding to each domain; converting each word in the text corresponding to each domain into a word vector; respectively corresponding to each domain The word vector is used as an input to train the corresponding neural networks in each field.
  • the training module is further configured to input, for each field, a word vector corresponding to each word in the text according to the order of the words in the text corresponding to the domain, and input the next word of each input word.
  • the word vector corresponding to the word is used as an output to train the neural network by adjusting the parameters of the corresponding neural network in the field.
  • the re-scoring module 1308 is further configured to perform weighted summation on the first score and the second score corresponding to the candidate word sequence to obtain a final score of the candidate word sequence.
  • the neural network is a circulating neural network.
  • the apparatus further includes a physical repair module (not shown) for performing entity extraction on the speech recognition result to obtain a plurality of entity words; retrieving the entity words; and when the search result is inconsistent with the entity words When the entity words are repaired.
  • a physical repair module (not shown) for performing entity extraction on the speech recognition result to obtain a plurality of entity words; retrieving the entity words; and when the search result is inconsistent with the entity words When the entity words are repaired.
  • the entity repairing module is further configured to determine a domain of the voice recognition result according to the domain in which the extracted candidate word sequence is located; and retrieve each entity word in a database corresponding to the domain of the voice recognition result.
  • Figure 15 is a diagram showing the internal structure of a computer device in one embodiment.
  • the computer device may specifically be the server 120 of FIG.
  • the computer device includes the computer device including a processor, a memory, and a network interface connected by a system bus.
  • the memory comprises a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium of the computer device stores an operating system, and may also store a computer program that, when executed by the processor, causes the processor to implement a speech recognition method.
  • the internal memory can also store a computer program that, when executed by the processor, causes the processor to perform the speech recognition method described above.
  • FIG. 15 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation of the computer device to which the solution of the present application is applied.
  • the specific computer device may include a ratio. More or fewer components are shown in the figures, or some components are combined, or have different component arrangements.
  • the speech recognition apparatus can be implemented in the form of a computer program that can be run on a computer device as in FIG.
  • the program modules constituting the voice recognition device may be stored in a memory of the computer device, such as the word sequence acquisition module, the extraction module, the domain identification module, the re-scoring module, and the speech recognition result determination module shown in FIG.
  • the computer program of each program module causes the processor to perform the steps in the speech recognition method of the various embodiments of the present application described in this specification.
  • the computer device shown in FIG. 15 can perform acquisition of a plurality of word sequences obtained by decoding the voice data by the word sequence acquisition module in the voice recognition device as shown in FIG. 13, and a first score corresponding to each word sequence.
  • the computer device may perform, by the extraction module, extracting, from the plurality of word sequences, a preset number of word sequences of the first score as a candidate word sequence.
  • the computer device can perform an area identifying the extracted candidate word sequence by the domain identification module.
  • the computer device may perform, by the re-scoring module, input each candidate word sequence into the neural network of the corresponding domain according to the domain in which the extracted candidate word sequence is located; re-score each candidate word sequence through the neural network to obtain each candidate word The second score corresponding to the sequence.
  • the computer device may obtain a final score of each candidate word sequence according to the first score and the second score corresponding to each candidate word sequence by the voice recognition result determining module; the candidate word sequence with the highest final score as the voice recognition result of the voice data .
  • a computer apparatus comprising a memory and a processor having stored therein a computer program that, when executed, implements the steps of the speech recognition method provided in any one of the embodiments of the present application.
  • a computer readable storage medium having stored thereon a computer program that, when executed by a processor, implements the steps of the speech recognition method provided in any one of the embodiments of the present application.
  • Non-volatile memory can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory can include random access memory (RAM) or external cache memory.
  • RAM is available in a variety of formats, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronization chain.
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDRSDRAM double data rate SDRAM
  • ESDRAM enhanced SDRAM
  • Synchlink DRAM SLDRAM
  • Memory Bus Radbus
  • RDRAM Direct RAM
  • DRAM Direct Memory Bus Dynamic RAM
  • RDRAM Memory Bus Dynamic RAM

Landscapes

  • Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

The present application relates to a speech recognition method and apparatus, and a computer readable storage medium, and a computer device. The method comprises: obtaining a plurality of word sequences obtained by decoding speech data, and a first score corresponding to each word sequence; extracting, from the plurality of word sequences, a preset quantity of word sequences of which the first scores rank highly as candidate word sequences; for the extracted candidate word sequences, recognizing the field where the candidate word sequences are located; inputting the candidate word sequences into a neural network of the corresponding field according to the field where the candidate word sequences are located; re-scoring the candidate word sequences by means of the neural network to obtain second scores; obtaining the final scores according to the first scores and second scores corresponding to the candidate word sequences; using, in the extracted candidate word sequences, the candidate word sequence with the highest final score as the speech recognition result of the speech data. In this way, the time duration for training a specific field of the neural network is also greatly shortened, and efficiency of the speech recognition is further improved.

Description

语音识别方法、装置、计算机可读存储介质和计算机设备Speech recognition method, device, computer readable storage medium, and computer device
本申请要求于2018年5月14日提交、申请号为201810457129.3、发明名称为“语音识别方法、装置、计算机可读存储介质和计算机设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。The present application claims priority to Chinese Patent Application No. 201101457129.3, entitled "Speech Recognition Method, Apparatus, Computer Readable Storage Media, and Computer Equipment", which is filed on May 14, 2018, the entire contents of which are incorporated by reference. In this application.
技术领域Technical field
本申请涉及语音识别技术领域,特别是涉及一种语音识别方法、装置、计算机可读存储介质和计算机设备。The present application relates to the field of speech recognition technologies, and in particular, to a speech recognition method, apparatus, computer readable storage medium, and computer device.
背景技术Background technique
随着计算机技术的飞速发展,语音识别领域的技术也日益成熟。With the rapid development of computer technology, the technology in the field of speech recognition has become increasingly mature.
传统技术中,语音识别的技术方案通常分为前端解码和后端处理两部分,前端主要负责接收输入的语音数据并对语音数据进行解码,得到多个存在可能性的句子,后端则可从前端得到的多个可能性的句子中确定其中一个句子作为最终的语音识别结果。传统技术中,后端可将多个可能性的句子输入至神经网络,以此确定最终的语音识别结果,然而这种方式下,需要利用海量的文本,且需要耗费较长的时间才能训练出最终可投入使用的神经网络,因此这种语音识别的方案效率较低。In the traditional technology, the technical solution of speech recognition is generally divided into two parts: front-end decoding and back-end processing. The front-end is mainly responsible for receiving input speech data and decoding the speech data to obtain multiple sentences with possibility, and the back end can be One of the sentences of the plurality of possibilities obtained by the front end determines one of the sentences as the final speech recognition result. In the traditional technology, the back end can input multiple possible sentences into the neural network to determine the final speech recognition result. However, in this way, a large amount of text needs to be utilized, and it takes a long time to train. The neural network that can eventually be put into use, so this speech recognition scheme is less efficient.
发明内容Summary of the invention
基于此,有必要针对上述语音识别效率较低的技术问题,提供一种能够提高语音识别效率的语音识别方法、装置、计算机可读存储介质和计算机设备。Based on this, it is necessary to provide a speech recognition method, apparatus, computer readable storage medium and computer device capable of improving speech recognition efficiency for the technical problem of low speech recognition efficiency.
一种语音识别方法,应用于计算机设备,所述方法包括:A speech recognition method is applied to a computer device, and the method includes:
获取对语音数据进行解码得到的多个词序列,以及每个词序列对应的第一得分;Obtaining a plurality of word sequences obtained by decoding the voice data, and a first score corresponding to each word sequence;
从所述多个词序列中提取所述第一得分靠前的预设数量的词序列作为候选词序列;Extracting, from the plurality of word sequences, a preset number of word sequences of the first score as a candidate word sequence;
识别提取的候选词序列所在的领域;Identifying the field in which the extracted candidate word sequence is located;
根据所述领域将每个候选词序列输入至对应领域的神经网络中;Inputting each candidate word sequence into a neural network of a corresponding domain according to the field;
通过所述神经网络对所述每个候选词序列进行重打分,得到所述每个候选词序列对应的第二得分;Retrieving each candidate word sequence by the neural network to obtain a second score corresponding to each candidate word sequence;
根据所述每个候选词序列对应的第一得分和第二得分得到所述每个候选词序列的最终得分;Obtaining a final score of each candidate word sequence according to the first score and the second score corresponding to each candidate word sequence;
将所述最终得分最高的候选词序列作为所述语音数据的语音识别结果。The candidate word sequence with the highest final score is used as the speech recognition result of the speech data.
一种语音识别装置,所述装置包括:A speech recognition device, the device comprising:
词序列获取模块,用于获取对语音数据进行解码得到的多个词序列,以及每个词序列对应的第一得分;a word sequence obtaining module, configured to acquire a plurality of word sequences obtained by decoding the voice data, and a first score corresponding to each word sequence;
提取模块,用于从所述多个词序列中提取所述第一得分靠前的预设数量的词序列作为候选词序列;An extraction module, configured to extract, from the plurality of word sequences, a preset number of word sequences of the first score as a candidate word sequence;
领域识别模块,用于识别提取的候选词序列所在的领域;a domain identification module, configured to identify an area in which the extracted candidate word sequence is located;
重打分模块,用于根据所述领域将每个候选词序列输入至对应领域的神经网络中;通过所述神经网络对所述每个候选词序列进行重打分,得到所述每个候选词序列对应的第二得分;a re-scoring module, configured to input each candidate word sequence into a neural network of a corresponding domain according to the domain; re-scoring each candidate word sequence by the neural network to obtain each candidate word sequence Corresponding second score;
语音识别结果确定模块,用于根据所述每个候选词序列对应的第一得分和第二得分得到所述每个候选词序列的最终得分;将所述最终得分最高的候选词序列作为所述语音数据的语音识别结果。a speech recognition result determining module, configured to obtain a final score of each candidate word sequence according to the first score and the second score corresponding to each candidate word sequence; using the candidate word sequence with the highest final score as the Speech recognition result of voice data.
一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器执行如下操作:A computer device comprising a memory and a processor, the memory storing a computer program, the computer program being executed by the processor, causing the processor to perform the following operations:
获取对语音数据进行解码得到的多个词序列,以及每个词序列对应的第一得分;Obtaining a plurality of word sequences obtained by decoding the voice data, and a first score corresponding to each word sequence;
从所述多个词序列中提取所述第一得分靠前的预设数量的词序列作为候选词序列;Extracting, from the plurality of word sequences, a preset number of word sequences of the first score as a candidate word sequence;
识别提取的候选词序列所在的领域;Identifying the field in which the extracted candidate word sequence is located;
根据所述领域将每个候选词序列输入至对应领域的神经网络中;Inputting each candidate word sequence into a neural network of a corresponding domain according to the field;
通过所述神经网络对所述每个候选词序列进行重打分,得到所述每个候选词序列对应的第二得分;Retrieving each candidate word sequence by the neural network to obtain a second score corresponding to each candidate word sequence;
根据所述每个候选词序列对应的第一得分和第二得分得到所述每个候选词序列的最终得分;Obtaining a final score of each candidate word sequence according to the first score and the second score corresponding to each candidate word sequence;
将所述最终得分最高的候选词序列作为所述语音数据的语音识别结果。The candidate word sequence with the highest final score is used as the speech recognition result of the speech data.
一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时,使得所述处理器执行如下操作:A computer readable storage medium storing a computer program, when executed by a processor, causes the processor to perform the following operations:
获取对语音数据进行解码得到的多个词序列,以及每个词序列对应的第一得分;Obtaining a plurality of word sequences obtained by decoding the voice data, and a first score corresponding to each word sequence;
从所述多个词序列中提取所述第一得分靠前的预设数量的词序列作为候选词序列;Extracting, from the plurality of word sequences, a preset number of word sequences of the first score as a candidate word sequence;
识别提取的候选词序列所在的领域;Identifying the field in which the extracted candidate word sequence is located;
根据所述领域将每个候选词序列输入至对应领域的神经网络中;Inputting each candidate word sequence into a neural network of a corresponding domain according to the field;
通过所述神经网络对所述每个候选词序列进行重打分,得到所述每个候选词序列对应的第二得分;Retrieving each candidate word sequence by the neural network to obtain a second score corresponding to each candidate word sequence;
根据所述每个候选词序列对应的第一得分和第二得分得到所述每个候选词序列的最终得分;Obtaining a final score of each candidate word sequence according to the first score and the second score corresponding to each candidate word sequence;
将所述最终得分最高的候选词序列作为所述语音数据的语音识别结果。The candidate word sequence with the highest final score is used as the speech recognition result of the speech data.
上述语音识别方法、装置、计算机可读存储介质和计算机设备,通过获取到对语音数据进行解码得到的多个词序列及对应的第一得分,从多个词序列中选取出多个候选词序列后并确定候选词序列的领域,可将候选词序列输入至领域对应的神经网络中,获取到神经网络对每个候选词序列进行重打分得到的第二得分后,可根据第一得分与第二得分确定每个候选词序列的最终得分,进而可选出最终得分最高的候选词序列作为语音数据的语音识别结果。这种语音识别的方法,在对候选词序列进行重打分之前,先识别出候选词序列的领域,从而可使用候选词序列所属的领域对应的神经网络对候选词序列进行重打分,使用所属领域对应的神经网络对候选词序列进行重打分得到的第二得分不仅分数更为准确,由于各个领域均有各自对应的神经网络,各个领域分别采用各自的文本对各自对应的神经网络进行训练,而无需采用所有的文本进行训练,因此针对特定领域的神经网络进行训练的时间也大幅度的缩短了,进一步的提高了语音识别的效率。The above-mentioned speech recognition method, apparatus, computer readable storage medium and computer device select a plurality of candidate word sequences from a plurality of word sequences by acquiring a plurality of word sequences obtained by decoding the speech data and a corresponding first score After determining the domain of the candidate word sequence, the candidate word sequence may be input into the neural network corresponding to the domain, and the second score obtained by the neural network for re-rating each candidate word sequence may be obtained according to the first score and the first score. The two scores determine the final score of each candidate word sequence, and then the candidate word sequence with the highest final score can be selected as the speech recognition result of the speech data. The method for speech recognition first identifies the domain of the candidate word sequence before re-rating the candidate word sequence, so that the candidate word sequence can be re-scored using the neural network corresponding to the domain to which the candidate word sequence belongs, using the domain The corresponding neural network re-scores the candidate word sequence to obtain a second score that is not only more accurate. Since each field has its own corresponding neural network, each field uses its own text to train each corresponding neural network. No need to use all the text for training, so the training time for the specific domain of the neural network is also greatly shortened, further improving the efficiency of speech recognition.
附图说明DRAWINGS
图1为一个实施例中语音识别方法的应用环境图;1 is an application environment diagram of a voice recognition method in an embodiment;
图2为一个实施例中语音识别方法的流程示意图;2 is a schematic flow chart of a voice recognition method in an embodiment;
图3为一个实施例中步骤206的流程示意图;3 is a schematic flow chart of step 206 in an embodiment;
图4为一个实施例中神经网络的训练步骤的流程示意图;4 is a schematic flow chart of a training step of a neural network in an embodiment;
图5为一个实施例中神经网络的输入数据和输出数据的示意图;5 is a schematic diagram of input data and output data of a neural network in an embodiment;
图6为一个实施例中在步骤214之后的流程示意图;6 is a schematic flow chart after step 214 in an embodiment;
图7为另一个实施例中语音识别方法的流程示意图;7 is a schematic flow chart of a voice recognition method in another embodiment;
图8为一个实施例中各个领域对应的神经网络训练过程的示意图;8 is a schematic diagram of a neural network training process corresponding to each field in an embodiment;
图9为一个实施例中前端对语音数据进行解码的流程示意图;9 is a schematic flow chart of decoding a voice data by a front end in an embodiment;
图10为一个实施例中词序列的示意图;Figure 10 is a schematic illustration of a sequence of words in one embodiment;
图11为一个实施例中语义分类模型对候选词序列进行重打分的流程示意图;11 is a schematic flow chart of re-scoring a candidate word sequence by a semantic classification model in an embodiment;
图12为另一个实施例中语义分类模型对候选词序列进行重打分的流程示意图;12 is a schematic flow chart of re-scoring a candidate word sequence by a semantic classification model in another embodiment;
图13为一个实施例中语音识别装置的结构框图;Figure 13 is a block diagram showing the structure of a voice recognition apparatus in an embodiment;
图14为另一个实施例中领域识别模块的结构框图;14 is a structural block diagram of a domain identification module in another embodiment;
图15为一个实施例中计算机设备的结构框图。Figure 15 is a block diagram showing the structure of a computer device in an embodiment.
具体实施方式Detailed ways
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。In order to make the objects, technical solutions, and advantages of the present application more comprehensible, the present application will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the application and are not intended to be limiting.
图1为一个实施例中语音识别方法的应用环境图。参照图1,该语音识别方法应用于语音识别系统。该语音识别系统包括终端110和服务器120。终端110和服务器120通过网络连接。终端110具体可以是台式终端或移动终端,移动终端具体可以包括手机、平板电脑、笔记本电脑等中的至少一种。服务器120可以用独立的服务器或者是多个服务器组成的服务器集群来实现。FIG. 1 is an application environment diagram of a voice recognition method in an embodiment. Referring to Figure 1, the speech recognition method is applied to a speech recognition system. The speech recognition system includes a terminal 110 and a server 120. The terminal 110 and the server 120 are connected through a network. The terminal 110 may specifically be a desktop terminal or a mobile terminal, and the mobile terminal may specifically include at least one of a mobile phone, a tablet computer, a notebook computer, and the like. The server 120 can be implemented by a stand-alone server or a server cluster composed of a plurality of servers.
如图2所示,在一个实施例中,提供了一种语音识别方法。本实施例以该方法应用于上述图1中的服务器120来举例说明。参照图2,该语音识别方法具体包括如下步骤:As shown in FIG. 2, in one embodiment, a speech recognition method is provided. This embodiment is exemplified by applying the method to the server 120 in FIG. 1 described above. Referring to FIG. 2, the voice recognition method specifically includes the following steps:
步骤202,获取对语音数据进行解码得到的多个词序列,以及每个词序列对应的第一得分。Step 202: Acquire a plurality of word sequences obtained by decoding the voice data, and a first score corresponding to each word sequence.
服务器可获取到对语音数据进行解码后得到的多个词序列,解码操作可以 通过终端进行,终端在获取到语音数据时,可利用训练好的声学模型、语言模型以及发音词典等,对获取到的语音数据进行解码,从而可得到多个词序列。当然,该解码操作也可以通过服务器进行。The server can obtain a plurality of word sequences obtained by decoding the voice data, and the decoding operation can be performed by the terminal. When acquiring the voice data, the terminal can use the trained acoustic model, the language model, and the pronunciation dictionary to obtain the obtained The speech data is decoded so that a plurality of word sequences can be obtained. Of course, the decoding operation can also be performed by the server.
词序列是指对语音数据进行解码后得到的多个词语以及该多个词语对应的路径,将每条路径的起始位置至结束位置上对应的词语进行有序连接得到的句子,均可以看作是一条识别结果,即每条词序列均可以看作是一条识别结果。且每个词序列的每个路径均有对应的得分,即第一得分。第一得分可以看作是根据每个路径出现的概率计算得到的分数。The word sequence refers to a plurality of words obtained by decoding the voice data and a path corresponding to the plurality of words, and sentences obtained by orderly connecting the corresponding positions of each path to the corresponding words at the end position can be seen. The result is a recognition result, that is, each word sequence can be regarded as a recognition result. And each path of each word sequence has a corresponding score, that is, the first score. The first score can be thought of as a score calculated from the probability of occurrence of each path.
步骤204,从多个词序列中提取第一得分靠前的预设数量的词序列作为候选词序列。Step 204: Extract a predetermined number of word sequences of the first score from the plurality of word sequences as a candidate word sequence.
每个词序列均有对应的第一得分,可提取出第一得分靠前的词序列。第一得分靠前是指,按照第一得分从高到低的顺序将词序列进行排序,即可以将第一得分最高的词序列排在首位。提取出第一得分靠前的词序列,即提取出第一得分数值较高的词序列,提取词序列的数量可预先设置,故可提取第一得分靠前的预设数量的词序列作为候选词序列,预设数量可根据技术人员的考虑进行调整。Each word sequence has a corresponding first score, and the sequence of words with the first score can be extracted. The first score is preceded by sorting the sequence of words according to the order of the first score from high to low, that is, the sequence of words with the highest score of the first score can be ranked first. The sequence of words with the first score is extracted, that is, the sequence of words with the first score is extracted, and the number of extracted word sequences can be preset, so that the preset number of word sequences of the first score can be extracted as candidates. The sequence of words, the preset number can be adjusted according to the consideration of the technician.
在另一种可能实现方式中,按照第一得分从低到高的顺序将词序列进行排序,即可将第一得分最高的词序列排在末位。提取出第一得分靠前的预设数量的词序列即从末位开始,按照词序列的排序,提取预设数量的词序列。In another possible implementation, the sequence of words is sorted in order from lowest to highest, and the sequence of words with the highest highest score is ranked at the last position. The predetermined number of word sequences with the first score is extracted, that is, starting from the last position, and the preset number of word sequences are extracted according to the order of the word sequences.
步骤206,识别提取的候选词序列所在的领域。Step 206: Identify the domain in which the extracted candidate word sequence is located.
步骤208,根据识别出的领域将每个候选词序列输入至对应领域的神经网络中。Step 208: Input each candidate word sequence into the neural network of the corresponding domain according to the identified domain.
当从多个词序列中选取出预设数量的候选词序列后,可识别出该预设数量的候选词序列所在的领域。领域可由技术人员进行自定义的设置,比如可将领域划分为导航、音乐等。候选词序列有多个,识别出的领域可以为一个,即多个候选词序列均属于同一个领域,当识别出候选词序列对应的领域后,可将多个候选词序列分别输入至对应领域的神经网络中。即,针对每个领域,均有对应的神经网络,每个领域的神经网络可以针对性的对各自领域的候选词序列进行重打分。When a predetermined number of candidate word sequences are selected from a plurality of word sequences, the field in which the predetermined number of candidate word sequences are located can be identified. The field can be customized by the technician, such as dividing the field into navigation, music, and so on. There are multiple candidate word sequences, and the identified domain may be one, that is, multiple candidate word sequences belong to the same domain. After identifying the domain corresponding to the candidate word sequence, multiple candidate word sequences may be input to the corresponding domain respectively. In the neural network. That is, for each domain, there is a corresponding neural network, and the neural network of each domain can re-scoring the candidate word sequences of the respective domains in a targeted manner.
步骤210,通过神经网络对每个候选词序列进行重打分,得到每个候选词序列对应的第二得分。Step 210: Re-score each candidate word sequence by a neural network to obtain a second score corresponding to each candidate word sequence.
步骤212,根据每个候选词序列对应的第一得分和第二得分得到每个候选词序列的最终得分。Step 212: Obtain a final score of each candidate word sequence according to the first score and the second score corresponding to each candidate word sequence.
步骤214,将最终得分最高的候选词序列作为语音数据的语音识别结果。 Step 214, the candidate word sequence with the highest final score is used as the speech recognition result of the speech data.
当识别出候选词序列的领域,并将全部的候选词序列均输入至对应的神经网络中,比如当识别出候选词序列的领域为“导航”领域,则将全部的候选词序列输入至“导航”领域对应的神经网络中,“导航”领域的神经网络即可对输入的每个候选词序列进行重打分,得到每个候选词序列的第二得分。When the field of the candidate word sequence is identified and all candidate word sequences are input into the corresponding neural network, for example, when the field identifying the candidate word sequence is the "navigation" field, all the candidate word sequences are input to " In the neural network corresponding to the "field", the neural network in the "navigation" field can re-score each input candidate word sequence to obtain a second score for each candidate word sequence.
重打分是指,候选词序列所在领域对应的神经网络会对输入的每个候选词序列进行一个分数计算,得到每个候选词序列的第二得分,将每个候选词序列的第一得分与第二得分通过一定的方式进行结合计算,比如可以分别对第一得分和第二得分进行加权计算,再将加权计算后的第一得分和第二得分的分数相加。即可根据第一得分和第二得分可得到每个候选词序列的最终得分,并可将最终得分最高的候选词序列作为语音数据的语音识别结果。The re-scoring means that the neural network corresponding to the domain of the candidate word sequence performs a score calculation on each candidate word sequence input, and obtains a second score of each candidate word sequence, and the first score of each candidate word sequence is The second score is combined and calculated in a certain manner. For example, the first score and the second score may be separately weighted, and the weighted calculated first score and the second score may be added. The final score of each candidate word sequence can be obtained according to the first score and the second score, and the candidate word sequence with the highest final score can be used as the speech recognition result of the speech data.
上述语音识别方法,通过获取到对语音数据进行解码得到的多个词序列及对应的第一得分,从多个词序列中选取出多个候选词序列后并确定该多个候选词序列所在的领域,可将每个候选词序列输入至该领域对应的神经网络中,获取到神经网络对每个候选词序列进行重打分得到的第二得分后,可根据第一得分与第二得分确定每个候选词序列的最终得分,进而可选出最终得分最高的候选词序列作为语音数据的语音识别结果。The speech recognition method, after acquiring a plurality of word sequences obtained by decoding the speech data and the corresponding first score, selecting a plurality of candidate word sequences from the plurality of word sequences, and determining the sequence of the plurality of candidate word sequences a field, each candidate word sequence may be input into a corresponding neural network of the domain, and after obtaining a second score obtained by the neural network re-rating each candidate word sequence, each of the first score and the second score may be determined. The final score of the candidate word sequence, and then the candidate word sequence with the highest final score can be selected as the speech recognition result of the speech data.
这种语音识别的方法,在对候选词序列进行重打分之前,先识别出候选词序列所在的领域,从而可使用该领域对应的神经网络对候选词序列进行重打分,使用该领域对应的神经网络对候选词序列进行重打分得到的第二得分不仅分数更为准确,由于各个领域均有各自对应的神经网络,各个领域分别采用各自的文本对各自对应的神经网络进行训练,而无需采用所有的文本进行训练,因此针对特定领域的神经网络进行训练的时间也大幅度的缩短了,进一步的提高了语音识别的效率。The method for speech recognition first identifies the field in which the candidate word sequence is located before re-scoring the candidate word sequence, so that the candidate word sequence can be re-scored using the corresponding neural network in the field, and the corresponding nerve in the field is used. The second score obtained by the network to score the candidate word sequence is not only more accurate, but each domain has its own corresponding neural network, and each field uses its own text to train the corresponding neural network without using all The text is trained, so the training time for the specific domain of the neural network is also greatly shortened, further improving the efficiency of speech recognition.
在一个实施例中,如图3所示,步骤206,包括:In one embodiment, as shown in FIG. 3, step 206 includes:
步骤302,将每个候选词序列输入至语义分类模型中。 Step 302, input each candidate word sequence into a semantic classification model.
步骤304,通过语义分类模型对每个候选词序列进行分类,得到每个候选词序列对应的分类标签。Step 304: Classify each candidate word sequence by a semantic classification model, and obtain a classification label corresponding to each candidate word sequence.
步骤306,获取分类标签中占比最大的分类标签对应的领域作为提取的候选 词序列所在的领域。Step 306: Acquire an area corresponding to the category label with the largest proportion among the classification labels as the field in which the extracted candidate word sequence is located.
当提取出预设数量的词序列作为候选词序列后,可将每个候选词序列都输入至训练好的语义分类模型中。训练好的语义分类模型是指预先运用大量的训练样本训练得到的语义分类模型,用于识别输入的候选词序列所在的领域。将提取出的预设数量的候选词序列都输入至训练好的语义分类模型中后,训练好的语义分类模型可对每个候选词序列进行分类,输出对每个候选词序列进行分类后得到的分类标签,可根据每个候选词序列的分类标签确定每个候选词序列对应的领域。After extracting a predetermined number of word sequences as candidate word sequences, each candidate word sequence can be input into the trained semantic classification model. The trained semantic classification model refers to a semantic classification model trained in a large number of training samples in advance, and is used to identify the field in which the input candidate word sequence is located. After inputting the extracted preset number of candidate word sequences into the trained semantic classification model, the trained semantic classification model can classify each candidate word sequence, and output the classification of each candidate word sequence to obtain The classification label can determine the domain corresponding to each candidate word sequence according to the classification label of each candidate word sequence.
每个候选词序列对应的分类标签可能存在差异,即,训练好的语义分类模型对每个候选词序列的分类结果可能存在差异,则代表着训练好的语义分类模型判断每个候选词序列所属领域是不同的。这种情况下,为了确定候选词序列的领域,可获取到每个候选词序列的分类标签,将占比最大的分类标签作为全部的候选词序列对应的分类标签,即将占比最大的分类标签对应的领域作为全部的候选词序列所在的领域。There may be differences in the classification labels corresponding to each candidate word sequence. That is, the trained semantic classification model may have different classification results for each candidate word sequence, and represents a trained semantic classification model to determine the sequence of each candidate word sequence. The fields are different. In this case, in order to determine the domain of the candidate word sequence, the classification label of each candidate word sequence can be obtained, and the classification label with the largest proportion is used as the classification label corresponding to all the candidate word sequences, that is, the classification label with the largest proportion. The corresponding field is the field in which all candidate word sequences are located.
比如,候选词序列A的分类标签为1,候选词序列B的分类标签为1,候选词序列C的分类标签为2,候选词序列D的分类标签为1,其中1可以表示为导航类,2表示为音乐类,那么在候选词序列A、B、C、D中,分类标签为1的占比最大,因此可将分类标签1作为候选词序列A、B、C、D的分类标签,即可认为候选词序列A、B、C、D的领域为分类标签对应的领域为导航类。For example, the classification label of the candidate word sequence A is 1, the classification label of the candidate word sequence B is 1, the classification label of the candidate word sequence C is 2, and the classification label of the candidate word sequence D is 1, wherein 1 can be represented as a navigation class. 2 is represented as music, then in the candidate word sequences A, B, C, D, the classification label is the largest proportion of 1, so the classification label 1 can be used as the classification label of the candidate word sequences A, B, C, D. The domain in which the candidate word sequences A, B, C, and D are considered to belong to the category label is a navigation class.
通过语义分类模型识别出提取的多个候选词序列所在的领域,且当每个候选词序列的领域存在分歧时,将分类标签占比最大的分类标签对应的领域作为候选词序列所在的领域,确保了候选词序列所属领域的准确性,也提高了后续候选词序列的重打分分数的准确度,从而提高了语音识别结果的准确性。The domain of the extracted candidate word sequences is identified by the semantic classification model, and when there is a divergence in the domain of each candidate word sequence, the domain corresponding to the classification tag with the largest proportion of the classification tags is used as the field in which the candidate word sequence is located. The accuracy of the domain of the candidate word sequence is ensured, and the accuracy of the re-scoring score of the subsequent candidate word sequence is also improved, thereby improving the accuracy of the speech recognition result.
在一个实施例中,上述语音识别方法还包括:获取各个领域对应的文本;将各个领域对应的文本中的每个词语转换成词向量;分别将各个领域对应的词向量作为输入,对各个领域对应的神经网络进行训练。In one embodiment, the voice recognition method further includes: acquiring text corresponding to each domain; converting each word in the text corresponding to each domain into a word vector; respectively inputting a word vector corresponding to each domain as an input to each field The corresponding neural network is trained.
在实际运用神经网络之前,可以先根据实际需要对每个神经网络进行训练,训练好的神经网络才能够投入实际使用对输入的候选词序列的领域识别。由于每个领域均有各自对应的神经网络,因此在对每个领域的神经网络进行训练时,需要获取各个领域对应的文本。文本可以是句子,在获取到每个领域的句子后,对于每个领域,可对该领域对应的每个句子进行分词处理,则可得到每个句子 对应的多个词语。将每个句子中包含的词语均转换成词向量,作为该领域对应的神经网络的输入数据,以此方式对每个领域对应的神经网络进行训练,训练完毕后即可得到每个领域对应的训练好的神经网络。Before the actual use of the neural network, each neural network can be trained according to actual needs, and the trained neural network can be put into actual use to identify the domain of the input candidate word sequence. Since each domain has its own corresponding neural network, it is necessary to obtain corresponding texts of various fields when training neural networks in each field. The text can be a sentence. After each sentence of each field is obtained, for each field, each sentence corresponding to the field can be processed by word segmentation, and multiple words corresponding to each sentence can be obtained. Convert the words contained in each sentence into word vectors, as the input data of the corresponding neural network in the field, and train the corresponding neural network in each field in this way. After training, each field can be obtained. Train a good neural network.
预先对各个领域的神经网络进行训练,确保了在神经网络在实际使用过程中对候选词序列进行重打分时的准确度,从而提高语音识别的准确性,也能够提高语音识别的效率。Pre-training neural networks in various fields ensures the accuracy of re-scoring candidate word sequences in the actual use of neural networks, thereby improving the accuracy of speech recognition and improving the efficiency of speech recognition.
在一个实施例中,将词向量作为输入,对每个领域对应的神经网络进行训练,包括:对于各个领域,按照该领域对应的文本中词语的顺序,将文本中的每个词语对应的词向量作为输入,将每个输入的词语的下一个词语对应的词向量作为输出,以调整该领域对应的神经网络的参数对神经网络进行训练。In one embodiment, the word vector is taken as an input, and the neural network corresponding to each domain is trained, including: for each domain, the words corresponding to each word in the text according to the order of the words in the corresponding text of the domain As an input, the vector takes the word vector corresponding to the next word of each input word as an output, and adjusts the parameters of the corresponding neural network in the field to train the neural network.
在得到每个领域的文本后,可将文本中包含的词语均转换成词向量,并将每个领域的文本对应的词向量输入至每个领域对应的神经网络中,对对应的神经网络进行训练。在将文本对应的词向量输入至神经网络后,可按照文本中词语的顺序,将每个输入的词语的词向量作为输入,并将每个输入的词语的下一个词语对应的词向量作为输出,以调整对应神经网络的参数。需注意的是,每个输出的词语对应的输入数据并不仅仅是前一个时刻输入的词语,而是前一个时刻与之前所有时刻输入的词语。以这种方式调整神经网络的参数对神经网络进行训练,从而可得到每个领域对应的训练好的神经网络,确保了训练得到的神经网络对候选词序列进行重打分时,得到的重打分分数的可靠性。After obtaining the text of each field, the words contained in the text can be converted into word vectors, and the word vectors corresponding to the texts of each field are input into the corresponding neural network of each field, and the corresponding neural network is performed. training. After inputting the word vector corresponding to the text into the neural network, the word vector of each input word may be input as the order of the words in the text, and the word vector corresponding to the next word of each input word may be output. To adjust the parameters of the corresponding neural network. It should be noted that the input data corresponding to each output word is not only the word input at the previous moment, but the words input at the previous moment and all previous moments. In this way, the parameters of the neural network are adjusted to train the neural network, so that the trained neural network corresponding to each field can be obtained, and the trained score can be obtained when the trained neural network re-scores the candidate word sequence. Reliability.
在一个实施例中,如图4所示,上述语音识别方法还包括神经网络的训练步骤,包括:In an embodiment, as shown in FIG. 4, the voice recognition method further includes a training step of the neural network, including:
步骤402,获取各个领域对应的文本。Step 402: Acquire text corresponding to each field.
步骤404,将各个领域对应的文本中的每个词语转换成词向量。 Step 404, converting each word in the text corresponding to each field into a word vector.
步骤406,对于各个领域,按照该领域对应的文本中词语的顺序,将文本中的每个词语对应的词向量作为输入,将每个输入的词语的下一个词语对应的词向量作为输出,以调整该领域对应的神经网络的参数对神经网络进行训练。Step 406: For each domain, according to the order of the words in the text corresponding to the domain, input the word vector corresponding to each word in the text as an input, and output the word vector corresponding to the next word of each input word as an output. The neural network is trained by adjusting the parameters of the corresponding neural network in the field.
为了确保投入实际使用的神经网络的准确度,可以预先对神经网络进行训练,得到训练好的神经网络后再进行实际运用。每个领域均有各自对应的神经网络,即需要对每个领域对应的神经网络分别进行训练。在针对每个领域的神经网络进行训练时,需要获取到每个领域对应的文本,并可以将每个领域的文本中的词语均转换成词向量,词向量可以是一个N维的向量数据,N为正整数。 每个领域对应的文本可以看作是每个领域对应的句子,可使用分词工具对每个句子进行词语划分,即每个句子会对应有多个词语,可将每个句子中包含的词语转换成词向量后输入至各个领域对应的神经网络中。In order to ensure the accuracy of the neural network put into practical use, the neural network can be trained in advance, and the trained neural network can be obtained and then applied. Each domain has its own corresponding neural network, that is, it needs to train the corresponding neural network in each field separately. When training for each domain of the neural network, it is necessary to obtain the corresponding text of each field, and can convert the words in the text of each field into word vectors, and the word vector can be an N-dimensional vector data. N is a positive integer. The text corresponding to each field can be regarded as the corresponding sentence of each field. You can use the word segmentation tool to divide the words into each sentence, that is, each sentence will have multiple words, and the words contained in each sentence can be converted. The word vector is input to the corresponding neural network of each field.
将每个句子中包含的词语对应的词向量作为神经网络的输入,并将每个句子中,输入的词语的下一个词语作为上一个词语的词向量的输出。输出数据是每个输入词语的下一个词语对应的词向量,但是对于每一个输出数据而言,输入数据并不仅仅是上一个输入的词语的词向量,而是上一个输入的词语的词向量以及之前输入的全部词语的词向量。设定了神经网络的输入数据和输出数据,并不断地对神经网络的参数进行调整,对神经网络进行训练。The word vector corresponding to the word contained in each sentence is used as the input of the neural network, and the next word of the input word in each sentence is used as the output of the word vector of the previous word. The output data is the word vector corresponding to the next word of each input word, but for each output data, the input data is not only the word vector of the last input word, but the word vector of the last input word. And the word vector of all the words entered before. The input data and output data of the neural network are set, and the parameters of the neural network are continuously adjusted to train the neural network.
比如,句子A为:我昨天上学迟到了,那么可使用分词工具将句子拆分为:我/昨天/上学/迟到/了/。如图5所示,按照句子A中词语原本的顺序,依次将句子A中包含的词语对应的词向量输入至神经网络中,图中x1、x2、x3、x4、x5、x6分别为各个词语对应的词向量,即为输入数据。默认地,在输入的词向量之前增加了一个空白输入,即将句子A的词语对应的词向量均输入至神经网络后,“我”对应的词向量不是第一个输入了,而变成了第二个输入数据,即第一个输入数据默认为空。For example, sentence A is: I was late for school yesterday, then I can use the word segmentation tool to split the sentence into: I / Yesterday / school / late / /. As shown in FIG. 5, according to the original order of the words in the sentence A, the word vectors corresponding to the words contained in the sentence A are sequentially input into the neural network, where x1, x2, x3, x4, x5, and x6 are respectively words. The corresponding word vector is the input data. By default, a blank input is added before the input word vector, that is, after the word vector corresponding to the word of sentence A is input to the neural network, the word vector corresponding to "I" is not the first input, but becomes the first The two input data, the first input data, is empty by default.
对应地,将每个输入的词语的下一个词语对应的词向量作为对应的输出数据,即第一默认为空的输入数据对应的输出数据为词语“我”的词向量,而输入的词语“我”对应的输出数据为词语“昨天”的词向量。如图5中所示,x1、x2、x3、x4、x5、x6分别对应“空白词语”、“我”、“昨天”、“上学”、“迟到”、“了”,而y1、y2、y3、y4、y5、y6分别为各个输入的词向量对应的输出数据,分别对应的词语为“我”、“昨天”、“上学”、“迟到”、“了”,即词语“上学”对应的输出数据为它的下一个词语“迟到”对应的词向量。对于输出数据为“昨天”而言,输入数据只有词语“我”和默认的空白词语,但对于输出数据为“了”来说,输入数据为在这之前输入的全部的词语,即“我”、“昨天”、“上学”、“迟到”以及默认的空白词语都是词语“了”的输入数据。也就是说,每个输出的词向量,都与前一个时刻的输入以及之前所有时刻的输入相关。即每个输出的词语都与前一个时刻的输入词语以及之前所有时刻输入的词语相关。Correspondingly, the word vector corresponding to the next word of each input word is used as the corresponding output data, that is, the output data corresponding to the first default empty input data is the word vector of the word “I”, and the input word “ The corresponding output data of my "is the word vector of the word "yesterday". As shown in FIG. 5, x1, x2, x3, x4, x5, and x6 correspond to "blank words", "me", "yesterday", "going to school", "late arrival", "out", and y1, y2, respectively. Y3, y4, y5, and y6 are the output data corresponding to the input word vectors, respectively, and the corresponding words are “I”, “Yesterday”, “Going to School”, “Late”, “Yes”, that is, the word “School” corresponds to The output data is the word vector corresponding to its next word "late". For the output data is "yesterday", the input data only has the words "I" and the default blank words, but for the output data is "a", the input data is all the words entered before this, that is, "I" The words "yesterday", "going to school", "late" and the default blank words are the input data of the word "了". That is to say, each output word vector is related to the input of the previous moment and the input of all previous moments. That is, each output word is related to the input word at the previous moment and the words entered at all previous times.
神经网络的训练过程就是不停调整神经网络的参数的过程,当神经网络的参数被确定时,则可认为此神经网络已经训练完毕。在每一次输入文本中的每个词语对应的词向量对神经网络进行训练时,均会对神经网络的参数进行调整, 直到技术人员确定了某个参数是满足要求的参数,则神经网络训练完毕。确定设定的参数是否是满足要求的参数时,可采用验证方式,即将神经网络的参数设置好以后,使用大量验证的样本数据对神经网络进行验证。将验证数据输入至神经网络中,检测神经网络在该参数下对验证数据的预测准确率,当神经网络的预测准备率达到技术人员设定的标准值时,则可认为该参数是满足要求的参数,否则需要继续对神经网络的参数进行调整,直至神经网络通过验证。The training process of the neural network is a process of constantly adjusting the parameters of the neural network. When the parameters of the neural network are determined, the neural network can be considered to have been trained. When the neural network is trained in the word vector corresponding to each word in each input text, the parameters of the neural network are adjusted until the technician determines that a certain parameter is the required parameter, and the neural network is trained. . When it is determined whether the set parameter is a parameter that satisfies the requirement, a verification mode may be adopted, that is, after the parameters of the neural network are set, the neural network is verified by using a large amount of verified sample data. The verification data is input into the neural network, and the prediction accuracy of the verification data by the neural network under the parameter is detected. When the prediction preparation rate of the neural network reaches the standard value set by the technician, the parameter can be considered as satisfying the requirement. Parameters, otherwise you need to continue to adjust the parameters of the neural network until the neural network passes the verification.
通过大量的各个领域对应的文本对每个领域对应的神经网络进行训练,确保了训练好的神经网络的可靠性,也提高了神经网络对数据处理的效率,进而能够提高语音识别的效率和准确度。Training the neural network corresponding to each field through a large number of texts corresponding to each field ensures the reliability of the trained neural network, and improves the efficiency of the neural network for data processing, thereby improving the efficiency and accuracy of speech recognition. degree.
在一个实施例中,根据候选词序列对应的第一得分和第二得分得到候选词序列的最终得分,包括:对根据候选词序列对应的第一得分和第二得分进行加权求和,得到候选词序列的最终得分。In one embodiment, the final score of the candidate word sequence is obtained according to the first score and the second score corresponding to the candidate word sequence, including: performing weighted summation on the first score and the second score corresponding to the candidate word sequence to obtain a candidate The final score of the word sequence.
终端在获取到语音数据后,可对语音数据进行解码操作,可得到多个词序列,每个词序列均有各自对应的第一得分,从多个词序列中选取出预设数量的词序列作为候选词序列,即每个候选词序列有各自对应的第一得分。在将每个候选词序列输入至所在领域对应的训练好的神经网络中后,训练好的神经网络会对每个候选词序列进行重打分得到第二得分,则每个候选词序列均有各自对应的第二得分。在得到每个候选词序列的第一得分和第二得分后,可得到每个候选词序列的最终得分。After acquiring the voice data, the terminal may perform decoding operation on the voice data, and may obtain a plurality of word sequences, each word sequence having a corresponding first score, and selecting a preset number of word sequences from the plurality of word sequences. As a candidate word sequence, that is, each candidate word sequence has a corresponding first score. After each candidate word sequence is input into the trained neural network corresponding to the domain, the trained neural network will re-score each candidate word sequence to obtain a second score, and each candidate word sequence has its own Corresponding second score. After obtaining the first score and the second score of each candidate word sequence, a final score for each candidate word sequence can be obtained.
在根据第一得分和第二得分计算最终得分时,可进行加权计算,第一得分和第二得分的权重可以相同,也可以不同,具体可取决于技术人员的设定。比如,技术人员认为第一得分的准确度更高,或者对语音识别结果的影响更大,则可以将第一得分的权重增加。当对第一得分和第二得分均进行了加权计算后,可将第一得分和第二得分加权后的分数进行求和,即可得到每个候选词序列的最终得分。When the final score is calculated according to the first score and the second score, the weighting calculation may be performed, and the weights of the first score and the second score may be the same or different, and may depend on the setting of the technician. For example, if the technician thinks that the accuracy of the first score is higher, or the impact on the speech recognition result is greater, the weight of the first score may be increased. After the weighting calculation is performed on both the first score and the second score, the first score and the second score weighted score may be summed to obtain a final score of each candidate word sequence.
使用加权对第一得分和第二得分进行计算,可随时通过调整第一得分和第二得分的权重以此调整第一得分和第二得分对于语音识别结果的影响,即,可随时调整语音侧和语义侧的占比,可进一步确保了最终分数的可靠性,从而选出的语音识别结果更为准确。Using the weighting to calculate the first score and the second score, the influence of the first score and the second score on the speech recognition result may be adjusted at any time by adjusting the weights of the first score and the second score, that is, the voice side may be adjusted at any time The proportion of the semantic side can further ensure the reliability of the final score, so that the selected speech recognition result is more accurate.
在一个实施例中,神经网络为循环神经网络。In one embodiment, the neural network is a circulating neural network.
循环神经网络,也可称为RNNLM(Recurrent Neural Network Based Language  Model),也可称为循环网络语言模型。循环网络语言模型除了考虑当前输入的词语外,可以考虑之前输入的多个词语,并可根据之前输入的词语组成的长文本来计算接下来出现词的概率,因此循环网络语言模型具有“更好的记忆效应”。比如“我的”“心情”后面可能出现“不错”,也可能出现“不好”,这些词的出现,依赖于之前出现了“我的”和“心情”,这个就是“记忆效应”。The cyclic neural network, also known as RNNLM (Recurrent Neural Network Based Language Model), can also be called a cyclic network language model. In addition to considering the currently input words, the cyclic network language model can consider multiple words that were previously input, and can calculate the probability of the next word based on the long text composed of the previously entered words, so the cyclic network language model has "better" Memory effect." For example, "my" or "mood" may appear "good" or "bad". The appearance of these words depends on the appearance of "my" and "mood" before, and this is the "memory effect."
循环网络语言模型会将输入的每个词映射到一个紧凑的连续向量空间,而映射的连续向量空间使用相对小的参数集合并使用循环连接建立对应的模型,产生长距离的上下文依赖。大词汇连续语音识别中常用的一种语言模型,即NGRAM模型(一种语言统计模型),会依赖输入的前N-1个词,N为正整数,而循环网络语言模型可以捕获之前输入的所有的词的历史信息。因此相较于传统的语言模型只依赖于前面输入的N-1个词的特性,循环网络语言模型对于输入的候选词序列进行重打分得到的分数更为准确。The circular network language model maps each word entered into a compact continuous vector space, while the mapped continuous vector space uses a relatively small set of parameters and uses loop joins to create the corresponding model, resulting in long-distance context dependencies. A language model commonly used in large-vocal continuous speech recognition, namely the NGRAM model (a linguistic statistical model), relies on the first N-1 words of the input, N is a positive integer, and the cyclic network language model can capture the previously input. Historical information for all words. Therefore, compared with the traditional language model, which only depends on the characteristics of the N-1 words input earlier, the cyclic network language model is more accurate in scoring the input candidate word sequences.
在一个实施例中,在将最终得分最高的候选词序列作为语音数据的语音识别结果之后,还包括:对语音识别结果进行实体提取,得到实体词语;对实体词语进行检索;当检索结果与实体词语不一致时,对实体词语进行修复。In an embodiment, after the candidate word sequence with the highest final score is used as the speech recognition result of the speech data, the method further includes: performing entity extraction on the speech recognition result to obtain an entity word; retrieving the entity word; and performing the retrieval result and the entity When the words are inconsistent, the entity words are repaired.
当得到每个候选词序列的第一得分和第二得分后,可计算得到每个候选词序列的最终得分,并选出最终得分最高的候选词作为语音数据的语音识别结果。在确定语音数据对应的语音识别结果后,可对语音识别结果进行实体修复。先对语音识别结果进行实体提取,实体代表了在语音识别结果这个句子中发挥较大作用的词语,比如“我想去世界之窗”,那么在这个句子中,“世界之窗”则是这个句子的重点,而“我想去”只是说话者意图的一种表达,而最终需要落实的地点是“世界之窗”。After obtaining the first score and the second score of each candidate word sequence, the final score of each candidate word sequence can be calculated, and the candidate word with the highest final score is selected as the speech recognition result of the speech data. After determining the speech recognition result corresponding to the speech data, the speech recognition result may be physically repaired. First, the speech recognition result is physically extracted. The entity represents a word that plays a big role in the sentence of the speech recognition result, such as "I want to go to the window of the world", then in this sentence, "the window of the world" is this The focus of the sentence, and "I want to go" is only an expression of the speaker's intention, and the final place to be implemented is the "window of the world."
对语音识别结果进行实体提取时,可以得到多个实体词语,也可以只得到一个实体词语,当提取出实体词语后,可对每个实体词语进行检索,当检索到的结果与语音识别结果中提取出的实体词语不一致时,则将语音识别结果中提取出的实体词语替换成检索结果,以此方式对实体词语进行修复。When entity extraction is performed on the speech recognition result, multiple entity words can be obtained, or only one entity word can be obtained. After the entity words are extracted, each entity word can be retrieved, and the retrieved result and the speech recognition result are When the extracted entity words are inconsistent, the entity words extracted from the speech recognition result are replaced with the retrieval result, and the entity words are repaired in this way.
在语音识别中,识别错别字一般主要集中在领域内特殊的实体,因此可针对此问题,将语音识别结果中的实体提取出来,并对实体词语进行相应的修复,从而减少了语音识别结果中出现错别字的概率,提高了最终得到的语音识别结果的准确度。In speech recognition, the recognition of typos is generally concentrated on special entities in the domain. Therefore, the entity in the speech recognition result can be extracted for this problem, and the entity words are correspondingly repaired, thereby reducing the occurrence of speech recognition results. The probability of a typo increases the accuracy of the resulting speech recognition result.
在一个实施例中,对实体词语进行检索,包括:根据候选词序列所在的领 域确定语音识别结果的领域;在语音识别结果的领域对应的数据库中对每个实体词语进行检索。In one embodiment, retrieving the entity term comprises: determining a domain of the speech recognition result based on the domain in which the candidate word sequence is located; and retrieving each entity term in a database corresponding to the domain of the speech recognition result.
当对语音数据进行解码后可得到多个词序列,从词序列中提取出多个词序列作为候选词序列,并选出其中一个候选词序列作为语音数据的语音识别结果,因此可根据提取的候选词序列所在的领域确定语音识别结果对应的领域。当确定了语音识别结果的领域后,可在领域对应的数据库中对每个实体词语进行检索。When the speech data is decoded, a plurality of word sequences can be obtained, a plurality of word sequences are extracted from the word sequence as a candidate word sequence, and one of the candidate word sequences is selected as a speech recognition result of the speech data, and thus can be extracted according to the The field in which the candidate word sequence is located determines the field to which the speech recognition result corresponds. After the field of speech recognition results is determined, each entity term can be retrieved in a database corresponding to the domain.
根据语音识别结果所在的领域对实体词语进行检索,不仅减少了检索的数据量,也提高了检索结果的准确性,因此在对实体词语进行修复时,修复结果也能更为准确,从而提高了对语音识别结果进行纠正的效率和准确性。Retrieving the entity words according to the field in which the speech recognition result is located not only reduces the amount of data retrieved, but also improves the accuracy of the search results. Therefore, when the entity words are repaired, the repair results can be more accurate, thereby improving the accuracy. The efficiency and accuracy of correcting speech recognition results.
在一个实施例中,如图6所示,在步骤214之后,还包括以下步骤:In an embodiment, as shown in FIG. 6, after step 214, the following steps are further included:
步骤602,对语音识别结果进行实体提取,得到实体词语。Step 602: Perform physical extraction on the speech recognition result to obtain an entity word.
步骤604,根据提取的候选词序列所在的领域确定语音识别结果的领域。Step 604: Determine a field of the speech recognition result according to the domain in which the extracted candidate word sequence is located.
步骤606,在语音识别结果的领域对应的数据库中对每个实体词语进行检索。Step 606: Search each entity word in a database corresponding to the field of the speech recognition result.
步骤608,当检索结果与实体词语不一致时,对实体词语进行修复。Step 608: When the search result is inconsistent with the entity word, the entity word is repaired.
当从多个候选词序列中选出最终得分最高的候选词序列作为语音数据的语音识别结果后,可对语音识别结果的实体进行提取,实体代表了在语音识别结果这个句子中发挥较大作用的词语,比如“我想去世界之窗”,那么在这个句子中,“世界之窗”则是这个句子的重点,而“我想去”只是说话者意图的一种表达,而最终需要落实的地点是“世界之窗”。在确定语音数据的语音识别结果之前,对每个可能是语音识别结果的候选词序列的领域进行了确定,即识别出了候选词序列所在的领域。候选词序列可以是多个,全部的候选词序列对应的领域一般为一个,即多个候选词序列对应同一个领域,而语音识别结果是最终得分最高的候选词序列,则可根据提取的候选词序列所在的领域确定语音识别结果的领域。When the candidate word sequence with the highest final score is selected from the plurality of candidate word sequences as the speech recognition result of the speech data, the entity of the speech recognition result can be extracted, and the entity represents a large role in the sentence of the speech recognition result. Words such as "I want to go to the window of the world", then in this sentence, "window of the world" is the focus of this sentence, and "I want to go" is just an expression of the speaker's intention, and ultimately needs to be implemented The location is "Window of the World." Before determining the speech recognition result of the speech data, the field of each candidate word sequence that may be the speech recognition result is determined, that is, the field in which the candidate word sequence is located is identified. The sequence of candidate words may be multiple, and the field corresponding to all candidate word sequences is generally one, that is, multiple candidate word sequences correspond to the same field, and the speech recognition result is the candidate word sequence with the highest final score, and the candidate can be extracted according to the candidate. The field in which the word sequence is located determines the field of speech recognition results.
在确定了语音识别结果的领域后,可在语音识别结果的领域对应的数据库中对每个实体词语进行检索,即每个领域均有各自对应的数据库。比如语音识别结果为“我想去帝王大厦”,对此句子进行实体提取,可得到“帝王大厦”这个实体词语,而语音识别结果的领域为“导航”领域,那么对实体词语“帝王大厦”进行检索时,是在“导航”领域对应的数据库中进行检索,即检索对应的地名,当检索结果为“地王大厦”时,则说明检索结果与实体词语不一致,那么可将实体词语“帝 王大厦”替换为检索结果“地王大厦”,则语音识别结果中的实体词语进行了修复操作,得到了修复后的语音识别结果“我想去地王大厦”,即通过实体修复操作可对语音识别结果进行错别字等纠正操作,得到更为准确的语音识别结果。After the field of the speech recognition result is determined, each entity word can be retrieved in a database corresponding to the field of the speech recognition result, that is, each field has its own corresponding database. For example, the speech recognition result is “I want to go to the Emperor Building”, and the entity extraction of this sentence can get the entity word “Imperial Building”, and the field of speech recognition result is “navigation” field, then the entity word “Imperial Building” When searching, it is searched in the database corresponding to the "navigation" field, that is, the corresponding place name is retrieved. When the search result is "Diwang Building", the search result is inconsistent with the entity word, then the entity word "emperor" can be The building is replaced by the search result "Diwang Building", and the entity words in the speech recognition result are repaired, and the corrected speech recognition result "I want to go to the Diwang Building", that is, the voice can be corrected by the physical repair operation The recognition result is corrected by a typos and the like, and a more accurate speech recognition result is obtained.
如图7所示,在一个实施例中,提供了一种语音识别方法。本实施例以该方法应用于上述图1中的服务器120来举例说明。参照图7,该语音识别方法具体包括如下步骤:As shown in Figure 7, in one embodiment, a speech recognition method is provided. This embodiment is exemplified by applying the method to the server 120 in FIG. 1 described above. Referring to FIG. 7, the voice recognition method specifically includes the following steps:
步骤702,对神经网络进行训练,得到训练好的神经网络。In step 702, the neural network is trained to obtain a trained neural network.
在实际运用神经网络之前,需要预先根据实际项目对神经网络进行训练。每个领域均有各自对应的神经网络,因此利用每个领域对应的文本对每个领域对应的神经网络进行单独的训练,训练出的神经网络与领域一一对应,如图8所示,每个神经网络均对应有各自的训练文本。而在实际运用中,语义分类模型可识别出提取的候选词序列所在的领域,当通过语义分类模型确定了该提取的候选词序列的领域后,则将每个候选词序列依次输入至对应领域的神经网络中进行重打分,因此神经网络的领域与语义分类模型的领域是相对应的,因此在训练时,如图8所示的,可将语义分类模型与神经网络共同进行训练,使得神经网络的领域与语义分类模型的分类领域对应,神经网络用于对语音进行识别,而语义分类模型用于对语音数据的语义进行分析,因此可认为神经网络属于识别侧,而语义分类模型属于语义侧。Before actually using the neural network, it is necessary to train the neural network according to the actual project in advance. Each domain has its own corresponding neural network. Therefore, the corresponding neural network of each domain is trained separately by using the corresponding text of each domain. The trained neural network corresponds to the domain one by one, as shown in Figure 8. Each neural network has its own training text. In practical application, the semantic classification model can identify the domain in which the extracted candidate word sequence is located. After the domain of the extracted candidate word sequence is determined by the semantic classification model, each candidate word sequence is sequentially input to the corresponding domain. In the neural network, the scoring is performed, so the domain of the neural network corresponds to the domain of the semantic classification model. Therefore, during training, as shown in Fig. 8, the semantic classification model can be trained together with the neural network to make the nerve The domain of the network corresponds to the classification domain of the semantic classification model. The neural network is used to identify the speech, while the semantic classification model is used to analyze the semantics of the speech data. Therefore, the neural network belongs to the recognition side, and the semantic classification model belongs to the semantics. side.
在对神经网络进行训练时,可获取到各个领域对应的文本,将每个文本中的每个词语转换成词向量后,可将词向量作为输入数据,并按照句子中的词语的顺序,将输入词语的下一个词语的词向量作为当前输入词语的输出数据,即每个输出的词向量都与前一个时刻输入的词向量以及之前所有时刻输入的词向量相关。在输入词语对应的词向量对神经网络进行训练时,可对神经网络的参数进行不断的调整,直至调整到一个合适的参数,则代表神经网络训练完毕,得到了训练好的神经网络。在分别对每个领域的神经网络进行训练后,如果在实际运用过程中需要对某一个领域对应的神经网络进行更新时,只需要更新该领域对应的神经网络即可,比如,只需要更新领域n时,则可只需要更新如图8中所示的灰色部分,即对领域n对应的神经网络进行重新训练,并同时在语义侧重新训练对应领域的语义分类模型即可。When training the neural network, the corresponding texts of each field can be obtained. After converting each word in each text into a word vector, the word vector can be used as input data, and according to the order of the words in the sentence, The word vector of the next word of the input word is used as the output data of the current input word, that is, each output word vector is related to the word vector input at the previous moment and the word vector input at all previous times. When the neural network is trained by the word vector corresponding to the input word, the parameters of the neural network can be continuously adjusted until the appropriate parameters are adjusted, and the neural network is trained, and the trained neural network is obtained. After training the neural network in each field separately, if it is necessary to update the neural network corresponding to a certain domain in the actual application process, only the corresponding neural network of the domain needs to be updated, for example, only the domain needs to be updated. In the case of n, it is only necessary to update the gray portion as shown in FIG. 8, that is, to retrain the neural network corresponding to the field n, and at the same time retrain the semantic classification model of the corresponding domain on the semantic side.
神经网络可以是循环神经网络,也可称为循环网络语言模型,循环网络语言模型的优势在于长记忆效应,即每个输出的词语的词向量都是由之前所有输 入的词语的词向量共同影响的,当前输入的词语的词向量对应的输出数据是上一个词语对应的词向量。输出数据是每个输入词语的下一个词语对应的词向量,但是对于每一个输出数据而言,输入数据并不仅仅是上一个输入的词语的词向量,而是上一个输入的词语的词向量以及之前输入的全部词语的词向量。即每个输出的词语都与前一个时刻的输入词语以及之前所有时刻输入的词语相关。以此类推,设定每个输入的词向量对应的输出数据,对循环网络语言模型进行训练,以得到训练好的神经网络投入实际的使用。The neural network can be a cyclic neural network, also known as a cyclic network language model. The advantage of the cyclic network language model is the long memory effect, that is, the word vector of each output word is influenced by the word vectors of all previously input words. The output data corresponding to the word vector of the currently input word is the word vector corresponding to the previous word. The output data is the word vector corresponding to the next word of each input word, but for each output data, the input data is not only the word vector of the last input word, but the word vector of the last input word. And the word vector of all the words entered before. That is, each output word is related to the input word at the previous moment and the words entered at all previous times. By analogy, the output data corresponding to each input word vector is set, and the cyclic network language model is trained to obtain the trained neural network for practical use.
步骤704,获取对语音数据进行解码得到的多个词序列,以及每个词序列对应的第一得分。Step 704: Acquire a plurality of word sequences obtained by decoding the voice data, and a first score corresponding to each word sequence.
对语音解码处理可由前端进行,前端可以是终端,可对语音数据进行解码得到多个词序列,以及每个词序列对应的第一得分。具体地,如图9所示,前端在获取到输入的语音数据后,对语音数据进行特征提取,运用训练好的声学模型、语言模型以及发音词典对特征提取后的语音数据进行搜索解码,可得到多个词序列。声学模型可通过语音训练集训练得到,语言模型由文本训练集训练得到,训练好的声学模型与语言模型才能够被投入至实际运用中。The speech decoding process can be performed by the front end, and the front end can be a terminal, and the speech data can be decoded to obtain a plurality of word sequences, and a first score corresponding to each word sequence. Specifically, as shown in FIG. 9 , after acquiring the input voice data, the front end extracts the feature data, and uses the trained acoustic model, the language model, and the pronunciation dictionary to perform search and decoding on the extracted voice data. Get multiple word sequences. The acoustic model can be trained through the speech training set, the language model is trained by the text training set, and the trained acoustic model and language model can be put into practical use.
词序列可以看做是多个词语以及多个路径,也可称为lattice,lattice本质上是一个有向无环(directed acyclic graph)图,图上的每个节点代表一个词的结束时间点,每条边代表一个可能的词,以及该词发生的声学得分和语言模型得分。对语音识别的结果进行表示时,每个节点存放当前位置上语音识别的结果,包括声学概率、语言概率等信息,如图10所示,从最左边起始位置<s>开始,沿着不同的弧走到最终的</s>可以得到不同的词序列,而弧上存放的概率组合起来,表示了输入语音得到某一段文字的概率(分数)。比如图10中所示的,“北京欢迎你”、“背景换映你”,都可以看作是识别结果的一条路径,即“北京欢迎你”、“背景换映你”均是一个词序列。而图中每条路径均对应有一个概率,根据概率可计算得到每条路径的分数,即第一得分,因此每个词序列均有各自对应的第一得分。A word sequence can be thought of as multiple words and multiple paths. It can also be called lattice. Latice is essentially a directed acyclic graph. Each node on the graph represents the end time of a word. Each edge represents a possible word, as well as the acoustic score and language model score of the word. When the result of the speech recognition is expressed, each node stores the result of the speech recognition at the current position, including the acoustic probability, the language probability, and the like, as shown in FIG. 10, starting from the leftmost starting position <s>, along the different The arc goes to the final </s> to get a different sequence of words, and the probability of being stored on the arc is combined to represent the probability (fraction) of the input speech to get a certain piece of text. For example, as shown in Figure 10, "Beijing welcomes you" and "background changes you" can be regarded as a path to identify the results, that is, "Beijing welcomes you" and "background changes you" are a sequence of words. . Each path in the graph corresponds to a probability. According to the probability, the score of each path can be calculated, that is, the first score, so each word sequence has its own corresponding first score.
步骤706,从多个词序列中提取第一得分靠前的预设数量的词序列作为候选词序列。Step 706: Extract a predetermined number of word sequences of the first score from the plurality of word sequences as a candidate word sequence.
当得到多个词序列后,可从词序列中提取出第一得分靠前的预设数量的词序列作为候选词序列。即通过lattice计算出的最优路径,即概率最高的路径不一定与实际词序列是匹配的,所以可提取出第一得分靠前的预设数量的词序列 作为候选词序列,也可称为n-best(n条最优,一种语音特征信息综合快速算法),预设数量可由技术人员根据实际需要而设定。When a plurality of word sequences are obtained, a predetermined number of word sequences of the first score may be extracted from the word sequence as a candidate word sequence. That is, the optimal path calculated by lattice, that is, the path with the highest probability does not necessarily match the actual word sequence, so the predetermined number of word sequences with the first score before can be extracted as the candidate word sequence, which can also be called N-best (n optimal, a comprehensive fast algorithm for speech feature information), the preset number can be set by the technician according to actual needs.
步骤708,将每个候选词序列输入至语义分类模型中。 Step 708, input each candidate word sequence into a semantic classification model.
步骤710,通过语义分类模型对每个候选词序列进行分类,得到每个候选词序列对应的分类标签。Step 710: classify each candidate word sequence by a semantic classification model, and obtain a classification label corresponding to each candidate word sequence.
步骤712,获取分类标签中占比最大的分类标签对应的领域作为提取的候选词序列所在的领域。Step 712: Acquire an area corresponding to the classification label with the largest proportion among the classification labels as the field in which the extracted candidate word sequence is located.
步骤714,根据提取的候选词序列所在的领域将候选词序列输入至对应领域的训练好的神经网络中。Step 714: Input the candidate word sequence into the trained neural network of the corresponding domain according to the domain in which the extracted candidate word sequence is located.
步骤716,通过训练好的神经网络对每个候选词序列进行重打分,得到每个候选词序列对应的第二得分。Step 716: Re-score each candidate word sequence by the trained neural network to obtain a second score corresponding to each candidate word sequence.
步骤718,对每个候选词序列对应的第一得分和第二得分进行加权求和,得到每个候选词序列的最终得分。 Step 718, weighting and summing the first score and the second score corresponding to each candidate word sequence to obtain a final score of each candidate word sequence.
步骤720,将最终得分最高的候选词序列作为语音数据的语音识别结果。In step 720, the candidate word sequence with the highest final score is used as the speech recognition result of the speech data.
当提取出预设数量的词序列作为候选词序列后,可将每个候选词序列都输入至训练好的语义分类模型中,训练好的语义分类模型可对每个候选词序列进行分类,输出对每个候选词序列进行分类后得到的分类标签,可根据每个候选词序列的分类标签确定每个候选词序列对应的领域。但训练好的语义分类模型对每个候选词序列的领域分类结果可能存在差异,为了确定提取的候选词序列所在的领域,可获取到每个候选词序列的分类标签,将占比最大的分类标签作为全部的候选词序列对应的分类标签,即将占比最大的分类标签对应的领域作为全部的候选词序列的领域,因此最终多个候选词序列对应的领域将会是同一个,一般不会出现同一个语音数据对应的多个候选词序列属于多个领域。After extracting a preset number of word sequences as candidate word sequences, each candidate word sequence can be input into the trained semantic classification model, and the trained semantic classification model can classify each candidate word sequence and output The classification label obtained by classifying each candidate word sequence may determine the domain corresponding to each candidate word sequence according to the classification label of each candidate word sequence. However, the trained semantic classification model may have different domain classification results for each candidate word sequence. In order to determine the domain in which the extracted candidate word sequence is located, the classification label of each candidate word sequence may be obtained, and the classification with the largest proportion may be obtained. The label is a classification label corresponding to all candidate word sequences, that is, the domain corresponding to the largest classification label is the domain of all candidate word sequences. Therefore, the fields corresponding to the plurality of candidate word sequences will eventually be the same, generally not A plurality of candidate word sequences corresponding to the same voice data belong to a plurality of fields.
在提取出多个候选词序列后,即进行了n-best的提取步骤后,可对候选词序列进行领域识别,可使用语义分类模型识别出候选词序列所在的领域。如图11所示,当进行了n-best的提取后,可使用语义分类模型对候选词序列进行领域识别,并将候选词序列输入至对应领域的神经网络中。每个领域均有各自对应的神经网络,即分领域的神经网络,神经网络可以是RNNLM。相较于NGRAM语言模型,RNNLM的保存长期记忆的能力更好。NGRAM语言模型是大词汇连续语音识别中常用的一种语言模型,该模型基于这样一种假设,第N个词的出现只与前面N-1个词相关,而与其它任何词都不相关,整句的概率就是各个词 出现概率的乘积,这些概率可以通过直接从语料中统计N个词同时出现的次数得到。After extracting a plurality of candidate word sequences, after performing the n-best extraction step, the candidate word sequence can be identified by the domain, and the semantic classification model can be used to identify the domain in which the candidate word sequence is located. As shown in FIG. 11, after the extraction of n-best is performed, the candidate word sequence can be identified by the semantic classification model, and the candidate word sequence is input into the neural network of the corresponding domain. Each domain has its own corresponding neural network, that is, a neural network that is divided into domains. The neural network can be RNNLM. Compared to the NGRAM language model, RNNLM has better ability to preserve long-term memory. The NGRAM language model is a commonly used language model in large vocabulary continuous speech recognition. The model is based on the assumption that the appearance of the Nth word is only related to the previous N-1 words, and is not related to any other words. The probability of a whole sentence is the product of the probability of occurrence of each word. These probabilities can be obtained by counting the number of simultaneous occurrences of N words from the corpus.
但RNNLM则不同。比如“我的”“心情”后面可能出现“不错”,也可能出现“不好”,这些词的出现,依赖于之前出现了“我的”和“心情”,即“记忆效应”。传统的语言模型,如NGRAM语言模型,在计算一个词出现的概率时,只依赖于前面N-1个词,N一般最多设定为5,再之前的词就会被忽略。但这种做法存在不合理性,因为之前出现过的词,都会对当前的词语有影响,比如,“刘德华/是/一个/优秀/的/演员/以及/歌手/有着/很多/经典/作品/他/其中/的/一个”,如果考虑到之前的语境,那么接下来出现“歌”的概率一定会高于“哥”,而如果只考虑之前的3个或者4个词,忽略掉前文,此处则无法确定是“歌”还是“哥”。但是RNNLM的记忆效应好,就是相对于传统语言模型,RNNLM会考虑之前输入的很长的文本,根据之前很长的文本,来计算接下来出现词的概率,因此RNNLM具有更好的记忆效应。But RNNLM is different. For example, "my" and "mood" may appear "good" or "bad". The appearance of these words depends on the appearance of "my" and "mood", that is, "memory effect". Traditional language models, such as the NGRAM language model, rely on the first N-1 words when calculating the probability of a word appearing. N is usually set to a maximum of 5, and the previous words are ignored. However, this practice is irrational, because words that have appeared before will have an impact on the current words, for example, "Andy Lau / yes / one / excellent / / actors / and / singers / have / many / classic / works / he / / / / one", if you consider the previous context, then the probability of "song" will be higher than "go", and if you only consider the previous 3 or 4 words, ignore In the previous article, it is impossible to determine whether it is "song" or "brother". However, the memory effect of RNNLM is good. Compared with the traditional language model, RNNLM considers the long text input before, and calculates the probability of the next word based on the long text before, so RNNLM has better memory effect.
将候选词序列输入至对应领域的神经网络后,可通过神经网络对候选词序列进行重打分得到候选词序列的第二得分,最终得到语音数据的语音识别结果。重打分也可称为rescoring。由于生成lattice的语言模型无法做到足够精确,因此需要使用更大的语言模型重新调整n-best的分数,这个过程称之为rescoring,即重打分。After the candidate word sequence is input to the neural network of the corresponding domain, the candidate word sequence can be re-scored by the neural network to obtain the second score of the candidate word sequence, and finally the speech recognition result of the speech data is obtained. Re-scoring can also be called rescoring. Since the language model that generates the lattice cannot be accurate enough, it is necessary to re-adjust the n-best score using a larger language model. This process is called rescoring, which is a re-scoring.
如图12所示的例子,多个候选词序列及对应的第一得分分别为:播放周杰伦的双节棍94.7456、播放周杰伦的双截棍95.3976、播放周结论的双截棍94.7951、…,则将这些候选词序列都输入至语义分类模型进行分类,语义分类模型识别出这些候选词序列属于音乐领域,那么可将每个候选词序列输入至音乐领域对应的神经网络中进行重打分处理,并得到每个候选词序列的第二得分,具体为:播放周杰伦的双节棍93.3381、播放周杰伦的双截棍95.0925、播放周结论的双截棍94.1557、…。As shown in the example of FIG. 12, the plurality of candidate word sequences and the corresponding first scores are: playing Jay Chou's nunchaku 94.7456, playing Jay Chou's nunchaku 95.3976, and playing the week conclusion of the nunchaku 94.7795, ... These candidate word sequences are input into the semantic classification model for classification. The semantic classification model recognizes that the candidate word sequences belong to the music field, and then each candidate word sequence can be input into the corresponding neural network of the music field for re-scoring processing, and The second score of each candidate word sequence is obtained, specifically: playing Jay Chou's nunchaku 93.3381, playing Jay Chou's nunchaku 95.0925, playing the week conclusion of the nunchaku 94.1557, ....
得到每个候选词的第一得分和第二得分后,可对第一得分和第二得分进行加权求和得到每个候选词序列的最终得分。具体地,每个候选词序列的最终得分=第一得分*第一权重+第二得分*第二权重。第一权重与第二权重可以相等也可以不相等,第一权重和第二权重分别代表了第一得分和第二得分各自所占的比重,当第一权重大于第二权重时,可认为技术人员将第一得分的占比分数调大了,认为第一得分的比重高于第二得分,即技术人员认为第一得分对语音识 别结果的影响更大。第一权重和第二权重可以是经过大量的测试数据确定的,比如运用大量的候选词序列,不停的调整第一权重和第二权重的值,直到将根据第一权重和第二权重计算得到的最终得分最高的候选词序列作为语音识别结果时,该最终得分最高的候选词序列的确为最准确的语音识别结果,则可确定第一权重和第二权重的实际值,并在之后的实际运用中使用此第一权重和第二权重作为最终得分的计算参数。After obtaining the first score and the second score of each candidate word, the first score and the second score may be weighted and summed to obtain a final score for each candidate word sequence. Specifically, the final score of each candidate word sequence = first score * first weight + second score * second weight. The first weight and the second weight may be equal or unequal, and the first weight and the second weight respectively represent the proportion of each of the first score and the second score, and when the first weight is greater than the second weight, the technology may be considered The person increases the score of the first score, and thinks that the proportion of the first score is higher than the second score, that is, the technician thinks that the first score has a greater influence on the speech recognition result. The first weight and the second weight may be determined by a large amount of test data, such as using a large number of candidate word sequences, and continuously adjusting the values of the first weight and the second weight until the first weight and the second weight are to be calculated. When the obtained candidate word sequence with the highest final score is used as the speech recognition result, the sequence of the candidate word with the highest final score is indeed the most accurate speech recognition result, and the actual values of the first weight and the second weight can be determined, and This first weight and the second weight are used in the actual application as the calculation parameters of the final score.
比如,当第一权重为0.4,第二权重为0.6时,则可计算得到候选词序列“播放周杰伦的双截棍”的最终得分=95.3976*0.4+95.0925*0.6=95.21454,将全部的候选词序列的最终得分都计算完毕后,可看出候选词序列“播放周杰伦的双截棍”的分数最高,因此可将“播放周杰伦的双截棍”作为语音识别结果。在计算得到最终得分后,还可以增加一个计算步骤,比如将最终得分的对数作为选取分数,即上述候选词序列“播放周杰伦的双截棍”的最终得分分数为95.21454,再取95.21454的对数,如log 1095.21454,计算得到每个候选词序列的最终得分的对数后,将最终得分的对数作为每个候选词序列的选取分数,再选取出选取分数最高的候选词序列作为语音识别结果。这种计算方式可以多样化,也可以采取其他的计算方式,具体可由技术人员设定。 For example, when the first weight is 0.4 and the second weight is 0.6, the final score of the candidate word sequence "Play Jay Chou's nunchaku" = 95.3976 * 0.4 + 95.0925 * 0.6 = 95.21454, all candidate words can be calculated. After the final scores of the sequence are calculated, it can be seen that the candidate word sequence "playing Jay Chou's nunchaku" has the highest score, so "play Jay Chou's nunchaku" can be used as the speech recognition result. After calculating the final score, a calculation step can also be added, such as taking the logarithm of the final score as the selection score, that is, the final score of the candidate sequence "playing Jay Chou's nunchaku" is 95.21454, and then taking the pair of 95.21454. For example, log 10 95.21454, after calculating the logarithm of the final score of each candidate word sequence, the logarithm of the final score is taken as the selection score of each candidate word sequence, and then the candidate word sequence with the highest selection score is selected as the speech. Identify the results. This calculation method can be diversified, and other calculation methods can be adopted, which can be set by a technician.
步骤722,对语音识别结果进行实体提取,得到实体词语。Step 722: Perform physical extraction on the speech recognition result to obtain an entity word.
步骤724,根据提取的候选词序列所在的领域确定语音识别结果的领域。Step 724: Determine a field of the speech recognition result according to the domain in which the extracted candidate word sequence is located.
步骤726,在语音识别结果的领域对应的数据库中对每个实体词语进行检索。 Step 726, searching for each entity word in a database corresponding to the field of the speech recognition result.
步骤728,当检索结果与实体词语不一致时,对实体词语进行修复。Step 728: When the search result is inconsistent with the entity word, the entity word is repaired.
在语音识别中,识别错误一般都集中在领域内的特殊实体,针对此问题,可在得到语音数据的语音识别结果后,对语音识别结果进行实体修复操作。先对语音识别结果进行实体提取,实体代表了在语音识别结果这个句子中发挥较大作用的词语,比如“我想去帝王大厦”,那么在这个句子中,“帝王大厦”则是这个句子的重点,而“我想去”只是说话者意图的一种表达,而最终需要落实的地点是“帝王大厦”。In speech recognition, recognition errors are generally concentrated in special entities in the domain. For this problem, the speech recognition result can be physically repaired after the speech recognition result of the speech data is obtained. First, the speech recognition result is physically extracted. The entity represents a word that plays a big role in the sentence of the speech recognition result, such as "I want to go to the emperor building", then in this sentence, "the emperor building" is the sentence Focus, and "I want to go" is just an expression of the speaker's intention, and the final place to be implemented is the "Emperor Building."
对语音识别结果进行实体提取时,可以得到多个实体词语,也可以只得到一个实体词语,当提取出实体词语后,可对每个实体词语进行检索,当对语音数据进行解码后可得到多个词序列,从词序列中选取出多个词序列作为候选词序列,并选出其中一个候选词序列作为语音数据的语音识别结果,因此可根据 候选词序列所在的领域确定语音识别结果对应的领域。当确定了语音数据的语音识别结果的领域后,可在该领域对应的数据库中对每个实体词语进行检索,当检索到的结果与语音识别结果中提取出的实体词语不一致时,则将语音识别结果中提取出的实体词语替换成检索结果,以此方式对实体词语进行修复。比如当语音识别结果为“我想去帝王大厦”,提取出“帝王大厦”此实体词语之后,可根据语音识别结果所属的导航领域对“帝王大厦”进行检索,即对地理位置信息检索,根据检索可发现实际结果应该是“地王大厦”,因此可以将整句话纠正为“我想去地王大厦”,实体修复功能可快速、准确的针对具体识别错误进行修复。When entity extraction is performed on the speech recognition result, multiple entity words can be obtained, or only one entity word can be obtained. After the entity words are extracted, each entity word can be retrieved, and when the voice data is decoded, more can be obtained. a sequence of words, selecting a plurality of word sequences as a candidate word sequence from the sequence of words, and selecting one of the candidate word sequences as the speech recognition result of the speech data, so that the speech recognition result corresponding to the field in which the candidate word sequence is located may be determined. field. After the field of the speech recognition result of the speech data is determined, each entity word may be searched in a corresponding database in the field, and when the retrieved result is inconsistent with the entity word extracted in the speech recognition result, the speech is The entity words extracted in the recognition result are replaced with the search results, and the entity words are repaired in this way. For example, when the speech recognition result is "I want to go to the Emperor Building", after extracting the entity word of "Emperor Building", the "Emperor Building" can be searched according to the navigation field to which the speech recognition result belongs, that is, the geographical location information is retrieved, according to The search reveals that the actual result should be “Diwang Building”, so the whole sentence can be corrected to “I want to go to the King Building”, and the physical repair function can quickly and accurately repair the specific identification errors.
本实施例中的语音识别方法,由于采用了分领域的重打分机制,分领域后的重打分操作可以大幅度的提升识别准确率,经实验证明,在不同领域中,语音识别准确率分别有4%到16%的相对提升,整体识别准确率提升了6%。另外,由于每个领域均有各自对应的神经网络,因此可以大幅度的减小对神经网络进行训练的成本。传统的语音识别方案需要利用海量的文本训练出一个语言模型,为了保证重打分的效果,语言模型规模十分庞大,模型训练周期很长,对模型整体效果的评估也比较困难,8.2G文本训练RNNLM时间为100小时左右;区分领域之后,单个领域的模型训练时间可以缩短到24小时内,针对特定领域模型的更新时间大幅缩短。且传统技术中的语言模型不可能全面的覆盖到所有数据,一旦发现语言模型中存在问题,就需要会整个语言模型进行更新,而每次更新都会产生巨大的开销,而本实施例中,则只需要对需要处理或更新的领域对应的神经网络进行重新训练即可,减小了训练成本,也缩短了神经网络更新所需要消耗的时间。In the speech recognition method in this embodiment, since the sub-segment re-scoring mechanism is adopted, the re-scoring operation after the sub-field can greatly improve the recognition accuracy. It is proved by experiments that the speech recognition accuracy rates are different in different fields. With a relative increase of 4% to 16%, the overall recognition accuracy increased by 6%. In addition, since each domain has its own corresponding neural network, the cost of training the neural network can be greatly reduced. The traditional speech recognition scheme needs to use a large amount of text to train a language model. In order to ensure the effect of re-scoring, the language model is very large, the model training period is very long, and the evaluation of the overall effect of the model is difficult. 8.2G text training RNNLM The time is about 100 hours; after the field is divided, the training time of the model in a single field can be shortened to 24 hours, and the update time for the domain-specific model is greatly shortened. Moreover, the language model in the traditional technology cannot cover all the data comprehensively. Once there is a problem in the language model, the entire language model needs to be updated, and each update will generate huge overhead, but in this embodiment, It is only necessary to retrain the neural network corresponding to the domain that needs to be processed or updated, which reduces the training cost and shortens the time required for the neural network update.
图2至图12分别为各个实施例中语音识别方法的流程示意图。应该理解的是,虽然各个图的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,各个图中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。2 to 12 are schematic flowcharts of a voice recognition method in various embodiments. It should be understood that although the various steps in the flowcharts of the various figures are sequentially displayed as indicated by the arrows, these steps are not necessarily performed in the order indicated by the arrows. Except as explicitly stated herein, the execution of these steps is not strictly limited, and the steps may be performed in other orders. Moreover, at least some of the steps in the various figures may comprise a plurality of sub-steps or stages, which are not necessarily performed at the same time, but may be performed at different times, the execution of these sub-steps or stages The order is also not necessarily sequential, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of the other steps.
在一个实施例中,如图13所示,提供了一种语音识别装置,包括:In one embodiment, as shown in FIG. 13, a voice recognition apparatus is provided, including:
词序列获取模块1302,用于获取对语音数据进行解码得到的多个词序列, 以及每个词序列对应的第一得分。The word sequence obtaining module 1302 is configured to acquire a plurality of word sequences obtained by decoding the voice data, and a first score corresponding to each word sequence.
提取模块1304,用于从多个词序列中提取第一得分靠前的预设数量的词序列作为候选词序列。The extracting module 1304 is configured to extract, from the plurality of word sequences, a preset number of word sequences of the first score as a candidate word sequence.
领域识别模块1306,用于识别提取的候选词序列所在的领域。The domain identification module 1306 is configured to identify the domain in which the extracted candidate word sequence is located.
重打分模块1308,用于根据该领域将每个候选词序列输入至对应领域的神经网络中;通过神经网络对每个候选词序列进行重打分,得到每个候选词序列对应的第二得分。The re-scoring module 1308 is configured to input each candidate word sequence into the neural network of the corresponding domain according to the field; each candidate word sequence is re-scoreed by the neural network to obtain a second score corresponding to each candidate word sequence.
语音识别结果确定模块1310,用于根据每个候选词序列对应的第一得分和第二得分得到每个候选词序列的最终得分;将最终得分最高的候选词序列作为语音数据的语音识别结果。The speech recognition result determining module 1310 is configured to obtain a final score of each candidate word sequence according to the first score and the second score corresponding to each candidate word sequence; and use the candidate word sequence with the highest final score as the speech recognition result of the speech data.
在一个实施例中,如图14所示,上述领域识别模块1306包括:In one embodiment, as shown in FIG. 14, the domain identification module 1306 includes:
输入模块1306A,用于将每个候选词序列输入至训练好的语义分类模型中。The input module 1306A is configured to input each candidate word sequence into the trained semantic classification model.
分类模块1306B,用于通过语义分类模型对每个候选词序列进行分类,得到每个候选词序列对应的分类标签;获取分类标签中占比最大的分类标签对应的领域作为提取的候选词序列所在的领域。The classification module 1306B is configured to classify each candidate word sequence by a semantic classification model to obtain a classification label corresponding to each candidate word sequence; and obtain an area corresponding to the classification label with the largest proportion among the classification labels as the extracted candidate word sequence. field of.
在一个实施例中,上述装置还包括训练模块(图中未示出),用于获取各个领域对应的文本;将各个领域对应的文本中的每个词语转换成词向量;分别将各个领域对应的词向量作为输入,对各个领域对应的神经网络进行训练。In an embodiment, the apparatus further includes a training module (not shown) for acquiring text corresponding to each domain; converting each word in the text corresponding to each domain into a word vector; respectively corresponding to each domain The word vector is used as an input to train the corresponding neural networks in each field.
在一个实施例中,上述训练模块还用于对于各个领域,按照该领域对应的文本中词语的顺序,将文本中的每个词语对应的词向量作为输入,将每个输入的词语的下一个词语对应的词向量作为输出,以调整该领域对应的神经网络的参数对神经网络进行训练。In one embodiment, the training module is further configured to input, for each field, a word vector corresponding to each word in the text according to the order of the words in the text corresponding to the domain, and input the next word of each input word. The word vector corresponding to the word is used as an output to train the neural network by adjusting the parameters of the corresponding neural network in the field.
在一个实施例中,上述重打分模块1308还用于对根据候选词序列对应的第一得分和第二得分进行加权求和,得到候选词序列的最终得分。In one embodiment, the re-scoring module 1308 is further configured to perform weighted summation on the first score and the second score corresponding to the candidate word sequence to obtain a final score of the candidate word sequence.
在一个实施例中,神经网络为循环神经网络。In one embodiment, the neural network is a circulating neural network.
在一个实施例中,上述装置还包括实体修复模块(图中未示出),用于对语音识别结果进行实体提取,得到多个实体词语;对实体词语进行检索;当检索结果与实体词语不一致时,对实体词语进行修复。In an embodiment, the apparatus further includes a physical repair module (not shown) for performing entity extraction on the speech recognition result to obtain a plurality of entity words; retrieving the entity words; and when the search result is inconsistent with the entity words When the entity words are repaired.
在一个实施例中,上述实体修复模块还用于根据提取的候选词序列所在的领域确定语音识别结果的领域;在语音识别结果的领域对应的数据库中对每个实体词语进行检索。In an embodiment, the entity repairing module is further configured to determine a domain of the voice recognition result according to the domain in which the extracted candidate word sequence is located; and retrieve each entity word in a database corresponding to the domain of the voice recognition result.
图15示出了一个实施例中计算机设备的内部结构图。该计算机设备具体可以是图1中的服务器120。如图15所示,该计算机设备包括该计算机设备包括通过系统总线连接的处理器、存储器和网络接口。其中,存储器包括非易失性存储介质和内存储器。该计算机设备的非易失性存储介质存储有操作系统,还可存储有计算机程序,该计算机程序被处理器执行时,可使得处理器实现语音识别方法。该内存储器中也可储存有计算机程序,该计算机程序被处理器执行时,可使得处理器执行上述语音识别方法。Figure 15 is a diagram showing the internal structure of a computer device in one embodiment. The computer device may specifically be the server 120 of FIG. As shown in FIG. 15, the computer device includes the computer device including a processor, a memory, and a network interface connected by a system bus. Wherein, the memory comprises a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system, and may also store a computer program that, when executed by the processor, causes the processor to implement a speech recognition method. The internal memory can also store a computer program that, when executed by the processor, causes the processor to perform the speech recognition method described above.
本领域技术人员可以理解,图15出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。It will be understood by those skilled in the art that the structure shown in FIG. 15 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation of the computer device to which the solution of the present application is applied. The specific computer device may include a ratio. More or fewer components are shown in the figures, or some components are combined, or have different component arrangements.
在一个实施例中,本申请提供的语音识别装置可以实现为一种计算机程序的形式,计算机程序可在如图15的计算机设备上运行。计算机设备的存储器中可存储组成该语音识别装置的各个程序模块,比如,图13所示的词序列获取模块、提取模块、领域识别模块、重打分模块和语音识别结果确定模块。各个程序模块构成的计算机程序使得处理器执行本说明书中描述的本申请各个实施例的语音识别方法中的步骤。In one embodiment, the speech recognition apparatus provided herein can be implemented in the form of a computer program that can be run on a computer device as in FIG. The program modules constituting the voice recognition device may be stored in a memory of the computer device, such as the word sequence acquisition module, the extraction module, the domain identification module, the re-scoring module, and the speech recognition result determination module shown in FIG. The computer program of each program module causes the processor to perform the steps in the speech recognition method of the various embodiments of the present application described in this specification.
例如,图15所示的计算机设备可以通过如图13所示的语音识别装置中的词序列获取模块执行获取对语音数据进行解码得到的多个词序列,以及每个词序列对应的第一得分。计算机设备可通过提取模块执行从多个词序列中提取第一得分靠前的预设数量的词序列作为候选词序列。计算机设备可通过领域识别模块执行识别提取的候选词序列所在的领域。计算机设备可通过重打分模块执行根据提取的候选词序列所在的领域将每个候选词序列输入至对应领域的神经网络中;通过神经网络对每个候选词序列进行重打分,得到每个候选词序列对应的第二得分。计算机设备可通过语音识别结果确定模块执行根据每个候选词序列对应的第一得分和第二得分得到每个候选词序列的最终得分;将最终得分最高的候选词序列作为语音数据的语音识别结果。For example, the computer device shown in FIG. 15 can perform acquisition of a plurality of word sequences obtained by decoding the voice data by the word sequence acquisition module in the voice recognition device as shown in FIG. 13, and a first score corresponding to each word sequence. . The computer device may perform, by the extraction module, extracting, from the plurality of word sequences, a preset number of word sequences of the first score as a candidate word sequence. The computer device can perform an area identifying the extracted candidate word sequence by the domain identification module. The computer device may perform, by the re-scoring module, input each candidate word sequence into the neural network of the corresponding domain according to the domain in which the extracted candidate word sequence is located; re-score each candidate word sequence through the neural network to obtain each candidate word The second score corresponding to the sequence. The computer device may obtain a final score of each candidate word sequence according to the first score and the second score corresponding to each candidate word sequence by the voice recognition result determining module; the candidate word sequence with the highest final score as the voice recognition result of the voice data .
在一个实施例中,提供了一种计算机设备,包括存储器和处理器,存储器中存储有计算机程序,该处理器执行计算机程序时实现本申请任意一个实施例中提供的语音识别方法的步骤。In one embodiment, a computer apparatus is provided comprising a memory and a processor having stored therein a computer program that, when executed, implements the steps of the speech recognition method provided in any one of the embodiments of the present application.
在一个实施例中,提供了一种计算机可读存储介质,其上存储有计算机程 序,计算机程序被处理器执行时实现本申请任意一个实施例中提供的语音识别方法的步骤。In one embodiment, a computer readable storage medium is provided having stored thereon a computer program that, when executed by a processor, implements the steps of the speech recognition method provided in any one of the embodiments of the present application.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一非易失性计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。One of ordinary skill in the art can understand that all or part of the process of implementing the above embodiments can be completed by a computer program to instruct related hardware, and the program can be stored in a non-volatile computer readable storage medium. Wherein, the program, when executed, may include the flow of an embodiment of the methods as described above. Any reference to a memory, storage, database or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. Non-volatile memory can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of formats, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronization chain. Synchlink DRAM (SLDRAM), Memory Bus (Rambus) Direct RAM (RDRAM), Direct Memory Bus Dynamic RAM (DRDRAM), and Memory Bus Dynamic RAM (RDRAM).
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。The technical features of the above embodiments may be arbitrarily combined. For the sake of brevity of description, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, It is considered to be the range described in this specification.
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对本申请专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。The above-mentioned embodiments are merely illustrative of several embodiments of the present application, and the description thereof is more specific and detailed, but is not to be construed as limiting the scope of the claims. It should be noted that a number of variations and modifications may be made by those skilled in the art without departing from the spirit and scope of the present application. Therefore, the scope of the invention should be determined by the appended claims.

Claims (22)

  1. 一种语音识别方法,其特征在于,应用于计算机设备,所述方法包括:A speech recognition method, characterized in that it is applied to a computer device, and the method includes:
    获取对语音数据进行解码得到的多个词序列,以及每个词序列对应的第一得分;Obtaining a plurality of word sequences obtained by decoding the voice data, and a first score corresponding to each word sequence;
    从所述多个词序列中提取所述第一得分靠前的预设数量的词序列作为候选词序列;Extracting, from the plurality of word sequences, a preset number of word sequences of the first score as a candidate word sequence;
    识别提取的候选词序列所在的领域;Identifying the field in which the extracted candidate word sequence is located;
    根据所述领域将每个候选词序列输入至对应领域的神经网络中;Inputting each candidate word sequence into a neural network of a corresponding domain according to the field;
    通过所述神经网络对所述每个候选词序列进行重打分,得到所述每个候选词序列对应的第二得分;Retrieving each candidate word sequence by the neural network to obtain a second score corresponding to each candidate word sequence;
    根据所述每个候选词序列对应的第一得分和第二得分得到所述每个候选词序列的最终得分;Obtaining a final score of each candidate word sequence according to the first score and the second score corresponding to each candidate word sequence;
    将所述最终得分最高的候选词序列作为所述语音数据的语音识别结果。The candidate word sequence with the highest final score is used as the speech recognition result of the speech data.
  2. 根据权利要求1所述的方法,其特征在于,所述识别提取的候选词序列所在的领域,包括:The method according to claim 1, wherein the identifying the domain in which the extracted candidate word sequence is located comprises:
    将所述每个候选词序列输入至语义分类模型中;Inputting each candidate word sequence into a semantic classification model;
    通过所述语义分类模型对所述每个候选词序列进行分类,得到所述每个候选词序列对应的分类标签;And classifying each candidate word sequence by using the semantic classification model to obtain a classification label corresponding to each candidate word sequence;
    获取所述分类标签中占比最大的分类标签对应的领域作为所述提取的候选词序列所在的领域。Obtaining, by the domain corresponding to the classification tag having the largest proportion among the classification tags, the domain in which the extracted candidate word sequence is located.
  3. 根据权利要求1所述的方法,其特征在于,所述方法还包括:The method of claim 1 further comprising:
    获取各个领域对应的文本;Get the text corresponding to each field;
    将所述各个领域对应的文本中的每个词语转换成词向量;Converting each word in the text corresponding to each field into a word vector;
    分别将所述各个领域对应的词向量作为输入,对所述各个领域对应的神经网络进行训练。The word vectors corresponding to the respective fields are respectively input as inputs, and the neural networks corresponding to the respective fields are trained.
  4. 根据权利要求3所述的方法,其特征在于,所述分别将所述各个领域对应的词向量作为输入,对所述各个领域对应的神经网络进行训练,包括:The method according to claim 3, wherein the respective word vectors corresponding to the respective fields are respectively input as input, and the neural network corresponding to each domain is trained, including:
    对于各个领域,按照所述领域对应的文本中词语的顺序,将所述文本中的每个词语对应的词向量作为输入,将每个输入的词语的下一个词语对应的词向 量作为输出,以调整所述领域对应的神经网络的参数对所述神经网络进行训练。For each field, according to the order of words in the text corresponding to the domain, the word vector corresponding to each word in the text is taken as an input, and the word vector corresponding to the next word of each input word is output as The neural network is trained by adjusting parameters of the neural network corresponding to the domain.
  5. 根据权利要求1所述的方法,其特征在于,所述根据所述每个候选词序列对应的第一得分和第二得分得到所述每个候选词序列的最终得分,包括:The method according to claim 1, wherein the obtaining a final score of each candidate word sequence according to the first score and the second score corresponding to each candidate word sequence comprises:
    对所述候选词序列对应的第一得分和第二得分进行加权求和,得到所述候选词序列的最终得分。And weighting the first score and the second score corresponding to the candidate word sequence to obtain a final score of the candidate word sequence.
  6. 根据权利要求1所述的方法,其特征在于,所述神经网络为循环神经网络。The method of claim 1 wherein said neural network is a cyclic neural network.
  7. 根据权利要求1所述的方法,其特征在于,在所述将所述最终得分最高的候选词序列作为所述语音数据的语音识别结果之后,还包括:The method according to claim 1, wherein after the candidate word sequence having the highest final score is used as the voice recognition result of the voice data, the method further includes:
    对所述语音识别结果进行实体提取,得到实体词语;Entity extraction of the speech recognition result to obtain an entity word;
    对所述实体词语进行检索;Retrieving the entity words;
    当检索结果与所述实体词语不一致时,对所述实体词语进行修复。When the retrieval result is inconsistent with the entity word, the entity word is repaired.
  8. 根据权利要求7所述的方法,其特征在于,所述对所述实体词语进行检索,包括:The method of claim 7 wherein said retrieving said entity term comprises:
    根据所述提取的候选词序列所在的领域确定所述语音识别结果的领域;Determining a field of the speech recognition result according to an area in which the extracted candidate word sequence is located;
    在所述语音识别结果的领域对应的数据库中对所述实体词语进行检索。The entity words are retrieved in a database corresponding to the field of the speech recognition result.
  9. 一种语音识别装置,其特征在于,所述装置包括:A speech recognition apparatus, characterized in that the apparatus comprises:
    词序列获取模块,用于获取对语音数据进行解码得到的多个词序列,以及每个词序列对应的第一得分;a word sequence obtaining module, configured to acquire a plurality of word sequences obtained by decoding the voice data, and a first score corresponding to each word sequence;
    提取模块,用于从所述多个词序列中提取所述第一得分靠前的预设数量的词序列作为候选词序列;An extraction module, configured to extract, from the plurality of word sequences, a preset number of word sequences of the first score as a candidate word sequence;
    领域识别模块,用于识别提取的候选词序列所在的领域;a domain identification module, configured to identify an area in which the extracted candidate word sequence is located;
    重打分模块,用于根据所述领域将每个候选词序列输入至对应领域的神经网络中;通过所述神经网络对所述每个候选词序列进行重打分,得到所述每个候选词序列对应的第二得分;a re-scoring module, configured to input each candidate word sequence into a neural network of a corresponding domain according to the domain; re-scoring each candidate word sequence by the neural network to obtain each candidate word sequence Corresponding second score;
    语音识别结果确定模块,用于根据所述每个候选词序列对应的第一得分和第二得分得到所述每个候选词序列的最终得分;将所述最终得分最高的候选词序列作为所述语音数据的语音识别结果。a speech recognition result determining module, configured to obtain a final score of each candidate word sequence according to the first score and the second score corresponding to each candidate word sequence; using the candidate word sequence with the highest final score as the Speech recognition result of voice data.
  10. 根据权利要求9所述的装置,其特征在于,所述领域识别模块包括:The device according to claim 9, wherein the domain identification module comprises:
    输入模块,用于将所述每个候选词序列输入至训练好的语义分类模型中;An input module, configured to input each of the candidate word sequences into the trained semantic classification model;
    分类模块,用于通过所述语义分类模型对所述每个候选词序列进行分类,得到所述每个候选词序列对应的分类标签;获取所述分类标签中占比最大的分类标签对应的领域作为所述提取的候选词序列所在的领域。a classification module, configured to classify each of the candidate word sequences by using the semantic classification model to obtain a classification label corresponding to each candidate word sequence; and obtain a domain corresponding to the classification label with the largest proportion among the classification labels As the field in which the extracted candidate word sequence is located.
  11. 根据权利要求9所述的装置,其特征在于,所述装置还包括训练模块,用于获取各个领域对应的文本;将所述各个领域对应的文本中的每个词语转换成词向量;分别将所述各个领域对应的词向量作为输入,对所述各个领域对应的神经网络进行训练。The device according to claim 9, wherein the device further comprises: a training module, configured to acquire text corresponding to each domain; and convert each word in the text corresponding to each domain into a word vector; The word vectors corresponding to the respective fields are input as inputs, and the neural networks corresponding to the respective fields are trained.
  12. 根据权利要求11所述的装置,其特征在于,所述训练模块还用于对于各个领域,按照所述领域对应的文本中词语的顺序,将所述文本中的每个词语对应的词向量作为输入,将每个输入的词语的下一个词语对应的词向量作为输出,以调整所述领域对应的神经网络的参数对所述神经网络进行训练。The apparatus according to claim 11, wherein the training module is further configured to, for each domain, use a word vector corresponding to each word in the text according to an order of words in the text corresponding to the domain. Inputting, the word vector corresponding to the next word of each input word is output, and the neural network is trained by adjusting parameters of the neural network corresponding to the domain.
  13. 根据权利要求9所述的装置,其特征在于,所述装置还包括实体修复模块,用于对所述语音识别结果进行实体提取,得到多个实体词语;对所述实体词语进行检索;当检索结果与所述实体词语不一致时,对所述实体词语进行修复。The device according to claim 9, wherein the device further comprises a entity repairing module, configured to perform entity extraction on the speech recognition result, to obtain a plurality of entity words; to retrieve the entity words; When the result is inconsistent with the entity word, the entity word is repaired.
  14. 一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时,使得所述处理器执行如权利要求1至8中任一项所述方法的步骤。A computer readable storage medium storing a computer program, the computer program being executed by a processor, causing the processor to perform the steps of the method of any one of claims 1 to 8.
  15. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器执行如下操作:A computer device comprising a memory and a processor, the memory storing a computer program, the computer program being executed by the processor, causing the processor to perform the following operations:
    获取对语音数据进行解码得到的多个词序列,以及每个词序列对应的第一得分;Obtaining a plurality of word sequences obtained by decoding the voice data, and a first score corresponding to each word sequence;
    从所述多个词序列中提取所述第一得分靠前的预设数量的词序列作为候选词序列;Extracting, from the plurality of word sequences, a preset number of word sequences of the first score as a candidate word sequence;
    识别提取的候选词序列所在的领域;Identifying the field in which the extracted candidate word sequence is located;
    根据所述领域将每个候选词序列输入至对应领域的神经网络中;Inputting each candidate word sequence into a neural network of a corresponding domain according to the field;
    通过所述神经网络对所述候选词序列进行重打分,得到所述每个候选词序列对应的第二得分;Retrieving the candidate word sequence by the neural network to obtain a second score corresponding to each candidate word sequence;
    根据所述每个候选词序列对应的第一得分和第二得分得到所述每个候选词序列的最终得分;Obtaining a final score of each candidate word sequence according to the first score and the second score corresponding to each candidate word sequence;
    将提取的候选词序列中所述最终得分最高的候选词序列作为所述语音数据 的语音识别结果。The candidate word sequence having the highest final score among the extracted candidate word sequences is used as the speech recognition result of the speech data.
  16. 根据权利要求15所述的计算机设备,其特征在于,所述计算机程序被所述处理器执行时,使得所述处理器执行如下操作:The computer apparatus according to claim 15, wherein said computer program is executed by said processor such that said processor performs the following operations:
    将所述每个候选词序列输入至语义分类模型中;Inputting each candidate word sequence into a semantic classification model;
    通过所述语义分类模型对所述每个候选词序列进行分类,得到所述每个候选词序列对应的分类标签;And classifying each candidate word sequence by using the semantic classification model to obtain a classification label corresponding to each candidate word sequence;
    获取所述分类标签中占比最大的分类标签对应的领域作为所述提取的候选词序列所在的领域。Obtaining, by the domain corresponding to the classification tag having the largest proportion among the classification tags, the domain in which the extracted candidate word sequence is located.
  17. 根据权利要求15所述的计算机设备,其特征在于,所述计算机程序被所述处理器执行时,使得所述处理器执行如下操作:The computer apparatus according to claim 15, wherein said computer program is executed by said processor such that said processor performs the following operations:
    获取各个领域对应的文本;Get the text corresponding to each field;
    将所述各个领域对应的文本中的每个词语转换成词向量;Converting each word in the text corresponding to each field into a word vector;
    分别将所述各个领域对应的词向量作为输入,对所述各个领域对应的神经网络进行训练。The word vectors corresponding to the respective fields are respectively input as inputs, and the neural networks corresponding to the respective fields are trained.
  18. 根据权利要求17所述的计算机设备,其特征在于,所述计算机程序被所述处理器执行时,使得所述处理器执行如下操作:The computer apparatus according to claim 17, wherein said computer program is executed by said processor such that said processor performs the following operations:
    对于各个领域,按照所述领域对应的文本中词语的顺序,将所述文本中的每个词语对应的词向量作为输入,将每个输入的词语的下一个词语对应的词向量作为输出,以调整所述领域对应的神经网络的参数对所述神经网络进行训练。For each field, according to the order of words in the text corresponding to the domain, the word vector corresponding to each word in the text is taken as an input, and the word vector corresponding to the next word of each input word is output as The neural network is trained by adjusting parameters of the neural network corresponding to the domain.
  19. 根据权利要求15所述的计算机设备,其特征在于,所述计算机程序被所述处理器执行时,使得所述处理器执行如下操作:The computer apparatus according to claim 15, wherein said computer program is executed by said processor such that said processor performs the following operations:
    对所述候选词序列对应的第一得分和第二得分进行加权求和,得到所述候选词序列的最终得分。And weighting the first score and the second score corresponding to the candidate word sequence to obtain a final score of the candidate word sequence.
  20. 根据权利要求15所述的计算机设备,其特征在于,所述神经网络为循环神经网络。The computer apparatus according to claim 15, wherein said neural network is a cyclic neural network.
  21. 根据权利要求15所述的计算机设备,其特征在于,所述计算机程序被所述处理器执行时,使得所述处理器执行如下操作:The computer apparatus according to claim 15, wherein said computer program is executed by said processor such that said processor performs the following operations:
    对所述语音识别结果进行实体提取,得到实体词语;Entity extraction of the speech recognition result to obtain an entity word;
    对所述实体词语进行检索;Retrieving the entity words;
    当检索结果与所述实体词语不一致时,对所述实体词语进行修复。When the retrieval result is inconsistent with the entity word, the entity word is repaired.
  22. 根据权利要求21所述的计算机设备,其特征在于,所述计算机程序被所述处理器执行时,使得所述处理器执行如下操作:The computer apparatus according to claim 21, wherein said computer program, when executed by said processor, causes said processor to perform the following operations:
    根据所述提取的候选词序列所在的领域确定所述语音识别结果的领域;Determining a field of the speech recognition result according to an area in which the extracted candidate word sequence is located;
    在所述语音识别结果的领域对应的数据库中对所述实体词语进行检索。The entity words are retrieved in a database corresponding to the field of the speech recognition result.
PCT/CN2019/082300 2018-05-14 2019-04-11 Speech recognition method and apparatus, and computer readable storage medium and computer device WO2019218818A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810457129.3 2018-05-14
CN201810457129.3A CN108711422B (en) 2018-05-14 2018-05-14 Speech recognition method, speech recognition device, computer-readable storage medium and computer equipment

Publications (1)

Publication Number Publication Date
WO2019218818A1 true WO2019218818A1 (en) 2019-11-21

Family

ID=63869029

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/082300 WO2019218818A1 (en) 2018-05-14 2019-04-11 Speech recognition method and apparatus, and computer readable storage medium and computer device

Country Status (2)

Country Link
CN (1) CN108711422B (en)
WO (1) WO2019218818A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111554276A (en) * 2020-05-15 2020-08-18 深圳前海微众银行股份有限公司 Speech recognition method, device, equipment and computer readable storage medium
CN111554275A (en) * 2020-05-15 2020-08-18 深圳前海微众银行股份有限公司 Speech recognition method, device, equipment and computer readable storage medium
CN112885336A (en) * 2021-01-29 2021-06-01 深圳前海微众银行股份有限公司 Training and recognition method and device of voice recognition system, and electronic equipment
CN113689888A (en) * 2021-07-30 2021-11-23 浙江大华技术股份有限公司 Abnormal sound classification method, system, device and storage medium
CN113793597A (en) * 2021-09-15 2021-12-14 云知声智能科技股份有限公司 Voice recognition method and device, electronic equipment and storage medium

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108711422B (en) * 2018-05-14 2023-04-07 腾讯科技(深圳)有限公司 Speech recognition method, speech recognition device, computer-readable storage medium and computer equipment
CN110176230B (en) * 2018-12-11 2021-10-08 腾讯科技(深圳)有限公司 Voice recognition method, device, equipment and storage medium
CN111475129A (en) * 2019-01-24 2020-07-31 北京京东尚科信息技术有限公司 Method and equipment for displaying candidate homophones through voice recognition
CN110164020A (en) 2019-05-24 2019-08-23 北京达佳互联信息技术有限公司 Ballot creation method, device, computer equipment and computer readable storage medium
CN110534100A (en) * 2019-08-27 2019-12-03 北京海天瑞声科技股份有限公司 A kind of Chinese speech proofreading method and device based on speech recognition
CN110797026A (en) * 2019-09-17 2020-02-14 腾讯科技(深圳)有限公司 Voice recognition method, device and storage medium
CN112992127B (en) * 2019-12-12 2024-05-07 杭州海康威视数字技术股份有限公司 Voice recognition method and device
CN110942775B (en) * 2019-12-20 2022-07-01 北京欧珀通信有限公司 Data processing method and device, electronic equipment and storage medium
CN111179916B (en) * 2019-12-31 2023-10-13 广州市百果园信息技术有限公司 Training method for re-scoring model, voice recognition method and related device
CN111508478B (en) * 2020-04-08 2023-04-11 北京字节跳动网络技术有限公司 Speech recognition method and device
CN111651599B (en) * 2020-05-29 2023-05-26 北京搜狗科技发展有限公司 Method and device for ordering voice recognition candidate results
CN112259084A (en) * 2020-06-28 2021-01-22 北京沃东天骏信息技术有限公司 Speech recognition method, apparatus and storage medium
CN112669845B (en) * 2020-12-25 2024-04-12 竹间智能科技(上海)有限公司 Speech recognition result correction method and device, electronic equipment and storage medium
CN112802476B (en) * 2020-12-30 2023-10-24 深圳追一科技有限公司 Speech recognition method and device, server and computer readable storage medium
CN112802461B (en) * 2020-12-30 2023-10-24 深圳追一科技有限公司 Speech recognition method and device, server and computer readable storage medium
CN113539272A (en) * 2021-09-13 2021-10-22 腾讯科技(深圳)有限公司 Voice recognition method and device, storage medium and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050182628A1 (en) * 2004-02-18 2005-08-18 Samsung Electronics Co., Ltd. Domain-based dialog speech recognition method and apparatus
CN102376305A (en) * 2011-11-29 2012-03-14 安徽科大讯飞信息科技股份有限公司 Speech recognition method and system
CN106328147A (en) * 2016-08-31 2017-01-11 中国科学技术大学 Speech recognition method and device
CN106803422A (en) * 2015-11-26 2017-06-06 中国科学院声学研究所 A kind of language model re-evaluation method based on memory network in short-term long
CN108711422A (en) * 2018-05-14 2018-10-26 腾讯科技(深圳)有限公司 Audio recognition method, device, computer readable storage medium and computer equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050182628A1 (en) * 2004-02-18 2005-08-18 Samsung Electronics Co., Ltd. Domain-based dialog speech recognition method and apparatus
CN102376305A (en) * 2011-11-29 2012-03-14 安徽科大讯飞信息科技股份有限公司 Speech recognition method and system
CN106803422A (en) * 2015-11-26 2017-06-06 中国科学院声学研究所 A kind of language model re-evaluation method based on memory network in short-term long
CN106328147A (en) * 2016-08-31 2017-01-11 中国科学技术大学 Speech recognition method and device
CN108711422A (en) * 2018-05-14 2018-10-26 腾讯科技(深圳)有限公司 Audio recognition method, device, computer readable storage medium and computer equipment

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111554276A (en) * 2020-05-15 2020-08-18 深圳前海微众银行股份有限公司 Speech recognition method, device, equipment and computer readable storage medium
CN111554275A (en) * 2020-05-15 2020-08-18 深圳前海微众银行股份有限公司 Speech recognition method, device, equipment and computer readable storage medium
CN111554275B (en) * 2020-05-15 2023-11-03 深圳前海微众银行股份有限公司 Speech recognition method, device, equipment and computer readable storage medium
CN111554276B (en) * 2020-05-15 2023-11-03 深圳前海微众银行股份有限公司 Speech recognition method, device, equipment and computer readable storage medium
CN112885336A (en) * 2021-01-29 2021-06-01 深圳前海微众银行股份有限公司 Training and recognition method and device of voice recognition system, and electronic equipment
CN112885336B (en) * 2021-01-29 2024-02-02 深圳前海微众银行股份有限公司 Training and recognition method and device of voice recognition system and electronic equipment
CN113689888A (en) * 2021-07-30 2021-11-23 浙江大华技术股份有限公司 Abnormal sound classification method, system, device and storage medium
CN113793597A (en) * 2021-09-15 2021-12-14 云知声智能科技股份有限公司 Voice recognition method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN108711422B (en) 2023-04-07
CN108711422A (en) 2018-10-26

Similar Documents

Publication Publication Date Title
WO2019218818A1 (en) Speech recognition method and apparatus, and computer readable storage medium and computer device
US10216725B2 (en) Integration of domain information into state transitions of a finite state transducer for natural language processing
US10176804B2 (en) Analyzing textual data
US8959014B2 (en) Training acoustic models using distributed computing techniques
CN107480143B (en) Method and system for segmenting conversation topics based on context correlation
US7562014B1 (en) Active learning process for spoken dialog systems
US8731926B2 (en) Spoken term detection apparatus, method, program, and storage medium
WO2020186712A1 (en) Voice recognition method and apparatus, and terminal
JP7266683B2 (en) Information verification method, apparatus, device, computer storage medium, and computer program based on voice interaction
JP2019159654A (en) Time-series information learning system, method, and neural network model
Bluche et al. Predicting detection filters for small footprint open-vocabulary keyword spotting
CN104750677A (en) Speech translation apparatus, speech translation method and speech translation program
CN115688779A (en) Address recognition method based on self-supervision deep learning
Yusuf et al. Low resource keyword search with synthesized crosslingual exemplars
JP5975938B2 (en) Speech recognition apparatus, speech recognition method and program
George et al. Unsupervised query-by-example spoken term detection using segment-based bag of acoustic words
CN113468311B (en) Knowledge graph-based complex question and answer method, device and storage medium
CN115600595A (en) Entity relationship extraction method, system, equipment and readable storage medium
CN113342953A (en) Government affair question and answer method based on multi-model integration
Zhang et al. Open-domain document-based automatic QA models based on CNN and attention mechanism
CN111090720A (en) Hot word adding method and device
US20240144911A1 (en) Abbreviation classification for speech synthesis and recognition
Zhou et al. Research on matching method in humming retrieval
WO2022074760A1 (en) Data processing device, data processing method, and data processing program
Manghat et al. Few-shot meta multilabel classifier for low resource accented code-switched speech

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19803470

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19803470

Country of ref document: EP

Kind code of ref document: A1