CN111696531A

CN111696531A - Recognition method for improving speech recognition accuracy by using jargon sentences

Info

Publication number: CN111696531A
Application number: CN202010467020.5A
Authority: CN
Inventors: 高洋洋
Original assignee: Shengzhi Information Technology Nanjing Co ltd
Current assignee: Shengzhi Information Technology Nanjing Co ltd
Priority date: 2020-05-28
Filing date: 2020-05-28
Publication date: 2020-09-22
Also published as: WO2021238700A1

Abstract

The invention discloses a recognition method for improving the accuracy rate of voice recognition by using conversational terminology sentences, which relates to the technical field of voice recognition and provides a method for improving the accuracy rate of voice recognition by dynamically updating a language model by using sentences configured in the conversational technology; in the process of constructing the voice recognition system, the first language model is trained by using universal text resources; after customizing the dialogs of the dialogue robot, training a second language model by using the dialogue sentence texts; the final language model is fused with the first language model and the second language model, so that the speech recognition system has better accuracy rate on speech under a user-defined scene.

Description

Recognition method for improving speech recognition accuracy by using jargon sentences

Technical Field

The invention relates to the technical field of voice recognition, in particular to a recognition method for improving voice recognition accuracy by using jargon sentences.

Background

The development of voice recognition, semantic understanding and voice synthesis technology enables the intelligent voice conversation robot to enter daily life, and provides more and more convenient intelligent voice conversation service for users. The user can compile customized dialogs according to the requirements of the user's own scene, and create the intelligent voice conversation robot which meets the requirements of the user.

The speech recognition converts the speech spoken by the user into a corresponding text, then the semantic understanding judges the intention of the user according to the jargon sentence defined by the user and generates a response text, and finally the speech synthesis response text is converted into speech and played to the user.

The speech recognition in the existing intelligent speech dialogue robot system has universality, can be used in various scenes and is irrelevant to the type, the application field and the configuration of interactive speech technology of the intelligent speech dialogue robot. In order to be able to be used in a variety of scenarios, the speech recognition system needs to reach a balance in accuracy under these scenarios, which results in that the accuracy of the speech recognition system under certain scenarios is not too high.

In an actual intelligent voice conversation robot, a user presets a conversation scene and a speaking semantic range of the robot, and the assumption is not made in the universal voice recognition. The voice recognition system is enhanced by utilizing the candidate corpora configured in the intelligent voice conversation robot technology, and the method has important significance for improving the voice recognition accuracy and the man-machine conversation quality.

Disclosure of Invention

The invention provides a recognition method for improving the accuracy of voice recognition by using conversational terminology sentences, aiming at overcoming the defects of the prior art and improving the accuracy of voice recognition by dynamically updating a language model by using sentences configured in the conversational technology.

The invention adopts the following technical scheme for solving the technical problems:

the recognition method for improving the speech recognition accuracy rate by using the jargon sentence provided by the invention comprises the following steps:

step 1, training a first language model by using a universal text; the first language model is trained as follows:

setting i as the length of the sequence being counted, wherein i is an integer greater than or equal to 1;

when i is equal to 1, firstly, counting the 1 st word w of the word sequence₁Number of occurrences C (w)₁) Then count w₁∑ sum of the number of times any word w is concatenated after sequential occurrence_wC(w₁,w)；

When i is larger than 1, firstly counting word sequences w in the universal text₁、w₂、…、w_iNumber of sequential occurrences C (w)₁,w₂，...,w_i) Then counting word sequences w in the text₁、w₂、…、w_i-1∑ sum of the number of times any word w is concatenated after sequential occurrence_wC(w₁,w₂,...,w_i-1,w)；w_sFor the s-th word of the sequence of words,s is an integer greater than 0 and less than (i + 1);

for a sequence of words w₁,w₂,...,w_nThe composed sentence, n is the number of words in the sentence and the sequence probability P_generalCalculated by the following formula:

wherein, P (w)_i|w₁,w₂,...,w_i-1) Is the conditional probability of the occurrence of the ith word, P (w)₁) Is the conditional probability of the occurrence of the 1 st word, P (w)₂|w₁) The conditional probability of occurrence for the 2 nd word;

C(w₁,w₂，...,w_i) For word sequences w in text₁、w₂、…、w_iNumber of sequential occurrences, ∑_wC(w₁,w₂,...,w_i-1W) is a sequence of words w in the text₁、w₂、…、w_i-1The sum of the number of times any word w is connected after appearing in sequence;

step 2, defining the dialect of the dialogue robot, and training a language model by adopting a conversational terminology sentence to obtain a second language model;

the second language model is a sequence probability of the jargon sentence, and specifically includes:

for a sequence of words w₁,w₂,...,w_nComposed conversational sentences of sequence probability P_dialogueCalculated by the following formula:

wherein, P (w)_i|w₁,w₂,...,w_i-1) Is the conditional probability of the occurrence of the ith word, P (w)₁) Conditional probability of occurrence for the 1 st word，P(w₂|w₁) The conditional probability of occurrence for the 2 nd word;

step 3, fusing the first language model and the second language model to generate a final language model;

the final language model is:

by a sequence of words w₁,w₂,...,w_nComposed sentences of sequence probability P_final(w₁w₂...w_n) Calculated by the following formula;

P_final(w₁w₂...w_n)＝λ₁P_general+λ₂P_dialogue

wherein λ is₁And λ₂For interpolation coefficients, for adjusting the first language model and the second language model at P_final(w₁w₂...w_n) The weight in (1);

and 4, generating a voice recognition system by using the final voice model, and improving the accuracy of voice recognition by using the voice recognition system.

Compared with the prior art, the invention adopting the technical scheme has the following technical effects:

the invention provides a method for improving the accuracy of speech recognition by dynamically updating a language model by statements configured in a dialect; in the process of constructing the voice recognition system, the first language model is trained by using universal text resources; after customizing the dialogs of the dialogue robot, training a second language model by using the dialogue sentence texts; the final language model is fused with the first language model and the second language model, so that the speech recognition system has better accuracy rate on speech under a user-defined scene.

Detailed Description

The technical scheme of the invention is further explained in detail as follows:

the language models used in current speech recognition systems are mainly statistical language models and neural network language models. It should be noted that the method proposed by the present invention is applicable not only to statistical language models but also to neural network language models.

1. Training a first language model using generic text

Speech recognition systems typically have a large amount of text corpora from various domains for training language models. The training of the generic language model is trained using these various domain text corpora that are not dialog system dependent.

To be able to adapt to a variety of scenarios, speech recognition systems typically train language models with a large corpus of text from a variety of scenarios, the text being independent of the specific dialog system, referred to as generic text.

The following describes the training and calculation steps of the first language model, taking the most common n-gram language model in the statistical language models as an example.

Assuming i is a positive integer greater than 1, in a particular implementation of the speech recognition system, i is typically set to 3 or 4. When i is 3, it is called a 3-gram language model, and when i is 4, it is called a 4-gram language model.

Firstly, counting words w in universal text₁、w₂、…、w_iNumber of sequential occurrences C (w)₁,w₂,...,w_i) Then, the word w in the text is counted₁、w₂、…、w_i-1∑ sum of the number of times any word w is concatenated after sequential occurrence_wC(w₁,w₂,...,w_i-1,w)。

For sentence w₁,w₂,...,w_nThe sequence probability is calculated by the following formula:

wherein, P (w)_i|w₁,w₂,...,w_i-1) For the conditional probability of each word occurrence, it can be calculated by counting the above statistical methods:

2. training a second language model using user statements configured in phonetics

3. fusing a first language model and a second language model

And the final language model is obtained by fusing the first language model and the second language model. In particular for sentence w₁,w₂,...,w_nThe sequence probability is calculated by the following formula

P_final(w₁w₂...w_n)＝λ₁P_general(w₁w₂...w_n)+λ₂P_dialogue(w₁w₂...w_n)

λ₁And λ₂For interpolation coefficients, for adjusting the common language model and the conversation language model at P_final(w₁w₂...w_n) The weight in (1). In a specific implementation₁And λ₂The value of (c) varies from session to session.

The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

Claims

1. A recognition method for improving speech recognition accuracy by using jargon sentences is characterized by comprising the following steps:

When i is larger than 1, firstly counting word sequences w in the universal text₁、w₂、…、w_iNumber of sequential occurrences C (w)₁,w₂，...,w_i) Then counting word sequences w in the text₁、w₂、…、w_i-1∑ sum of the number of times any word w is concatenated after sequential occurrence_wC(w₁,w₂,...,w_i-1,w)；w_sIs the s-th word of the word sequence, and s is an integer which is greater than 0 and less than (i + 1);

wherein, P (w)_i|w₁,w₂,...,w_i-1) Is the conditional probability of the occurrence of the ith word, P (w)₁) For the 1 st word occurrenceConditional probability of (c), P (w)₂|w₁) The conditional probability of occurrence for the 2 nd word;

the final language model is:

P_final(w₁w₂...w_n)＝λ₁P_general+λ₂P_dialogue