CN112634878B

CN112634878B - Speech recognition post-processing method and system and related equipment

Info

Publication number: CN112634878B
Application number: CN202011476615.3A
Authority: CN
Inventors: 黄石磊; 刘轶; 程刚
Original assignee: PKU-HKUST SHENZHEN-HONGKONG INSTITUTION; Peking University Shenzhen Graduate School
Current assignee: PKU-HKUST SHENZHEN-HONGKONG INSTITUTION; Peking University Shenzhen Graduate School
Priority date: 2020-12-15
Filing date: 2020-12-15
Publication date: 2024-05-17
Anticipated expiration: 2040-12-15
Also published as: CN112634878A

Abstract

The invention discloses a voice recognition post-processing method and system and related equipment. The method comprises the following steps: extracting the first N best recognition results N-best lists from a word graph lattice generated by the voice recognition system for the first decoding of the input voice; re-scoring N-best lists using a trained BERT bi-directional language model with part of speech; the result with the highest score is selected from N-best lists as the final recognition result. When the N-best lists is re-scored, the invention can simultaneously utilize the context information and the part-of-speech information of the context by using the BERT bidirectional language model with the part-of-speech, thereby further improving the performance of the voice recognition system.

Description

Speech recognition post-processing method and system and related equipment

Technical Field

The invention relates to the technical field of voice recognition, in particular to a voice recognition post-processing method and system and related equipment.

Background

Speech recognition is an interdisciplinary discipline. Speech recognition technology has advanced significantly over the last two decades, beginning to move from the laboratory to the market. A speech recognition system is composed of acoustic model, language model and pronunciation dictionary. The language models can be divided into three general categories: a rule-based language model, a statistics-based language model, and a neural network-based language model. The statistical-based N-gram language model is currently commonly used in speech recognition, and it is assumed that the probability of any word occurring is at most related to N-1 words in front of it. Therefore, the above information that can be utilized by the N-gram language model is limited by the size of N. Theoretically, the larger N, the more the above information can be utilized. But the larger N, the more serious the data sparseness problem of the model. To solve the data sparseness problem, a number of related smoothing algorithms have been proposed successively: laplace smoothing, interpolation, backtracking. Today, neural network-based language models are gaining widespread attention. On this basis, a bi-directional language model, a language model based on an attention mechanism, and the like have been proposed. How to apply a language model based on a neural network to a speech recognition system to further improve the performance of the system is an important research direction at present.

Disclosure of Invention

The invention aims to provide a method and a system for processing after voice recognition and related equipment so as to improve the performance of a voice recognition system.

In order to achieve the above purpose, the present invention adopts the following technical scheme.

In a first aspect of the present invention, a method for post-processing speech recognition is provided, including: extracting the first N best recognition results N-best lists from a word graph lattice generated by the voice recognition system for the first decoding of the input voice; re-scoring N-best lists using a trained bi-directional language model with part-of-speech BERT (Bidirectional Encoder Representation from Transformers, transform-based bi-directional encoder characterization); the result with the highest score is selected from N-best lists as the final recognition result.

In a possible implementation manner, the method further includes a training step of pre-training the BERT bi-directional language model with parts of speech, and the training step specifically includes: preprocessing the text corpus for training; word segmentation and part-of-speech tagging are carried out through a word segmentation tool, phrases and corresponding parts-of-speech in a text corpus are obtained, then B, I, E, S tags are utilized to be combined with the parts-of-speech, and the parts-of-speech of the phrases are further distributed; carrying out the same mask processing on the text information and the part-of-speech information of the text corpus; carrying out average weighted summation on word vectors of the text information and word vectors of the part-of-speech information after mask processing, and then inputting the word vectors into a network for training to obtain a BERT bidirectional language model with the part-of-speech; in the process of training the BERT bidirectional language model, a task for predicting the next sentence NSP is forbidden, and only a Mask LM task for training the language model in a Mask mode is reserved.

In one possible implementation, the re-scoring N-best lists using the trained BERT bi-directional language model with parts of speech includes: for sentences formed by each result in N-best lists, solving the part of speech of each word in the sentences through a word segmentation tool, and then utilizing B, I, E, S four tags to combine with the part of speech to further segment the part of speech of each word; an input sample mode based on a sliding window and a word-by-word mask coding mode are adopted, an input sample is constructed for each sentence, coding processing is carried out, and then the input samples are input into a BERT bidirectional language model; and calculating the probability and score of each sentence through the BERT bi-directional language model, and completing the re-scoring of N-best lists.

In a possible implementation manner, the input sample is constructed and encoded for each sentence by adopting an input sample mode based on a sliding window and a word-by-word mask encoding mode, and then the input samples are input into a BERT bi-directional language model, which includes: setting a sliding window with a length of max_length=2m, wherein M is a positive integer; if the length of the sentence does not exceed max_length, constructing an input sample in a word-by-word mask mode for the whole sentence, constructing a batch after coding processing, and inputting the batch into the BERT bi-directional language model; if the length of the sentence exceeds max_length, sequentially extracting the sentence content in each sliding window by moving the sliding window backwards by a step length M from the beginning of the sentence, constructing an input sample by adopting a word-by-word mask mode from the first word if all words in the current sliding window are processed for the first time, constructing the input sample by adopting a word-by-word mask mode from the M+1st word if the first M words in the current sliding window are processed in the previous sliding window, finally constructing a batch after coding all the input samples of the sentence, and inputting the batch into the BERT bi-directional language model.

In a possible implementation manner, the calculating the probability and the score of each sentence through the BERT bi-directional language model includes: calculating by the BERT bi-directional language model to obtain the probability value of the text of each mask position in each sentence under the condition of context constraint, and marking as:

P(w₁|(w₂,w₃,...,w_L)),P(w₂|(w₁,w₃,...,w_L)),···,P(w_L|(w₁,w₂,...,w_L-1))

wherein w ₁,w₂,w₃,...,w_L represents the L words in the sentence, respectively;

The sentence is represented by S, the length of the sentence is represented by L, the probability value of the sentence is represented by P (S), the score of the sentence is represented by score (S), and the probability value and the score of the sentence are calculated as follows:

P(S)＝P(w₁|(w₂,w₃,...,w_L))P(w₂|(w₁,w₃,...,w_L))···P(w_L|(w₁,w₂,...,w_L-1));

score(S)＝log P(S)＝log P(w₁|(w₂,w₃,...,w_L))+···+log P(w_L|(w₂,w₃,...,w_L-1)).

In a second aspect of the present invention, there is provided a speech recognition post-processing system comprising: the extraction module is used for extracting the first N best recognition results N-best lists from a word graph lattice generated by the voice recognition system for the first time of decoding the input voice; the re-scoring module is used for re-scoring the N-best lists by using the trained BERT bi-directional language model with the part of speech; the result with the highest score is selected from N-best lists as the final recognition result.

In a possible implementation manner, the system further comprises a training module for pre-training the BERT bi-directional language model with parts of speech, and the training module is specifically configured to: preprocessing the text corpus for training; word segmentation and part-of-speech tagging are carried out through a word segmentation tool, phrases and corresponding parts-of-speech in a text corpus are obtained, then B, I, E, S tags are utilized to be combined with the parts-of-speech, and the parts-of-speech of the phrases are further distributed; carrying out the same mask processing on the text information and the part-of-speech information of the text corpus; carrying out average weighted summation on word vectors of the text information and word vectors of the part-of-speech information after mask processing, and then inputting the word vectors into a network for training to obtain a BERT bidirectional language model with the part-of-speech; in the process of training the BERT bidirectional language model, a task for predicting the next sentence NSP is forbidden, and only a Mask LM task for training the language model in a Mask mode is reserved.

In a possible implementation manner, the re-classifying module is specifically configured to: for sentences formed by each result in N-best lists, solving the part of speech of each word in the sentences through a word segmentation tool, and then utilizing B, I, E, S four tags to combine with the part of speech to further segment the part of speech of each word; an input sample mode based on a sliding window and a word-by-word mask coding mode are adopted, an input sample is constructed for each sentence, coding processing is carried out, and then the input samples are input into a BERT bidirectional language model; and calculating the probability and score of each sentence through the BERT bi-directional language model, and completing the re-scoring of N-best lists.

In a third aspect of the present invention, there is provided a computer device comprising a processor and a memory, the memory storing a program, the program comprising computer-executable instructions, the processor executing the computer-executable instructions stored in the memory when the computer device is running to cause the computer device to perform the speech recognition post-processing method according to the first aspect.

In a fourth aspect of the present invention, there is provided a computer readable storage medium storing one or more programs, the one or more programs comprising computer-executable instructions, which when executed by a computer device, cause the computer device to perform the speech recognition post-processing method as described in the first aspect.

By adopting the technical scheme, the invention has the following technical effects:

When the N-best lists is remarked, the invention can simultaneously utilize the context information by using the BERT bidirectional language model, thereby improving the performance of the voice recognition system. Meanwhile, an attention-based mechanism is used in the BERT bi-directional language model, and information with higher context correlation can be utilized. Thus, the defect that the general RNN language model can only utilize the above information when the language model is re-classified is overcome, for example, the information above two words in two different sentences is the same, when the two words are predicted, the two words are predicted to be the same only by utilizing the above information, but when the two words are constrained by different below information, the two words have different and more reasonable prediction results.

In addition, the present invention can more accurately predict the current word by using the BERT bi-directional language model with part of speech, not only the text information of the context but also the part of speech information of the context when predicting a word due to the added part of speech, because part of speech division of the adjacent words can be directly restrained by the known part of speech of a word in the words. Meanwhile, for words (OOV) outside the vocabulary, although they are all treated as the same character from the text information perspective, by adding the text information, the words outside the vocabulary are further constrained by the part-of-speech feature.

As above, by adding Chinese part-of-speech information into the BERT-based bi-directional language model, and then scoring N-best Lists generated by the first decoding in the speech recognition by the trained language model, the result with the highest score is selected as the final recognition result, so that the performance of the speech recognition system can be effectively improved.

In a further embodiment, the invention also adopts an input sample mode based on a sliding window and combines a word-by-word mask coding mode to carry out input coding, so that the BERT bidirectional language model can not be bothered by the length of sentences for re-scoring tasks.

In a further embodiment, the invention further provides for the efficient speed of scoring of the language model by batch processing operations, i.e., both input and output.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments and the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a post-processing method for speech recognition according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a post-processing method for speech recognition according to an embodiment of the present invention;

FIG. 3 is a training flow diagram of a BERT bi-directional language model with parts of speech according to an embodiment of the invention;

FIG. 4 is a flow chart of the present invention for re-scoring using a BERT bi-directional language model;

FIG. 5 is a flow chart of the encoding phase of input samples in an embodiment of the present invention;

FIG. 6 is a flow chart of a decoding phase of an output probability in an embodiment of the invention;

FIG. 7 is a block diagram of a speech recognition post-processing system according to an embodiment of the present invention;

Fig. 8 is a block diagram of a computer device according to an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

The terms first, second, third and the like in the description and in the claims and in the above drawings, are used for distinguishing between different objects and not necessarily for describing a particular sequential or chronological order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.

To facilitate understanding of the present invention, a general flow of post-speech recognition processing is first described as follows.

In the speech recognition task, a sentence of speech is input to a speech recognition system, and a word graph (language) corresponding to the audio recognition result can be generated by the combined action of an acoustic model, a language model and a pronunciation dictionary. lattice is essentially a directed acyclic graph, where each node represents the ending time point of a word, each edge represents a possible word, and the acoustic score and language model score that the word occurs. In speech recognition, lattice is generally used to save recognized candidate paths, where each candidate path represents a candidate recognition result. In an actual speech recognition system, the optimal path does not necessarily match the actual word sequence, and it is generally desirable to obtain a plurality of candidate paths with the top score, i.e., N-best lists, in the chinese sense, the top N best recognition results. The first N best recognition results (N-best lists) for the corresponding speech audio may be extracted from the speech generated by the first decoding of the speech recognition system.

The N results in N-best lists may then be twice decoded, i.e., re-scored, using the new language model to select the highest scored result as the final recognition result. Currently, recurrent Neural Network (RNN) language models and long and short term memory neural network (LSTM) language models are commonly used for the repartitioning operation of the language models. Training an LSTM language model by only text corpus, then respectively carrying out re-scoring on N-best lists by using the trained language model, and finally selecting sentences with highest scores as final recognition results. However, the re-scoring is performed by adopting language models such as RNN, LSTM and the like, only the context information can be utilized, the context information cannot be utilized at the same time, and the final effect is not good enough.

Referring to fig. 1 and 2, an embodiment of the present invention provides a post-processing method for speech recognition to improve the performance of a speech recognition system. The method comprises the following steps:

S1, extracting the first N best recognition results, namely N-best lists, from a word graph lattice generated by a voice recognition system for performing first decoding on input voice;

S2, performing re-scoring on the N-best lists by using a trained BERT (Bidirectional Encoder Representation from Transformers, transform-based bi-directional encoder representation) bi-directional language model with part of speech;

S3, selecting the result with the highest score from N-best lists as a final recognition result.

The key of the method is that a BERT bi-directional language model with part of speech is trained in advance and applied to a voice recognition system, N-best lists generated in the first decoding of the voice recognition system is subjected to the operation of re-scoring and sorting by using the model, and finally the result with the highest score is selected as the final recognition result.

The method mainly comprises two parts, namely training a BERT bidirectional language model with part of speech, and then applying the model to a voice recognition system for re-scoring operation. The two parts are described in detail below.

1. The BERT bi-directional language model with parts of speech is trained.

The flow is shown in fig. 3, and includes the following steps.

1.1. The text corpus used to train the BERT bi-directional language model is preprocessed.

Alternatively, the BERT bi-directional language model is trained using the punctuation-removed data, such that the trained BERT bi-directional language model may be directly re-tasked. The preprocessing flow of the text corpus is as follows:

(1) Firstly, collecting corpus on a network, such as a public website;

(2) Then removing illegal characters in the corpus;

(3) Dividing the text corpus into lines by periods, semicolons, exclamation marks, ellipses and question marks so that each line represents a sentence;

(4) And finally, eliminating all punctuation in the text corpus.

1.2. Only Mask LM training mode is used.

Since the task of predicting the next sentence (NSP, next Sentence Prediction) has no relation with the task of re-scoring, when the BERT bi-directional language model is continuously trained, the NSP task is not needed, and only Mask LM task in the original trained BERT bi-directional language model is reserved. Mask LM tasks refer to tasks for training a Language Model (LM) by Mask (Mask) means.

Part-of-speech information is added to the bert bi-directional language model.

(1) Firstly, word segmentation and part-of-speech tagging are carried out on the preprocessed text corpus through a word segmentation tool such as a crust word segmentation tool, so that the word group and the corresponding part-of-speech in the text corpus are obtained.

(2) And then, utilizing B, I, E, S four tags to combine with the parts of speech, further distributing the parts of speech of the phrase, namely further dividing the parts of speech of each word, combining the parts of speech with the word segmentation sequence, or adding word segmentation information into the part of speech sequence.

B. The meanings of I, E, S four tags are as follows: when the length of a phrase is greater than one, a label B is used for representing the beginning part of the corresponding part of speech, a label I is used for representing the middle part of the part of speech, and a label E is used for representing the ending part of the part of speech; the label S is a label added by part of speech when a phrase only has one Chinese character, and represents that the word is composed of single Chinese characters, and the part of speech is directly combined with the S to represent the part of speech of the word.

Examples:

the word 'pick up' is a verb (V), and the word-part labels corresponding to the two Chinese characters are respectively represented by 'BV and EV';

the word 'beat' is a verb (V) and is composed of a single Chinese character, and the 'SV' is used for representing the part-of-speech label corresponding to the Chinese character;

The word 'Yangtze river bridge' is a noun (N), and the part-of-speech labels corresponding to the four Chinese characters are respectively represented by 'BN, IN, IN, EN'.

(3) And carrying out the same masking processing on the text information and the part-of-speech information of the corpus, and meanwhile, taking the masked text as a label of a corresponding sample, so that a training input sample is constructed.

The masking process is to replace the text and the corresponding part of speech at a certain position in the text by using the mark of 'MASK', wherein the replaced text is used as the label information of the text information, and an input sample for training a language model is constructed through the process.

(4) And carrying out average weighted summation on the word vector of the text information after masking and the word vector of the part-of-speech information to train and obtain the BERT bidirectional language model with the part-of-speech.

2. The bi-directional BERT language model was used for the re-scoring.

2.0 First a general calculation formula is introduced.

A sentence of length L is denoted by S (i.e. the sentence comprises L words) and the probability value of the sentence is denoted by P (S). For a general unidirectional language model, P (S) is calculated by the following formula:

P(S)＝P(w₁)P(w₂|(w₁))P(w₃|(w₁,w₂))···P(w_L|(w₁,w₂,...,w_L-1)).

In the invention, the BERT bi-directional language model is used, and the calculation formula of the modified P (S) is as follows:

wherein ,P(w₁|(w₂,w₃,...,w_L)),P(w₂|(w₁,w₃,...,w_L)),···,P(w_L|(w₁,w₂,...,w_L-1)) represents probability values that the text of each mask position in a sentence appears under the context constraint, and w ₁,w₂,w₃,...,w_L represents L words in the sentence, respectively.

Finally, score of the sentence is represented by score (S), and the calculation formula is as follows:

and 2.1, acquiring an N-best lists identification result corresponding to the audio.

(1) First, input voice audio, such as voice audio with content containing only one sentence, is fed into a Kaldi-based voice recognition system.

(2) The speech recognition system obtains the lattice generated by the first decoding of the input speech, and the N-best lists corresponding to the audio is obtained from the lattice through the Kaldi tool. N is a positive integer, and the specific value is determined according to actual needs. Let n=300, i.e. 300-best lists, be assumed, so that 300 candidate recognition results for the corresponding audio will be obtained. N-best lists includes N recognition results, each of which corresponds to one sentence, or N sentences in N-best lists. The purpose of the subsequent re-scoring operation is to score N recognition results or sentences so as to select the recognition result with the highest score.

2.2. N-best lists was scored again using the BERT bi-directional language model with parts of speech.

The flow of the re-scoring operation is shown in fig. 4 and includes the following two stages.

(1) Encoding phase of input samples based on sliding window:

First, for each sentence in N-best lists, the part of speech of each word in the sentence is obtained by a word segmentation tool, such as a bargain word segmentation tool, and then the part of speech of each word is further segmented by combining B, I, E, S four labels with the part of speech, wherein B represents the beginning of a part of speech, I represents the middle part of the part of speech, E represents the ending part of the part of speech, S represents that the word is composed of separate Chinese characters, and the part of speech is directly combined with S to represent the part of speech of the word.

Then, the part-of-speech tag information and the text information are subjected to corresponding identical processing operation, namely, an input sample is constructed for each sentence by adopting an input sample mode based on a sliding window and a word-by-word mask coding mode, and the input samples are subjected to coding processing and then are input into a BERT bi-directional language model.

As shown in fig. 5, the specific steps are as follows:

Since the BERT bi-directional language model requires a text of a fixed length to be input each time, a maximum length max_length=100 of the input text is added, so that the input text can be directly input into the model for sentences having a length not greater than max_length. And if the sentence length is greater than max_length, adopting an input mode based on a sliding window.

300 Candidate results (sentences) are represented by S _i, i.e., S ₁,S₂,...,S₃₀₀. The probability and score for each result, namely P (S _i) and score (S _i), are then calculated using the trained BERT bi-directional language model. Wherein for each calculation of P (S _i), a word-by-word mask method is used.

For a sentence of length L, if the length does not exceed max_length, the input samples can be constructed for the whole sentence in a word-by-word mask manner, i.e.: the first word in the sentence is masked, i.e., the first word is used as a tag, and is replaced with a MASK symbol "[ MASK ]" in the sentence, so that an Input sample Input of the BERT bi-directional language model is constructed (S _i,1), and then the same operations are performed on the L words in the sentence, respectively, to construct the Input sample Input (S _i,2),...,Input(S_i,L). L input samples are coded together to form a batch, and the batch is input into the BERT bi-directional language model together.

For sentences with a length greater than max_length, a sliding window with a length size of 2M is set, M being a positive integer. It is possible to let 2m=max_length, for example equal to 100, then m=50. The sliding windows may then be moved backward from the beginning of the sentence by a step size M, extracting the sentence content within each sliding window in turn. If m=50, then each time slide backward 50 steps, i.e. each time hold 50 pieces of text information.

In the first sliding window where sliding is not started, all words are processed for the first time, and then the input samples can be constructed by adopting a word-by-word mask mode from the first word. If the window is a sliding window after sliding, the conditional probability of M (e.g. 50) words before the text in the current sliding window is calculated in the previous sliding window at the last moment, so that the word-by-word mask calculation from the first position is not needed for the words in the current sliding window, but the M+1st (e.g. 51) words. Finally, all input samples of the sentence are coded and then constructed into a batch, and the batch is input into the BERT bi-directional language model.

(2) And the output probability decoding stage.

The flow of this stage is shown in fig. 6.

Because NSP tasks are eliminated, the BERT bi-directional language model adds a softmax layer only after the BERT model. Through the normalization processing of the softmax layer, the prediction result of the BERT model on the mask position can be converted into the predicted probability value of all elements in the dictionary. The ith element in the pronunciation dictionary is denoted by v _i,Representing the predicted result of BERT bi-directional language model on element v _i in pronunciation dictionary, then/>Representing the probability that the masked position is predicted to v _i.

Let prob_list represent the output result of the softmax layer, and id _v represent the position index in the pronunciation dictionary for text v. The probability of v _j in sentence S _i is calculated as follows:

therefore, after passing through the last softmax layer, for the masked words, the probabilities of the words under the condition of the context constraint can be directly calculated by decoding the position id of the words in the dictionary, and the probabilities of the words in the sentences with the length L are respectively ：P(w₁|(w₂,w₃,...,w_L)),···,P(w_L|(w₁,w₂,...,w_L-1)).

Then, the probability P (S) and score (S) of each sentence can be calculated, respectively:

then, scores of all sentences such as score are obtained (S ₁),score(S₂),...,score(S₃₀₀).

Finally, selecting the sentence with the largest score from the scores of all sentences as the recognition result of the final voice recognition system.

The speech recognition post-processing method of the present invention is described in detail above.

In order to facilitate the implementation of the invention, the embodiment of the invention also provides a corresponding system and equipment.

Referring to fig. 7, in one embodiment of the present invention, a speech recognition post-processing system is provided, including:

The extracting module 71 is configured to extract the first N best recognition results N-best lists from the word graph lattice generated by the speech recognition system for performing the first decoding on the input speech;

A re-scoring module 72 for re-scoring N-best lists using the trained BERT bi-directional language model with parts of speech; the result with the highest score is selected from N-best lists as the final recognition result.

Further, the system further comprises a training module 73 for pre-training the BERT bi-directional language model with parts of speech, the training module 73 being specifically configured to:

preprocessing the text corpus for training;

word segmentation and part-of-speech tagging are carried out through a word segmentation tool, phrases and corresponding parts-of-speech in a text corpus are obtained, then B, I, E, S tags are utilized to be combined with the parts-of-speech, and the parts-of-speech of the phrases are further distributed;

carrying out the same mask processing on the text information and the part-of-speech information of the text corpus;

Carrying out average weighted summation on word vectors of the text information and word vectors of the part-of-speech information after mask processing, and then inputting the word vectors into a network for training to obtain a BERT bidirectional language model with the part-of-speech;

In the process of training the BERT bidirectional language model, a task for predicting the next sentence NSP is forbidden, and only a Mask LM task for training the language model in a Mask mode is reserved.

Further, the re-scoring module 72 may be specifically configured to:

For sentences formed by each result in N-best lists, solving the part of speech of each word in the sentences through a word segmentation tool, and then utilizing B, I, E, S four tags to combine with the part of speech to further segment the part of speech of each word; an input sample mode based on a sliding window and a word-by-word mask coding mode are adopted, an input sample is constructed for each sentence, coding processing is carried out, and then the input samples are input into a BERT bidirectional language model; and calculating the probability and score of each sentence through the BERT bi-directional language model, and completing the re-scoring of N-best lists.

Referring to fig. 8, in one embodiment of the present invention, there is further provided a computer device including a processor 81 and a memory 82, wherein the memory 82 stores a program, and the program includes computer-executable instructions, and when the computer device 80 is executed, the processor 81 executes the computer-executable instructions stored in the memory 82, so that the computer device 80 performs the post-speech recognition processing method as described above.

An embodiment of the present invention also provides a computer-readable storage medium storing one or more programs, the one or more programs comprising computer-executable instructions, which when executed by a computer device, cause the computer device to perform a speech recognition post-processing method as described hereinbefore.

In summary, the embodiment of the invention discloses a method and a system for processing after voice recognition and related equipment. By adopting the technical scheme, the invention has the following technical effects:

In addition, the present invention can more accurately predict the current word by using the two-way language model with part of speech to BERT, and not only the text information of the context but also the part of speech information of the context can be utilized when predicting a word due to the added part of speech, because the part of speech division of the adjacent words can be directly restrained by the known part of speech of a word in the words. Meanwhile, for words (OOV) outside the vocabulary, although they are all treated as the same character from the text information perspective, by adding the text information, the words outside the vocabulary are further constrained by the part-of-speech feature.

In the foregoing embodiments, the descriptions of the embodiments are each focused, and for those portions of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.

The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; those of ordinary skill in the art will appreciate that: the technical scheme described in the above embodiments can be modified or some technical features thereof can be replaced equivalently; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method of post-processing speech recognition, comprising:

Extracting the first N best recognition results N-best lists from a word graph lattice generated by the voice recognition system for the first decoding of the input voice;

Re-scoring N-best lists using a trained BERT bi-directional language model with part of speech; selecting the result with the highest score from N-best lists as a final recognition result;

the method also comprises a training step of pre-training the BERT bi-directional language model with part of speech, and the training step specifically comprises the following steps:

preprocessing the text corpus for training;

Word segmentation and part-of-speech tagging are carried out through a word segmentation tool, phrases and corresponding parts-of-speech in a text corpus are obtained, then B, I, E, S tags are utilized to be combined with the parts-of-speech, and the parts-of-speech of the phrases are further distributed; carrying out the same mask processing on the text information and the part-of-speech information of the text corpus; when the length of a phrase is greater than one, a label B is used for representing a beginning part of a corresponding part of speech, a label I is used for representing a middle part of the part of speech, a label E is used for representing an ending part of speech, a label S is used for representing that the word is composed of single Chinese characters when the phrase only has one Chinese character, and the part of speech is directly combined with the label S to represent the part of speech of the word;

carrying out average weighted summation on word vectors of the text information after mask processing and word vectors of the corresponding part-of-speech information, and training to obtain a BERT bi-directional language model with the part-of-speech;

2. The method of claim 1, wherein the re-scoring N-best lists using a trained BERT bi-directional language model with parts of speech comprises:

For sentences formed by each result in N-bestlists, solving the part of speech of each word in the sentences through a word segmentation tool, and then utilizing B, I, E, S four tags to combine with the part of speech to further segment the part of speech of each word;

an input sample mode based on a sliding window and a word-by-word mask coding mode are adopted, an input sample is constructed for each sentence, coding processing is carried out, and then the input samples are input into a BERT bidirectional language model;

And calculating the probability and score of each sentence through the BERT bi-directional language model, and completing the re-scoring of N-best lists.

3. The method of claim 2, wherein the input sample is constructed and encoded for each sentence using a sliding window based input sample method and a word-by-word mask encoding method, and then input to the BERT bi-directional language model, comprising:

A sliding window with the length of max_length=2m is set as a positive integer;

If the length of the sentence does not exceed max_length, constructing an input sample in a word-by-word mask mode for the whole sentence, constructing a batch after coding processing, and inputting the batch into the BERT bi-directional language model;

If the length of the sentence exceeds max_length, sequentially extracting the sentence content in each sliding window by moving the sliding window backwards by a step length M from the beginning of the sentence, constructing an input sample by adopting a word-by-word mask mode from the first word if all words in the current sliding window are processed for the first time, constructing the input sample by adopting a word-by-word mask mode from the M+1st word if the first M words in the current sliding window are processed in the previous sliding window, finally constructing a batch after coding all the input samples of the sentence, and inputting the batch into the BERT bi-directional language model.

4. The method of claim 2, wherein computing the probability and score for each sentence via the BERT bi-directional language model comprises:

calculating by the BERT bi-directional language model to obtain the probability value of the text of each mask position in each sentence under the condition of context constraint, and marking as:

P(w₁|(w₂,w₃,...,w_L)),P(w₂|(w₁,w₃,...,w_L)),…,P(w_L|(w₁,w₂,...,w_L-1))

P(S)＝P(w₁|(w₂,w₃,...,w_L))P(w₂|(w₁,w₃,...,w_L))…P(w_L|(w₁,w₂,...,w_L-1));

score(S)＝logP(S)＝logP(w₁|(w₂,w₃,...,w_L))+…+logP(w_L|(w₁,w₂,...,w_L-1)).

5. a speech recognition post-processing system, comprising:

the extraction module is used for extracting the first N best recognition results N-best lists from the word graph lattice generated by the first decoding of the input voice by the voice recognition system:

The re-scoring module is used for re-scoring the N-best lists by using the trained BERT bi-directional language model with the part of speech; selecting the result with the highest score from N-best lists as a final recognition result;

the system also comprises a training module for pre-training the BERT bi-directional language model with parts of speech, wherein the training module is specifically used for:

preprocessing the text corpus for training;

Word segmentation and part-of-speech tagging are carried out through a word segmentation tool, phrases and corresponding parts-of-speech in a text corpus are obtained, then B, I, E, S tags are utilized to be combined with the parts-of-speech, and the parts-of-speech of the phrases are further distributed; carrying out the same mask processing on the text information and the part-of-speech information of the text corpus; carrying out average weighted summation on word vectors of the text information and word vectors of the part-of-speech information after mask processing, and then inputting the word vectors into a network for training to obtain a BERT bidirectional language model with the part-of-speech; when the length of a phrase is greater than one, a label B is used for representing a beginning part of a corresponding part of speech, a label I is used for representing a middle part of the part of speech, a label E is used for representing an ending part of speech, a label S is used for representing that the word is composed of single Chinese characters when the phrase only has one Chinese character, and the part of speech is directly combined with the label S to represent the part of speech of the word;

6. The system of claim 5, wherein the re-scoring module is specifically configured to:

For sentences formed by each result in N-bestlists, solving the part of speech of each word in the sentences through a word segmentation tool, and then utilizing B, I, E, S four tags to combine with the part of speech to further divide the part of speech of each word; an input sample mode based on a sliding window and a word-by-word mask coding mode are adopted, and an input sample is constructed for each sentence, coded and then input into a BERT bidirectional language model; and calculating the probability and score of each sentence through the BERT bi-directional language model, and completing the re-scoring of N-bestlists.

7. A computer device comprising a processor and a memory, the memory having stored therein a program comprising computer-executable instructions that, when the computer device is running, cause the computer device to perform the speech recognition post-processing method of claim 1.

8. A computer readable storage medium storing one or more programs, the one or more programs comprising computer-executable instructions, which when executed by a computer device, cause the computer device to perform the speech recognition post-processing method of claim 1.