CN113012690B

CN113012690B - Decoding method and device supporting domain customization language model

Info

Publication number: CN113012690B
Application number: CN202110192804.6A
Authority: CN
Inventors: 谢东平
Original assignee: Suzhou Collaborative Innovation Intelligent Manufacturing Equipment Co ltd
Current assignee: Suzhou Collaborative Innovation Intelligent Manufacturing Equipment Co ltd
Priority date: 2021-02-20
Filing date: 2021-02-20
Publication date: 2023-10-10
Anticipated expiration: 2041-02-20
Also published as: CN113012690A

Abstract

The invention provides a decoding method supporting a domain customized language model, which is based on cluster search decoding and comprises the following steps: generating a first decoding network as a first decoding search network; generating a second decoding network through a language model of the custom domain; expanding the bundle and simultaneously re-scoring the language model in the custom field; and outputting all decoding hypotheses when enough decoding hypotheses exist in the final decoding hypothesis set. The decoding score and the cluster expansion process of the language model in the customized field are combined, so that the end-to-end language model can be quickly applied to the new field, the performance of the language identification system is improved, and the robustness and the identification efficiency are optimized. The decoding device provided by the invention has the corresponding advantages because the decoding method can be realized, only limited resource allocation is needed, and the further popularization and application of the voice recognition system are facilitated.

Description

Decoding method and device supporting domain customization language model

Technical Field

The invention belongs to the technical field of computers, and particularly relates to a decoding method and a corresponding decoding device for supporting a domain customized language model in an end-to-end voice recognition technology.

Background

The above information disclosed in the following background section is only for enhancement of understanding of the background of the disclosure and therefore it may contain information that does not form the prior art that is already known to those of ordinary skill in the art. The matters in the background section are only those known to the public and do not, of course, represent prior art in the field.

Existing end-to-end speech recognition systems mainly include three types of systems based on Connectionist Temporal Classification (CTC), sequence-to-Sequence (Seq 2 Seq) or RNN Transducer (RNNT). The RNNT-based end-to-end voice recognition system is very suitable for a voice recognition service scene deployed on a terminal due to the characteristic of small occupied space of the system, and in the system, the end-to-end voice recognition system can provide high-quality voice recognition service for users in an off-line environment. The RNN converter end-to-end voice recognition technology has proven to have a simpler process and better model recognition accuracy than the traditional Hybrid system on big data. End-to-end speech recognition has become the most popular research direction for speech recognition for nearly two years, and has achieved remarkable effects and performance improvements in the field of large-scale speech recognition, and has received strong attention from the speech recognition community. One of the bases of speech recognition is a language model. Language models are generally of three types: (1) a generative model, (2) an analytical model, (3) an identification model. On the basis of the generative model and the analytic model, the two models are combined to generate a model which supports the decoding method of the domain custom language model and has practical value, namely an identification model. The recognition model can start from a certain set of language elements and a rule system, and determine whether the elements are a stack of words with eight disorder or qualified sentences in the language through operation of a limited step. Such as a syntax type calculus model proposed by the mathematical logic method of the Bare-Hill. For a language sequence, the language model computes the probability of the sequence. From a machine learning perspective: the language model models the probability distribution of sentences, which can be understood as a probability distribution model P, giving a probability P (S) for each string S in the language. The language model has important significance in the fields of speech recognition, machine translation, syntax analysis, handwriting recognition and other application fields. In speech recognition, the input speech may be recognized as a plurality of candidate sentences, some of which are with wrong words or grammar and do not conform to human language, in which case a language model is used to output the probability of rationality of each candidate sentence. At the heart of the language model is the probability of a sentence calculated, and in order to obtain this probability, a series of conditional probabilities need to be calculated. These conditional probabilities are parameters of the overall language model. These conditional probabilities are derived in the prior art mainly based on neural network methods. I.e., training the language model through a neural network, a sufficient number of speech data samples need to be input.

Various intelligent products supporting the voice recognition function have been gradually extended to various corners of the user's work and life, such as intelligent robots, intelligent vehicle-mounted devices, etc., and such intelligent products provide more intelligent services to the user through the voice recognition function. In practical application, the intelligent products need to continuously update newly added entries in the application scene to the decoding network so as to ensure that the intelligent products can adapt to the continuously changed scene in time; the decoding network is actually a state network, and the speech recognition process is actually to search a path in the state network that matches the speech best, which is called decoding process.

It can be seen how the core of speech recognition is how to output the nodes connected in series in the optimal decoding path as the result of speech recognition by searching the decoding path of the speech signal in the decoding network. The decoding method directly affects the efficiency and accuracy of speech recognition. The decoding method in the current end-to-end speech recognition technology adopts a technology called Beam Search decoder (bundle search decoding), and BeamSearch Decoder is also commonly applied to algorithm applications such as OCR, text translation, speech recognition and the like. This decoding method controls a set of decoding hypotheses of size B for each path of audio to be decoded (each decoding hypothesis in the set of decoding hypotheses includes a sequence of sequentially output words from the beginning to the current step). And in each forward process, carrying out cluster expansion on the output of each hypothesis set on the whole vocabulary size V, calculating a scoring matrix with the total size of B x V, and recording the log probability distribution of all hypothesis sets on the vocabulary V. Based on this probability distribution matrix, and the score on each existing set of decoding hypotheses, the present application refers to score (in the present application, score is a vector of dimension B because there are B decoding hypotheses). The current score is summed with the corresponding row of the matrix of size B x V to form a new score matrix of size B x V. The elements in the scoring matrix are ranked according to the size, new B elements with the highest score are found out, and the information of the B elements is recorded, wherein the main information comprises: 1. each element corresponds to which set of decoding hypotheses, i.e., what of the B sets; 2. each element corresponds to which word in the vocabulary V. Based on both information, the existing B decoding hypothesis sets may be updated. The main updates include: new information according to which decoding hypothesis set the new B elements come from and its corresponding; the decoding hypothesis is updated and added to the corresponding positions of the set of decoding hypotheses until all B updates are completed. The above-described updating process is repeated until the decoding process goes to the last frame, and the corresponding set of decoding hypotheses is put into a FINAL (FINAL) set of decoding paths, the elements of which are the decoding paths of each decoding hypothesis. Elements of the decoding path include scores, bundles, and decoding hypotheses. If the elements in the FINAL decoding path set reach the preset number, stopping the decoding process and outputting a FINAL decoding result. The end-to-end voice recognition system based on the neural network can be well decoded, and a corresponding voice recognition result is output. But has a major problem in that it is difficult to perform very convenient domain customization and domain expansion, and decoding is limited only to the speech domain already included in the process of training a language model by a neural network. For the speech field which is not included in training, the decoding effect of speech recognition is relatively weakened, namely the robustness in the recognition of new or customized field speech is poor. For example, language models trained with life scene corpus may have poor language recognition rates when used in professional engineering scenes. If the performance of the language model is to be improved in the new field, the existing decoding method needs to collect the voice data in the new field and restart training the end-to-end neural network language model, so that the supporting effect on some language identification contents requiring quick customization is very low, and the actual application requirements in the voice identification cannot be met by using the existing decoding method. Therefore, the limitation in the prior art can be known, and the further optimization and popularization and application of the voice recognition technology are not facilitated.

Disclosure of Invention

In order to solve all or part of the problems in the prior art, the application provides a decoding method supporting a domain customized language model, which is used for end-to-end speech recognition and can enable the end-to-end language model to be quickly adapted to a new domain. Another aspect of the present application provides a decoding apparatus using the decoding method, which can implement the decoding method of the present application, for use in an end-to-end language recognition system.

RNNT-based end-to-end neural networks are generally composed of Transcription Encoder, prediction (Decoder) and join neural networks. In the application, the RNNT end-to-end neural network structure adopts a convolution and a transducer mixed network structure. As shown in fig. 1, the present application adopts a RNNT architecture based on convolution and transform, and the transform network is formed by stacking three basic modules, wherein each module is composed of a three-layer Time Delay Neural Network (TDNN) and a layer of transform. The Prediction network also comprises three basic modules, each comprising a layer of one-dimensional causal convolution and a layer of transform. The terms and definitions referred to in the present application are to be construed as illustrative only and not limiting. Before explaining the present application in further detail, terms and terminology involved in the embodiments of the present application are explained, and the terms and terminology involved in the embodiments of the present application are applicable to the following explanation.

WFST (weighted find-state transducer), a weighted finite state transducer, is used for large-scale speech recognition, where WFST defines a binary relationship, the first sequence being an input tag and the second sequence being an output tag. The same input will have different output paths and thus each sequence pair is represented by a weight. WFST is an abstract mathematical structure describing a sequence mapping of an input sequence and an output sequence: the WFST receives an input sequence, and obtains an output sequence and a corresponding jump cost through internal state jump. The WFST enables the mapping relation between sequences to be described by a unified mathematical language, so that different WFSTs can be combined through various mathematical operations defined by the WFST to describe various complex sequence mapping relations. The WFST operation used in speech recognition mainly comprises a composition compound operation and some optimization operations: determinize, minimize and weight push, etc. WFST technology is a technology known in the art of speech recognition and can be directly generated by the openfst tool and the kaldi tool, and will not be described again here.

Speech frame: feature extraction is the conversion of an input sequence of samples into a sequence of feature vectors, one for representing one speech segment, called a speech Frame (Frame).

A Decoder (Decoder) is one of the cores of a speech recognition system, and its task is to find a word string capable of outputting an input signal with maximum probability, based on acoustics, language model, and dictionary.

Knowledge sources, i.e. sources of knowledge required by a Decoder (Decoder) to decode based on a sequence of features of a speech signal to obtain a recognition result, several WFST representation-based knowledge sources referred to herein include: pronunciation dictionary, denoted by L, input symbols: phoneme, output symbol: a word; a collection containing words and their pronunciations; a Language Model (LM, language Model), denoted herein by G, has the same input symbols as the output symbols, and is a knowledge representation of the Language structure (including rules between words, sentences, e.g., grammar, word usual collocation, etc.), and Language Model probability P (W) is used to represent a priori probability of occurrence of a sequence W of speech units in a segment of a speech signal.

The decoding network (Search Space), also known as the Search Space, is compiled into one network using various knowledge sources of WFST fusion. Decoding is to find the optimal output character sequence in the dynamic network space of the WFST structure. The decoder works to search for the optimal path for a given speech input.

The decoding method supporting the domain custom language model provided by the invention comprises the following steps of: s1, generating a first decoding network serving as a first decoding search network to obtain a first score set; generating a second decoding network through a language model in the customized field, and obtaining a second score set as a language model score query network; the elements of the first score set are first scores, and the elements of the second score set are second scores; s2, expanding the bundle and simultaneously rescaling a language model in the custom field, wherein the steps comprise the steps of iteration: taking a submatrix of an encoder expansion matrix corresponding to a current voice frame and a decoder output matrix, inputting the spliced matrix into an output layer for performing transmission forward calculation, and generating a first scoring matrix; adding the first score set and the second score set with the corresponding positions of the first score matrix to obtain a second score matrix; based on the second scoring matrix, expanding the bundle and putting the expanded bundle into a decoding hypothesis set; the first score and the second score are updated. S3, judging whether enough decoding hypotheses exist in the final decoding hypothesis set; if yes, outputting all decoding hypotheses; if not, the process returns to step S2 and continues until there are enough decoding hypotheses.

In general, initializing the speech signal prior to step S2 includes: extracting characteristics of an input voice sequence, marking a frame number of the voice sequence as num_frames, marking a dimension number of each voice frame as dim_feature, and obtaining a voice characteristic matrix by taking the frame number and the dimension number as elements;

inputting the speech feature matrix into an encoder, and encoding to obtain a matrix [ num_frames, enc_dim ], expanding the matrix into an encoder expansion matrix of [ B, num_frames, enc_dim ], wherein B represents the size of the whole bundle; the size of the preset word list is marked as V.

The initializing the voice signal further comprises: initializing a set of decoded B words; each word represents a sentence initial symbol, and the size of each word is obtained as a vector representation; and inputting the decoder, and initializing the output matrix of the decoder.

The first set of scores and the second set of scores are vectors of dimension B. The first scoring matrix and the second scoring matrix are matrices of size [ B, V ].

Step S2 of initializing a decoding environment at the beginning includes: recording a set t_idx with the size of B at the decoding corresponding moment of each current bundle, wherein each element t_idx_i is a timestamp corresponding to a current voice frame corresponding to an ith bundle; initializing a first set of scores, in the example denoted { score_1, score_2, & gt, score_b }, size B; initializing a second score set, in embodiments denoted lm_score { lm_score_1, lm_score_2, & gt, lm_score_b }, of size B; the set of Cartesian product states of the first decoding network and the second decoding network corresponding to each current bundle decoding is recorded, and is recorded as a Cartesian product score pair set { cartesian_1, cartesian_2,..a., cartesian_b }, the size is B, and the elements are a set of Cartesian product pairs corresponding to the states of the first decoding network and the second decoding network and corresponding jump costs, and are recorded as (< LState, LMState >, cost), including a Cartesian product pair < LState, LMState > and a jump cost. "i" herein means any number between the initial value and B.

the initial value of t_idx_i is set to 1.

Each first score initial value is set to 0; each second score initial value is set to 0;

the initial value of the state LSstate of the first decoding network is set as startL, the initial value of the state LMState of the second decoding network is set as startLM, and the start represents the initial states of the corresponding first decoding network and second decoding network; the cost initial value is set to 0.

In step S2, the "spliced matrix" is a matrix obtained by performing a splicing operation on the submatrices of the encoder expansion matrix and other elements of the decoder output matrix, while keeping B unchanged.

The initializing a decoding environment further comprises: and selecting a sub-matrix in the coder expansion matrix corresponding to the initial t_idx position and a corresponding decoder output matrix, and splicing the obtained matrix input and output layers to generate an initial first scoring matrix.

In the step S2, an iterative operation is performed from the initial value of t_idx_i to B traversal i.

As a further refinement, when t_idx_i is equal to num_frames, both the decoded first score and the second score of the current bundle are set to minus infinity. Both the decoded first score and the second score of the current bundle are set to minus infinity. Setting to minus infinity can ensure that the current bundle cannot be expanded further.

In step S2, the "based on the second score matrix" means that the elements in the second score matrix are sorted according to the size, and the B effective score values with the highest score are obtained; the initial value of t_idx_i starts to B traverse i, record the word corresponding to each valid score value, noted as idx_i, and from which bundle idx_i comes; judging whether the idx_i corresponds to a null character or not, and expanding the bundle according to a judging result.

If idx_i corresponds to a character other than null, then the first process is performed: updating the Cartesian product score pair set corresponding to the current bundle; updating the second score of each bundle to be the largest score of the updated Cartesian product score pair among the new costs in the set; updating an ith row matrix of the decoder output matrix with idx_i; then a second process is performed: taking the first score of the current bundle and the addition result of the corresponding elements of the first score matrix corresponding to the current i effective score as updated first scores; if idx_i corresponds to a null character, adding t_idx_i=t_idx_i+1, namely the corresponding voice frame, to the next position, directly taking the set of Cartesian product score pairs corresponding to the current bundle as an updated set of Cartesian product score pairs, and directly taking the second score of the current bundle as an updated second score; the second process is then directly performed.

In the first process, it is confirmed that the state LState of the first decoding network in the cartesian product score pair set corresponding to each bundle is the final state of the first decoding network. The final state ensures that the whole word has been output, otherwise the word has not been output.

Updating the set of Cartesian product score pairs corresponding to the current bundle in the first process comprises traversing i before all Cartesian product pairs and corresponding sets of jump costs are updated: the first step: finding out all Cartesian product pairs and corresponding jump cost sets from the Cartesian product score pair sets in the corresponding state of the current bundle, and inputting the current character idx_i; and a second step of: judging whether the current state LState receives idx_i or not, and jumping to a state lstate_j (1 < =j < =m); if the judgment is no, the corresponding state is not found, namely m is 0, the current (LState, LMState) is skipped, and the cost is returned to the first step; if yes, continuing to judge: if a non-empty word W is output from the edge in the first decoding network in the state of the transition from the current state LState to LState_j, the state LMState of the second decoding network is found, the W is received, a new state LMState_j is obtained, the cost is updated as the cost plus the increased transition cost delta when the language model transitions from the state LMState to the state LMState_j, namely cost' =cost+delta; replacing the original (< LState, LMState >, cost) with a new (< lstate_j, lmstate_j >, cost') to update the set of cartesian product score pairs for the corresponding state of the current bundle; if no non-null words are output from LState to LState_j, the old (< LState, LMState >, cost) is replaced with new (< LState, LMState >, cost) and the set of Cartesian product score pairs of the corresponding state of the current bundle beam b_i is updated.

The t_idx set is updated and the current bundle is output to the final set of decoding hypotheses when t_idx_i equals num_frames.

The first decoding network is a WFST network generated by a decoder from a word-to-word dictionary. Is a mapping of words (word) to individual words (character), specifically splitting a word into a sequence of words.

Another aspect of the present invention provides a decoding apparatus supporting a domain custom language model, including an input device and a decoder communicatively connected to each other; the input device is used for acquiring a voice signal; the decoder comprises a processor and a memory, the memory storing a computer program; the processor comprises an extraction module, a decoding network module and a decoding module; the extraction module is used for extracting the characteristics of the voice sequence; the decoding network module is used for generating a decoding network based on a knowledge source; the decoding module is used for searching an optimal path in the decoding network; the processor performs the decoding method of the present invention in accordance with the computer program instructions.

Compared with the prior art, the invention has the main beneficial effects that:

1. according to the decoding method supporting the domain customization language model, the custom domain language model is introduced, the second score is combined into the cluster expansion process, and the custom domain language model is re-scored based on the second score matrix, so that the problem of quick recognition in the new custom domain in the traditional end-to-end recognition is solved, the end-to-end voice recognition can be quickly expanded to the new domain, the effect and expansibility of the end-to-end voice recognition are improved, the application range of the end-to-end voice recognition is greatly improved, the quick expansion can be quickly customized according to the requirements of customers, and the use experience of users is improved.

2. The decoding device for supporting the domain customization language model has corresponding advantages due to the adoption of the decoding method for supporting the domain customization language model; only limited resource allocation is needed, which is beneficial to further popularization of the voice recognition system.

Drawings

FIG. 1 is a schematic diagram of an RNNT architecture employing convolution and transducer based techniques according to the present invention.

Fig. 2 is a schematic diagram of a decoding method for supporting a domain custom language model according to a first embodiment of the present invention.

Fig. 3 is a schematic diagram of a decoding apparatus supporting domain custom language model according to a first embodiment of the present invention.

Fig. 4 is a schematic diagram of the overall process of the second embodiment of the present invention.

Fig. 5 is a schematic diagram of steps involved in a first process according to a third embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully, and it is apparent that the embodiments described are only some, but not all, of the embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The foregoing and/or additional aspects and advantages of the present invention will be apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings. In the figures, parts of the same structure or function are denoted by the same reference numerals, and not all illustrated parts are denoted by the associated reference numerals throughout the figures, if necessary, for the sake of clarity. The use of certain conventional english terms or letters for the sake of clarity of description of the invention is intended to be exemplary only and not limiting of the interpretation or particular use, and should not be taken to limit the scope of the invention in terms of its possible chinese translations or specific letters.

The operations of the embodiments are depicted in the following embodiments in a particular order, which is presented to provide a better understanding of the details of the embodiments in order to provide a thorough understanding of the invention, but is not necessarily depicted in a one-to-one correspondence with the decoding methods of the invention, nor is it intended to limit the scope of the invention in this regard.

It should be noted that the flowcharts and block diagrams in the figures illustrate the operational processes that may be implemented by the methods according to the embodiments of the present invention. It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the intervening blocks, depending upon the objectives sought to be achieved by the steps involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and manual operations.

Example 1

In a first embodiment of the present invention, an example of an end-to-end speech recognition encoder and decoder, as shown in fig. 1, employs a RNNT architecture based on convolution and transform. In this embodiment, the decoding method supporting the domain custom language model, based on the bundle search decoding, as shown in fig. 2, includes: s1, generating a first decoding network serving as a first decoding search network to obtain a first score set; generating a second decoding network through a language model in the customized field, and obtaining a second score set as a language model score query network; s2, updating the bundle and simultaneously rescaling the language model in the custom field, wherein the steps comprise the following operations of iteration: taking a submatrix of an encoder expansion matrix corresponding to a current voice frame and a decoder output matrix, inputting the spliced matrix into an output layer for performing transmission forward calculation, and generating a first scoring matrix; adding the first score set and the second score set with the corresponding positions of the first score matrix to obtain a second score matrix; based on the second scoring matrix, expanding the bundle and putting the expanded bundle into a decoding hypothesis set; the first score and the second score are updated. S3, judging whether enough decoding hypotheses exist in the final decoding hypothesis set; if yes, outputting all decoding hypotheses; if not, returning to the step S2 and continuing.

As shown in fig. 3, the decoding apparatus of the present embodiment includes an input device 1 and a decoder 2 which are communicatively connected to each other; in this embodiment, a voice sequence is acquired through the input device 1; the decoder 2 comprises a processor 21 and a memory 22, the memory 22 storing a computer program; the processor comprises an extraction module, a decoding network module and a decoding module; the extraction module is used for extracting the characteristics of the voice sequence; the decoding network module is used for generating a decoding network based on the knowledge source; the decoding module is used for searching the optimal path in the decoding network. The decoding method of the present invention is performed in this embodiment by the processor 1 running said computer program instructions. It should be understood that the foregoing description is presented for convenience of understanding with reference to the drawings, and in practical application, the decoder 2 may be a device, or a system formed by a plurality of devices that are communicatively connected to each other in different spaces may be arranged; the functional modules, such as the extraction module, the decoding network module, and the decoding module, may be disposed in one or more memories of one device, or may be disposed in several devices in the system, and are not limited thereto. In this embodiment, the decoding method of the present invention is implemented by running the processor 21 on the computer program stored in the memory 22, and in practical application, the decoding method of the present invention may be implemented manually in conjunction with the computer program or by a plurality of different computer programs, or may be implemented as a subroutine of the speech recognition software in conjunction with other functions of the speech recognition system. The output result is stored in the memory 22 in the form of computer code in the present embodiment, and in practical application, the output result may be displayed in a text form on a computer display or other display devices, or may be recorded in a text file form, which is not limited herein.

Example two

The second embodiment is different from the first embodiment in that the method further includes initializing a voice signal to obtain an encoder expansion matrix and a decoder output matrix before step S2. As shown in fig. 4, in this embodiment, first, a voice signal input to an input device is extracted by the aforementioned extraction module to obtain features of the voice sequence, the number of frames of the voice sequence is num_frames, the dimension of each voice frame is dim_feature, and a voice feature matrix is obtained by using the feature as an element. The speech matrix is input into an encoder of an end-to-end speech recognition system to obtain a transformed matrix [ num_frames, enc_dim ], and an encoder expansion matrix expanded to [ B, num_frames, enc_dim ]. A set of decoded B words is initialized, where B represents the size of the entire bundle beam. Each word is marked as < s >, represents the initial symbol of the sentence, and the magnitude of each word is obtained as a vector representation, and is input into a decoder to obtain a decoder output matrix with the magnitude of [ B, dec_dim ] as an initial decoder output matrix. The size of the preset word list is V.

The first decoding network generated by the decoding network module in this embodiment is a WFST network generated by a word-to-word dictionary, and is denoted as l.fst as a first-pass decoding search network. In the WFST framework, the speech recognition problem is treated as a conversion (transduction) process that converts an input speech signal into a word sequence. Each model in the speech recognition system is interpreted as a WFST, and the inverse (-log probability) of the model's score is taken as the weight of the WFST. The WFST framework in this embodiment is directly generated by the openfst tool or the kaldi tool, and will not be described here. The form of the knowledge source L is shown as follows:

Speech recognition speech recognition

EOF

The first field is a word, the word and the vocabulary of the language model are in one-to-one correspondence, the later field is a word, and the mapping relation from word to word is formed.

In this embodiment, step S2 starts to initialize the decoding environment, including the size B of the bundle to be decoded. In this embodiment, in the bundle decoding search, the viterbi pruning search (Viter Beam Search) is performed, and when the decoder searches the best path in the decoding network, only the number of nodes with the bundle width is reserved in each state of the extended path, so that unnecessary paths can be removed, and the memory consumption is reduced because the full path search is not required, and the decoding efficiency is improved.

In this embodiment, the initialization decoding environment specifically records a set t_idx with a size B at the time corresponding to decoding of each current bundle, each element t_idx_i is a timestamp corresponding to a current speech frame corresponding to an ith bundle, and the initial value is 1; initializing a first score set to score, { score_1, score_2, score_b, size B, each score in the set initialized to 0; the initial set of language model scores, the second set of scores, is denoted lm_score, { lm_score_1, lm_score_2,..once, lm_score_b }, of size B, each score in the set is initialized to 0. Selecting a submatrix [ B, enc_dim ] Matrix and a decoder output Matrix [ B, dec_dim ] Matrix in an encoder expansion Matrix corresponding to the initial t_idx position, keeping B unchanged, splicing to obtain a Matrix of [ B, enc_dim+dec_dim ], inputting the Matrix into an output layer, and obtaining a scoring Matrix with the size of [ B, V ], and marking the scoring Matrix as T-Matrix, namely initializing the first scoring Matrix. The first set of scores and the second set of scores are vectors of dimension B. Hereinafter, any number from the initial value to B is denoted by "i".

Recording a set of Cartesian product states of a first decoding network and a second decoding network corresponding to each current cluster decoding, wherein the set is denoted as { Cartesian_1, cartesian_2,.. A., cartesian_B }, and the set size is B, wherein each element Cartesian_i in the set is also a set, hereinafter referred to as a Cartesian product score pair set, is a set of Cartesian product pairs corresponding to states of the first decoding network and the second decoding network and corresponding jump costs, and is denoted as (< LState, LMState >, cost), including Cartesian product pairs < LState, LMState > and jump costs cost; the state LSstate initial value of the first decoding network is startL, the state LMState initial value of the second decoding network is startLM, and the start represents the initial states of the corresponding first decoding network and second decoding network; the cost is initialized to 0. I.e., initializing Cartesian i= { (< lstate=startl, lmstate=startlm >, 0) }, where start represents the initial state of the corresponding FST network.

The second decoding network generated in the present embodiment is a G network generated from a language model of the custom domain, and is denoted as g.fst.

In this embodiment, in the step S2, an iterative operation is performed from the initial value of t_idx_i to B traversal i.

Starting from the initial t_idx position, selecting a sub-matrix of the corresponding frame position of the encoder expansion matrix of [ B, num_frames, enc_dim ] according to the value corresponding to t_idx. If the current beam i corresponds to t_idx_i, selecting the submatrix of the position of [ i, t_idx_i, enc_dim ] until the selection of B submatrices is completed, and obtaining an encoder submatrix with the size of [ B, enc_dim ]. Wherein if idx_i corresponds to a non-null character, updating the corresponding decoder output, and if idx_i corresponds to a null character, adding t_idx_i=t_idx_i+1, i.e. the corresponding frame, to the next position until the updating of the B outputs is completed, obtaining a decoder output matrix of [ B, dec_dim ]. The two matrixes are spliced to obtain a Matrix of [ B, enc_dim+dec_dim ] which is input to a final output layer to obtain a first score Matrix T-Matrix of a score Matrix with the size of [ B, V ]. Thus, the traversal of 1 to B is completed.

In this embodiment, in step S2, the "based on the second score matrix" means that the elements in the second score matrix are sorted according to size, and the B effective score values with the highest score are obtained, i.e. the B scores with the highest score values are obtained; the initial value of t_idx_i starts to B traverse i, record the word corresponding to each valid score value, noted as idx_i, and from which bundle idx_i comes; judging whether the idx_i corresponds to a null character or not, and expanding the bundle according to a judging result. In this embodiment, the first score set score, the second score set lm_score, and the first score Matrix T-Matrix corresponding positions are added. partial_score=scores+lm_scores+t-Matrix, which is a scoring Matrix of size [ B, V ], the second scoring Matrix. And ordering the elements of the second scoring matrix according to the size to obtain B scoring values with the best score, wherein the B scoring values are the effective scoring values. The B size ordered valid scores and the idx from which bundle beam and corresponding word are recorded (the idx of a word is the corresponding position in the word table with V words, 0 < = idx < = V-1). B tuples (score_1, b_1, idx_1), … …, (score_B, b_B, idx_B) are obtained. For example, [1.2,2.2,3.2,3.5;1.8, 2.3; 3.3,4.5;1.9,2.7,3.6,4.9; ] the highest score is 4.9, 4.5, 3.6, respectively from the 3,2,3 th cluster beam, their column coordinates are 3,2, respectively, starting from 0.

For each idx_i, we walk i from 1 to B to expand. After which a second process is performed. The second process is performed including: the first score of the current bundle is marked as score_i, and the result of adding the score_i and the corresponding element of the first score matrix corresponding to the current i-th valid score value is used as updated score_i.

After the iteration process from 1 to B traversal i is completed, when t_idx_i is equal to num_frames, outputting the current bundle into a final decoding hypothesis set, namely outputting the decoding result of the current bundle beam into a final output result, and completing the decoding of the last voice frame. In a preferred implementation of this embodiment, when t_idx_i is equal to num_frames, both the first score and the second score of the decoding of the current bundle are set to minus infinity. Setting to minus infinity can ensure that the current bundle cannot be expanded further.

Example III

In the third embodiment, as shown in fig. 5, in the bundle expanding process, it is first determined whether the idx_i is a null character, and the bundle is expanded according to the determination result. If idx_i corresponds to a null character, adding t_idx_i=t_idx_i+1, namely the corresponding voice frame, to the next position, directly taking the set of Cartesian product score pairs corresponding to the current bundle as an updated set of Cartesian product score pairs, and directly taking the second score of the current bundle as an updated second score; the second process is then directly performed. If idx_i corresponds to a character other than null, the first process is performed.

The following describes the first procedure in detail, including the first step: find the set of all Cartesian product pairs { LState_1, LState_2, … …, LState_m } and corresponding jump cost from the set of Cartesian product pairs score_b_i } of the state corresponding to the current bundle beam b_i, and input the current character idx_i. It is determined whether all (< LState, LMState >, cost) of the cartesian_ { b_i } completes the update. And respectively performing two operations according to the judgment result. First, the judgment result is: not all (< LState, LMState >, cost) of the cartesian_ { b_i } are updated, the second step of operation is performed: it is determined whether the current state LState receives idx_i and jumps to state lstate_j (1 < =j < =m). If the judgment is no, the corresponding state is not found, namely m is 0, the current (LState, LMState) is skipped, and the cost is returned to the first step; if yes, continuing to judge: if a non-empty word W is output from the edge in l.fst of the state from the current state LState to lstate_j, the state LMState of g.fst is found and W is received, a new state lmstate_j is obtained, and the cost is updated to cost plus the added skip cost delta when the language model skips from state LMState to state lmstate_j, i.e., cost' =cost+Δ. The set of Cartesian product score pairs for the corresponding state of the current bundle beam b_i is updated with new (< LState_j, LMState_j >, cost') instead of the original (< LState, LMState >, cost), and the process returns to the first step. If no non-null words are output from LState to LState_j, the original set of Cartesian product score pairs of the corresponding state of the current bundle beam b_i is replaced by a new (< LState_j, LMState >, cost) and the set of Cartesian product score pairs of the corresponding state of the current bundle beam b_i is updated by the new (< LState, LMState >, cost) and returns to the first step. Second kind: the judgment result is: all (< LState, LMState >, cost) in the cartesian_ { b_i } have completed updating, then go directly to the third step.

And a third step of: the second score lmscore of each bundle is updated to be the largest score of the new costs in NewCartesian i. And confirming that the state LState of the first decoding network in the Cartesian product score pair set corresponding to each bundle is the final state of the first decoding network. The final state ensures that the whole word has been output, otherwise the word has not been output. The i-th row matrix of [ B, dec_dim ] is updated with idx_i.

Those of ordinary skill in the art will appreciate that all or part of the steps of implementing the method of the above-described embodiments may be implemented by a program, which may be stored in a memory, and which, when executed, includes one or a combination of the steps of the method embodiments. The functional modules in the embodiments of the present invention may be integrated in one processor, or each module may exist alone physically, or two or more modules may be integrated in one processor. The above-mentioned functional modules may be implemented in the form of hardware or in the form of software functional modules. The integrated modules may also be stored in a memory if implemented in the form of software functional modules and sold or used as a stand-alone product.

The memory of the present invention may be a read only memory, a magnetic disk or optical disk, etc. While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.

It is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

The foregoing detailed description of the invention has been presented in terms of a specific embodiment for illustrating the principles of operation and structure of the invention, and the above description of the embodiment is provided for the purpose of facilitating understanding of the principles and concepts of the invention and is not intended to limit the scope of the invention to the details of the embodiment. It should be noted that it will be apparent to those skilled in the art that various improvements and modifications can be made to the present invention without departing from the principles of the invention, and such improvements and modifications fall within the scope of the appended claims.

Claims

1. A decoding method supporting a domain custom language model is characterized in that: decoding based on a bundle search, comprising:

s1, generating a first decoding network serving as a first decoding search network to obtain a first score set; generating a second decoding network through a language model in the customized field, and obtaining a second score set as a language model score query network; the elements of the first score set are first scores, and the elements of the second score set are second scores;

s2, expanding the bundle and simultaneously rescaling a language model in the custom field, wherein the steps comprise the steps of iteration:

taking a submatrix of an encoder expansion matrix corresponding to a current voice frame and a decoder output matrix, inputting the spliced matrix into an output layer for performing transmission forward calculation, and generating a first scoring matrix; adding the first score set and the second score set with the corresponding positions of the first score matrix to obtain a second score matrix;

based on the second scoring matrix, expanding the bundle and putting the expanded bundle into a decoding hypothesis set; updating the first score and the second score;

s3, judging whether enough decoding hypotheses exist in the final decoding hypothesis set; if yes, outputting all decoding hypotheses; if not, returning to the step S2, and continuing until enough decoding assumptions exist;

Wherein initializing the speech signal prior to step S2 comprises: extracting characteristics of an input voice sequence, marking a frame number of the voice sequence as num_frames, marking a dimension number of each voice frame as dim_feature, and obtaining a voice characteristic matrix by taking the frame number and the dimension number as elements;

inputting the speech feature matrix into an encoder, and encoding to obtain a matrix [ num_frames, enc_dim ], expanding the matrix into an encoder expansion matrix of [ B, num_frames, enc_dim ], wherein B represents the size of the whole bundle; the size of the preset word list is recorded as V;

step S2 of initializing a decoding environment at the beginning includes:

recording a set t_idx with the size of B at the decoding corresponding moment of each current bundle, wherein each element t_idx_i is a timestamp corresponding to a current voice frame corresponding to an ith bundle;

initializing a first score set, wherein the size of the first score set is B; initializing a second score set, wherein the size of the second score set is B;

recording a set of Cartesian product states of a first decoding network and a second decoding network corresponding to each current cluster decoding, wherein the set is recorded as a Cartesian product score pair set, the size of the Cartesian product pair set is B, elements in the Cartesian product pair set corresponding to the states of the first decoding network and the second decoding network and a set of corresponding jump costs are recorded as (< LState, LMState >, cost) and the Cartesian product pair set comprises < LState, LMState > and the jump cost;

In the step S2, an iterative operation is performed from an initial value of t_idx_i to B traversal i;

in the step S2, the "based on the second score matrix" means that the elements in the second score matrix are sorted according to the size, and the B effective score values with the best score are obtained; the initial value of t_idx_i starts to B traverse i, record the word corresponding to each valid score value, noted as idx_i, and from which bundle idx_i comes; judging whether the idx_i corresponds to an empty character or not, and expanding the bundle according to a judging result;

if idx_i corresponds to a non-null character, then a first procedure is followed: updating the Cartesian product score pair set corresponding to the current bundle; updating the second score of each bundle to be the maximum value of the score in the new cost in the updated Cartesian product score pair set; updating an ith row matrix of the decoder output matrix with idx_i; then a second process is performed: taking the first score of the current bundle and the addition result of the corresponding elements of the first score matrix corresponding to the current i effective score as updated first scores;

if idx_i corresponds to a null character, adding t_idx_i=t_idx_i+1, namely the corresponding voice frame, to the next position, directly taking the set of Cartesian product score pairs corresponding to the current bundle as an updated set of Cartesian product score pairs, and directly taking the second score of the current bundle as an updated second score; the second process is then directly performed.

2. The method for decoding a domain-specific language model of claim 1, wherein: the initializing the voice signal further comprises: initializing a set of decoded B words; each word represents a sentence initial symbol, and the size of each word is obtained as a vector representation;

and inputting the decoder, and initializing the output matrix of the decoder.

3. The method for decoding a domain-specific language model of claim 1, wherein: the first set of scores and the second set of scores are vectors of dimension B.

4. The method for decoding a domain-specific language model of claim 1, wherein: the first scoring matrix and the second scoring matrix are matrices of size [ B, V ].

5. The method for decoding a domain-specific language model of claim 1, wherein: the initial value of t_idx_i is set to 1.

6. The method for decoding a domain-specific language model of claim 1, wherein: each first score initial value is set to 0; each of the second score initial values is set to 0.

7. The method for decoding a domain-specific language model of claim 1, wherein: the initial value of the state LSstate of the first decoding network is set as startL, the initial value of the state LMState of the second decoding network is set as startLM, and the start represents the initial states of the corresponding first decoding network and second decoding network; the cost initial value is set to 0.

8. The method for decoding a domain-specific language model of claim 1, wherein: in step S2, the "spliced matrix" is a matrix obtained by performing a splicing operation on the submatrices of the encoder expansion matrix and other elements of the decoder output matrix, while keeping B unchanged.

9. The method for decoding a domain-specific language model of claim 8, wherein: the initializing a decoding environment further comprises: and selecting a sub-matrix in the coder expansion matrix corresponding to the initial t_idx position and a corresponding decoder output matrix, and splicing the obtained matrix input and output layers to generate an initial first scoring matrix.

10. The method for decoding a domain-specific language model of claim 1, wherein: when t_idx_i is equal to num_frames, both the decoded first score and the second score of the current bundle are set to minus infinity.

11. The method for decoding a domain-specific language model of claim 1, wherein: in the first process, it is confirmed that the state LState of the first decoding network in the cartesian product score pair set corresponding to each bundle is the final state of the first decoding network.

12. The method for decoding a domain-specific language model of claim 1, wherein: updating the set of cartesian product score pairs corresponding to the current bundle in the first process includes traversing i before all the sets of cartesian product pairs and corresponding jump costs are updated:

the first step: finding out all Cartesian product pairs and corresponding jump cost sets from the Cartesian product score pair sets in the corresponding state of the current bundle, and inputting the current character idx_i;

and a second step of: judging whether the current state LState receives idx_i or not, and jumping to a state lstate_j (1 < =j < =m); if the judgment is no, the corresponding state is not found, namely m is 0, the current (LState, LMState) is skipped, and the cost is returned to the first step; if yes, continuing to judge: if a non-empty word W is output from the edge in the first decoding network in the state of the transition from the current state LState to LState_j, the state LMState of the second decoding network is found, the W is received, a new state LMState_j is obtained, the cost is updated as the cost plus the increased transition cost delta when the language model transitions from the state LMState to the state LMState_j, namely cost' =cost+delta; replacing the original (< LState, LMState >, cost) with a new (< lstate_j, lmstate_j >, cost') to update the set of cartesian product score pairs for the corresponding state of the current bundle; if no non-null words are output from LState to LState_j, the old (< LState, LMState >, cost) is replaced with new (< LState, LMState >, cost) and the set of Cartesian product score pairs of the corresponding state of the current bundle beam b_i is updated.

13. The method for decoding a domain-specific language model of claim 1, wherein: the t_idx set is updated and when t_idx_i equals num_frames, the current bundle is output to the final set of decoding hypotheses.

14. A method of decoding a domain-specific language model according to any one of claims 1-13, wherein: the first decoding network is a WFST network generated by a decoder from a word-to-word dictionary.

15. A decoding device supporting a domain custom language model, characterized in that: comprises an input device and a decoder which are connected with each other in a communication way;

the input device is used for acquiring a voice signal;

the decoder comprises a processor and a memory, the memory storing a computer program;

the processor comprises an extraction module, a decoding network module and a decoding module; the extraction module is used for extracting the characteristics of the voice sequence; the decoding network module is used for generating a decoding network based on a knowledge source; the decoding module is used for searching an optimal path in the decoding network;

the processor executes the method for decoding a domain-specific customized language model according to any one of claims 1 to 14 according to the computer program instructions.