CN110364171B

CN110364171B - Voice recognition method, voice recognition system and storage medium

Info

Publication number: CN110364171B
Application number: CN201910741739.0A
Authority: CN
Inventors: 黄羿衡; 蒲松柏
Original assignee: Shenzhen Tencent Computer Systems Co Ltd
Current assignee: Shenzhen Tencent Computer Systems Co Ltd
Priority date: 2018-01-09
Filing date: 2018-01-09
Publication date: 2023-01-06
Anticipated expiration: 2038-01-09
Also published as: CN108305634A; CN108305634B; CN110364171A

Abstract

The invention provides a voice recognition method, which is applied to a voice recognition system and comprises the following steps: collecting voice signals in an analog form, and processing the voice signals in the analog form to form voice signals in a digital form; preprocessing the voice signal in the digital form; extracting the voice characteristics of the voice signals in the digital form, and performing characteristic compensation processing and characteristic normalization processing on the voice characteristics to form dynamic characteristics corresponding to the voice signals in the digital form; and decoding the dynamic characteristics corresponding to the voice signals in the digital form by a decoder of the voice recognition system to form corresponding voice recognition results. The invention also provides a voice recognition system and a storage medium. The invention can realize the collection of the voice signals in the analog form and form the corresponding voice recognition result according to the processing of the voice recognition system, thereby realizing the accurate recognition of the voice signals in the analog form.

Description

Voice recognition method, voice recognition system and storage medium

Description of the cases

This application is based on the application number: 201810020113.6, application date of 2018, 09 month, invention name: chinese patent application for decoding method, decoder and storage medium, filed as filed in the chinese patent application, the entire contents of which are incorporated herein by reference.

Technical Field

The present invention relates to computer technology, and more particularly, to a speech recognition method, a speech recognition system, and a storage medium for automatic speech recognition technology.

Background

The automatic speech recognition technology is used for converting an analog speech signal into a text which can be processed by a computer, and is widely applied to various services such as speech dialing, telephone ticket booking, speech input, a translation system, speech navigation and the like. To this end, artificial Intelligence (AI) provides a solution to train a suitable speech recognition system to support the above applications. The artificial intelligence is the theory, method and technology for simulating, extending and expanding human intelligence by using a digital computer or a machine controlled by the digital computer, sensing environment, acquiring knowledge and obtaining the best result by using the knowledge, and the artificial intelligence of an application system, namely the artificial intelligence for researching the design principle and the implementation method of various intelligent machines, so that the machine has the functions of sensing, reasoning and decision making, and in the field of voice processing, the recognition of voice is realized by using the digital computer or the machine controlled by the digital computer.

The decoder in the speech recognition system provided by AI counting is used as the core of the automatic speech recognition system, and is used for searching the decoding path of the speech signal in the decoding network, and outputting the nodes connected in series in the optimal decoding path as the result of speech recognition, and the decoder directly influences the recognition efficiency and precision of the automatic speech recognition system.

The decoder provided by the related technology relies on a decoding path searched in a decoding space collection constructed by knowledge sources such as a language model and the like, the volume of the language model at an industrial level is often very large, the volume of the decoding space constructed on the basis is further enlarged on the basis of the language model, if the decoding efficiency needs to be ensured, a large amount of storage resources and calculation resources need to be deployed in the decoding process, and the limited resources in industrial application restrict the decoding efficiency and influence the accuracy of a speech recognition system on speech recognition of a speech signal in an analog form.

Disclosure of Invention

Embodiments of the present invention provide a speech recognition method, a speech recognition system, and a storage medium, which can implement efficient decoding of a speech signal in a resource intensive manner.

The technical scheme of the embodiment of the invention is realized as follows:

the embodiment of the invention provides a voice recognition method, which is applied to a voice recognition system and comprises the following steps:

collecting voice signals in an analog form, and processing the voice signals in the analog form to form

A speech signal in digital form;

preprocessing the voice signal in the digital form to eliminate background noise and overlapped frames in the voice signal in the digital form;

extracting the voice characteristics of the voice signals in the digital form, and performing characteristic compensation processing and characteristic normalization processing on the voice characteristics to form dynamic characteristics corresponding to the voice signals in the digital form;

and decoding the dynamic characteristics corresponding to the voice signals in the digital form by a decoder of the voice recognition system to form corresponding voice recognition results.

In the foregoing solution, the decoding, by a decoder of the speech recognition system, the dynamic feature corresponding to the digital speech signal includes:

splitting an original language model into a low-order language model and a differential language model, wherein the order of the low-order language model is lower than that of the original language model, and the differential language model is the difference between the original language model and the low-order language model;

decoding a voice signal using a first decoding network formed based on the low-level language model to obtain a path and a corresponding score, and,

re-scoring the decoding path using a second decoding network formed based on the differential language model;

and outputting the output symbols included by the paths meeting the scoring conditions as recognition results.

In the above scheme, the method further comprises: and adding a hypothesis set linked list to the companion hypothesis set of the next token when the existing companion hypothesis set of the next token is empty.

An embodiment of the present invention provides a speech recognition system, including:

the sampling analog/digital conversion module is used for collecting the voice signals in an analog form and processing the voice signals in the analog form to form voice signals in a digital form;

the preprocessing module is used for preprocessing the voice signal in the digital form so as to eliminate background noise and overlapped frames in the voice signal in the digital form;

the characteristic extraction module is used for extracting the voice characteristics of the voice signal in the digital form;

the characteristic processing module is used for carrying out characteristic compensation processing and characteristic normalization processing on the voice characteristics to form dynamic characteristics corresponding to the voice signals in the digital form;

and the decoder is used for decoding the dynamic characteristics corresponding to the voice signals in the digital form through the decoder of the voice recognition system so as to form corresponding voice recognition results.

In the foregoing solution, the decoder includes:

the decoding network module is used for splitting an original language model into a low-order language model and a differential language model, wherein the order of the low-order language model is lower than that of the original language model, and the differential language model is the difference between the original language model and the low-order language model;

a decoding module for decoding the voice signal to obtain a path and a corresponding score by using a first decoding network formed based on the low-level language model, and,

the decoding module is further used for re-scoring the decoding path by using a second decoding network formed based on the differential language model;

In the above solution, the decoding network module is further configured to fuse the low-order language model in a weighted finite state transformer, and obtain the first decoding network through fusion, or,

and fusing the low-order language model, the pronunciation dictionary and the acoustic model in a weighted finite state converter, and obtaining the first decoding network through fusion.

In the foregoing solution, the decoding module is further configured to perform the following processing for each frame of the speech signal:

initializing a token list in the first decoding network, and traversing tokens in the token list;

wherein, the following processing is executed aiming at the currently traversed target token: traversing the edges of the first decoding network from the state corresponding to the target token, calculating the sum of the acoustic model score and the language model score of the traversed edges by using the target frame, and taking the sum as the score of the traversed edges.

In the above solution, the decoding network module is further configured to traverse before the token in the token list,

and determining the token with the optimal current time point score in the tokens in the token list, and calculating the bundling width used in next bundling search according to the bundling width set by the determined token.

In the above solution, the decoding network module is further configured to initialize a score of a first token in the token list and assign a preamble pointer to be null;

performing a hash lookup construction on the second decoding network, and storing edges of the same state connected to the second decoding network in a hash manner,

the search key on each state of the second decoding network is an input symbol of the corresponding state, and the value corresponding to the key is an edge connecting the corresponding state and a skip state of the corresponding state.

In the foregoing solution, the decoding module is further configured to determine a next state of the states corresponding to the traversed edge when the score of the traversed edge does not exceed the score threshold;

creating an edge connecting states corresponding to the target token and the next state, recording input symbols of the traversed edge in the created edge, outputting symbols, acoustic model scores and language model scores, and pointing to the next token from the target token;

wherein the state corresponding to the next token in the second decoding network is a next state pointed to from the traversed edge in the first decoding network; traversing hypotheses of the set of hypotheses of the target token, and a companion set of hypotheses for each hypothesis traversed.

In the foregoing solution, the decoding module is further configured to, in the process of traversing the hypotheses in the hypothesis set of the target token and the companion hypothesis set of each traversed hypothesis, add the hypotheses in the hypothesis set of the target token into the pre-established null hypothesis set linked list assigned with a null value according to a descending order of scores when the output symbol corresponding to the traversed edge is a null symbol.

In the foregoing solution, the decoding module is further configured to, in the process of traversing the hypotheses in the hypothesis set of the target token and the companion hypothesis set of each traversed hypothesis, when an output symbol corresponding to the traversed edge is not an empty symbol, locate a state used for re-scoring and an edge starting from the re-scored state in the second decoding network, expand all edges starting from the re-scored state, and form a hypothesis set linked list used for storing companion hypotheses in the process of expansion.

In the foregoing solution, the decoding module is further configured to, when the hash table of the rescoring state is used to query an edge and a state corresponding to an input symbol, generate a corresponding new companion hypothesis set corresponding to a next state pointed by the queried edge, assign a state corresponding to the new companion hypothesis set to the next state pointed by the queried edge, and use a preamble pointer corresponding to the new companion hypothesis set as an output symbol of a currently traversed companion hypothesis set;

calculating a score for the new companion hypothesis set as a sum of the scores: scoring a currently traversed companion hypothesis set, scoring an acoustic model of a currently traversed edge, scoring a language model of the currently traversed edge, and scoring a language model corresponding to the queried edge; and adding the companion hypotheses in the new companion hypothesis set to the hypothesis set linked list which is pre-established and assigned to be empty according to the sequence from small to large.

In the above solution, the decoding module is further configured to point a skip state from the rescored state to a next state pointed by the queried edge when the rescoring state hash table is used to query an edge and a state corresponding to an input symbol and only the corresponding edge is queried; replacing the set of hypotheses for the target token with the new companion set of hypotheses; calculating a score for the new companion hypothesis set as a sum of the scores: the score of the currently traversed companion hypothesis set, the acoustic model score of the currently traversed edge, the language model score of the currently traversed edge, and the language model score corresponding to the queried edge.

In the foregoing solution, the decoding module is further configured to add the hypothesis set linked list to the companion hypothesis set of the next token when the existing companion hypothesis set of the next token is empty.

In the above scheme, the decoding module is further configured to merge existing companion hypothesis sets in the hypothesis set linked list according to a descending order of scores if the existing companion hypothesis sets of the next token are not empty, and if a hypothesis set exists in a companion hypothesis set in the next token and a first companion hypothesis set of the existing hypothesis set has a same state as a first companion hypothesis set of the hypothesis set linked list,

and if the state of the first companion hypothesis set of the existing hypothesis sets is different from that of the first companion hypothesis set of the hypothesis set linked list, inserting the hypothesis set linked list into the hypothesis set of the next token according to the scoring sequence of the head part of the companion hypothesis set.

In the above scheme, the decoding module is further configured to, after traversing hypotheses in the hypothesis set of the target token and the companion hypothesis set of each traversed hypothesis, move the target token out of the token list, and add the next token into the token list until all tokens have been moved out of the token list.

In the foregoing scheme, the decoding module is further configured to search for a companion hypothesis set with the highest score, and output an output symbol corresponding to the companion hypothesis set with the highest score as an identification result.

An embodiment of the present invention provides a decoder, including:

a memory for storing executable instructions;

and the processor is used for realizing the decoding method provided by the embodiment of the invention when executing the executable instructions stored in the memory.

The embodiment of the invention provides a storage medium, which stores executable instructions, wherein the executable instructions are used for executing the decoding method provided by the embodiment of the invention.

The embodiment of the invention has the following beneficial effects:

the original language model is split and two-stage decoding is carried out, the decoding in the two stages can be consistent with the decoding network directly constructed on the basis of the original model in the identification precision, and the decoding precision is ensured; meanwhile, the first decoding network formed by the low-order language model is used for decoding, and the second decoding network of the differential language model is used for re-grading, so that the size of the decoding network is obviously reduced, the storage resource is saved, and the decoding efficiency is improved.

Drawings

FIG. 1A is a diagram illustrating an alternative structure of a finite state automaton according to an embodiment of the present invention;

FIG. 1B is a diagram illustrating an alternative structure of a weighted finite state automaton according to an embodiment of the present invention;

fig. 1C is a schematic diagram of an alternative structure of a weighted finite state machine according to an embodiment of the present invention;

FIG. 2 is an alternative functional schematic diagram of an automatic speech recognition system provided by an embodiment of the present invention;

FIG. 3 is a schematic diagram of an alternative architecture of the automatic speech recognition system 100 provided by the embodiments of the present invention;

fig. 4 is a schematic diagram of an alternative implementation of a decoding process performed by a decoder according to an embodiment of the present invention;

FIG. 5 is a diagram of an alternative hardware configuration of an automatic speech recognition system provided by an embodiment of the present invention;

FIG. 6 is a schematic diagram of a decoding scheme provided by an embodiment of the invention;

FIG. 7 is an alternative schematic diagram of a decoding scheme provided by an embodiment of the invention;

FIG. 8 is an alternative flow diagram of a decoding scheme provided by an embodiment of the present invention;

fig. 9A is an alternative structural diagram of a TLG decoding network according to an embodiment of the present invention;

fig. 9B is an alternative structural diagram of a TLG decoding network according to an embodiment of the present invention;

fig. 10 is a schematic diagram of an alternative application scenario of a speech recognition system applying the decoding scheme provided by the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the attached drawings, and all other embodiments obtained by a person of ordinary skill in the art according to the embodiments of the present invention without any creative effort belong to the protection scope of the present invention.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.

Before further detailed description of the present invention, terms and expressions referred to in the embodiments of the present invention are described, and the terms and expressions referred to in the embodiments of the present invention are applicable to the following explanations.

1) Automatic Speech Recognition (ASR), a technique for converting human Speech into text, aims at enabling a device running a Speech Recognition system to hear text comprised by successive Speech uttered by different persons.

2) Finite-State Automata (FSA, finish-State Automata), see fig. 1A, where fig. 1A is an optional structural schematic diagram of a Finite-State Automata provided in an embodiment of the present invention, in the Finite-State Automata, a Node (Node) represents a State (State), a bold circle represents an initial State, a double-line circle represents an end State, and when a State is both the initial State and the end State, a double-thick-line circle represents the State, and a non-initial State is represented by a single-thin-line circle.

The method comprises the steps of decoding scores and information on a certain state at a certain moment through a data structure called a Token (Token), accessing the finite state automaton in a Token passing (Token Processing) mode, wherein the Token (Token) enters from an initial state from the finite state automaton, transferring to a next state through symbols of input states (namely input symbols), the transferring uses directional edges (Arc, also called transfer edges or transfer arcs in the text) to represent the transfer between the states, when the Token reaches a termination state after the last transfer is completed, and a series of states and edges recorded in the Token form a Path (Path) in the process of transferring from the initial state to the termination state.

3) Fig. 1B is an optional structural schematic diagram of the weighted Finite State Automata according to an embodiment of the present invention, where the weighted Finite State Automata sets scores (also referred to as weights) indicating probabilities for different transitions on the basis of the weighted Finite State Automata, and the score of one path is the sum of the scores of all transitions included in the path.

4) Weighted Finite State transition machines (WFST, fine-State transmitter), referring to fig. 1C, fig. 1C is an optional structural schematic diagram of a weighted Finite State transition machine provided in an embodiment of the present invention, where the weighted Finite State automata includes both input symbols and output symbols on the basis of the weighted Finite State automata, each transition includes both input symbols and output symbols, the input symbols and the output symbols use ": connect, and an output symbol of one path of the WFST is a concatenation of output symbols of all transitions in the path.

5) The knowledge source, i.e., the source of knowledge required for a Decoder (Decoder) to decode a recognition result based on a feature sequence of a speech signal, includes several knowledge sources expressed by WFST.

5.1 Acoustic Models (AM), knowledge representations of the differentiation of acoustics, phonetics, environmental variables, speaker gender, accents, etc., including Acoustic models based on Hidden Markov Models (HMM), such as mixed gaussian-Hidden Markov models (GMM-HMM) and deep neural network-Hidden Markov models (DNN-HMM) representations, the Hidden Markov models being weighted finite state automata in a discrete time domain; of course, end-to-End (End to End) acoustic models may also be included, such as the connection timing classification-long-short-time memory (CTC-LSTM) model and Attention (Attention) model.

The probability distribution of the speech features representing the speech units (such as words, syllables, phonemes, etc.) in each state of the acoustic model is connected into an ordered state sequence by the transition between states, i.e. a sequence of speech units represented by a speech signal is obtained, assuming that

Is a sequence of phonetic units, noted:

probability of acoustic model

To represent

And observation sequence

The degree of matching.

5.2 Language Model (LM), denoted herein by G, input symbols that are identical to output symbols, knowledge representation of Language structure (including words, rules between sentences, e.g., grammar, common collocation of words, etc.), and Language Model probabilities

Sequence for representing speech units

A priori probability of occurrence in a segment of a speech signal.

5.3 Acoustic context factor model, denoted by C, also called triphone model, inputs symbols: context-dependent (Triphone, called Triphone), output symbols: phonemes (Monophnoe); representing the correspondence from triphones to phonemes.

5.4 ) pronunciation dictionary, denoted by L, input symbols: phoneme, output symbol: a word; containing the set of words and their pronunciations.

6) A set of characters (Alphabet) is a collection of all characters (symbols), a finite length sequence of words formed by characters is called a String (String), the collection of strings forms a language, and the operation of connecting two strings together is referred to herein as concatenation (Concatenate).

7) A decoding network (Search Space), also called a Search Space, uses WFST fused knowledge sources, including language models, and may also include at least one of acoustic models, acoustic context factor models, and pronunciation dictionaries, for example, a single-factor decoding network composed of L and G, denoted as LG; C. the C-Level decoding network consisting of the L and the G is marked as CLG; a CLG decoding network represented using a hidden markov model (H), denoted HCLG; in addition, the decoding network formed by the end-to-end acoustic model (denoted as T), the pronunciation dictionary and G is referred to as TLG decoding network herein.

The decoder is atThe purpose of searching in the decoding network is to search the path with the highest grade, i.e. the optimal path, in the decoding network aiming at the characteristic sequence extracted from the collected voice signal, wherein the output symbols of the serial transfer in the optimal path are word strings

So that

The maximum value is obtained, and the maximum value,

as a result of the recognition of the speech signal, among others,

expressed as:

（1）

8) In Pruning, that is, viterbi Pruning Search (also called Beam Pruning) or Beam Search (Beam Search), when a decoder searches for an optimal path in a decoding network, only nodes with the number of Beam widths (Beam Width) are reserved in each state of an extended path, so that unnecessary paths can be removed.

9) The method comprises the steps that a word Graph and a decoding process of a decoder, namely the decoder performs a pruning process based on Token passing (Token Processing) in a decoding network, in the decoding process, paths which are passed by all tokens (Token) and can be communicated with a termination state are recorded, a formed Directed Acyclic Graph (Directed Acyclic Graph) is the word Graph, each node of the word Graph represents an ending time point of a word, each edge represents a possible word, and information such as acoustic scores, language model scores, time points and the like of the word occurs.

10 MERGE-SORT (MERGE), also known as MERGE, is an effective SORT algorithm, employing a typical application of Divide and Conquer (Divide and Conquer); combining the ordered subsequences to obtain a completely ordered sequence; that is, each subsequence is ordered first, and subsequences are ordered sequentially.

One example of a merging process is: comparing the sizes of ai and bj, if ai is less than or equal to bj, copying the element ai in the first ordered list into r k, and adding 1 to i and k respectively; otherwise, copying the element b [ j ] in the second ordered table to r [ k ], adding 1 to j and k respectively, and repeating the steps until one of the ordered tables is completely taken out, and then copying the rest elements in the other ordered table to a unit from the subscript k to the subscript t in r; the merging and sorting algorithm is usually implemented by recursion, firstly dividing the interval [ s, t ] to be sorted by a midpoint, then sorting the left subintervals, then sorting the right subintervals, and finally merging the left interval and the right interval into an ordered interval [ s, t ] by one-time merging operation.

An automatic speech recognition system for performing automatic speech recognition implementing an embodiment of the present invention is described below.

The automatic voice recognition system provided by the embodiment of the invention is used for carrying out isolated character (word) recognition, keyword detection and continuous voice recognition. The identification object of the isolated character is a character, a word or a phrase, a model is trained for each object, and a vocabulary table such as 'I', 'you', 'he' and the like is combined; the recognition object of keyword detection is continuous voice signal, but only one or several signals in the signal are recognized; continuous speech recognition is the recognition of an arbitrary piece of speech or a segment of speech.

In some embodiments, the automatic speech recognition system can be classified into a person-specific and a person-unspecific speech system according to the degree of dependence on the speaker, wherein the model of the person-specific automatic speech recognition system is trained only for speech data of one person, and the model of the automatic speech recognition system needs to be retrained when used for recognizing speech of other persons.

In some embodiments, the automatic speech recognition system may be classified into a small vocabulary, a medium vocabulary, a large vocabulary, and an infinite vocabulary automatic speech recognition system, depending on the size of the recognition vocabulary.

In some embodiments, the automatic speech recognition system can be classified into a desktop (PC) automatic speech recognition system, a telephone automatic speech recognition system, and an embedded device (e.g., a cell phone, a tablet) automatic speech recognition system according to the difference between the speech device and the channel.

Referring to fig. 2, fig. 2 is an optional functional schematic diagram of an automatic speech recognition system according to an embodiment of the present invention, in which a speech signal is preprocessed to extract speech features, and a pre-trained template library is used to perform pattern matching to form a recognition result of the speech signal.

In some embodiments, the structure of the automatic speech recognition system may be different for different recognition tasks, but according to fig. 2, the basic technology and the processing flow of the automatic speech recognition system are substantially the same, and the following description is made about an exemplary structure of the automatic speech recognition system, it being understood that the automatic speech recognition system described below is only an example for implementing the embodiment of the present invention, and the functional diagram of the automatic speech recognition system shown in fig. 2 can foresee various exemplary structures of the automatic speech recognition system.

Referring to fig. 3, fig. 3 is an alternative structural schematic diagram of the automatic speech recognition system 100 according to the embodiment of the present invention, which relates to two parts, namely a front end 110 and a back end 120, where the front end 110 includes a sampling analog-to-digital (a/D) conversion module 111, a preprocessing module 112, a feature extraction module 113, and a feature processing module 114; the back end 120 comprises a decoder 121 and also two knowledge sources, an acoustic model 122 and a context correlation 123, but may of course also comprise other types of knowledge sources, such as a pronunciation dictionary and a language model.

The sampling analog-to-digital (a/D) conversion module 111 is used for collecting analog voice signals, and converts sounds from a physical state into analog signals that are discrete in time and continuous in amplitude according to a certain sampling frequency (more than 2 times the highest frequency of the sounds), and generally, the analog-to-digital (a/D) conversion is performed by using a Pulse Code Modulation (PCM) or uniform quantization Modulation (uniform quantization) method to form digital voice signals.

The preprocessing module 112 is configured to perform preprocessing on the digital speech signal, which involves pre-emphasis, windowing, framing, endpoint detection, and filtering; the pre-emphasis is used for improving the high-frequency part of the voice signal and smoothing the frequency spectrum of the voice signal; windowing and framing are used for dividing the speech signal into a plurality of mutually overlapped frames (frames) in the form of rectangular windows or Hamming windows and the like according to the time-varying characteristic of the speech signal, for example, into a plurality of frames with the length of 20 milliseconds (ms) and adjacent frames have overlap of 10 ms; the endpoint detection is used for finding out the initial part and the end part of the voice signal, and the filtering is used for removing the background noise of the voice signal; the preprocessed voice signal is extracted by the feature extraction module 113 according to a certain feature extraction method to most represent the voice feature of the voice signal, a normalized feature sequence of the voice signal is formed according to a time sequence, and feature compensation and feature normalization are performed by the feature processing module 114 to form a dynamic feature.

The speech features may be represented by time domain features and Frequency domain features, and may be derived from sources including features based on human occurrence mechanisms, such as Linear Predictive Cepstral Coefficients (LPCC), features based on human auditory perception, such as Mel-Frequency cepstral coefficients (MFCC), and may include, in addition to the static speech features described above, logarithmic energy or dynamic features formed by calculating first-order and second-order differences from the static features and concatenating the dynamic features to form new features.

Knowledge sources such as acoustic context, pronunciation dictionary, acoustic model and language model are fused in a WFST-based network to form a decoding network, wherein the acoustic model is obtained by training a speech database, the language model is obtained by training the speech database, the process of training the acoustic model and the language model is a fusion process of speech, linguistic knowledge, a signal processing technology, a data mining technology and a statistical modeling method, a decoder 121 searches an optimal path through a certain search mode, output symbols of a series of edges connected in series in the optimal path form a word string, and the word string is output as a recognition result of a speech signal.

The back end 120 performs decoding by the decoder 121, that is, when the feature sequence of the speech signal is input, searches for an optimal path in the decoding network: at time t, when each state of the decoding network reaches the optimal score, the path is ended, the result at time t +1 can be obtained according to the result at time t, and when the last time is reached, the state with the highest score is traced back to obtain the optimal path.

Referring to fig. 4, fig. 4 is a schematic diagram of an alternative implementation of a decoding process performed by a decoder according to an embodiment of the present invention, where the decoder is configured to integrate various knowledge sources, such as an acoustic model, a pronunciation dictionary, a context element, and a language model, into a WFST, and perform search and matching operations on a feature sequence of an input speech signal until a path including a word string with the highest output probability is searched as a recognition result.

The decoding network module 1211 of the decoder 121 is configured to implement model integration and model optimization, in which context-dependent acoustic models, pronunciation dictionaries, and acoustic context factor models are integrated into a single WFST (hereinafter referred to as integrated WFST), i.e., a decoding network, using an integration algorithm, and the model optimization includes performing a deterministic operation by a deterministic algorithm and performing a minimal operation by a minimal algorithm, thereby reducing an occupancy rate of a recognition time and a storage space and improving a recognition efficiency.

In the case of deterministic operations (Determinization), in a deterministic integrated WFST, at most only one edge per input symbol corresponds to each state of the integrated WFST, which has the effect that, for a signature sequence of a speech signal input into an automatic speech recognition system, only one path corresponds to the signature sequence in a decoding network due to the elimination of repeated paths in the decoding network, thereby reducing the time and space consumption for decoding.

In terms of Minimization operations (Minimization), the minimized integrated WFST is equivalent to the pre-minimized integrated WFST, and the number of states included and the number of edges included in the minimized integrated WFST are minimal in all of the finalized integrated WFSTs.

The search module 1213 of the decoder 121 is configured to search for an optimal path in the established decoding network, which involves initialization, evaluation, pruning and backtracking of the path; as for the pruning of the paths, the modes of global cumulative probability pruning, language model pruning, histogram pruning and the like are included, and the number of the paths is prevented from being increased explosively by cutting unnecessary paths.

Continuing with the description of the hardware structure of the automatic speech recognition system provided by the embodiment of the present invention, referring to fig. 5, fig. 5 is a schematic diagram of an alternative hardware structure of the automatic speech recognition system provided by the embodiment of the present invention, and the automatic speech recognition system 200 shown in fig. 5 may include: at least one processor 210, at least one communication bus 240, a user interface 230, at least one network interface 220, and memory 250. The various components in the automatic speech recognition system 200 are coupled together by a communication bus 240. It will be appreciated that a communication bus 240 is used to enable communications among the components. The communication bus 240 includes a power bus, a control bus, and a status signal bus in addition to a data bus. But for clarity of illustration the various buses are labeled in figure 5 as communication bus 240.

User interface 230 may include, among other things, a display, a keyboard, a mouse, a trackball, a click wheel, a key, a button, a touch pad, or a touch screen. Network interface 220 may include standard wired, wireless interfaces; typically, the wireless interface may be a WiFi interface.

It is understood that the Memory 250 may be a high-speed RAM Memory or a Non-Volatile Memory (Non-Volatile Memory), such as at least one disk Memory. Memory 250 may also be at least one memory system remote from processor 210.

The method applied to the automatic speech recognition system provided by the embodiment of the present invention may be applied to the processor 210, or the processor 210. The processor 210 may be an integrated circuit chip having signal processing capabilities. In implementation, the different operations in the decoding method applied to the automatic speech recognition system may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 210. The processor 210 described above may be a general purpose processor, a DSP or other programmable logic device, discrete gate or transistor logic devices, discrete hardware components, or the like. Processor 210 may implement or perform the methods, steps, and logic blocks of embodiments of the present invention as applied to an automatic speech recognition system. A general purpose processor may be a microprocessor or any conventional processor or the like. The decoding method applied to the automatic speech recognition system provided by the embodiment of the invention can be directly embodied as the execution of a hardware decoding processor, or the execution of the hardware decoding processor and a software module in the decoding processor by combination.

As an example of a software implementation of the automatic speech recognition system, the software of the automatic speech recognition system may be located in a storage medium located in the memory 250, the storage medium is located in the memory 250 and stores the software of the speech recognition system 100, the automatic speech recognition system 100 includes the decoder 121, other software modules of the speech recognition system 100 can be understood according to fig. 3, and a description will not be repeated, and the processor 210 reads executable instructions in the memory 250 and completes, in combination with hardware thereof, the decoding method applied to the automatic speech recognition system provided in the embodiment of the present invention.

The decoder 121 includes a decoding network module 1211 and a decoding module 1212, which are described below.

A decoding network module 1211 configured to split an original language model into a low-order language model and a differential language model, wherein the low-order language model has a lower order than the original language model, and the differential language model is a difference between the original language model and the low-order language model;

a decoding module 1212, configured to decode a speech signal to obtain a path and a corresponding score using a first decoding network formed based on the low-order language model, and re-score the decoded path using a second decoding network formed based on the differential language model; and outputting the output symbols included by the paths meeting the scoring conditions as the recognition results.

In some embodiments, the decoding network module 1211 is further configured to fuse the low-order language model in a weighted finite state transformer to obtain the first decoding network, or fuse the low-order language model, a pronunciation dictionary and an acoustic model in a weighted finite state transformer to obtain the first decoding network.

In some embodiments, the decoding module 1212 is further configured to initialize a token list in the first decoding network and traverse tokens in the token list for each frame of the speech signal; wherein the following is performed for the currently traversed target token:

traversing the edges of the first decoding network from the state corresponding to the target token, calculating the sum of the acoustic model score and the language model score of the traversed edges by using the target frames (namely the currently traversed frames), and taking the sum as the score of the traversed edges.

In some embodiments, the decoding network module 1211 is further configured to determine, before traversing the tokens in the token list, a token with the best current time point score from the tokens in the token list, and calculate a bundle width used in a next bundle search according to a bundle width set by the determined token.

In some embodiments, the decoding network module 1211 is further configured to initialize a score and a preamble pointer of a first token in the token list to be null; performing a hash lookup construction on the second decoding network, storing edges of the same state connected to the second decoding network in a hash manner,

the search key on each state is an input symbol of the corresponding state, and the value corresponding to the key is an edge connecting the corresponding state and a jump state of the corresponding state.

In some embodiments, the decoding module 1212 is further configured to determine a next state of the states corresponding to the traversed edge when the score of the traversed edge does not exceed the score threshold; creating an edge between the state corresponding to the target token and the next state, recording input symbols of the traversed edge in the edge, outputting symbols, an acoustic model score and a language model score, and pointing to the next token from the target token, namely connecting the state pointed by the target token in the first decoding network and the state corresponding to the next token in the first decoding network; wherein the state corresponding to the next token in the second decoding network is the next state pointed to from the traversed edge; traversing hypotheses of the set of hypotheses of the target token, and a companion set of hypotheses for each hypothesis traversed.

In some embodiments, the decoding module 1212 is further configured to, in traversing the hypotheses in the hypothesis set of the target token and the companion hypothesis set of each traversed hypothesis, add the hypotheses in the hypothesis set of the target token to the pre-established null hypothesis set linked list assigned with null values in an order from small to large scores when the output symbol corresponding to the traversed edge is a null symbol.

In some embodiments, the decoding module 1212 is further configured to, in traversing the hypotheses in the hypothesis set of the target token and the companion hypothesis set of each traversed hypothesis, when an output symbol corresponding to the traversed edge is not an empty symbol, locate, in the second decoding network, a state used for rescoring and an edge starting from the rescored state, in which all edges starting from the rescored state are expanded, and form a hypothesis set linked list used for storing companion hypotheses in an expanding process.

In some embodiments, the decoding module 1212 is further configured to, when the edge and the state corresponding to the input symbol are queried by using the hash table of the rescored state, generate a corresponding new companion hypothesis set corresponding to a next state pointed by the queried edge, where a state corresponding to the new companion hypothesis set is assigned as the next state pointed by the queried edge, and a preamble pointer corresponding to the new companion hypothesis set is an output symbol of a currently traversed companion hypothesis set; calculating a score for the new companion hypothesis set as a sum of the scores: scoring a currently traversed companion hypothesis set, scoring an acoustic model of a currently traversed edge, scoring a language model of the currently traversed edge, and scoring a language model corresponding to the queried edge; and adding the companion hypotheses in the new companion hypothesis set to the hypothesis set linked list which is pre-established and assigned to be empty according to the sequence from small to large.

In some embodiments, the decoding module 1212 is further configured to, when the hash table of the rescored states is used to query an edge and a state corresponding to an input symbol, and only the corresponding edge is queried, point a state jumped from the rescored state to a next state pointed by the queried edge; replacing the set of hypotheses for the target token with the new companion set of hypotheses; calculating a score for the new companion hypothesis set as a sum of the scores: the score of the currently traversed companion hypothesis set, the acoustic model score of the currently traversed edge, the language model score of the currently traversed edge, and the language model score corresponding to the queried edge.

In some embodiments, the decoding module 1212 is further configured to add a hypothesis set linked list to the companion hypothesis set of the next token when the existing companion hypothesis set of the next token is empty.

In some embodiments, the decoding module 1212 is further configured to, if the existing companion hypothesis set of the next token is not empty, merge the existing hypothesis set and the companion hypothesis set in the hypothesis set linked list in an order from small to large scores if the hypothesis set exists in the companion hypothesis set in the next token and a first companion hypothesis set of the existing hypothesis set has a same state as a first companion hypothesis set of the hypothesis set linked list, and insert the hypothesis set linked list into the hypothesis set of the next token according to a score order of a head of the companion hypothesis set if the first companion hypothesis set of the existing hypothesis set has a different state from the first companion hypothesis set of the hypothesis set linked list.

In some embodiments, the decoding module 1212 is further configured to, after traversing the hypotheses in the set of hypotheses of the target token and the traversed companion set of hypotheses for each hypothesis, move the target token out of the token list and add the next token into the token list until all tokens have been moved out of the token list.

In some embodiments, the decoding module 1212 is further configured to find a companion hypothesis set with a highest score, and output an output symbol corresponding to the companion hypothesis set with the highest score as the identification result.

As a hardware implementation example of the automatic speech recognition system shown in fig. 5, a configuration of a hardware platform of the automatic speech recognition system provided by the embodiment of the present invention may be: 2 14-core CPUs (E5-2680 v 4) and 256G memory; disk array (Raid), 2 × 300 Serial Attached SCSI (SAS), 6 × 800g solid State Storage (SSD); 2 × 40g network port (optical port, multimode), 8 × Graphics Processing Unit (GPU), 2.4GHz master frequency, GPU model Tesla M40 GB; of course, the configuration of the hardware platform carrying the automatic speech recognition system shown above is merely an example, and can be flexibly changed as needed.

As mentioned above, in continuous speech recognition, decoding is a process of calculating a word sequence having a maximum a posteriori probability for a feature sequence of an input speech signal, and therefore, a good decoding algorithm should satisfy: accuracy, namely, effectively utilizing various knowledge to ensure that the identification result is as accurate as possible; the method has high efficiency, namely, the recognition result is obtained as soon as possible, and ideally, the recognition result is output immediately after the voice signal is input into the automatic voice recognition system; low consumption, i.e. trying to occupy hardware resources, including memory and processor.

Referring to fig. 6, fig. 6 is a schematic diagram of a decoding scheme according to an embodiment of the present invention, in which a two-stage (2-PASS) decoding is performed in an automatic speech recognition system, a word graph (Lattice) including multiple candidate paths is obtained by using an HCLG decoding network in stage 1, multiple paths in stage 1 are re-scored (recoring) by using a language model decoding network (denoted as g.fst) in stage 2, a path with a high score is selected as an optimal path, and an output symbol of the optimal path is output as a recognition result of a speech signal.

In the stage 1, the WFST is used for fusing the HCLG decoding network formed by the industrial-level language model, and the volume of the HCLG decoding network occupies too much memory space, so that the HCLG decoding network cannot be applied in industry; for example, when a language model decoding network is formed for a language model of 2 gigabytes (G), the volume of the decoding network is close to 20G bytes, which cannot be applied to the large-scale concurrent situations of speech signal recognition in industry, and the decoding speed is obviously slowed down because of the large volume of the decoding network.

Firstly, when decoding is implemented on an automatic speech recognition system of an HCLG decoding network, in the process of transmitting a token in the HCLG decoding network, the re-scoring of hypotheses on the token has the defect that an upper limit exists on an associated hypothesis set of each token, and since the upper limit is set manually, the upper limit is not applied to a large-scale decoding system at an actual industrial level, even if the upper limit is set according to experience of actual industrial application, a path of a correct recognition result is inevitably pruned, and decoding precision is affected.

Secondly, for the search process of the language model decoding network, a practical acceleration scheme is lacked, because the search in the language model decoding network belongs to the most time-consuming part in the decoding process, in the application of the industrial level, a corresponding decoding network needs to be generated for the language model with the volume of tens of GB or even hundreds of GB, the decoding network expresses knowledge of the language structure on the basis of the language model, and the volume of the decoding network is undoubtedly further expanded on the basis of the language model; if the decoding scheme shown in fig. 6 is applied, the lack of a specific scheme for generating a corresponding decoding network for a business-level language model may affect the recognition efficiency of the automatic speech recognition system as a whole.

It can be seen that the decoding scheme shown by fig. 6 according to the embodiment of the present invention involves the following problems:

1) During decoding, the upper limit requirement is set on the number of the companion hypothesis sets retained by each token, and the decoding speed and efficiency are directly influenced;

2) During decoding, a decoding network used in the stage 2 is not used (optimization and acceleration of an expansion process directly influence the searching speed and further influence the decoding efficiency;

3) The solution for generating a decoding network proposed for the conventional HCLG network is too large in size and limited in memory resources in industrial application, so that the solution has no industrial practical value.

Referring to fig. 7 and 8, fig. 7 is an alternative schematic diagram of a decoding scheme provided by an embodiment of the present invention, fig. 8 is an alternative flow diagram of the decoding scheme provided by the embodiment of the present invention, and an industrially practical solution is provided for the defects existing in the decoding scheme shown in fig. 6 in fig. 7 and 8.

In operation 11, the original language model is split into a low-order language model having an order lower than that of the original language model and a differential language model which is a difference between the original language model and the low-order language model.

In operation 12, the voice signal is decoded to obtain paths and corresponding scores using a first decoding network formed based on a low-order language model, and the decoded paths are re-scored using a second decoding network formed based on a differential language model.

In operation 13, output symbols included in the path satisfying the scoring condition are output as a recognition result.

In some embodiments, the network is formed for the first decoding based on: and fusing a low-order language model in the weighted finite state converter to obtain the first decoding network, or fusing the low-order language model, the pronunciation dictionary and the acoustic model in the weighted finite state converter to obtain the first decoding network.

In some embodiments, when the first decoding network formed based on the low-level language model is used to decode the voice signal to obtain the path and the corresponding score, the following implementation may be adopted:

initializing a token list in the first decoding network for each frame of the speech signal, and traversing the tokens in the token list, performing the following for the currently traversed target token: and traversing the edges of the first decoding network from the state corresponding to the target token, calculating the sum of the acoustic model score and the language model score of the traversed edges by using the target frame (namely the currently traversed frame), and taking the sum as the score of the traversed edges.

In some embodiments, before traversing the tokens in the token list, a token with the best current time point score among the tokens in the token list may be further determined, and the bundle width used in the next bundle search is calculated according to the bundle width set in the determined token.

In some embodiments, when initializing the token list in the first decoding network, the following implementation may be adopted: initializing the score of a first token in the token list and assigning a preamble pointer to be null; carrying out Hash search construction on the second decoding network, and storing the edges in the same state, which are connected to the second decoding network, in a Hash mode; the search key on each state is an input symbol of the corresponding state, and the value corresponding to the key is an edge connecting the corresponding state and a jump state of the corresponding state.

In some embodiments, in the process of initializing the token list in the first decoding network, when the score of the traversed edge does not exceed the score threshold, determining a next state of the states corresponding to the traversed edge; creating an edge between a state corresponding to the target token and a next state, recording input symbols of the traversed edge in the created edge, outputting symbols, an acoustic model score and a language model score, and pointing to the next token from the target token, namely pointing to a corresponding state of the next token in the first decoding network from the corresponding state of the target token in the first decoding network; wherein the state corresponding to the next token in the second decoding network is the next state pointed to from the traversed edge; traversing the hypotheses of the set of hypotheses of the target token, and a companion set of hypotheses for each hypothesis traversed.

In some embodiments, the decoding paths are re-scored using a second decoding network formed based on a differential language model, and according to the case that the corresponding symbols are connected to be null symbols, the following implementation may be adopted: in the process of traversing the hypotheses in the hypothesis set of the target token and the companion hypothesis set of each traversed hypothesis, when the output symbol corresponding to the traversed edge is a null symbol, adding the hypotheses in the hypothesis set of the target token into a pre-established null hypothesis set linked list assigned with null values according to the order of the scores from small to large.

In some embodiments, when the second decoding network formed based on the differential language model is used, according to the case that the corresponding symbol is connected to a non-null symbol, the following implementation may be adopted: and in the process of traversing the hypotheses in the hypothesis set of the target token and the companion hypothesis set of each traversed hypothesis, when an output symbol corresponding to the traversed edge is not a null symbol, positioning a state for performing re-scoring and an edge starting from the re-scored state in a second decoding network, in the second decoding network, expanding all edges starting from the re-scored state, and forming a hypothesis set linked list for storing the companion hypothesis in the expanding process.

In some embodiments, in the process of forming a linked list for storing the companion hypothesis in the expanding process, according to the hash table using the rescoring state, whether an edge and a state corresponding to the input symbol are queried may be implemented as follows:

1) When the hash table of the re-grading state is used for inquiring the edge and the state corresponding to the input symbol, generating a corresponding new companion hypothesis set corresponding to the next state pointed by the inquired edge, assigning the state corresponding to the new companion hypothesis set as the next state pointed by the inquired edge, and enabling a preamble pointer corresponding to the new companion hypothesis set to be the output symbol of the currently traversed companion hypothesis set; the score of the new companion hypothesis set is calculated as the sum of the following scores: scoring of the currently traversed companion hypothesis set, scoring of an acoustic model of the currently traversed edge, scoring of a language model of the currently traversed edge, and scoring of a language model corresponding to the queried edge; adding the companion hypotheses in the new companion hypothesis set into a pre-established hypothesis set linked list which is assigned to be empty according to the sequence from small to large;

2) When the hash table of the re-scoring state is used, querying the corresponding edge and state of the input symbol, and only querying the corresponding edge, pointing the state jumped from the re-scoring state to the next state pointed by the queried edge; replacing the hypothesis set of the target token with a new companion hypothesis set; the score for the new companion hypothesis set is computed as the sum of the following scores: the score of the currently traversed companion hypothesis set, the acoustic model score of the currently traversed edge, the language model score of the currently traversed edge, and the language model score corresponding to the queried edge.

In some embodiments, according to the next token corresponding to the currently traversed token, and according to whether the existing companion hypothesis set of the next token is empty, the following processing is performed on the pre-established companion hypothesis set:

1) Adding the hypothesis set linked list into the companion hypothesis set of the next token when the existing companion hypothesis set of the next token is empty;

2) When the existing companion hypothesis set of the next token is not empty, the hypothesis set linked list is processed in the following way according to different conditions of hypothesis sets existing in the companion hypothesis set in the next token:

2.1 If an assumption set exists in a companion assumption set in a next token, and the state of a first companion assumption set of the existing assumption set is the same as that of a first companion assumption set of an assumption set linked list, merging the existing assumption set and the companion assumption set in the assumption set linked list according to the order of scores from small to large;

2.2 If the state of the first companion hypothesis set of the existing hypothesis set is different from the state of the first companion hypothesis set of the hypothesis set linked list, inserting the hypothesis set linked list into the hypothesis set of the next token according to the scoring order of the head of the companion hypothesis set.

In some embodiments, after traversing the hypotheses in the hypothesis set of target tokens and the companion hypothesis set of each hypothesis traversed, the target token is moved out of the token list and the next token is added to the token list until all tokens have been moved out of the token list.

In some embodiments, when the output symbols included in the path satisfying the scoring condition are output as the recognition result, the following implementation may be adopted: and searching the companion hypothesis set with the highest score, and outputting an output symbol corresponding to the companion hypothesis set with the highest score as an identification result.

The algorithm implementation process of the decoding scheme provided by the embodiment of the invention is continuously explained, and the related abbreviations are explained.

< eps > represents null symbols; ilabel represents the input symbol; olabel stands for the symbol output; hypslist refers to a set of hypotheses on a token at decoding time; cohyp refers to a companion hypothesis set generated after re-scoring corresponding to a certain hypothesis set; a token is a data structure that records scores (including acoustic model scores and language model scores) and information at a certain state at a certain time at decoding time; arc, edge.

In operation 21, initializing the token list, where initializing the content includes initializing a score of the first token to 0.0, assigning a preamble (back) pointer to NULL (NULL), performing a hash lookup structure on g.fst, and storing a plurality of edges originally connected to the same state in a hash manner: the lookup Key (Key) of each state is an Input Label (Input Label), and the Value (Value) is an edge and jump state starting from the state.

In operation 22, the Frame (Frame) pointer is incremented by 1, and the existing token is processed (the Frame to be processed pointed to by the Frame pointer is also referred to as the current Frame, the target Frame, or the Frame to be processed).

In operation 22.1, the tokens in the token list are traversed, the token score with the best score at the current time point is found, and the bundle width required to be pruned in the next search is calculated according to the bundle width set in the current search.

In operation 22.2, the token list is traversed again.

In operation 22.3, it is assumed that a certain token is traversed, which is assumed to be token a (the currently traversed token is referred to as current token, target token or pending token in the following).

In operation 22.3.1, the corresponding state of token a in tlg.fst (which may be replaced by a dynamic decoding network) is found, and this state is set to state1.

In operation 22.3.2, all edges from this state1 are traversed in turn, setting this edge to arc1.

In operation 22.3.3, the acoustic model score ac _ cost and the language model score graph _ cost of the edge arc1 are calculated using the current frame, and the scores are recorded.

In operation 22.3.4, if the score exceeds a pre-set pruning value (i.e., a score threshold), then the edge is discarded to go to the next edge; if the score does not exceed the pruning value, then a token B is newly created or found in the existing token list, the corresponding state of the token B in TLG.fst is the next state pointed to by arc1, and the next state pointed to by arc1 is set as state2.

In operation 22.3.5, a new connection (Link), i.e. an edge, is created, which records the input symbols of the current edge arc1, outputs the symbols, the acoustic model score and the language model score, and points from token a to token B, i.e. connects the corresponding state of token a in tlg.fst with the corresponding state of token B in tlg.fst.

In operation 22.3.6, all hypothesis sets are taken from token A, denoted as hyps, and the hypotheses in token A's hypothesis set hyps are traversed.

Assume that the currently traversed hypothesis is hypothesis hyp a at operation 22.3.7, ensuring that the companion hypothesis sets in hypothesis hyp a are ordered in order from small to large according to score.

In operation 22.3.8, a linked list is created that records multiple companion hypothesis sets, denoted as hyp a', and initially assigned null.

In operation 22.3.9, all companion hypothesis sets in hypothesis hypa are traversed, assuming that the currently selected companion hypothesis is companion hypothesis cohyp a.

In operation 22.3.10, if the output symbol olabel for the edge is not < eps >, operation 22.3.11 through operation 22.3.12.2 are performed.

In operation 22.3.11, the state corresponding to the companion hypothesis cohyp a is found in g2.Fst, marked as state a, and all edges starting from state a of g.fst for re-scoring and g2.Fst are located.

In operation 22.3.12, all edges in g.fst that start from state stateA begin are expanded.

In operation 22.3.12.1, using the hash table on stateA, the edges and state of the input symbol being olabel (i.e., the concatenated input symbol) are queried;

if only the edge with the input symbol of olabel exists and the state does not exist, the method goes to 2.3.12.2 to continue execution;

if the edge and the state with the input symbol of olabel exist, setting the searched edge as arc2, searching the next state pointed by the arc2, and recording as state A'; generating a new companion hypothesis set, which is recorded as a companion hypothesis set cohyp a ', a state value corresponding to the companion hypothesis set cohyp a ' is set as state a ', a preamble pointer of the companion hypothesis set cohyp a ' is an output symbol of the companion hypothesis set cohyp a traversed currently, and a score of the companion hypothesis set cohyp a ' is a score of the companion hypothesis set cohyp a, an acoustic model score _ ac _ cost of a currently traversed edge (i.e., an edge traversed by the token a in tlg.fst), a language model score _ cost of a currently traversed edge (i.e., an edge traversed currently by the token a) and a language model score graph _ cost corresponding to arc 2;

this new companion hypothesis set cohyp a 'is added to hypothesis set hyp a' in order of score from smaller to larger, proceeding to operation 22.3.14.

In operation 22.3.12.2, traversing arc2 inputs an edge equal to < eps >, jumping state A to arc2 pointing to the next state A'; companion hypothesis set cohyp a ' is replaced by companion hypothesis set cohyp a ', and the score of companion hypothesis set cohyp a ' is: the score of a cohyp A set, the ac _ cost of the edge arc1, the sum of the language model score graph _ cost of the edge arc1 and the graph cost corresponding to the edge arc2 are associated, and the state corresponding to the cohyp A 'is A'; swap companion hypothesis set cohyp a 'for companion hypothesis set cohyp a', return to operation 22.3.10 to recursively execute operation 22.3.10-operation 22.3.12.2.

In operation 22.3.13, if the connection corresponding output symbol olabel is < eps >, 2.3.13.1 is performed.

In operation 22.3.13.1, companion hypothesis set cohyp a is added directly to hypothesis set hyp a' in order of smaller to larger scores.

In operation 22.3.14, traversal continues back to 2.3.9 until the companion hypothesis set in hypothesis set hyp a has been completely traversed.

In operation 22.3.15, for the hypothesis set hyp a' generated through the above process, checking the existing hypothesis sets of token B, and ensuring that the hypothesis sets of token B are arranged from small to large according to the associated hypothesis set score with the minimum score corresponding to each hypothesis set;

if the existing companion hypothesis set of the token B is empty, directly adding the hypothesis set hyp A' into the companion hypothesis set of the token B;

if the companion hypothesis set of token B is not empty, the companion hypothesis set of token B is traversed first, if there is a certain hypothesis set hyp B in the companion hypothesis set of token B, and the state of its first companion hypothesis set is the same as that of the first companion hypothesis set corresponding to the hypothesis set hyp a ', the companion hypothesis set hyp B and the companion hypothesis set hyp a ' are merged in order of scores from small to large, otherwise, the hypothesis set hyp a ' is directly inserted into the hypothesis set of token B in order of scores of the companion hypothesis set header (cohyp _ head).

In operation 22.3.16, execution 2.3.6 through 2.3.15 continues back to 2.3.6 until all hypothesis sets have been traversed.

In operation 22.3.17 token a is moved out of the token list, token B is added to the token list, and execution continues from 2.2 to 2.3.16 back to operation 22.2 until all tokens have been moved out of the token list

In operation 22.4, all steps from operation 22 to operation 22.4 are performed back to operation 22 until all frames have been traversed in operation 22.5, the companion hypothesis set with the highest score is found, and output symbols corresponding to the companion hypothesis set are output, where the output symbols are the recognition result of the decoder.

According to the algorithm implementation process, it is obvious that:

firstly, generating a decoding network by using a low-order language model, wherein the volume of the generated decoding network is smaller than that generated according to an original language model; in addition, the grading of the differential language model is added in real time during decoding, and the grading is used for re-grading the tokens expanded in real time, so that the better grading after the re-grading of the original language model is recorded on each token, the decoding speed is accelerated, meanwhile, the decoding result with the same precision as that of the large model language model can be obtained, and the decoding speed is obviously improved on the basis of not influencing the decoding precision.

Secondly, the rapid sequencing and merging when the number of the companion hypothesis sets of each token is excessive are solved through a divide-and-conquer scheme, so that more companion hypothesis sets can be used, and a Hash method is used for accelerating the rapid matching in the expansion process of the edges of the decoding network.

And thirdly, a tool for generating a corresponding language model decoding network (marked as G.fst) aiming at the language model is used, the problem of overlarge memory consumption when a large decoding network is generated in an open source tool (marked as openfst) is solved, the decoding and identification of the language model with hundreds of GB can be possible, and the accuracy and the real-time performance of the whole automatic speech recognition system are improved.

As can be seen from the above, the decoding scheme provided in the embodiment of the present invention uses a divide-and-conquer scheme, to classify and merge the associated hypothesis set in the token, and when the language model decoding network is expanded, for example, hash of Mid-Square method (Mid-Square) may be used to accelerate search, that is, first, the Key is calculated as Square (Key) ^2, and then the middle part of (Key) ^2 is taken as the Value (Value) of the Key, so that the speed of searching the optimal path in the language model can be significantly increased.

Therefore, the decoding scheme provided by the embodiment of the invention can generate a corresponding language model for a large language model and carry out decoding, makes up for the lack of a practical scheme capable of generating a decoding network for an original language model, and provides a large model solution with high precision at an industrial level.

In some embodiments, a dynamic decoder is provided to replace the decoding scheme of the decoder, the decoded path is dynamically expanded through a dictionary, and then a language model decoding network is used to perform dynamic re-scoring and pruning on the decoded path; the alternative scheme has the advantages that the generation of the TLG decoding network by combining the pronunciation dictionary with the language model and the language model is not needed, only the decoding network corresponding to the language model is needed, the preparation work of decoding is simplified, the memory space consumed by the decoding network is further reduced,

continuing to explain with an illustration for implementing an embodiment of the present invention, referring to fig. 9A to 9B, fig. 9A is an optional structural schematic diagram of a TLG decoding network provided by the embodiment of the present invention, fig. 9B is an optional structural schematic diagram of a TLG decoding network provided by the embodiment of the present invention, first decoding in tlg.fst, where a decoded path is 0-1-2-4-6, an output symbol of the path is "weather today", and a score of the path is 0+0.8+1.2=2.0; the other decoded path is 0-1-2-4-7, the output symbol of the path is "today day start", the path score is 0+0.8+1.0=1.8, obviously the path is better than the path with the output symbol of "today weather".

After g.fst re-scoring, the weather score is found to be a new addition of 0.1, the final path score is 2.1, and the weather score is 0.4 in g.fst, so that a new addition of 0.4 is required, and the final score is 1.8+0.4=2.2.

After re-scoring, the score of weather today 2.1 is smaller than the score of weather today 2.2, and the final output recognition result is "weather today".

Referring to fig. 10, fig. 10 is a schematic diagram of an optional application scenario of a speech recognition system applying the decoding scheme provided in the embodiment of the present invention, and illustrates an example of an actual application scenario provided in the embodiment of the present invention.

As an example, the automatic speech recognition system provided in the embodiment of the present invention may be implemented as an offline recognition scheme of a terminal (e.g., a smart phone, a tablet computer, etc.), where the terminal obtains relevant data of speech recognition from a cloud in advance, and relies on a processor and a memory of the terminal to perform speech recognition independent of a server, such as speech input in various APPs.

As another example, the automatic speech recognition system provided by the embodiment of the present invention is implemented as a cloud speech recognition scheme, and the applied products are related scenes that need to call a speech recognition function, such as an intelligent home scene, a speech input transcription, a vehicle navigation system, an intelligent sound box, and the like, and the scene application is completed by calling the speech recognition capability of the cloud, and the application can be packaged as a speech recognition APP, and speech recognition engines embedded in various APPs provide effective speech recognition support for various intelligent speech interaction scenes.

In summary, the embodiment of the present invention provides an automatic speech recognition system with a decoding scheme, which can improve the recognition accuracy of the automatic speech recognition system provided by the related art, and simultaneously maintain or improve the recognition speed of the existing automatic speech recognition system; because the embodiment of the invention can utilize the decoding network generated by the language model at the industrial level to perform re-grading, compared with the prior art which can not generate a practical decoding network aiming at the language model at the industrial level, the prior art which generates the TLG decoding network by using the language model at the same level is huge and can not be practical; compared with the HCLG decoding network used in the stage 1, the size of the TLG decoding network using the low order in the stage 1 is obviously reduced compared with that of the HCLG decoding network, and the decoding network constructed by the differential language model can be used for re-grading, so that the recognition accuracy can be consistent with that of the HCLG decoding network used in the stage 1.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present invention are included in the protection scope of the present invention.

Claims

1. A speech recognition method, applied in a speech recognition system, comprising:

A speech signal in digital form;

decoding a speech signal using a first decoding network formed based on the low-level language model to obtain a decoding path and a corresponding score, and,

and outputting the output symbols included by the decoding paths meeting the scoring conditions as recognition results to form corresponding voice recognition results.

2. The speech recognition method of claim 1, further comprising:

fusing the low-level language model in a weighted finite state transformer to obtain the first decoding network through fusion, or,

3. The speech recognition method of claim 1, wherein decoding the speech signal using the first decoding network formed based on the low-level language model to obtain a decoding path and a corresponding score comprises:

performing the following processing for each frame of the speech signal:

wherein, the following processing is executed for the currently traversed target token:

traversing the edges of the first decoding network from the state corresponding to the target token, calculating the sum of the acoustic model score and the language model score of the traversed edges by using the target frame, and taking the sum as the score of the traversed edges.

4. The speech recognition method of claim 3, further comprising:

before traversing the tokens in the token list,

5. The speech recognition method of claim 3, wherein initializing the token list in the first decoding network comprises:

initializing the score of a first token in the token list and assigning a preamble pointer to be null;

carrying out Hash search construction on the second decoding network, and storing edges in the same state, which are connected to the second decoding network, in a Hash mode;

the search key on each state of the second decoding network is an input symbol of the corresponding state, and the value corresponding to the key is an edge connecting the corresponding state and a jump state of the corresponding state.

6. The speech recognition method of claim 5, further comprising:

when the score of the traversed edge does not exceed the score threshold, determining the next state of the states corresponding to the traversed edge;

creating an edge connecting states corresponding to the target token and the next state, recording input symbols of the traversed edge in the created edge, outputting symbols, acoustic model scores and language model scores, and enabling the created edge to point to the next token from the target token;

wherein the state corresponding to the next token in the second decoding network is a next state pointed to from the traversed edge in the first decoding network;

traversing hypotheses of the set of hypotheses of the target token, and a companion set of hypotheses for each hypothesis traversed.

7. The speech recognition method of claim 6, wherein re-scoring the decoding path using a second decoding network formed based on the differential language model comprises:

in traversing the hypotheses of the set of hypotheses of the target token and the companion set of hypotheses for each of the traversed hypotheses,

and when the output symbol corresponding to the traversed edge is a null symbol, adding the hypotheses in the hypothesis set of the target token into the hypothesis set linked list which is pre-established and assigned as null according to the sequence of scores from small to large.

8. The speech recognition method of claim 6, wherein re-scoring the decoding path using a second decoding network formed based on the differential language model comprises:

in traversing the hypotheses of the set of hypotheses for the target token and the set of companion hypotheses for each of the traversed hypotheses,

locating, in the second decoding network, a state for rescoring and an edge starting from the rescored state when the output symbol corresponding to the traversed edge is not a null symbol, and,

and in the second decoding network, expanding all edges starting from the rescored state, and forming a hypothesis set linked list for storing the companion hypothesis in the expanding process.

9. The speech recognition method of claim 8, wherein forming a linked list for storing companion hypotheses during the expanding comprises:

when the edge and the state corresponding to the input symbol are inquired by using the hash table of the re-grading state,

generating a corresponding new companion hypothesis set corresponding to the next state pointed by the queried edge, assigning the state corresponding to the new companion hypothesis set to be the next state pointed by the queried edge, and enabling a preamble pointer corresponding to the new companion hypothesis set to be an output symbol of the currently traversed companion hypothesis set;

calculating a score for the new companion hypothesis set as the sum of the scores: scoring a currently traversed companion hypothesis set, scoring an acoustic model of a currently traversed edge, scoring a language model of the currently traversed edge, and scoring a language model corresponding to the queried edge;

and adding the companion hypotheses in the new companion hypothesis set to the hypothesis set linked list which is pre-established and assigned to be empty according to the sequence from small to large.

10. The speech recognition method of claim 8, wherein forming a linked list of hypothesis sets for storing companion hypotheses in the expanding process comprises:

when the hash table of the re-scoring state is used to query the corresponding edge and state of the input symbol and only the corresponding edge is queried,

directing a jump state from the rescored state to a next state pointed to by the queried edge;

replacing the hypothesis set of the target token with a new companion hypothesis set;

calculating a score for the new companion hypothesis set as the sum of the scores: the score of the currently traversed companion hypothesis set, the acoustic model score of the currently traversed edge, the language model score of the currently traversed edge, and the language model score corresponding to the currently queried edge.

11. The speech recognition method of claim 6, further comprising:

when the existing companion assumption set for the next token is not empty,

if an assumption set exists in the companion assumption set in the next token and the state of the first companion assumption set of the existing assumption set is the same as that of the first companion assumption set of the assumption set linked list, merging the existing assumption set and the companion assumption set in the assumption set linked list according to the order of scores from small to large,

and if the state of the first companion hypothesis set of the existing hypothesis set is different from that of the first companion hypothesis set of the hypothesis set linked list, inserting the hypothesis set linked list into the hypothesis set of the next token according to the scoring sequence of the head part of the companion hypothesis set.

12. A speech recognition system, the system comprising:

the sampling analog/digital conversion module is used for acquiring a voice signal in an analog form and processing the voice signal in the analog form to form a voice signal in a digital form;

the decoder is used for splitting an original language model into a low-order language model and a differential language model, wherein the order of the low-order language model is lower than that of the original language model, and the differential language model is the difference between the original language model and the low-order language model;

the decoder is used for decoding the speech signal to obtain a decoding path and a corresponding score by using a first decoding network formed based on the low-level language model, and,

the decoder is used for re-scoring the decoding path by using a second decoding network formed based on the differential language model;

and the decoder is used for outputting the output symbols included by the decoding paths meeting the scoring conditions as recognition results so as to form corresponding voice recognition results.

13. A speech recognition system, characterized in that the speech recognition system comprises:

a memory for storing executable instructions;

a processor for implementing the speech recognition method of any one of claims 1 to 11 when executing the executable instructions stored by the memory.

14. A computer-readable storage medium storing executable instructions, wherein the executable instructions, when executed by a processor, implement the speech recognition method of any one of claims 1 to 11.