CN113707137B

CN113707137B - Decoding realization method and device

Info

Publication number: CN113707137B
Application number: CN202111007250.4A
Authority: CN
Inventors: 肖艳红; 赵茂祥; 李全忠; 何国涛; 蒲瑶
Original assignee: Puqiang Times Zhuhai Hengqin Information Technology Co ltd
Current assignee: Puqiang Times Zhuhai Hengqin Information Technology Co ltd
Priority date: 2021-08-30
Filing date: 2021-08-30
Publication date: 2024-02-20
Anticipated expiration: 2041-08-30
Also published as: CN113707137A

Abstract

The invention relates to a decoding realization method and a decoding realization device, comprising a topological structure of an HMM model of a modeling unit, wherein the topological structure comprises a starting state, a transmitting state and an ending state; setting a self-jump edge in the transmitting state for self-jump of the transmitting state; the transmit state includes a self-hop path and a transfer path such that the topology completes sequence alignment; the step of aligning the topology completion sequence is as follows: when each frame of audio is decoded, calculating the acoustic score and the language score of blank characters used by the self-jump path and the acoustic score and the language score of effective characters used by the transfer path; comparing the scores of each path, and determining the highest score as the emission state score; and performing sequence alignment according to the emission state score. The invention can greatly reduce the number of models in the decoding network, thereby greatly reducing the memory required in the decoding process.

Description

Decoding realization method and device

Technical Field

The invention belongs to the technical field of neural networks, and particularly relates to a decoding realization method and device.

Background

In speech recognition, an input speech sequence and an output sequence are not equal in length, and one frame of data for speech recognition is difficult to give one pronunciation unit, but tens of frames of data are easy to judge the corresponding pronunciation unit. The traditional acoustic model training of voice recognition requires that the data of each frame is known to be effective in training, and the pretreatment of voice forced alignment is required before the data is trained. Compared with the traditional acoustic model, the acoustic model training adopting CTC as a loss function is a complete end-to-end type, and data does not need to be aligned in advance, and only one input sequence and one output sequence are needed. The CTC model introduces blank characters that are introduced for alignment with the input features, which have no output meaning. In the decoding process based on the CTC model, since each modeling unit is connected with one blank character, the decoding network contains a large number of blank character models, and the blank characters have no actual output meaning,

the HMM model is a model commonly used in the problem of sequence alignment, and plays an important role in the decoding process of speech recognition. It comprises the following parts:

state set of N transmitting states, state transition probability, observation sequence, here each o _t The set U belonging to the acoustic model modeling unit, the emission probability, i.e. the likelihood of the acoustic model, represents the observation o seen at state i _t The two special states can be used to more conveniently splice a plurality of HMMs into a larger HMM.

In the related art, the topology of the HMM model includes a start state, an end state, and a transmission state. The edges between states represent the direction and weight of the jump. Each emission state represents a modeling unit of an acoustic model (the modeling unit of the acoustic model may be a phoneme, a pinyin, a word, etc.), and the emission probability at time t is an acoustic model likelihood score of the modeling unit at time t.

The topology and sequence alignment process is that in CTC model based decoding, each modeling unit has one HMM model, each HMM model has three states, where HMM topologies with blank characters can self-hop, while HMM topologies with other modeling units or valid characters cannot self-hop.

Because the HMM model in the prior art contains a large number of blank character models, the blank characters have no actual output meaning, and the decoding network is larger, so that the memory required for speech recognition decoding is larger.

Disclosure of Invention

In view of the above, the present invention aims to overcome the shortcomings of the prior art, and provide a decoding implementation method and apparatus, so as to solve the problem in the prior art that the memory required for speech recognition decoding is large due to the large decoding network.

In order to achieve the above purpose, the invention adopts the following technical scheme: a decoding implementation method, comprising:

providing a topology of an HMM model of a modeling unit, the topology including a start state, an emission state, and an end state; setting a self-jump edge in the transmitting state for self-jump of the transmitting state; the transmit state includes a self-hop path and a transfer path such that the topology completes sequence alignment; the step of aligning the topology completion sequence is as follows:

when each frame of audio is decoded, calculating the acoustic score and the language score of blank characters used by the self-jump path and the acoustic score and the language score of effective characters used by the transfer path;

comparing the scores of each path, and determining the highest score as the emission state score;

and performing sequence alignment according to the emission state score.

Further, the decoding calculates an acoustic score and a language score of the blank character used by the self-jump path and an acoustic score and a language score of a modeling unit other than the blank character used by the transfer path by using a viterbi algorithm.

Further, the scores of the paths are compared, and the path with the highest score is determined

Comparing the sum of the acoustic score and the language score of each path;

the path for which the sum of the acoustic score and the language score is the highest score is determined as the path of the highest score.

Further, the sequence alignment according to the highest-score path includes:

if the transmission state score of the current frame is the score of the blank character, the frame is aligned with the blank character;

if the transmission status score of the current frame is a score of a valid character, it indicates that the frame is aligned with the modeling unit to which the valid character belongs.

Further, if the current frame is aligned with the blank character, the self-jump of the modeling unit is represented, if the current frame is aligned with the valid character, the HMM state of the modeling unit to which the valid character belongs jumps from the transmitting state to the ending state, and the ending state continues to expand the starting states of other modeling units until the decoding is finished.

Further, voice data are obtained, and the tone pinyin corresponding to the voice data is modeled by adopting initials, finals and tones, so that a plurality of modeling units are generated.

An embodiment of the present application provides a decoding implementation apparatus, including:

the building module is used for providing a topological structure of the HMM model of the modeling unit, wherein the topological structure comprises a starting state, a transmitting state and an ending state; setting a self-jump edge in the transmitting state for self-jump of the transmitting state; the transmit state includes a self-hop path and a transfer path such that the topology completes sequence alignment; the step of aligning the topology completion sequence is as follows:

and performing sequence alignment according to the emission state score.

By adopting the technical scheme, the invention has the following beneficial effects:

the invention provides a decoding realization method and a decoding realization device, comprising a topological structure of an HMM model of a modeling unit, wherein the topological structure comprises a starting state, a transmitting state and an ending state; setting a self-jump edge in the transmitting state for self-jump of the transmitting state; the transmit state includes a self-hop path and a transfer path such that the topology completes sequence alignment; the step of aligning the topology completion sequence is as follows: when each frame of audio is decoded, calculating the acoustic score and the language score of blank characters used by the self-jump path and the acoustic score and the language score of effective characters used by the transfer path; comparing the scores of each path, and determining the highest score as the emission state score; and performing sequence alignment according to the emission state score. The invention can greatly reduce the number of models in the decoding network, thereby greatly reducing the memory required in the decoding process.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a topology sequence alignment step of an HMM model in the prior art;

FIG. 2 is a schematic diagram of a topology of an HMM model provided by the present invention;

FIG. 3 is a schematic diagram of steps for alignment of a topology completion sequence according to the present invention;

FIG. 4 is a flow chart of the decoding process according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described in detail below. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, based on the examples herein, which are within the scope of the invention as defined by the claims, will be within the scope of the invention as defined by the claims.

The topology structure of the HMM model in the prior art realizes the sequence alignment in the following way: as shown in fig. 1, after the input audio (hello) is extracted by the framing and windowing feature, an acoustic posterior sequence is obtained after the input audio is extracted by the acoustic model, the third part in fig. 1 is a path aligned with the HMM sequence, only the starting state of the first HMM model and the ending state of the last HMM model are reserved in the figure, and the starting state and the ending state of the HMM model at the middle position are removed. When the posterior sequence is aligned with the modeling unit, an HMM model of the modeling unit is used, and the emission probability of the model is the likelihood score of the modeling unit at the time t, here we assume that the modeling unit is pinyin. The method comprises the steps that when the alignment result of continuous voice frames is blank characters, the HMM of the blank characters automatically jumps in the transmission state, when the alignment result of the frames at a certain moment (t=3) is other modeling units such as ni3, the state of the HMM of the blank characters is transferred from the transmission state to the ending state, the ending state of the HMM is connected with the starting states of other modeling units such as ni3, and jumps from the starting state to the transmission state of ni3, the transmission probability of the frames is the acoustic likelihood score of ni3, ni3 cannot automatically jump, and therefore jumps from the transmission state to the ending state, and expands for a new round to connect with the starting states of other modeling units. The whole decoding process adopts a Viterbi algorithm, and at each time t, the acoustic and language scores of blank characters and other modeling units are calculated respectively, then clipping is carried out, and finally the optimal decoding result is obtained. And performing sequence alignment according to the decoding result.

A specific decoding implementation method and apparatus provided in the embodiments of the present application are described below with reference to the accompanying drawings.

As shown in fig. 1, the decoding implementation method provided in the embodiment of the present application,

s101, when each frame of audio is decoded, calculating the acoustic score and the language score of blank characters used by a self-jump path and the acoustic score and the language score of effective characters used by a transition path;

s102, comparing the scores of each path, and determining the highest score as a transmitting state score;

s103, performing sequence alignment according to the emission state score.

It should be noted that, as shown in fig. 2, each circle represents a state of HMM, a start state represented by a dark circle, an end state represented by a double circle, and an emission state represented by a middle circle. The edges between states represent the direction and weight of the jump, where the edges of the circular arc represent that the jump can be made by itself.

Compared with the prior art, the topological structure provided by the application removes the HMM model of the blank character, the emission states of the HMM models of other acoustic model modeling units are increased by the self-jump edge, wherein the emission probability of the self-jump edge is the emission probability of the blank character, and for input audio (hello), the topological structure provided by the application can complete the same sequence alignment, and the alignment process is as follows: the emitting state of the HMM model of each pronunciation unit has a self-jump edge, but the emitting probability of the self-jump is the emitting probability of blank characters, and the emitting probability of transition to the ending state is the emitting probability of valid characters. When each frame of audio is decoded, traversing two paths of the emission state of each modeling unit, and performing self-jump and transition, wherein the self-jump path uses acoustic and language scores of blank characters, the transition path uses acoustic and language scores of the characters, the score of each path is calculated, and a relatively high score is selected as the score of the emission state of the characters. Comparing the scores of each path, and determining the highest score as the emission state score; and performing sequence alignment according to the emission state score.

In some embodiments, the decoding employs a viterbi algorithm to calculate the acoustic and language scores of the blank characters used by the self-skip path and the acoustic and language scores of the modeling units outside the blank characters used by the transfer path.

Preferably, the score of each path is compared to determine the path of the highest score

Comparing the sum of the acoustic score and the language score of each path;

Preferably, the sequence alignment according to the highest-score path includes:

In some embodiments, if the current frame and the blank character are aligned, the self-jump of the modeling unit is represented, if the current frame and the valid character are aligned, the HMM state of the modeling unit to which the valid character belongs jumps from the transmitting state to the ending state, and the ending state continues to expand the starting states of other modeling units until the decoding is finished.

Preferably, the method comprises the steps of obtaining voice data and modeling the tone-added pinyin corresponding to the voice data by adopting initials, finals and tones to generate a plurality of modeling units.

As shown in fig. 4, the technical solution provided in the present application is to implement the HMM model of the blank character in the form of viterbi algorithm. If the frame and blank character are aligned, the self-jump of the modeling unit such as (ni 3) is represented, and if the frame and the modeling unit (ni 3) are aligned, the HMM state of the modeling unit jumps from the transmitting state to the ending state, and the ending state continues to expand the starting states of other modeling units until decoding is finished. The algorithm adopted in the decoding process is still a Viterbi algorithm, the alignment results are the same, and since a blank character is connected between modeling units of each non-blank character in each path in the decoding process, the number of HMM models of the blank character in the decoding process is very large, and the blank character has no output meaning, the improved scheme can greatly reduce the number of models in the decoding network under the condition of not influencing the recognition result, and further greatly reduce the memory required in the decoding process.

and performing sequence alignment according to the emission state score.

The embodiment of the application provides computer equipment, which comprises a processor and a memory connected with the processor;

the memory is used for storing a computer program, and the computer program is used for executing the decoding implementation method provided by any one of the embodiments;

the processor is used to call and execute the computer program in the memory.

In summary, the present invention provides a decoding implementation method and apparatus, including providing a topology structure of an HMM model of a modeling unit, where the topology structure includes a start state, an emission state, and an end state; setting a self-jump edge in the transmitting state for self-jump of the transmitting state; the transmit state includes a self-hop path and a transfer path such that the topology completes sequence alignment; the step of aligning the topology completion sequence is as follows: when each frame of audio is decoded, calculating the acoustic score and the language score of blank characters used by the self-jump path and the acoustic score and the language score of effective characters used by the transfer path; comparing the scores of each path, and determining the highest score as the emission state score; and performing sequence alignment according to the emission state score. The invention can greatly reduce the number of models in the decoding network, thereby greatly reducing the memory required in the decoding process.

It can be understood that the above-provided method embodiments correspond to the above-described apparatus embodiments, and corresponding specific details may be referred to each other and will not be described herein.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, magnetic disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A decoding implementation method is characterized in that,

when each frame of audio is decoded, calculating the acoustic score and the language score of blank characters used by the self-jump path and the acoustic score and the language score of effective characters used by the transfer path; comparing the scores of each path, and determining the highest score as the emission state score; performing sequence alignment according to the emission state score; comparing the scores of each path, and determining the path with the highest score:

comparing the sum of the acoustic score and the language score of each path;

2. The method of claim 1, wherein the step of determining the position of the substrate comprises,

the decoding calculates an acoustic score and a language score of a blank character used by the self-jump path and an acoustic score and a language score of a modeling unit outside the blank character used by the transfer path by using a Viterbi algorithm.

3. The method of claim 1, wherein performing sequence alignment according to the highest scoring path comprises:

4. The method of claim 3, wherein the step of,

if the current frame is aligned with the blank character, the self-jump of the modeling unit is represented, if the current frame is aligned with the valid character, the HMM state of the modeling unit to which the valid character belongs jumps from the transmitting state to the ending state, and the ending state of the HMM state continues to be expanded and connected with the starting states of other modeling units until decoding is finished.

5. The method according to any one of claim 1 to 4, wherein,

and acquiring voice data, modeling the tone pinyin corresponding to the voice data by adopting initials, finals and tones, and generating a plurality of modeling units.

6. A decoding implementation apparatus, comprising:

when each frame of audio is decoded, calculating the acoustic score and the language score of blank characters used by the self-jump path and the acoustic score and the language score of effective characters used by the transfer path; comparing the scores of each path, and determining the highest score as the emission state score;

performing sequence alignment according to the emission state score;

comparing the scores of each path, and determining the path with the highest score:

comparing the sum of the acoustic score and the language score of each path;