CN106782513B

CN106782513B - Speech recognition realization method and system based on confidence level

Info

Publication number: CN106782513B
Application number: CN201710060942.2A
Authority: CN
Inventors: 俞凯; 陈哲怀
Original assignee: Shanghai Jiaotong University; Suzhou Speech Information Technology Co Ltd
Current assignee: Sipic Technology Co Ltd
Priority date: 2017-01-25
Filing date: 2017-01-25
Publication date: 2019-08-23
Anticipated expiration: 2037-01-25
Also published as: CN106782513A

Abstract

A kind of speech recognition realization method and system based on confidence level, decoded information, which is obtained, according to the speech recognition for carrying out phoneme synchronous decoding from user speech generates the synchronous word figure acoustic information structure of phoneme, and word-based figure acoustic information structural generation confusion network to construct the competitive relation between speech recognition candidate result, i.e. confusion network competes probability；Simultaneously using the full search space of the auxiliary search network struction speech recognition based on language model, the full search Spatial Probability of intact mistake is calculated, and combine the speech recognition of phoneme synchronous decoding, process record is scanned for the full search space of generation, and path backtracking is carried out by entire search history, to obtain full search Spatial Probability；It is merged to obtain the court verdict of speech recognition finally by confusion network competition probability and full search Spatial Probability.One aspect of the present invention can provide correct confidence level to the result of speech recognition, so as to improve speech recognition user experience, on the other hand can substantially reduce the calculating and memory source consumption of speech recognition certainty factor algebra.

Description

Speech recognition realization method and system based on confidence level

Technical field

The present invention relates to a kind of accurately and efficiently confidences for being applied to speech recognition (Speech Recognition) (Confidence Measure, the CM) technology of spending, it is specifically a kind of to be based on phoneme synchronous decoding (Phone Synchronous Decoding), word figure and confusion network (Lattice and Confusion Network), adjuvant search space The speech recognition realization method and system of (Auxiliary Search Space).

Background technique

Speech recognition is that one kind allows machine that voice signal is changed into corresponding text or life by identification and understanding process The artificial intelligence technology of order.Existing voice identification technology can not still accomplish that completely correctly, confidence level is a kind of for judging voice The technology of identifying system own voices recognition result reliability, is generally given with recognition result reliability or recognition result probability value Out.

Traditional voice recognition confidence technology mainly includes confidence level (the Predictor features based on predicted characteristics Based CM) and confidence level (Posterior based CM) based on posterior probability, disadvantage includes: between more predicted characteristics It is often not independent mutually in statistical significance；Additional model training link is needed in conjunction with a variety of predicted characteristics, is unfavorable for more Scape application；Speech recognition system is intended to obtain correct text, and is difficult to provide accurate posterior probability, is embodied in: Both inaccuracy needs additional model training link to posterior probability method based on filler simultaneously；And the posteriority of word-based figure is general Rate method is not then complete to search space construction.

Summary of the invention

The present invention is imperfect for competition results characterization of the prior art to solution code space, and the confidence level caused is inaccurate Really；Retraining is carried out dependent on to each model of speech recognition, increases a large amount of extra process；The process calculation amount of building solution code space Greatly, cause speech recognition time-consuming to increase, be unfavorable for the defects of improving user experience etc., propose that a kind of voice based on confidence level is known On the one hand other realization method and system can provide correct confidence level to the result of speech recognition, so as to improve speech recognition On the other hand user experience can substantially reduce the calculating and memory source consumption of speech recognition certainty factor algebra.

The present invention is achieved by the following technical solutions:

The speech recognition implementation method based on confidence level that the present invention relates to a kind of, it is synchronous according to phoneme is carried out from user speech Decoded speech recognition obtains decoded information and generates the synchronous word figure acoustic information structure of phoneme, and word-based figure acoustic information knot Structure generates confusion network to construct the competitive relation between speech recognition candidate result, i.e. confusion network competes probability；Simultaneously Using the full search space of the auxiliary search network struction speech recognition based on language model, the complete of intact mistake is calculated Search space probability, and the speech recognition of phoneme synchronous decoding is combined, process record is scanned for the full search space of generation, And path backtracking is carried out by entire search history, to obtain full search Spatial Probability；It is general finally by competing confusion network Rate and full search Spatial Probability are merged to obtain the court verdict of speech recognition.

Technical effect

Compared with prior art, proposed by the present invention to be based on phoneme synchronous decoding (Phone Synchronous Decoding), word figure and confusion network (Lattice and Confusion Network), adjuvant search space The speech recognition confidence level technology of (Auxiliary Search Space), the conventional method that compares mainly have following difference:

System constructs each link	Conventional method	The present invention	Advantage compares
				Word figure generates	Synchronous decoding frame by frame	Phoneme synchronous decoding	It is more acurrate, efficient generating process
The building of full search space	Based on filler or word figure	Adjuvant search space	Construction search space is more comprehensively
				Confidence calculations	Word figure posterior probability	Confusion network competes probability	Voice recognition information is more acurrate

Detailed description of the invention

Fig. 1 is present system schematic diagram；

Fig. 2 is embodiment probability output schematic diagram；

In figure: the longitudinal axis is probability value, and horizontal axis is time shaft；

Fig. 3 is the speech recognition schematic diagram of phoneme synchronous decoding of the present invention；

Fig. 4 is the synchronous word figure acoustic information structural schematic diagram of phoneme；

Fig. 5 is confusion network schematic diagram；

Fig. 6 is the generating process schematic diagram for assisting dragnet network；

Fig. 7 is Confidence schematic diagram.

Specific embodiment

As shown in Figure 1, the present embodiment system includes: speech recognition module, word figure generation module, confusion network competition probability Computing module, full search Spatial Probability computing module and Confidence device, in which: the speech recognition mould of phoneme synchronous decoding Block is connected with word figure generation module and transmits complete phoneme information, and the synchronous word figure generation module building of phoneme is compact and without letter The acoustic information of breath loss, which is characterized and exported to confusion network, competes probability evaluation entity, and confusion network competes probability evaluation entity The competitive relation probability in phoneme word figure is extracted, full search Spatial Probability computing module constructs auxiliary search according to phoneme information Space, and full search Spatial Probability is further obtained, Confidence device is according to full search Spatial Probability and competitive relation probability Fusion obtains confidence level as final evaluation and identifies whether correct court verdict.

The present invention relates to the audio recognition methods of above system, comprising the following steps:

Step 1) obtains decoding letter as shown in figure 3, carry out the speech recognition of phoneme synchronous decoding frame by frame to user speech Breath, specifically includes:

1.1 by establishing continuous timing disaggregated model, so that Acoustic Modeling is more accurate；

1.2 model connection timing disaggregated model using neural network, and probability output distribution has unimodal protrusion The characteristics of；

When 1.3 speech recognition decoder, linguistics web search is only just carried out when there is the output of non-blank-white model and is obtained Otherwise decoded information directly abandons present frame acoustic information, goes to next frame.

Step 2) generates the synchronous word figure acoustic information structure of phoneme according to the obtained decoded information of step 1, specific to wrap It includes:

2.1 connection timing disaggregated models obtain going out for the phoneme in each frame after inputting each frame acoustic feature information Existing probability.

The acoustic feature information identifies physical features from multiple voice.

If 2.2 current acoustic characteristic informations are non-empty model frame, the weighted finite shape of adaptation Acoustic Modeling information is used State machine carries out linguistic information search to the frame acoustic feature information, obtains phoneme information and is deposited in the form of weighted finite state machine Storage, otherwise abandons the frame；Finally word figure acoustic information structure is obtained through merging treatment.

As shown in figure 4, the word figure acoustic information structure is same for the phoneme being indicated based on weighted finite state machine Word figure is walked, it is very compact phoneme level word figure which, which needs not move through beta pruning, and compression ratio is 80% compared with prior art； The synchronous word figure of phoneme is by carrying out two two-phases for the acoustic output model of all candidates between two different model output times Connect, such as:

The structure compares conventional method (synchronous decoding frame by frame), and theoretical search space reduces 90%；Theory search network pressure Contracting is compared close to 100:1.So that finally obtained voice recognition information is accurate, efficient.

The word-based figure acoustic information structural generation confusion network of step 3), for constructing between speech recognition candidate result Competitive relation, i.e. the competition probability of confusion network, specifically include:

3.1 cluster flag according to optimal decoding coordinates measurement confusion network；

The time boundary and phoneme information of 3.2 pairs of each candidate words cluster, and are merged on confusion network cluster flag；

Optimal decoding paths are extracted again on 3.3 confusion networks obtained after cluster.

As shown in figure 5, the competitive relation indicates (such as HAVE and MOVE) by confusion network, and known based on voice Competitive relation between other candidate result obtains competition probability, more accurate compared to traditional word figure posterior probability.

Step 4) uses the full search space of the auxiliary search network struction speech recognition constructed based on polynary language model, The full search Spatial Probability of decoding process is calculated, it is specific as shown in Figure 6, comprising:

4.1 based on polynary language model building pronunciation full search space；

4.2 construct the pronunciation search space with contextual information by the contextual information in pronunciation full search space itself；

4.3 combine the corresponding search condition modeling of acoustic model, obtain final full search space；

4.4 scan on full search space in conjunction with phoneme information, obtain candidate competitive unit；

4.5 pass through the speech recognition decoder probability of candidate competitive unit, and full search Spatial Probability is calculated.

The polynary language model is as unit of phoneme, word or word.

The full search space as shown in fig. 6, the auxiliary search network analog is pronounced.

Step 5) combines the speech recognition of phoneme synchronous decoding, scans for process record to the full search space of generation, And path backtracking is carried out by entire search history, to obtain full search Spatial Probability；And pass through Confidence device combination language Sound recognition result, confusion network competition probability and full search Spatial Probability, obtain final speech recognition result.

As shown in fig. 7, the differentiation process of the Confidence device specifically:

5.1 pairs of confusion network competition probability and full search Spatial Probability carry out the fusion of interpolation method, obtain confidence level；

5.2 when fused confidence level is less than threshold value, using speech recognition module output as speech recognition result；Otherwise Recognition failures, it is desirable that user re-enters.

Above-mentioned specific implementation can by those skilled in the art under the premise of without departing substantially from the principle of the invention and objective with difference Mode carry out local directed complete set to it, protection scope of the present invention is subject to claims and not by above-mentioned specific implementation institute Limit, each implementation within its scope is by the constraint of the present invention.

Claims

1. a kind of speech recognition implementation method based on confidence level, which is characterized in that synchronous according to phoneme is carried out from user speech Decoded speech recognition obtains decoded information and generates the synchronous word figure acoustic information structure of phoneme, and word-based figure acoustic information knot Structure generates confusion network to construct the competitive relation between speech recognition candidate result, i.e. confusion network competes probability；Simultaneously Using the full search space of the auxiliary search network struction speech recognition based on language model, the complete of intact mistake is calculated Search space probability, and the speech recognition of phoneme synchronous decoding is combined, process record is scanned for the full search space of generation, And path backtracking is carried out by entire search history, to obtain full search Spatial Probability；It is general finally by competing confusion network Rate and full search Spatial Probability are merged to obtain the court verdict of speech recognition；

The competitive relation, obtains in the following manner:

Again optimal decoding paths are extracted on 3.3 confusion networks obtained after cluster, final competitive relation passes through confusion network It indicates, and competition probability is obtained based on the competitive relation between speech recognition candidate result；

The full search Spatial Probability, obtains in the following manner:

2. according to the method described in claim 1, it is characterized in that, the phoneme synchronous decoding, to user speech carry out frame by frame The speech recognition of phoneme synchronous decoding, obtains decoded information, specifically:

1.2 model connection timing disaggregated model using neural network, and probability output distribution has unimodal spy outstanding Point；

When 1.3 speech recognition decoder, linguistics web search is only just carried out when there is the output of non-blank-white model and is decoded Otherwise information directly abandons present frame acoustic information, goes to next frame.

3. according to the method described in claim 1, it is characterized in that, the word figure acoustic information structure, in the following manner It arrives:

2.1 connection timing disaggregated models show that the appearance of the phoneme in each frame is general after inputting each frame acoustic feature information Rate；

If 2.2 current acoustic characteristic informations are non-empty model frame, the weighted finite state machine of adaptation Acoustic Modeling information is used Linguistic information search is carried out to the frame acoustic feature information, obtain phoneme information and is stored in the form of weighted finite state machine, Otherwise the frame is abandoned；Finally word figure acoustic information structure is obtained through merging treatment.

4. method according to claim 1 or 3, characterized in that the word figure acoustic information structure is to be had based on weighting The synchronous word figure of the phoneme that limit state machine is indicated, it is very compact phoneme level word figure which, which needs not move through beta pruning, The synchronous word figure of phoneme is by carrying out two two-phases for the acoustic output model of all candidates between two different model output times Even.

5. a kind of speech recognition system for realizing any of the above-described claim the method characterized by comprising speech recognition Module, word figure generation module, confusion network competition probability evaluation entity, full search Spatial Probability computing module and confidence level are sentenced Other device, in which: the speech recognition module of phoneme synchronous decoding is connected with word figure generation module and transmits complete phoneme information, sound The synchronous word figure generation module of element constructs compact and without information loss acoustic information and characterizes and export general to confusion network competition Rate computing module, confusion network competition probability evaluation entity extract the competitive relation probability in phoneme word figure, full search space Probability evaluation entity constructs adjuvant search space according to phoneme information, and further obtains full search Spatial Probability, and confidence level is sentenced Other device obtains confidence level as final evaluation according to full search Spatial Probability and competitive relation probability fusion and identifies whether correctly Court verdict.