CN113744723B

CN113744723B - Method and system for real-time re-scoring of voice recognition

Info

Publication number: CN113744723B
Application number: CN202111190697.XA
Authority: CN
Inventors: 王金龙; 徐欣康; 胡新辉; 谌明
Original assignee: Hithink Royalflush Information Network Co Ltd
Current assignee: Hithink Royalflush Information Network Co Ltd
Priority date: 2021-10-13
Filing date: 2021-10-13
Publication date: 2024-01-30
Anticipated expiration: 2041-10-13
Also published as: CN113744723A

Abstract

The embodiment of the specification provides a method and a system for real-time re-scoring of voice recognition, wherein the method comprises the steps of obtaining characteristics of voice frames in voice materials; based on the characteristics of the voice frame, acquiring candidate voice recognition results through a decoding model and a preset re-scoring model, wherein the preset re-scoring model is used for correcting the score of the voice recognition results of the decoding model in real time; a target speech recognition result is determined based on the candidate speech recognition result.

Description

Method and system for real-time re-scoring of voice recognition

Technical Field

The present disclosure relates to the field of speech recognition, and in particular, to a method and system for real-time re-scoring for speech recognition.

Background

In speech recognition, a speech recognition model is usually used, and a real-time re-scoring method is combined to obtain a speech recognition effect. Real-time re-scoring requires real-time construction of decoding networks, path searching and calculation are performed on a plurality of decoding networks, decoding speed is low, and meanwhile, the decoding networks occupy large memory space. In this case, it is necessary to quickly obtain a voice recognition result while maintaining accuracy.

It is therefore desirable to provide a method of real-time re-scoring for speech recognition.

Disclosure of Invention

One of the embodiments of the present disclosure provides a method for real-time scoring of speech recognition. The method comprises the following steps: acquiring characteristics of a voice frame in a voice material; based on the characteristics of the voice frame, obtaining candidate voice recognition results through a decoding model and a preset re-scoring model, wherein the preset re-scoring model is used for correcting the score of the voice recognition results of the decoding model in real time; and determining a target voice recognition result based on the candidate voice recognition result.

One of the embodiments of the present description provides a system for real-time scoring of speech recognition. The system comprises: the device comprises a feature acquisition module, a candidate result acquisition module and a target result determination module; the characteristic acquisition module is used for acquiring characteristics of voice frames in the voice materials; the candidate result acquisition module is used for acquiring candidate voice recognition results through a decoding model and a preset re-scoring model based on the characteristics of the voice frames, and the preset re-scoring model is used for correcting the scores of the voice recognition results of the decoding model in real time; the target result determining module is used for determining a target voice recognition result based on the candidate voice recognition result.

One of the embodiments of the present specification provides a device for real-time scoring of speech recognition, which includes a processor, and the processor is configured to perform the method for real-time scoring of speech recognition described in the present specification.

One of the embodiments of the present specification provides a computer-readable storage medium storing computer instructions that, when read by a computer, perform the method of real-time scoring for speech recognition described herein.

Drawings

The present specification will be further elucidated by way of example embodiments, which will be described in detail by means of the accompanying drawings. The embodiments are not limiting, in which like numerals represent like structures, wherein:

FIG. 1 is a schematic illustration of an application scenario of a system for real-time re-scoring of speech recognition according to some embodiments of the present description;

FIG. 2 is a schematic diagram of a system for real-time re-scoring of speech recognition according to some embodiments of the present description;

FIG. 3 is an exemplary flow chart of a method of real-time re-scoring of speech recognition according to some embodiments of the present description;

FIG. 4 is an exemplary flow chart of a method of generating a pre-set re-scoring model according to some embodiments of the present description;

FIG. 5 is a schematic diagram of a method of generating a re-scoring model according to some embodiments of the present description;

FIG. 6 is an exemplary diagram of a method of model traversal shown in accordance with some embodiments of the present description;

FIG. 7 is an exemplary diagram of a method of real-time re-scoring of speech recognition according to some embodiments of the present description;

FIG. 8 is an exemplary flow chart of a method of generating a pre-set heavy score model according to some embodiments of the present description.

Detailed Description

In order to more clearly illustrate the technical solutions of the embodiments of the present specification, the drawings that are required to be used in the description of the embodiments will be briefly described below. It is apparent that the drawings in the following description are only some examples or embodiments of the present specification, and it is possible for those of ordinary skill in the art to apply the present specification to other similar situations according to the drawings without inventive effort. Unless otherwise apparent from the context of the language or otherwise specified, like reference numerals in the figures refer to like structures or operations.

It will be appreciated that "system," "apparatus," "unit" and/or "module" as used herein is one method for distinguishing between different components, elements, parts, portions or assemblies at different levels. However, if other words can achieve the same purpose, the words can be replaced by other expressions.

As used in this specification and the claims, the terms "a," "an," "the," and/or "the" are not specific to a singular, but may include a plurality, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that the steps and elements are explicitly identified, and they do not constitute an exclusive list, as other steps or elements may be included in a method or apparatus.

A flowchart is used in this specification to describe the operations performed by the system according to embodiments of the present specification. It should be appreciated that the preceding or following operations are not necessarily performed in order precisely. Rather, the steps may be processed in reverse order or simultaneously. Also, other operations may be added to or removed from these processes.

Fig. 1 is a schematic view of an application scenario of a system for real-time re-scoring of speech recognition according to some embodiments of the present description. The system 100 for real-time re-scoring of speech recognition (hereinafter system 100) may include a server 110, a network 120, a storage device 130, a speech acquisition device 140, and a user 150.

Server 110 may be used to manage resources and process data and/or information from at least one component of the present system or external data sources (e.g., a cloud data center). In some embodiments, the server 110 may be a single server or a group of servers. The server farm may be centralized or distributed. In some embodiments, server 110 may be local or remote. For example, server 110 may receive or obtain voice data collected by voice collection device 140 and/or information and/or data in storage device 130 via network 120. For another example, server 110 may be directly connected to voice acquisition device 140 and/or storage device 130 to access stored information and/or data. In some embodiments, server 110 may be implemented on a cloud platform or on-board computer. For example only, the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distributed cloud, an internal cloud, a multi-layer cloud, or the like, or any combination thereof. In some embodiments, server 110 may retrieve relevant data and/or information, such as speech data, language models, decoding models, scoring models, etc., from storage 130 for performing speech recognition real-time retyping as shown in some embodiments of the present description.

In some embodiments, server 110 may include a processing engine 112. The processing engine 112 may process information and/or data related to the speech recognition real-time re-scoring to perform one or more of the functions described herein. In some embodiments, the server 110 may include a model for real-time re-scoring of speech recognition, e.g., a decoding model, a pre-set scoring model. In some embodiments, the processing engine 112 may decode the voice data through a decoding model to obtain a voice recognition result in text form, and perform real-time re-scoring on the voice recognition result through a preset scoring model to obtain an optimal voice recognition result. In some embodiments, the processing engine 112 may generate a decoding model and/or a preset scoring model based on an existing language model. In some embodiments, the processing engine 112 may obtain the preset scoring model by combining multiple scoring models. In some embodiments, server 110 may send data and/or information such as models generated by processing engine 112, text-to-speech recognition results, etc. to storage device 130 for storage. In some embodiments, the server 110 may send the text-to-speech recognition results to the speech collection device 140, the user terminal, and/or the output device, etc., for feedback to the user 150 and/or presentation of the text-to-speech recognition results.

In some embodiments, processing engine 112 may include one or more processing engines (e.g., a single chip processing engine or a multi-chip processing engine). For example only, the processing engine 112 may include a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), an application specific instruction set processor (ASIP), a Graphics Processing Unit (GPU), a Physical Processing Unit (PPU), a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), a Programmable Logic Device (PLD), a controller, a microcontroller unit, a Reduced Instruction Set Computer (RISC), a microprocessor, or the like, or any combination thereof.

The network 120 may facilitate the exchange of information and/or data. In some embodiments, one or more components of system 100 (e.g., server 110, storage device 130, voice capture device 140) may send information and/or data to other components of system 100 via network 120. For example, server 110 may obtain voice data collected by voice collection device 140 via network 120. As another example, server 110, voice capture device 140 may retrieve data and/or information from memory 130 and/or write data and/or information to memory 130 via network 120. In some embodiments, network 120 may be any form of wired or wireless network, or any combination thereof. By way of example only, the network 120 may include a cable network, a wired network, a fiber optic network, a telecommunications network, an intranet, the internet, a Local Area Network (LAN), a Wide Area Network (WAN), a Wireless Local Area Network (WLAN), a Metropolitan Area Network (MAN), a Public Switched Telephone Network (PSTN), a bluetooth network, a zigbee network, a Near Field Communication (NFC) network, and the like, or any combination thereof.

The storage device 130 may store data and/or instructions. In some embodiments, the storage device 130 may store data acquired from the voice acquisition device 140, e.g., acquired voice data, etc. In some embodiments, the storage device 130 may store data and/or instructions, e.g., decoding models, scoring models, text-to-speech recognition results, etc., used by the server 110 to perform or use to accomplish the exemplary methods described herein. In some embodiments, the storage device 130 may include mass storage, removable storage, volatile read-write memory, read-only memory (ROM), and the like, or any combination thereof. In some embodiments, storage device 130 may be implemented on a cloud platform. For example only, the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distributed cloud, an internal cloud, a multi-layer cloud, or the like, or any combination thereof.

In some embodiments, storage device 130 may be connected to network 120 to communicate with one or more components of system 100 (e.g., server 110, voice capture device 140). One or more components of system 100 may access data or instructions stored in storage device 130 via network 120. In some embodiments, storage device 130 may be directly connected to or in communication with one or more components of system 100 (e.g., server 110 and voice capture device 140). In some embodiments, the storage device 130 may be part of the server 110. In some embodiments, the storage device 130 may be integrated into the speech acquisition device 140.

The voice capture device 140 may capture voice data of the user 150 for use in capturing voice recognition results, e.g., in text form. The voice capture device 140 may be any device and/or apparatus that can input and capture voice or incorporate a voice input and capture module. In some embodiments, the voice capture device 140 may include a mobile device 140-1, a tablet computer 140-2, a laptop computer 140-3, a microphone 140-4, or the like, or any combination thereof. The mobile device 140-1 may be any mobile/handheld device capable of voice input and capture, such as a smart phone, personal digital assistant, handheld smart terminal, etc.; tablet computer 140-2 may be any smart tablet device capable of voice input and capture, e.g., an android tablet, iPad, etc.; the laptop computer 140-3 may be any notebook computer or the like that incorporates a voice input module such as a microphone; microphone 140-4 may be a stand alone device or a microphone-integrated device such as a microphone, a microphone-integrated headset, a microphone-integrated VR device, etc. In some embodiments, the voice capture device 140 may include devices and/or modules for capturing voice data/material, such as a microphone 140-5 for capturing voice or a module for capturing voice data/material, etc. In some embodiments, the voice capture device 140 may obtain voice data of the user 150, e.g., a conversation, etc., through any voice input device (e.g., microphone, etc.). In some embodiments, the voice acquisition device 140 may communicate and/or be connected to the server 110 and/or the storage device 130 over the network 120. For example, the voice capture device 140 may provide the acquired voice data/material to the server 110 over the network 120. In some embodiments, the voice capture device 140 may be directly connected to the server 110 or integrated within it. In some embodiments, the voice capture device 140 may receive text-to-speech recognition results returned by the server 110 and present them to the user 150.

The user 150 may provide voice data for recognition. The user 150 may provide the voice data to the server 110 through the voice acquisition device 140, and the processing engine 112 recognizes the voice data through the model to obtain a text-to-speech recognition result. In some embodiments, the user 150 may obtain text-to-speech recognition results of the server 110 through the speech collection device 140, a user terminal, or other device.

It should be noted that the system 100 is provided for illustrative purposes only and is not intended to limit the scope of the present application. Many modifications and variations will be apparent to those of ordinary skill in the art in light of the present description. For example, the system 100 may also include a voice database, a voice information source, and the like. As another example, the server 110 and the voice capture device 140 may be integral. The system 100 may be implemented on other devices to implement similar or different functions. However, variations and modifications do not depart from the scope of the present application.

FIG. 2 is a schematic diagram of a system for real-time re-scoring of speech recognition according to some embodiments of the present description. In some embodiments, the system 200 may include a feature acquisition module 210, a candidate result acquisition module 220, and a target result determination module 230.

The feature acquisition module 210 may be configured to acquire features of speech frames in the speech material. For more details on how to obtain features of a speech frame in speech material, see fig. 3 and its description.

The candidate result obtaining module 220 may be configured to obtain a candidate speech recognition result through the decoding model and the preset re-scoring model based on the features of the speech frame. In some embodiments, the pre-set re-scoring model is used to correct the score of the speech recognition result of the decoding model in real time.

The preset re-scoring model refers to a pre-designated model for re-scoring the speech recognition result. In some embodiments, the pre-set heavy scoring model may include pre-saved scores. In some embodiments, the pre-determined weight score model may include correction values for pre-saved scores.

The corrected score is a score obtained by correcting the score of the voice recognition result of the decoding model. In some embodiments, the real-time correction may include obtaining a post-correction score based on the score and the correction value. In some embodiments, the corrected score may be obtained by summing the score with a correction value to correct the score of the speech recognition result of the decoding model in real time. For example, a score of 0.5 points, a correction value of-0.03, and the two are summed to obtain a corrected score of 0.47.

The target result determination module 230 may be configured to determine a target speech recognition result based on the candidate speech recognition result. For more details on how to determine a target speech recognition result based on candidate speech recognition results, see fig. 3 and its description.

In some embodiments, the pre-determined weight score model may be a pre-generated model. In some embodiments, the system 200 may further include a language model acquisition module 240, a decoding model generation module 250, a scoring model generation module 260, and a re-scoring model generation module 270.

The language model acquisition module 240 may be used to acquire a first language model and a second language model. In some embodiments, the second language model may be obtained by training a preset language model. In some embodiments, the first language model may be obtained by clipping the second language model. For more details on how the first language model and the second language model are obtained, see fig. 4 and the description thereof.

The decoding model generation module 250 may be configured to generate a decoding model based on the first language model. For more details on how to generate a decoding model based on the first language model, see fig. 4 and its description.

The scoring model generation module 260 may be configured to generate a first scoring model and a second scoring model based on the first language model and the second language model. For more details on how to generate the first scoring model and the second scoring model based on the first language model and the second language model, see fig. 4 and the description thereof.

The re-scoring model generation module 270 may be configured to obtain the preset re-scoring model by combining the first scoring model and the second scoring model. For more details on how to obtain the preset weight scoring model by combining the first scoring model and the second scoring model, see fig. 4 and its description.

In some embodiments, the re-scoring model generation module 270 may include a score acquisition unit 271, a model update unit 272, and a model determination unit 273.

The score acquisition unit 271 may be configured to acquire the speech recognition result score of two by traversing the second scoring model, and synchronize traversing the first scoring model to acquire the speech recognition result score of one based on the traversing of the second scoring model.

In some embodiments, features corresponding to features of the second scoring model may be determined in the first scoring model while traversing the second scoring model, such that the first scoring model is traversed synchronously, the corresponding features may be features that are common to the first and second scoring models, e.g., consistent arcs, etc.

In some embodiments, corresponding arcs to the arcs in the second scoring model may be determined in the first scoring model based on the arcs in the second scoring model. In some embodiments, when an arc is found in the first scoring model that is consistent with an arc in the second scoring model, the consistent arc is determined to be the corresponding arc; when no arcs are found in the first scoring model that are consistent with the arcs in the second scoring model, the consistent arcs with the least number of back steps are determined to be the corresponding arcs by back-off. For more details on traversal and rollback, see FIG. 6 and its description.

The model updating unit 272 may be configured to update the second scoring model based on a difference between the speech recognition result score one and the speech recognition result score two. For more details on how to update the second scoring model based on the difference between the speech recognition result score one and the speech recognition result score two, see fig. 5 and its description.

The model determination unit 273 may be used to determine the preset re-scoring model based on the updated second scoring model.

The correction value is a value based on which the score of the speech recognition result is corrected. In some embodiments, the correction value may be a difference in speech recognition result scores of the first scoring model and the second scoring model. For example, if the first score model has a score of 0.6 and the second score model has a score of 0.63, the correction value may be-0.03 or 0.03. In some embodiments, the order in which the speech recognition result score one and the speech recognition result score subtract is fixed. For example, the correction value is obtained by subtracting the speech recognition result score from the speech recognition result score. Still taking the above example as an example, the correction value is-0.03 minutes.

FIG. 3 is an exemplary flow chart of a method of real-time re-scoring of speech recognition according to some embodiments of the present description. As shown in fig. 3, the process 300 includes the following steps.

In step 310, features of speech frames in the speech material are obtained. In some embodiments, step 310 may be performed by feature acquisition module 210.

Features refer to information contained in the speech material. Such as loudness, pitch, etc. In some embodiments, a feature may refer to a phoneme in a speech material.

The feature acquisition module 210 may acquire features of a speech frame in the speech material in a variety of ways and generate feature vectors.

In some embodiments, feature acquisition module 210 may extract speech frame features at regular intervals, generating a sequence of feature components of speech; and then calculating the probability of the phonemes in each voice frame by using an acoustic model to generate a matrix, wherein each row of the matrix corresponds to a feature vector, each feature vector corresponds to one frame of voice, each element in the matrix represents the probability of the phonemes, n frames of voices form the matrix, and each column corresponds to the same phonemes. The number of phonemes is fixed, such as 80. Each frame of speech has a probability value on each phoneme, and the probabilities of all phonemes in each frame add up to 1.

Step 320, obtaining candidate speech recognition results through the decoding model and the preset re-scoring model based on the features of the speech frame. In some embodiments, step 320 may be performed by candidate result acquisition module 220.

Candidate speech recognition results refer to a collection of at least one speech recognition result from which a target speech recognition result may be determined. The description of the target speech recognition result may be found below.

The candidate speech recognition results may include speech recognition results of the decoding model. In some embodiments, the candidate speech recognition results may include the speech recognition results of the decoding model and the real-time revised score.

The decoding model and the preset re-scoring model may be various models capable of implementing decoding and re-scoring. In some embodiments, the decoding model may be a decoding network HCLG. In some embodiments, the pre-set weight scoring model may be a Weighted Finite-State-Transducer (WFST).

The candidate result acquisition module 220 may input the matrix and/or feature vector into a decoding model (e.g., a decoding network HCLG) resulting in a directed graph structure. The directed graph structure includes a plurality of arcs and nodes, the arcs having inputs, outputs, and weights. Weights may be used to score sequences of arcs. The sequence of arcs is an ordered set of arcs, which may reflect an ordered set of words, e.g., "today is monday", "i are Zhang San". In some embodiments, the sequence of arcs may reflect the order of the overall speech recognition process. For example, the order of words recognized during speech recognition may be reflected by the order of arcs in the sequence of arcs. In some embodiments, the score may represent the confidence and/or accuracy of the speech recognition result. For example, the higher the score, the higher the confidence and accuracy that the speech recognition results are represented. The input of the arc is the ID of the jump between phonemes, when the output of the arc is 0, the output of the arc indicates no voice recognition result, and when the output is not 0, the input corresponds to the voice recognition result, namely, the sequence of words reflected by the sequence of the arc. The preset re-scoring model can re-score the voice recognition result output by the arc to obtain the score corrected in real time, and then the candidate voice recognition result can be obtained. More details about the re-scoring can be found elsewhere in this specification.

Step 330, determining a target speech recognition result based on the candidate speech recognition result. In some embodiments, step 330 may be performed by the target result determination module 230.

The target speech recognition result is the speech recognition result with the highest accuracy and/or confidence of final determination. Where accuracy and/or confidence may be expressed in terms of a percentage or fraction, etc. For example, among two speech recognition results having an accuracy of 80%, 90%, a speech recognition result having an accuracy of 90% is determined as the target speech recognition result. For another example, among three speech recognition results with scores of 0.8, 0.85, and 0.78, a speech recognition result with score of 0.85 is determined as the target speech recognition result.

The target result determination module 230 may determine the target speech recognition result in a variety of ways. In some embodiments, the target result determination module 230 may obtain an optimal speech recognition result based on the score of the candidate speech recognition result, and determine the optimal speech recognition result as the target speech recognition result.

In some embodiments of the present disclosure, a pre-generated re-scoring model is used to re-score the speech recognition result of the decoding model, thereby saving the time for generating the model and reducing the resource occupation during decoding; the preset re-scoring model stores the correction value of the score, the score of the decoding model can be corrected in real time through simple calculation, the re-scoring speed is increased, paths are not required to be searched in a plurality of decoding models and calculated, and the re-scoring model is only required to be searched; meanwhile, the memory occupation is reduced, and the number of decoding networks is reduced.

FIG. 4 is an exemplary flow chart of a method of generating a pre-set re-scoring model according to some embodiments of the present description.

In some embodiments, the pre-determined weight score model may be a pre-generated model. As shown in fig. 4, the process 400 includes the following steps.

Step 410, a first language model and a second language model are obtained. In some embodiments, step 410 may be performed by language model acquisition module 240.

In some embodiments, the first language model and the second language model may be language models in an arpa format.

The language model acquisition module 240 may acquire the first language model and the second language model in various ways. In some embodiments, the language model obtaining module 240 may train the language model in the preset arpa format to obtain the second language model. The manner of training may include, but is not limited to, supervised learning, etc. In some embodiments, the language model obtaining module 240 may cut a language model with a smaller volume from the second language model as the first language model. There are various ways of cutting. For example, the number of states and arcs are defined. For another example, states and arcs that are not important are deleted, and states and arcs with higher scores are retained.

Step 420 generates a decoding model based on the first language model. In some embodiments, step 420 may be performed by decoding model generation module 250.

The decoding model generation module 250 may generate the decoding model in a variety of ways. In some embodiments, the decoding model generation module 250 may generate the decoding model based on the first language model. In some embodiments, the decoding model generation module 250 may generate the decoding model hclg. Fst from the first language model and the acoustic model and the dictionary file.

Step 430, generating a first scoring model and a second scoring model based on the first language model and the second language model. In some embodiments, step 430 may be performed by scoring model generation module 260.

The first scoring model and the second scoring model refer to models that may be used to score speech recognition results. In some embodiments, the first scoring model and the second scoring model may score the output of arcs in the directed graph structure (i.e., the speech recognition results).

In some embodiments, the first scoring model and the second scoring model may be various models that may be used for scoring. In some embodiments, the first scoring model and the second scoring model may be weighted finite state machines.

In some embodiments, the scale of the first scoring model and the scale of the second scoring model may be defined. For example, the scale of the first scoring model and the scale of the second scoring model are unequal. In some embodiments, the scale of the first scoring model is smaller than the scale of the second scoring model.

In some embodiments, there may be an association between the first scoring model and the second scoring model. For example, at least a portion is the same between the first scoring model and the second scoring model. In some embodiments, the first scoring model may be part of the second scoring model.

The scoring model generation module 260 may generate the first scoring model and the second scoring model in a variety of ways. In some embodiments, the scoring model generation module 260 may generate the first scoring model and the second scoring model by converting the first language model and the second language model to a Finite State machine (FST).

Step 440, obtaining a preset heavy scoring model by combining the first scoring model and the second scoring model. In some embodiments, step 440 may be performed by re-scoring model generation module 270.

The re-scoring model generation module 270 may generate the pre-set re-scoring model in a variety of ways. In some embodiments, the re-scoring model generation module 270 may obtain the preset re-scoring model by combining the first scoring model and the second scoring model. For more details on the preset weight scoring model obtained by combining the first scoring model and the second scoring model, see fig. 5 and the description thereof.

In some embodiments of the present description, the first language model is obtained by clipping the second language model. The method has the advantages that firstly, the accuracy of the first language model is improved, namely, when the second language model is cut, the state and the arc with high scores are stored as the first language model; secondly, the workload is reduced, namely, a language model does not need to be trained separately, and the language model is used as a first language model.

It should be noted that the above description of the processes 300, 400 is for illustration and description only, and is not intended to limit the scope of applicability of the present disclosure. Various modifications and changes to the processes 300, 400 may be made by those skilled in the art under the guidance of this specification. However, such modifications and variations are still within the scope of the present description. For example, step 320 may be incorporated into step 330.

FIG. 5 is a schematic diagram of a method 500 of generating a re-scoring model, according to some embodiments of the present description.

The speech recognition result score refers to a score obtained by scoring the speech recognition result by a scoring model. For example, the first scoring model scores a score of 0.6 for the speech recognition result, and then a score of 0.6 for the speech recognition result. For another example, if the second scoring model scores the speech recognition result with a score of 0.63, the second score of the speech recognition result is 0.63.

The re-scoring model generation module 270 may obtain the pre-set re-scoring model in a number of ways. In some embodiments, the re-scoring model generation module 270 may obtain the preset re-scoring model by combining the first scoring model and the second scoring model. In some embodiments, the score obtaining unit 271 may obtain the second score of the speech recognition result by traversing the second score model, and based on the traversing of the second score model, the traversing of the first score model may be synchronized to obtain the first score of the speech recognition result, the model updating unit 272 may update the second score model based on a difference between the first score of the speech recognition result and the second score of the speech recognition result, and the model determining unit 273 may determine the preset re-score model based on the updated second score model. For more details on traversal see fig. 6 and its description.

The score acquisition unit 271 may perform traversal in various ways. For example, depth-first traversal, breadth-first traversal, etc. In some embodiments, the manner in which score acquisition unit 271 traverses may be a recursive depth-first traversal.

The updating and traversing are sequential. In some embodiments, the model updating unit 272 may update the second scoring model while the score acquisition unit 271 traverses the first scoring model and the second scoring model.

In some embodiments, a difference in scores of the speech recognition results of the first scoring model and the second scoring model may be determined as a correction value. For example, the score of the speech recognition result of the second scoring model is subtracted from the score of the speech recognition result of the first scoring model, and the resulting difference is used as the correction value.

The model updating unit 272 may update the second scoring model in various ways. For example, the model updating unit 272 may update the second scoring model based on the speech recognition result score one and the speech recognition result score two. In some embodiments, the model updating unit 272 may replace the second score model with the difference between the first score and the second score, that is, the correction value, to complete the updating of the second score model.

In some embodiments of the present disclosure, the difference value is used to update, and the updated second scoring model is determined as the preset scoring model, so that the generation of the scoring model is simplified, meanwhile, the complexity of score calculation in the scoring process is reduced, the decoding speed is improved, and the occupation of memory resources is reduced.

FIG. 6 is an exemplary diagram of a method 600 of model traversal according to some embodiments of the present description.

In some embodiments, the first scoring model and the second scoring model may be directed graph structures including a plurality of arcs and nodes. For more details on the arc, see the relevant description in step 320.

In some embodiments, there is a corresponding arc in the first scoring model to the second scoring model, i.e., the output of the arc has a particular relationship, e.g., the output is the same or similar speech recognition result, etc.

Traversing the scoring model refers to accessing the nodes and arcs of the model to obtain a sequence of all possible arcs of the model. In some embodiments, the traversal of the model is to obtain all possible speech recognition results.

Synchronous traversal refers to the synchronous search in the first scoring model for a sequence of arcs corresponding to the sequence of arcs of the second scoring model as the second scoring model traverses the sequence of arcs. Because the sequence of arcs is an ordered set of arcs, the process of finding the sequence of corresponding arcs is essentially the process of finding the next corresponding arc in the sequence of arcs. For example, traversing to a sequence of "today is monday night" arcs in the second scoring model, where a sequence of "today is monday" arcs has been found, then the next arc needs to be found in the second scoring model so that the new arc sequence is "today is monday night". In some embodiments, the sequence of corresponding arcs may be a sequence of completely identical arcs or may be a sequence of closest arcs. For example, for "Monday evening today," if perfect agreement cannot be found, then "Monday evening" may be determined as the sequence of corresponding arcs.

In some embodiments, the score acquisition unit 271 may determine the corresponding arc to the arc in the second scoring model in the first scoring model based on the arc in the second scoring model in various ways. In some embodiments, the score acquisition unit 271 determines, when an arc that coincides with an arc in the second scoring model is found in the first scoring model, the coincident arc as a corresponding arc. For example, as shown in fig. 6, the arc in the second scoring model is "today is monday evening" (i.e., the corresponding output of the arc is "today is monday evening", the same applies below), and the score acquisition unit 271 finds an arc in the first scoring model that coincides with the arc in the second scoring model, i.e., "today is monday evening", and determines the arc in the first scoring model as the corresponding arc.

In some embodiments, the score acquisition unit 271 determines, when an arc coincident with an arc in the second scoring model is not found in the first scoring model, the coincident arc with the smallest number of steps back as the corresponding arc by rollback. For example, as shown in fig. 6, the arc in the second scoring model is "today is monday at night", the score acquisition unit 271 finds the arc corresponding to "today is monday" in the first scoring model, but further does not find the arc corresponding to "today is monday at night" when finding in the first scoring model, and therefore, it is necessary to fall back once to the arc corresponding to "monday" and then continue finding the arc corresponding to "monday at night", the score acquisition unit 271 finds the arc "monday at night" in the first scoring model, so that the consistent arc with the smallest number of steps back is determined as the corresponding arc.

In some embodiments, the score acquisition unit 271 may traverse the first scoring model synchronously based on arcs in the second scoring model while traversing the second scoring model. For example, when the score acquisition unit 271 traverses to the arc "today is monday evening" in the second scoring model, an arc corresponding to the arc "today is monday evening" in the second scoring model is determined in the first scoring model.

The number of steps back refers to the number of times words need to be removed when back-off. In some embodiments, the number of steps back refers to sequentially removing words from front to back for the word sequence corresponding to the arc, one step back for each removal.

Fig. 7 is an exemplary schematic diagram of a method 700 of real-time re-scoring of speech recognition according to some embodiments of the present description.

As shown in fig. 7, the process of speech recognition is to generate a feature sequence based on the speech of the user, then process the feature sequence using an acoustic model, and then perform a decoding search to obtain a recognition result. In the decoding search, a preset heavy division model gf.fst and a decoding model hclg.fst are required. In some embodiments, the preset re-scoring model gf.fst is generated in advance by the re-scoring model generation module 270 using the first scoring model g1.Fst and the second scoring model g2.Fst, and the preset re-scoring model gf.fst stores a difference in the speech recognition result scores of each speech recognition result in the first scoring model g1.Fst and the second scoring model g2. Fst. For example, the re-scoring model generation module 270 obtains a difference of 0.03 score based on the voice recognition result score of 0.6 score and the voice recognition result score of 0.63 score, and stores the difference in the predetermined re-scoring model gf.fst. When the voice recognition result is remarked in real time, the candidate result obtaining module 220 may directly obtain the score corresponding to the voice recognition result, i.e. the difference value, from the preset remarking model gf.fst, and sum the score of the voice recognition result obtained by the decoding model hclg.fst with the corresponding score in the preset remarking model gf.fst, thereby obtaining the final score of the voice recognition result.

In some embodiments of the present disclosure, the voice recognition result of the decoding model hclg. Fst is re-scored based on a preset re-scoring model gf.fst, firstly, the decoding speed is increased, a series of calculations are required to be performed in the first scoring model g1.Fst and the second scoring model g2.Fst at the same time, and only one preset re-scoring model gf.fst is required to be searched at present, and no calculation is required, so that the speed of the decoding process is increased; and secondly, the number of decoding networks and the saved intermediate variables are reduced, so that the memory occupation is reduced.

FIG. 8 is an exemplary flow chart of a method 800 of generating a pre-set heavy score model according to some embodiments of the present description.

In some embodiments, each state in the first scoring model g1.fst and the second scoring model g2.fst has an additional attribute (e.g., an order) that is not directly stored in the data structures of the first scoring model g1.fst and the second scoring model g2.fst, but is a separate attribute. As shown in fig. 8, in some embodiments, the ngram order of each state in the first scoring model g1.fst and the second scoring model g2.fst may be obtained by a statistical method or the like, and saved to Sn1 and Sn2, respectively.

As shown in fig. 8, in some embodiments, a fallback model gback. In the first division model g1.Fst, for a general state, there are arcs corresponding to the number of inputs that it can accept. Given one of these inputs, a weight may be returned and the next state is skipped. If the given input is not within the acceptable range for these arcs, an error is returned. The fallback model gback.fst extends the first division model g1.fst, and can accept any input. This function is based on one assumption that: if an input is not in several arcs of the state, then an arc that can accept the input must be found through several rollback operations. The rollback model gback.fst allows for continuous rollback until an arc is found that meets the requirements. The weights of the several rollbacks are summed together with the satisfactory arc weights, returned as a weight fraction of the final arc, and jumped to the next state of the final arc.

As shown in fig. 8, in some embodiments, when traversing the second scoring model g2.fst, the weights of the arcs may be modified by using Sn1, sn2 and the back-off model gback.fst, and the second scoring model g2.fst after modifying the weights of the arcs may be stored as the preset heavy scoring model gf.fst.

In some embodiments, before the re-scoring model generation module 270 traverses the second scoring model g2.fst, the initial state of the second scoring model g2.fst and the initial state of the back-off model gback.fst are obtained, and then a recursive deep traversal is performed with both states as entries. In the recursive function, it is first determined whether the state of the current second scoring model g2.fst has been processed. And if so, returning directly. If not, all arcs of the current state are traversed and weights are updated. This is divided into three cases. First, if the input on the arc is not 0, it indicates that there is a speech recognition result output. At this time, assuming that the weight of the arc is w2, the weight value corresponding to the input needs to be queried from the rollback model gback. The difference between w2 and w1 is saved to the original arc. And at the same time, taking the next state of the current arc and the next state queried in the rollback model Gback. Second, the input on the arc is 0, but the attributes of the two states are identical (judged by Sn1 and Sn 2), then both are language model rollback operations, and the operation in the first case can be applied, i.e., modifying the weights and recursively invoking with the next state of both. In the third case, the input on the arc is 0 and the properties of the states are not consistent. At this time, only the second scoring model g2.fst is required to be rolled back, and then is called recursively.

The created model may be used for decoding. In some embodiments, during the decoding process, a space may be opened up for storing a state value for each decoding path, and a corresponding gf. When word output exists on the decoding path, namely the output value on the arc is not 0, a corresponding weight value can be obtained from GF.fst and added to the current decoding path, and meanwhile, the state value of the preset heavy scoring model GF.fst can be updated.

While the basic concepts have been described above, it will be apparent to those skilled in the art that the foregoing detailed disclosure is by way of example only and is not intended to be limiting. Although not explicitly described herein, various modifications, improvements, and adaptations to the present disclosure may occur to one skilled in the art. Such modifications, improvements, and modifications are intended to be suggested within this specification, and therefore, such modifications, improvements, and modifications are intended to be included within the spirit and scope of the exemplary embodiments of the present invention.

Meanwhile, the specification uses specific words to describe the embodiments of the specification. Reference to "one embodiment," "an embodiment," and/or "some embodiments" means that a particular feature, structure, or characteristic is associated with at least one embodiment of the present description. Thus, it should be emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various positions in this specification are not necessarily referring to the same embodiment. Furthermore, certain features, structures, or characteristics of one or more embodiments of the present description may be combined as suitable.

Furthermore, the order in which the elements and sequences are processed, the use of numerical letters, or other designations in the description are not intended to limit the order in which the processes and methods of the description are performed unless explicitly recited in the claims. While certain presently useful inventive embodiments have been discussed in the foregoing disclosure, by way of various examples, it is to be understood that such details are merely illustrative and that the appended claims are not limited to the disclosed embodiments, but, on the contrary, are intended to cover all modifications and equivalent arrangements included within the spirit and scope of the embodiments of the present disclosure. For example, while the system components described above may be implemented by hardware devices, they may also be implemented solely by software solutions, such as installing the described system on an existing server or mobile device.

Likewise, it should be noted that in order to simplify the presentation disclosed in this specification and thereby aid in understanding one or more inventive embodiments, various features are sometimes grouped together in a single embodiment, figure, or description thereof. This method of disclosure, however, is not intended to imply that more features than are presented in the claims are required for the present description. Indeed, less than all of the features of a single embodiment disclosed above.

In some embodiments, numbers describing the components, number of attributes are used, it being understood that such numbers being used in the description of embodiments are modified in some examples by the modifier "about," approximately, "or" substantially. Unless otherwise indicated, "about," "approximately," or "substantially" indicate that the number allows for a 20% variation. Accordingly, in some embodiments, numerical parameters set forth in the specification and claims are approximations that may vary depending upon the desired properties sought to be obtained by the individual embodiments. In some embodiments, the numerical parameters should take into account the specified significant digits and employ a method for preserving the general number of digits. Although the numerical ranges and parameters set forth herein are approximations that may be employed in some embodiments to confirm the breadth of the range, in particular embodiments, the setting of such numerical values is as precise as possible.

Each patent, patent application publication, and other material, such as articles, books, specifications, publications, documents, etc., referred to in this specification is incorporated herein by reference in its entirety. Except for application history documents that are inconsistent or conflicting with the content of this specification, documents that are currently or later attached to this specification in which the broadest scope of the claims to this specification is limited are also. It is noted that, if the description, definition, and/or use of a term in an attached material in this specification does not conform to or conflict with what is described in this specification, the description, definition, and/or use of the term in this specification controls.

Finally, it should be understood that the embodiments described in this specification are merely illustrative of the principles of the embodiments of this specification. Other variations are possible within the scope of this description. Thus, by way of example, and not limitation, alternative configurations of embodiments of the present specification may be considered as consistent with the teachings of the present specification. Accordingly, the embodiments of the present specification are not limited to only the embodiments explicitly described and depicted in the present specification.

Claims

1. A method of real-time re-scoring for speech recognition, comprising:

acquiring characteristics of a voice frame in a voice material;

based on the characteristics of the voice frame, obtaining a candidate voice recognition result through a decoding model and a preset re-scoring model, wherein the preset re-scoring model is used for correcting the score of the voice recognition result of the decoding model in real time, the preset re-scoring model is a pre-generated model, and the pre-generation comprises:

acquiring a first language model and a second language model, wherein the second language model is obtained by training a preset language model, and the first language model is obtained by cutting the second language model;

generating the decoding model based on the first language model;

generating a first scoring model and a second scoring model based on the first language model and the second language model;

Obtaining the preset heavy scoring model by combining the first scoring model and the second scoring model, wherein the method comprises the following steps:

obtaining a second voice recognition result score by traversing the second scoring model, and synchronously traversing the first scoring model based on the traversing of the second scoring model to obtain a first voice recognition result score;

updating the second scoring model based on the difference between the first speech recognition result score and the second speech recognition result score;

determining the preset heavy scoring model based on the updated second scoring model;

and determining a target voice recognition result based on the candidate voice recognition result.

2. The method of claim 1, wherein the pre-determined weight score model includes pre-stored correction values for the score, the real-time correction comprising:

and obtaining a corrected score by summing the score and the correction value, wherein the corrected score is the score of the voice recognition result of the decoding model after the real-time correction.

3. The method of claim 1, the synchronizing traversing the first scoring model comprising:

determining, in the first scoring model, a corresponding arc to an arc in the second scoring model based on the arc in the second scoring model, wherein

When an arc is found in the first scoring model that is consistent with an arc in the second scoring model, determining the consistent arc as the corresponding arc;

when no arcs are found in the first scoring model that are consistent with the arcs in the second scoring model, the consistent arcs with the least number of back steps are determined to be the corresponding arcs by back-off.

4. The method of claim 2, wherein the correction value is a difference in speech recognition result scores of the first scoring model and the second scoring model.

5. A system for real-time re-scoring of voice recognition comprises a feature acquisition module, a candidate result acquisition module and a target result determination module;

the characteristic acquisition module is used for acquiring characteristics of voice frames in the voice materials;

the candidate result obtaining module is used for obtaining a candidate voice recognition result through a decoding model and a preset re-scoring model based on the characteristics of the voice frame, the preset re-scoring model is used for correcting the score of the voice recognition result of the decoding model in real time, the preset re-scoring model is a model generated in advance, and the system further comprises a language model obtaining module, a decoding model generating module, a scoring model generating module and a re-scoring model generating module:

The language model acquisition module is used for acquiring a first language model and a second language model, the second language model is obtained by training a preset language model, and the first language model is obtained by cutting the second language model;

the decoding model generation module is used for generating the decoding model based on the first language model;

the scoring model generation module is used for generating a first scoring model and a second scoring model based on the first language model and the second language model;

the re-scoring model generation module is used for obtaining the preset re-scoring model by combining the first scoring model and the second scoring model, and comprises a score obtaining unit, a model updating unit and a model determining unit:

the score acquisition unit is used for acquiring a second score of the voice recognition result by traversing the second scoring model, and synchronously traversing the first scoring model based on the traversing of the second scoring model so as to acquire a first score of the voice recognition result;

the model updating unit is used for updating the second scoring model based on the difference value of the first voice recognition result score and the second voice recognition result score;

The model determining unit is used for determining the preset heavy scoring model based on the updated second scoring model;

the target result determining module is used for determining a target voice recognition result based on the candidate voice recognition result.

6. The system of claim 5, wherein the pre-determined weight score model includes pre-stored correction values for the score, the real-time correction comprising:

7. The system of claim 5, the synchronizing traversing the first scoring model comprising:

8. The system of claim 6, wherein the correction value is a difference in speech recognition result scores of the first scoring model and the second scoring model.

9. An apparatus for real-time scoring of speech recognition, comprising a processor configured to perform the method for real-time scoring of speech recognition of any one of claims 1-4.

10. A computer readable storage medium storing computer instructions which, when read by a computer, perform the method of real-time scoring of speech recognition according to any one of claims 1 to 4.