CN110797026A

CN110797026A - Voice recognition method, device and storage medium

Info

Publication number: CN110797026A
Application number: CN201910880013.5A
Authority: CN
Inventors: 康跃腾; 付彦喆; 王朋飞
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-09-17
Filing date: 2019-09-17
Publication date: 2020-02-14

Abstract

The embodiment of the application discloses a voice recognition method, a voice recognition device and a storage medium, wherein the voice recognition method comprises the following steps: extracting features of a voice signal stream, inputting the features into a first model to obtain a first recognition result, and determining a first score of the first recognition result, wherein the first model is used for recognizing words of a general expression; when the first model cannot identify the language vocabulary in the voice signal stream, loading a second model according to the domain type of the language vocabulary, inputting the language vocabulary into the second model to obtain a second identification result, and determining a second score of the second identification result, wherein the second model is used for identifying the vocabulary of the domain expression; and determining a final recognition result according to the first score and the second score. By adopting the embodiment of the application, the accuracy and the efficiency of voice recognition are improved.

Description

Voice recognition method, device and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a speech recognition method, apparatus, and storage medium.

Background

Key technologies of Speech Technology (Speech Technology) are Automatic Speech Recognition (ASR) and Speech synthesis (Text To Speech, TTS) as well as voiceprint Recognition. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future. Speech recognition can be performed in several ways: firstly, language material training and general language material related to each field are combined to retrain a language model and a decoding graph HCLG is synthesized again. Secondly, relevant models are retrained on large-scale corpora by adding audio and text related to the field in an end-to-end voice recognition system. However, in an actual scenario, the whole model needs to be retrained and deployed, resulting in inefficient speech recognition. And because the related corpora of the field are not many, the accuracy of identification is low.

Disclosure of Invention

The embodiment of the application provides a voice recognition method, a voice recognition device and a storage medium. The accuracy and efficiency of speech recognition can be improved.

In a first aspect, an embodiment of the present application provides a speech recognition method, including:

extracting features of a voice signal stream, inputting the features into a first model to obtain a first recognition result, and determining a first score of the first recognition result, wherein the first model is used for recognizing words of a general expression;

when the first model cannot identify the language vocabulary in the voice signal stream, loading a second model according to the domain type of the language vocabulary, inputting the language vocabulary into the second model to obtain a second identification result, and determining a second score of the second identification result, wherein the second model is used for identifying the vocabulary of the domain expression;

and determining a final recognition result according to the first score and the second score.

Wherein the vocabulary of the first model includes a special identifier, and the loading of the second model according to the domain type to which the language vocabulary belongs includes:

determining a special identifier corresponding to the field type to which the language vocabulary belongs;

and searching the second model from a plurality of preset domain models according to the special identifier and loading.

The first model comprises a first common language model and a first re-scoring language model, wherein the first common language model is generated according to a binary language model, and the first re-scoring language model is generated according to a quinary language model;

inputting the extracted features of the voice signal stream into a first model to obtain a first recognition result, and determining a first score of the first recognition result comprises:

recognizing the voice signal stream according to the first common language model to obtain a plurality of first recognition results, wherein one first recognition result corresponds to one first score;

and sequencing the plurality of first recognition results according to the first re-scoring language model, and selecting the first recognition result corresponding to the highest first score.

The second model comprises a second domain language model and a second scoring language model, and the second domain language model is generated according to the unary language model;

the inputting the language vocabulary into the second model to obtain a second recognition result, and the determining a second score of the second recognition result comprises:

recognizing the language vocabulary according to the second domain language model to obtain a plurality of second recognition results, wherein one second recognition result corresponds to one second score;

and sequencing the second recognition results according to the second re-scoring language model, and selecting the second recognition result corresponding to the highest second score.

Wherein the determining a final recognition result according to the first score and the second score comprises:

calculating a weighted average of the first score and the second score;

and determining the final recognition result according to the weighted average value.

Wherein the second domain language model comprises a vocabulary to be enhanced of the first model; the method further comprises the following steps:

and replacing the recognition result of the vocabulary to be strengthened recognized by the first model with the recognition result of the vocabulary to be strengthened recognized by the second model to obtain the final recognition result.

In a second aspect, an embodiment of the present application provides a speech recognition apparatus, including:

the processing module is used for extracting the characteristics of the voice signal flow, inputting the characteristics into a first model to obtain a first recognition result, and determining a first score of the first recognition result, wherein the first model is used for recognizing the vocabulary of the general expression;

the processing module is further configured to, when the first model cannot recognize a language vocabulary in the speech signal stream, load a second model according to a domain type to which the language vocabulary belongs, input the language vocabulary to the second model to obtain a second recognition result, and determine a second score of the second recognition result, where the second model is used to recognize a vocabulary of a domain utterance;

and the determining module is used for determining a final recognition result according to the first score and the second score.

The processing module is further configured to determine a special identifier corresponding to a domain type to which the language vocabulary belongs; and searching the second model from a plurality of preset domain models according to the special identifier and loading.

the processing module is further configured to recognize the voice signal stream according to the first common language model to obtain a plurality of first recognition results, where one first recognition result corresponds to one first score; and sequencing the plurality of first recognition results according to the first re-scoring language model, and selecting the first recognition result corresponding to the highest first score.

the processing module is further configured to recognize the language vocabulary according to the second domain language model to obtain a plurality of second recognition results, where one second recognition result corresponds to one second score; and sequencing the second recognition results according to the second re-scoring language model, and selecting the second recognition result corresponding to the highest second score.

Wherein the determining module is further configured to calculate a weighted average of the first score and the second score; and determining the final recognition result according to the weighted average value.

Wherein the second domain language model comprises a vocabulary to be enhanced of the first model;

the processing module is further configured to replace the recognition result for the vocabulary to be enhanced, recognized by the first model, with the recognition result for the vocabulary to be enhanced, recognized by the second model, and obtain the final recognition result.

In a third aspect, an embodiment of the present application provides a speech recognition apparatus, including: the speech recognition method comprises a processor, a memory and a communication bus, wherein the communication bus is used for realizing connection communication between the processor and the memory, and the processor executes a program stored in the memory for realizing the steps in the speech recognition method provided by the first aspect.

In one possible design, the speech recognition device provided by the application may include a module for performing the corresponding behavior in the method. The modules may be software and/or hardware.

Yet another aspect of the embodiments of the present application provides a computer-readable storage medium having stored thereon a plurality of instructions adapted to be loaded by a processor and to perform the method of the above-mentioned aspects.

Yet another aspect of the embodiments of the present application provides a computer program product containing instructions which, when run on a computer, cause the computer to perform the method of the above-described aspects.

By implementing the embodiment of the application, the characteristics of the voice signal stream are extracted and input into a first model to obtain a first recognition result, and a first score of the first recognition result is determined, wherein the first model is used for recognizing the vocabulary of the general expression; when the first model cannot identify the language vocabulary in the voice signal stream, loading a second model according to the domain type of the language vocabulary, inputting the language vocabulary into the second model to obtain a second identification result, and determining a second score of the second identification result, wherein the second model is used for identifying the vocabulary of the domain expression; and determining a final recognition result according to the first score and the second score. By dynamically loading the domain model, the efficiency and the accuracy of the voice recognition are improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic structural diagram of a speech recognition system according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a speech recognition method according to an embodiment of the present application;

fig. 3 is a schematic diagram of a state rollback according to an embodiment of the present application;

FIG. 4 is an alternate operational schematic provided by an embodiment of the present application;

FIG. 5 is a flow chart of another speech recognition method provided by the embodiments of the present application;

fig. 6 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a speech recognition device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

As shown in fig. 1, fig. 1 is a schematic structural diagram of a speech recognition system according to an embodiment of the present application. The speech recognition system comprises a ROOT model and a domain model, wherein the ROOT model is used for recognizing words of a general expression, the ROOT model comprises a ROOT model (base) and a ROOT model (big), the ROOT model (base) comprises an acoustic model and a decoding graph HCLG, the acoustic model and the decoding graph HCLG are used for recognizing the extracted characteristics of the speech signal flow, and the ROOT model (big) is used for carrying out ranking and scoring on the recognition results. The domain model is used to recognize vocabulary of domain expressions, which may include domain models (OOVs) including domain model 1, domain model 2 … …, domain model n, and the like, and domain re-scoring (rescore) models. If the vocabulary of a certain field needs to be recognized, a corresponding field model can be loaded. Each domain model includes an acoustic model and a decoding graph OOV-HCLG. The acoustic model and the decoding graph OOV-HCLG are used for recognizing vocabularies in the field, and the field re-scoring model can be used for sequencing and re-scoring recognition results. The speech recognition system can be applied to cloud micro speech recognition assistants, wechat applets and the like.

As shown in fig. 2, fig. 2 is a schematic flowchart of a speech recognition method according to an embodiment of the present application. The steps in the embodiments of the present application include at least:

s201, extracting the characteristics of the voice signal flow, inputting the characteristics into a first model to obtain a first recognition result, and determining a first score of the first recognition result, wherein the first model is used for recognizing the vocabulary of the general expression.

The first model comprises a first common language model and a first re-scoring language model, wherein the first common language model is generated according to a binary language model, and the first re-scoring language model is generated according to a quinary language model; a plurality of first recognition results can be obtained by recognizing the voice signal stream according to the first common language model, wherein one first recognition result corresponds to one first score; and then, sequencing the plurality of first recognition results according to the first re-scoring language model, and selecting the first recognition result corresponding to the highest first score.

As shown in fig. 1, the first model may be a ROOT model, which obtains an acoustic model a and a language model G1 from audio and corpus of a general domain, respectively, where the language model G1 is split into 2-gram G11 (a first common language model) and 5-gram G12 (a first reprinting language model), a decoding graph HCLG may be generated from G11, speech recognition of the general term is performed using the acoustic model and the decoding graph HCLG, and then recognition results of the acoustic model and the decoding graph HCLG are reprinted by G12.

For example, the corpus "i want to know which hall a vermilion patch is facing" is subjected to the word segmentation processing. I think/know/vermilion patch/on/which hall/upward is 1-gram. I want to know that the jade tablet is 2-gram on/which hall. I want to know which palace/i want to know the 5-gram on the jade tablet. The foregoing examples illustrate only some of the words. Wherein G11 includes 1-grams and 2-grams, and G12 includes 1-grams, 2-grams, 3-grams, 4-grams, and 5-grams.

S202, when the first model can not recognize the language vocabulary in the voice signal stream, loading a second model according to the domain type to which the language vocabulary belongs, inputting the language vocabulary to the second model to obtain a second recognition result, and determining a second score of the second recognition result, wherein the second model is used for recognizing the vocabulary of the domain expression.

In a specific implementation, a special identifier corresponding to a domain type to which the language vocabulary belongs can be determined; and searching the second model from a plurality of preset domain models according to the special identifier and loading.

For example, as shown in fig. 3, fig. 3 is a schematic diagram of a state rollback according to an embodiment of the present application. And after the characteristics of the signal flow are extracted, the signal flow enters a ROOT model for online decoding. Rollback is required if the ROOT model cannot recognize a certain domain vocabulary or vocabularies. Wherein, $ OOV represents a 1-gram State fallback identifier, and during the decoding graph search, the $ OOV identifier is included in the fallback State of the ROOT language model, and if the 1-gram fallback State is searched, the decoding graph search is performed in a pre-specified domain OOV-HCLG, and then the search result is replaced (replaced) through a Weighted Finite-State transformer (WFST). As shown in fig. 4, fig. 4 is an alternative operation diagram provided by the embodiment of the present application, in which an edge of $ OOV is replaced with a search result.

The second model comprises a second domain language model and a second scoring language model, and the second domain language model is generated according to the unary language model. The language vocabulary can be recognized according to the second domain language model to obtain a plurality of second recognition results, and one second recognition result corresponds to one second score; and sequencing the second recognition results according to the second re-scoring language model, and selecting the second recognition result corresponding to the highest second score.

As shown in FIG. 1, the second model may be a domain model, the domain model is divided into two parts, namely OOV-HCLG and rescore-HCLG, and a language model of OOV-HCLG is composed of a word set $ { OOV } which is not registered by ROOT model in a domain scene and 1-gram of a set $ { < s > word } of words to be enhanced in ROOT model. For example, words such as endgate/Xiyan/Yongkang left gate are all word sets $ { oov } that are not registered by the ROOT model, and words belonging to the historical field are unusual words. As another example, a "hall" is easily recognized as a "shop," and thus a "hall" is a word to be enhanced.

The language model of the rescore-HCLG is generated by the word set $ { OOV } and the specified sentence making sentence, and is used for re-scoring the recognition result of the language model of the OOV-HCLG and enhancing the recognition of the domain expression. When decoding is carried out through a ROOT model (base) and a language model of OOV-HCLG, 5-gram language model (big) of the ROOT model and rescore-HCLG of a domain model are synchronously utilized to carry out re-scoring respectively.

It should be noted that models in different domains can be integrated into one speech recognition system, and the domain models are hot-loaded in real time according to independent retrieval (query) requests of each domain, that is, when a new domain model is loaded, the loaded domain model is not released.

S203, determining a final recognition result according to the first score and the second score.

In a specific implementation, a weighted average of the first score and the second score may be calculated; and determining the final recognition result according to the weighted average value. For example, the recognition result with the highest weighted average may be selected as the final recognition result. For example, the first score and the second score are score1 and score2, respectively, with weights W1 and W2 set, respectively, where W1+ W2 ═ 1, and the final score final _ score ═ W1 ═ score1+ W2 ═ score 2. Wherein, the value range of W1 can be (+ -0.9-0.99), and the value range of W2 can be (+ -0.01-0.1).

In the embodiment of the application, through the mode of dynamically loading the domain model, efficient recognition can be performed on unknown vocabularies in the domain scene, strong recognition can be performed on specified vocabularies in the original model, recognition of general expressions in voice recognition is not affected, model training in the whole process is in a minute level, and the mode of hot loading is combined with the speech recognition model in the general domain, so that accuracy and efficiency of the voice recognition are improved.

As shown in fig. 5, fig. 5 is a schematic flowchart of another speech recognition method provided in the embodiment of the present application. The steps in the embodiments of the present application include at least:

s501, extracting the characteristics of the voice signal flow, inputting the characteristics into a first model to obtain a first recognition result, and determining a first score of the first recognition result, wherein the first model is used for recognizing the vocabulary of the general expression.

S502, when the first model can not recognize the language vocabulary in the voice signal stream, loading a second model according to the field type to which the language vocabulary belongs, inputting the language vocabulary to the second model to obtain a second recognition result, and determining a second score of the second recognition result, wherein the second model is used for recognizing the vocabulary of the field utterance.

It should be noted that models in different domains can be integrated into one speech recognition system, and the domain models are hot-loaded in real time according to independent retrieval (query) requests of each domain, that is, the loaded domain models are not released, and new domain models are loaded.

And S503, determining a final score according to the first score and the second score.

In a specific implementation, a weighted average of the first score and the second score may be calculated; and taking the weighted average as a final score according to the weighted average. For example, the recognition result with the highest weighted average may be selected as the final recognition result. For example, the first score and the second score are score1 and score2, respectively, with weights W1 and W2 set, respectively, where W1+ W2 ═ 1, and the final score final _ score ═ W1 ═ score1+ W2 ═ score 2. Wherein, the value range of W1 can be (+ -0.9-0.99), and the value range of W2 can be (+ -0.01-0.1).

S504, replacing the recognition result of the vocabulary to be strengthened recognized by the first model with the recognition result of the vocabulary to be strengthened recognized by the second model to obtain the final recognition result.

In specific implementation, the recognition result with the highest final score can be selected, and the recognition result is replaced according to the identifier of the word < s > word </s > to be strengthened to obtain the final recognition result. For example, if "shop" is included in the recognition result, the "shop" is replaced with "hall".

As shown in fig. 6, fig. 6 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application. The device in the embodiment of the application at least comprises:

the processing module 601 is configured to extract features of a speech signal stream, input the features into a first model to obtain a first recognition result, and determine a first score of the first recognition result, where the first model is used to recognize a vocabulary of a general utterance.

The processing module 601 is further configured to, when the first model cannot recognize the language vocabulary in the speech signal stream, load a second model according to the domain type to which the language vocabulary belongs, input the language vocabulary into the second model to obtain a second recognition result, and determine a second score of the second recognition result, where the second model is used to recognize the vocabulary of the domain utterance.

A determining module 602, configured to determine a final recognition result according to the first score and the second score.

Optionally, the recognition result for the vocabulary to be enhanced, which is recognized by the second model, is substituted for the recognition result for the vocabulary to be enhanced, which is recognized by the first model, so as to obtain the final recognition result. Specifically, the recognition result with the highest final score can be selected, and for the recognition result, the recognition result is replaced according to the identifier of the word < s > word </s > to be strengthened, so that the final recognition result is obtained. For example, if "shop" is included in the recognition result, the "shop" is replaced with "hall".

Referring to fig. 7, fig. 7 is a schematic structural diagram of a speech recognition device according to an embodiment of the present application. As shown, the apparatus may include: at least one processor 701, at least one communication interface 702, at least one memory 703 and at least one communication bus 704.

The processor 701 may be, among other things, a central processing unit, a general purpose processor, a digital signal processor, an application specific integrated circuit, a field programmable gate array or other programmable logic device, transistor logic, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor may also be a combination of computing functions, e.g., comprising one or more microprocessors, a digital signal processor and a microprocessor, or the like. The communication bus 704 may be a peripheral component interconnect standard PCI bus or an extended industry standard architecture EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 7, but this is not intended to represent only one bus or type of bus. A communication bus 704 is used to enable communications among the components. In this embodiment, the communication interface 702 of the device in this application is used for performing signaling or data communication with other node devices. The Memory 703 may include a volatile Memory, such as a Nonvolatile dynamic Random Access Memory (NVRAM), a phase change Random Access Memory (PRAM), a Magnetoresistive Random Access Memory (MRAM), and the like, and may further include a Nonvolatile Memory, such as at least one magnetic Disk Memory device, an electrically erasable Programmable Read-Only Memory (EEPROM), a flash Memory device, such as a nor flash Memory (NORflash Memory) or a nor flash Memory (NAND flash Memory), a semiconductor device, such as a Solid State Disk (SSD), and the like. The memory 703 may optionally be at least one memory device located remotely from the processor 701. A set of program codes is stored in the memory 703, and the processor 701 executes the program in the memory 703:

Wherein the vocabulary of the first model includes a special identifier,

optionally, the processor 701 is further configured to perform the following operation steps:

calculating a weighted average of the first score and the second score;

Further, the processor may cooperate with the memory and the communication interface to perform the operations performed by the speech recognition device in the embodiments of the above application.

The present application further provides a chip system, where the chip system includes a processor, and is configured to support a network device or a terminal device to implement the functions involved in any of the foregoing embodiments, such as generating or processing data and/or information involved in the foregoing methods. In one possible design, the system-on-chip may further include a memory for program instructions and data necessary for the speech recognition device. The chip system may be constituted by a chip, or may include a chip and other discrete devices.

Embodiments of the present application further provide a processor, coupled to the memory, for performing any of the methods and functions related to the voice recognition device in any of the embodiments.

Embodiments of the present application further provide a computer program product containing instructions, which when executed on a computer, cause the computer to perform any of the methods and functions related to the voice recognition device in any of the above embodiments.

Embodiments of the present application further provide an apparatus for performing any method and function related to a speech recognition device in any of the foregoing embodiments.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

The above-mentioned embodiments further explain the objects, technical solutions and advantages of the present application in detail. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method of speech recognition, the method comprising:

2. The method of claim 1, wherein the vocabulary of the first model includes a special identifier, and wherein loading the second model according to a domain type to which the language vocabulary belongs comprises:

3. The method of claim 1, wherein the first model comprises a first common language model generated according to a binary language model and a first re-scoring language model generated according to a quintet language model;

4. The method of claim 1, wherein the second model comprises a second domain language model and a second scoring language model, the second domain language model generated from a unary language model;

5. The method of any of claims 1-4, wherein said determining a final recognition result based on the first score and the second score comprises:

calculating a weighted average of the first score and the second score;

6. The method of claim 1, wherein the second domain language model comprises a vocabulary to be enhanced for the first model; the method further comprises the following steps:

7. A speech recognition apparatus, characterized in that the apparatus comprises:

8. The apparatus of claim 7,

the processing module is further used for determining a special identifier corresponding to the field type to which the language vocabulary belongs; and searching the second model from a plurality of preset domain models according to the special identifier and loading.

9. The method of claim 7 or 8,

the determining module is further configured to calculate a weighted average of the first score and the second score; and determining the final recognition result according to the weighted average value.

10. A computer-readable storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the method according to any one of claims 1 to 6.