CN110797026A - Voice recognition method, device and storage medium - Google Patents

Voice recognition method, device and storage medium Download PDF

Info

Publication number
CN110797026A
CN110797026A CN201910880013.5A CN201910880013A CN110797026A CN 110797026 A CN110797026 A CN 110797026A CN 201910880013 A CN201910880013 A CN 201910880013A CN 110797026 A CN110797026 A CN 110797026A
Authority
CN
China
Prior art keywords
model
score
recognition result
language
vocabulary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910880013.5A
Other languages
Chinese (zh)
Inventor
康跃腾
付彦喆
王朋飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201910880013.5A priority Critical patent/CN110797026A/en
Publication of CN110797026A publication Critical patent/CN110797026A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/32Multiple recognisers used in sequence or in parallel; Score combination systems therefor, e.g. voting systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the application discloses a voice recognition method, a voice recognition device and a storage medium, wherein the voice recognition method comprises the following steps: extracting features of a voice signal stream, inputting the features into a first model to obtain a first recognition result, and determining a first score of the first recognition result, wherein the first model is used for recognizing words of a general expression; when the first model cannot identify the language vocabulary in the voice signal stream, loading a second model according to the domain type of the language vocabulary, inputting the language vocabulary into the second model to obtain a second identification result, and determining a second score of the second identification result, wherein the second model is used for identifying the vocabulary of the domain expression; and determining a final recognition result according to the first score and the second score. By adopting the embodiment of the application, the accuracy and the efficiency of voice recognition are improved.

Description

Voice recognition method, device and storage medium
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a speech recognition method, apparatus, and storage medium.
Background
Key technologies of Speech Technology (Speech Technology) are Automatic Speech Recognition (ASR) and Speech synthesis (Text To Speech, TTS) as well as voiceprint Recognition. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future. Speech recognition can be performed in several ways: firstly, language material training and general language material related to each field are combined to retrain a language model and a decoding graph HCLG is synthesized again. Secondly, relevant models are retrained on large-scale corpora by adding audio and text related to the field in an end-to-end voice recognition system. However, in an actual scenario, the whole model needs to be retrained and deployed, resulting in inefficient speech recognition. And because the related corpora of the field are not many, the accuracy of identification is low.
Disclosure of Invention
The embodiment of the application provides a voice recognition method, a voice recognition device and a storage medium. The accuracy and efficiency of speech recognition can be improved.
In a first aspect, an embodiment of the present application provides a speech recognition method, including:
extracting features of a voice signal stream, inputting the features into a first model to obtain a first recognition result, and determining a first score of the first recognition result, wherein the first model is used for recognizing words of a general expression;
when the first model cannot identify the language vocabulary in the voice signal stream, loading a second model according to the domain type of the language vocabulary, inputting the language vocabulary into the second model to obtain a second identification result, and determining a second score of the second identification result, wherein the second model is used for identifying the vocabulary of the domain expression;
and determining a final recognition result according to the first score and the second score.
Wherein the vocabulary of the first model includes a special identifier, and the loading of the second model according to the domain type to which the language vocabulary belongs includes:
determining a special identifier corresponding to the field type to which the language vocabulary belongs;
and searching the second model from a plurality of preset domain models according to the special identifier and loading.
The first model comprises a first common language model and a first re-scoring language model, wherein the first common language model is generated according to a binary language model, and the first re-scoring language model is generated according to a quinary language model;
inputting the extracted features of the voice signal stream into a first model to obtain a first recognition result, and determining a first score of the first recognition result comprises:
recognizing the voice signal stream according to the first common language model to obtain a plurality of first recognition results, wherein one first recognition result corresponds to one first score;
and sequencing the plurality of first recognition results according to the first re-scoring language model, and selecting the first recognition result corresponding to the highest first score.
The second model comprises a second domain language model and a second scoring language model, and the second domain language model is generated according to the unary language model;
the inputting the language vocabulary into the second model to obtain a second recognition result, and the determining a second score of the second recognition result comprises:
recognizing the language vocabulary according to the second domain language model to obtain a plurality of second recognition results, wherein one second recognition result corresponds to one second score;
and sequencing the second recognition results according to the second re-scoring language model, and selecting the second recognition result corresponding to the highest second score.
Wherein the determining a final recognition result according to the first score and the second score comprises:
calculating a weighted average of the first score and the second score;
and determining the final recognition result according to the weighted average value.
Wherein the second domain language model comprises a vocabulary to be enhanced of the first model; the method further comprises the following steps:
and replacing the recognition result of the vocabulary to be strengthened recognized by the first model with the recognition result of the vocabulary to be strengthened recognized by the second model to obtain the final recognition result.
In a second aspect, an embodiment of the present application provides a speech recognition apparatus, including:
the processing module is used for extracting the characteristics of the voice signal flow, inputting the characteristics into a first model to obtain a first recognition result, and determining a first score of the first recognition result, wherein the first model is used for recognizing the vocabulary of the general expression;
the processing module is further configured to, when the first model cannot recognize a language vocabulary in the speech signal stream, load a second model according to a domain type to which the language vocabulary belongs, input the language vocabulary to the second model to obtain a second recognition result, and determine a second score of the second recognition result, where the second model is used to recognize a vocabulary of a domain utterance;
and the determining module is used for determining a final recognition result according to the first score and the second score.
The processing module is further configured to determine a special identifier corresponding to a domain type to which the language vocabulary belongs; and searching the second model from a plurality of preset domain models according to the special identifier and loading.
The first model comprises a first common language model and a first re-scoring language model, wherein the first common language model is generated according to a binary language model, and the first re-scoring language model is generated according to a quinary language model;
the processing module is further configured to recognize the voice signal stream according to the first common language model to obtain a plurality of first recognition results, where one first recognition result corresponds to one first score; and sequencing the plurality of first recognition results according to the first re-scoring language model, and selecting the first recognition result corresponding to the highest first score.
The second model comprises a second domain language model and a second scoring language model, and the second domain language model is generated according to the unary language model;
the processing module is further configured to recognize the language vocabulary according to the second domain language model to obtain a plurality of second recognition results, where one second recognition result corresponds to one second score; and sequencing the second recognition results according to the second re-scoring language model, and selecting the second recognition result corresponding to the highest second score.
Wherein the determining module is further configured to calculate a weighted average of the first score and the second score; and determining the final recognition result according to the weighted average value.
Wherein the second domain language model comprises a vocabulary to be enhanced of the first model;
the processing module is further configured to replace the recognition result for the vocabulary to be enhanced, recognized by the first model, with the recognition result for the vocabulary to be enhanced, recognized by the second model, and obtain the final recognition result.
In a third aspect, an embodiment of the present application provides a speech recognition apparatus, including: the speech recognition method comprises a processor, a memory and a communication bus, wherein the communication bus is used for realizing connection communication between the processor and the memory, and the processor executes a program stored in the memory for realizing the steps in the speech recognition method provided by the first aspect.
In one possible design, the speech recognition device provided by the application may include a module for performing the corresponding behavior in the method. The modules may be software and/or hardware.
Yet another aspect of the embodiments of the present application provides a computer-readable storage medium having stored thereon a plurality of instructions adapted to be loaded by a processor and to perform the method of the above-mentioned aspects.
Yet another aspect of the embodiments of the present application provides a computer program product containing instructions which, when run on a computer, cause the computer to perform the method of the above-described aspects.
By implementing the embodiment of the application, the characteristics of the voice signal stream are extracted and input into a first model to obtain a first recognition result, and a first score of the first recognition result is determined, wherein the first model is used for recognizing the vocabulary of the general expression; when the first model cannot identify the language vocabulary in the voice signal stream, loading a second model according to the domain type of the language vocabulary, inputting the language vocabulary into the second model to obtain a second identification result, and determining a second score of the second identification result, wherein the second model is used for identifying the vocabulary of the domain expression; and determining a final recognition result according to the first score and the second score. By dynamically loading the domain model, the efficiency and the accuracy of the voice recognition are improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic structural diagram of a speech recognition system according to an embodiment of the present application;
fig. 2 is a schematic flowchart of a speech recognition method according to an embodiment of the present application;
fig. 3 is a schematic diagram of a state rollback according to an embodiment of the present application;
FIG. 4 is an alternate operational schematic provided by an embodiment of the present application;
FIG. 5 is a flow chart of another speech recognition method provided by the embodiments of the present application;
fig. 6 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of a speech recognition device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
As shown in fig. 1, fig. 1 is a schematic structural diagram of a speech recognition system according to an embodiment of the present application. The speech recognition system comprises a ROOT model and a domain model, wherein the ROOT model is used for recognizing words of a general expression, the ROOT model comprises a ROOT model (base) and a ROOT model (big), the ROOT model (base) comprises an acoustic model and a decoding graph HCLG, the acoustic model and the decoding graph HCLG are used for recognizing the extracted characteristics of the speech signal flow, and the ROOT model (big) is used for carrying out ranking and scoring on the recognition results. The domain model is used to recognize vocabulary of domain expressions, which may include domain models (OOVs) including domain model 1, domain model 2 … …, domain model n, and the like, and domain re-scoring (rescore) models. If the vocabulary of a certain field needs to be recognized, a corresponding field model can be loaded. Each domain model includes an acoustic model and a decoding graph OOV-HCLG. The acoustic model and the decoding graph OOV-HCLG are used for recognizing vocabularies in the field, and the field re-scoring model can be used for sequencing and re-scoring recognition results. The speech recognition system can be applied to cloud micro speech recognition assistants, wechat applets and the like.
As shown in fig. 2, fig. 2 is a schematic flowchart of a speech recognition method according to an embodiment of the present application. The steps in the embodiments of the present application include at least:
s201, extracting the characteristics of the voice signal flow, inputting the characteristics into a first model to obtain a first recognition result, and determining a first score of the first recognition result, wherein the first model is used for recognizing the vocabulary of the general expression.
The first model comprises a first common language model and a first re-scoring language model, wherein the first common language model is generated according to a binary language model, and the first re-scoring language model is generated according to a quinary language model; a plurality of first recognition results can be obtained by recognizing the voice signal stream according to the first common language model, wherein one first recognition result corresponds to one first score; and then, sequencing the plurality of first recognition results according to the first re-scoring language model, and selecting the first recognition result corresponding to the highest first score.
As shown in fig. 1, the first model may be a ROOT model, which obtains an acoustic model a and a language model G1 from audio and corpus of a general domain, respectively, where the language model G1 is split into 2-gram G11 (a first common language model) and 5-gram G12 (a first reprinting language model), a decoding graph HCLG may be generated from G11, speech recognition of the general term is performed using the acoustic model and the decoding graph HCLG, and then recognition results of the acoustic model and the decoding graph HCLG are reprinted by G12.
For example, the corpus "i want to know which hall a vermilion patch is facing" is subjected to the word segmentation processing. I think/know/vermilion patch/on/which hall/upward is 1-gram. I want to know that the jade tablet is 2-gram on/which hall. I want to know which palace/i want to know the 5-gram on the jade tablet. The foregoing examples illustrate only some of the words. Wherein G11 includes 1-grams and 2-grams, and G12 includes 1-grams, 2-grams, 3-grams, 4-grams, and 5-grams.
S202, when the first model can not recognize the language vocabulary in the voice signal stream, loading a second model according to the domain type to which the language vocabulary belongs, inputting the language vocabulary to the second model to obtain a second recognition result, and determining a second score of the second recognition result, wherein the second model is used for recognizing the vocabulary of the domain expression.
In a specific implementation, a special identifier corresponding to a domain type to which the language vocabulary belongs can be determined; and searching the second model from a plurality of preset domain models according to the special identifier and loading.
For example, as shown in fig. 3, fig. 3 is a schematic diagram of a state rollback according to an embodiment of the present application. And after the characteristics of the signal flow are extracted, the signal flow enters a ROOT model for online decoding. Rollback is required if the ROOT model cannot recognize a certain domain vocabulary or vocabularies. Wherein, $ OOV represents a 1-gram State fallback identifier, and during the decoding graph search, the $ OOV identifier is included in the fallback State of the ROOT language model, and if the 1-gram fallback State is searched, the decoding graph search is performed in a pre-specified domain OOV-HCLG, and then the search result is replaced (replaced) through a Weighted Finite-State transformer (WFST). As shown in fig. 4, fig. 4 is an alternative operation diagram provided by the embodiment of the present application, in which an edge of $ OOV is replaced with a search result.
The second model comprises a second domain language model and a second scoring language model, and the second domain language model is generated according to the unary language model. The language vocabulary can be recognized according to the second domain language model to obtain a plurality of second recognition results, and one second recognition result corresponds to one second score; and sequencing the second recognition results according to the second re-scoring language model, and selecting the second recognition result corresponding to the highest second score.
As shown in FIG. 1, the second model may be a domain model, the domain model is divided into two parts, namely OOV-HCLG and rescore-HCLG, and a language model of OOV-HCLG is composed of a word set $ { OOV } which is not registered by ROOT model in a domain scene and 1-gram of a set $ { < s > word } of words to be enhanced in ROOT model. For example, words such as endgate/Xiyan/Yongkang left gate are all word sets $ { oov } that are not registered by the ROOT model, and words belonging to the historical field are unusual words. As another example, a "hall" is easily recognized as a "shop," and thus a "hall" is a word to be enhanced.
The language model of the rescore-HCLG is generated by the word set $ { OOV } and the specified sentence making sentence, and is used for re-scoring the recognition result of the language model of the OOV-HCLG and enhancing the recognition of the domain expression. When decoding is carried out through a ROOT model (base) and a language model of OOV-HCLG, 5-gram language model (big) of the ROOT model and rescore-HCLG of a domain model are synchronously utilized to carry out re-scoring respectively.
It should be noted that models in different domains can be integrated into one speech recognition system, and the domain models are hot-loaded in real time according to independent retrieval (query) requests of each domain, that is, when a new domain model is loaded, the loaded domain model is not released.
S203, determining a final recognition result according to the first score and the second score.
In a specific implementation, a weighted average of the first score and the second score may be calculated; and determining the final recognition result according to the weighted average value. For example, the recognition result with the highest weighted average may be selected as the final recognition result. For example, the first score and the second score are score1 and score2, respectively, with weights W1 and W2 set, respectively, where W1+ W2 ═ 1, and the final score final _ score ═ W1 ═ score1+ W2 ═ score 2. Wherein, the value range of W1 can be (+ -0.9-0.99), and the value range of W2 can be (+ -0.01-0.1).
In the embodiment of the application, through the mode of dynamically loading the domain model, efficient recognition can be performed on unknown vocabularies in the domain scene, strong recognition can be performed on specified vocabularies in the original model, recognition of general expressions in voice recognition is not affected, model training in the whole process is in a minute level, and the mode of hot loading is combined with the speech recognition model in the general domain, so that accuracy and efficiency of the voice recognition are improved.
As shown in fig. 5, fig. 5 is a schematic flowchart of another speech recognition method provided in the embodiment of the present application. The steps in the embodiments of the present application include at least:
s501, extracting the characteristics of the voice signal flow, inputting the characteristics into a first model to obtain a first recognition result, and determining a first score of the first recognition result, wherein the first model is used for recognizing the vocabulary of the general expression.
The first model comprises a first common language model and a first re-scoring language model, wherein the first common language model is generated according to a binary language model, and the first re-scoring language model is generated according to a quinary language model; a plurality of first recognition results can be obtained by recognizing the voice signal stream according to the first common language model, wherein one first recognition result corresponds to one first score; and then, sequencing the plurality of first recognition results according to the first re-scoring language model, and selecting the first recognition result corresponding to the highest first score.
As shown in fig. 1, the first model may be a ROOT model, which obtains an acoustic model a and a language model G1 from audio and corpus of a general domain, respectively, where the language model G1 is split into 2-gram G11 (a first common language model) and 5-gram G12 (a first reprinting language model), a decoding graph HCLG may be generated from G11, speech recognition of the general term is performed using the acoustic model and the decoding graph HCLG, and then recognition results of the acoustic model and the decoding graph HCLG are reprinted by G12.
For example, the corpus "i want to know which hall a vermilion patch is facing" is subjected to the word segmentation processing. I think/know/vermilion patch/on/which hall/upward is 1-gram. I want to know that the jade tablet is 2-gram on/which hall. I want to know which palace/i want to know the 5-gram on the jade tablet. The foregoing examples illustrate only some of the words. Wherein G11 includes 1-grams and 2-grams, and G12 includes 1-grams, 2-grams, 3-grams, 4-grams, and 5-grams.
S502, when the first model can not recognize the language vocabulary in the voice signal stream, loading a second model according to the field type to which the language vocabulary belongs, inputting the language vocabulary to the second model to obtain a second recognition result, and determining a second score of the second recognition result, wherein the second model is used for recognizing the vocabulary of the field utterance.
In a specific implementation, a special identifier corresponding to a domain type to which the language vocabulary belongs can be determined; and searching the second model from a plurality of preset domain models according to the special identifier and loading.
For example, as shown in fig. 3, fig. 3 is a schematic diagram of a state rollback according to an embodiment of the present application. And after the characteristics of the signal flow are extracted, the signal flow enters a ROOT model for online decoding. Rollback is required if the ROOT model cannot recognize a certain domain vocabulary or vocabularies. Wherein, $ OOV represents a 1-gram State fallback identifier, and during the decoding graph search, the $ OOV identifier is included in the fallback State of the ROOT language model, and if the 1-gram fallback State is searched, the decoding graph search is performed in a pre-specified domain OOV-HCLG, and then the search result is replaced (replaced) through a Weighted Finite-State transformer (WFST). As shown in fig. 4, fig. 4 is an alternative operation diagram provided by the embodiment of the present application, in which an edge of $ OOV is replaced with a search result.
The second model comprises a second domain language model and a second scoring language model, and the second domain language model is generated according to the unary language model. The language vocabulary can be recognized according to the second domain language model to obtain a plurality of second recognition results, and one second recognition result corresponds to one second score; and sequencing the second recognition results according to the second re-scoring language model, and selecting the second recognition result corresponding to the highest second score.
As shown in FIG. 1, the second model may be a domain model, the domain model is divided into two parts, namely OOV-HCLG and rescore-HCLG, and a language model of OOV-HCLG is composed of a word set $ { OOV } which is not registered by ROOT model in a domain scene and 1-gram of a set $ { < s > word } of words to be enhanced in ROOT model. For example, words such as endgate/Xiyan/Yongkang left gate are all word sets $ { oov } that are not registered by the ROOT model, and words belonging to the historical field are unusual words. As another example, a "hall" is easily recognized as a "shop," and thus a "hall" is a word to be enhanced.
The language model of the rescore-HCLG is generated by the word set $ { OOV } and the specified sentence making sentence, and is used for re-scoring the recognition result of the language model of the OOV-HCLG and enhancing the recognition of the domain expression. When decoding is carried out through a ROOT model (base) and a language model of OOV-HCLG, 5-gram language model (big) of the ROOT model and rescore-HCLG of a domain model are synchronously utilized to carry out re-scoring respectively.
It should be noted that models in different domains can be integrated into one speech recognition system, and the domain models are hot-loaded in real time according to independent retrieval (query) requests of each domain, that is, the loaded domain models are not released, and new domain models are loaded.
And S503, determining a final score according to the first score and the second score.
In a specific implementation, a weighted average of the first score and the second score may be calculated; and taking the weighted average as a final score according to the weighted average. For example, the recognition result with the highest weighted average may be selected as the final recognition result. For example, the first score and the second score are score1 and score2, respectively, with weights W1 and W2 set, respectively, where W1+ W2 ═ 1, and the final score final _ score ═ W1 ═ score1+ W2 ═ score 2. Wherein, the value range of W1 can be (+ -0.9-0.99), and the value range of W2 can be (+ -0.01-0.1).
S504, replacing the recognition result of the vocabulary to be strengthened recognized by the first model with the recognition result of the vocabulary to be strengthened recognized by the second model to obtain the final recognition result.
In specific implementation, the recognition result with the highest final score can be selected, and the recognition result is replaced according to the identifier of the word < s > word </s > to be strengthened to obtain the final recognition result. For example, if "shop" is included in the recognition result, the "shop" is replaced with "hall".
As shown in fig. 6, fig. 6 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application. The device in the embodiment of the application at least comprises:
the processing module 601 is configured to extract features of a speech signal stream, input the features into a first model to obtain a first recognition result, and determine a first score of the first recognition result, where the first model is used to recognize a vocabulary of a general utterance.
The first model comprises a first common language model and a first re-scoring language model, wherein the first common language model is generated according to a binary language model, and the first re-scoring language model is generated according to a quinary language model; a plurality of first recognition results can be obtained by recognizing the voice signal stream according to the first common language model, wherein one first recognition result corresponds to one first score; and then, sequencing the plurality of first recognition results according to the first re-scoring language model, and selecting the first recognition result corresponding to the highest first score.
As shown in fig. 1, the first model may be a ROOT model, which obtains an acoustic model a and a language model G1 from audio and corpus of a general domain, respectively, where the language model G1 is split into 2-gram G11 (a first common language model) and 5-gram G12 (a first reprinting language model), a decoding graph HCLG may be generated from G11, speech recognition of the general term is performed using the acoustic model and the decoding graph HCLG, and then recognition results of the acoustic model and the decoding graph HCLG are reprinted by G12.
For example, the corpus "i want to know which hall a vermilion patch is facing" is subjected to the word segmentation processing. I think/know/vermilion patch/on/which hall/upward is 1-gram. I want to know that the jade tablet is 2-gram on/which hall. I want to know which palace/i want to know the 5-gram on the jade tablet. The foregoing examples illustrate only some of the words. Wherein G11 includes 1-grams and 2-grams, and G12 includes 1-grams, 2-grams, 3-grams, 4-grams, and 5-grams.
The processing module 601 is further configured to, when the first model cannot recognize the language vocabulary in the speech signal stream, load a second model according to the domain type to which the language vocabulary belongs, input the language vocabulary into the second model to obtain a second recognition result, and determine a second score of the second recognition result, where the second model is used to recognize the vocabulary of the domain utterance.
In a specific implementation, a special identifier corresponding to a domain type to which the language vocabulary belongs can be determined; and searching the second model from a plurality of preset domain models according to the special identifier and loading.
For example, as shown in fig. 3, fig. 3 is a schematic diagram of a state rollback according to an embodiment of the present application. And after the characteristics of the signal flow are extracted, the signal flow enters a ROOT model for online decoding. Rollback is required if the ROOT model cannot recognize a certain domain vocabulary or vocabularies. Wherein, $ OOV represents a 1-gram State fallback identifier, and during the decoding graph search, the $ OOV identifier is included in the fallback State of the ROOT language model, and if the 1-gram fallback State is searched, the decoding graph search is performed in a pre-specified domain OOV-HCLG, and then the search result is replaced (replaced) through a Weighted Finite-State transformer (WFST). As shown in fig. 4, fig. 4 is an alternative operation diagram provided by the embodiment of the present application, in which an edge of $ OOV is replaced with a search result.
The second model comprises a second domain language model and a second scoring language model, and the second domain language model is generated according to the unary language model. The language vocabulary can be recognized according to the second domain language model to obtain a plurality of second recognition results, and one second recognition result corresponds to one second score; and sequencing the second recognition results according to the second re-scoring language model, and selecting the second recognition result corresponding to the highest second score.
As shown in FIG. 1, the second model may be a domain model, the domain model is divided into two parts, namely OOV-HCLG and rescore-HCLG, and a language model of OOV-HCLG is composed of a word set $ { OOV } which is not registered by ROOT model in a domain scene and 1-gram of a set $ { < s > word } of words to be enhanced in ROOT model. For example, words such as endgate/Xiyan/Yongkang left gate are all word sets $ { oov } that are not registered by the ROOT model, and words belonging to the historical field are unusual words. As another example, a "hall" is easily recognized as a "shop," and thus a "hall" is a word to be enhanced.
The language model of the rescore-HCLG is generated by the word set $ { OOV } and the specified sentence making sentence, and is used for re-scoring the recognition result of the language model of the OOV-HCLG and enhancing the recognition of the domain expression. When decoding is carried out through a ROOT model (base) and a language model of OOV-HCLG, 5-gram language model (big) of the ROOT model and rescore-HCLG of a domain model are synchronously utilized to carry out re-scoring respectively.
It should be noted that models in different domains can be integrated into one speech recognition system, and the domain models are hot-loaded in real time according to independent retrieval (query) requests of each domain, that is, the loaded domain models are not released, and new domain models are loaded.
A determining module 602, configured to determine a final recognition result according to the first score and the second score.
In a specific implementation, a weighted average of the first score and the second score may be calculated; and taking the weighted average as a final score according to the weighted average. For example, the recognition result with the highest weighted average may be selected as the final recognition result. For example, the first score and the second score are score1 and score2, respectively, with weights W1 and W2 set, respectively, where W1+ W2 ═ 1, and the final score final _ score ═ W1 ═ score1+ W2 ═ score 2. Wherein, the value range of W1 can be (+ -0.9-0.99), and the value range of W2 can be (+ -0.01-0.1).
Optionally, the recognition result for the vocabulary to be enhanced, which is recognized by the second model, is substituted for the recognition result for the vocabulary to be enhanced, which is recognized by the first model, so as to obtain the final recognition result. Specifically, the recognition result with the highest final score can be selected, and for the recognition result, the recognition result is replaced according to the identifier of the word < s > word </s > to be strengthened, so that the final recognition result is obtained. For example, if "shop" is included in the recognition result, the "shop" is replaced with "hall".
In the embodiment of the application, through the mode of dynamically loading the domain model, efficient recognition can be performed on unknown vocabularies in the domain scene, strong recognition can be performed on specified vocabularies in the original model, recognition of general expressions in voice recognition is not affected, model training in the whole process is in a minute level, and the mode of hot loading is combined with the speech recognition model in the general domain, so that accuracy and efficiency of the voice recognition are improved.
Referring to fig. 7, fig. 7 is a schematic structural diagram of a speech recognition device according to an embodiment of the present application. As shown, the apparatus may include: at least one processor 701, at least one communication interface 702, at least one memory 703 and at least one communication bus 704.
The processor 701 may be, among other things, a central processing unit, a general purpose processor, a digital signal processor, an application specific integrated circuit, a field programmable gate array or other programmable logic device, transistor logic, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor may also be a combination of computing functions, e.g., comprising one or more microprocessors, a digital signal processor and a microprocessor, or the like. The communication bus 704 may be a peripheral component interconnect standard PCI bus or an extended industry standard architecture EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 7, but this is not intended to represent only one bus or type of bus. A communication bus 704 is used to enable communications among the components. In this embodiment, the communication interface 702 of the device in this application is used for performing signaling or data communication with other node devices. The Memory 703 may include a volatile Memory, such as a Nonvolatile dynamic Random Access Memory (NVRAM), a phase change Random Access Memory (PRAM), a Magnetoresistive Random Access Memory (MRAM), and the like, and may further include a Nonvolatile Memory, such as at least one magnetic Disk Memory device, an electrically erasable Programmable Read-Only Memory (EEPROM), a flash Memory device, such as a nor flash Memory (NORflash Memory) or a nor flash Memory (NAND flash Memory), a semiconductor device, such as a Solid State Disk (SSD), and the like. The memory 703 may optionally be at least one memory device located remotely from the processor 701. A set of program codes is stored in the memory 703, and the processor 701 executes the program in the memory 703:
extracting features of a voice signal stream, inputting the features into a first model to obtain a first recognition result, and determining a first score of the first recognition result, wherein the first model is used for recognizing words of a general expression;
when the first model cannot identify the language vocabulary in the voice signal stream, loading a second model according to the domain type of the language vocabulary, inputting the language vocabulary into the second model to obtain a second identification result, and determining a second score of the second identification result, wherein the second model is used for identifying the vocabulary of the domain expression;
and determining a final recognition result according to the first score and the second score.
Wherein the vocabulary of the first model includes a special identifier,
optionally, the processor 701 is further configured to perform the following operation steps:
determining a special identifier corresponding to the field type to which the language vocabulary belongs;
and searching the second model from a plurality of preset domain models according to the special identifier and loading.
The first model comprises a first common language model and a first re-scoring language model, wherein the first common language model is generated according to a binary language model, and the first re-scoring language model is generated according to a quinary language model;
optionally, the processor 701 is further configured to perform the following operation steps:
recognizing the voice signal stream according to the first common language model to obtain a plurality of first recognition results, wherein one first recognition result corresponds to one first score;
and sequencing the plurality of first recognition results according to the first re-scoring language model, and selecting the first recognition result corresponding to the highest first score.
The second model comprises a second domain language model and a second scoring language model, and the second domain language model is generated according to the unary language model;
optionally, the processor 701 is further configured to perform the following operation steps:
recognizing the language vocabulary according to the second domain language model to obtain a plurality of second recognition results, wherein one second recognition result corresponds to one second score;
and sequencing the second recognition results according to the second re-scoring language model, and selecting the second recognition result corresponding to the highest second score.
Optionally, the processor 701 is further configured to perform the following operation steps:
calculating a weighted average of the first score and the second score;
and determining the final recognition result according to the weighted average value.
Wherein the second domain language model comprises a vocabulary to be enhanced of the first model;
optionally, the processor 701 is further configured to perform the following operation steps:
and replacing the recognition result of the vocabulary to be strengthened recognized by the first model with the recognition result of the vocabulary to be strengthened recognized by the second model to obtain the final recognition result.
Further, the processor may cooperate with the memory and the communication interface to perform the operations performed by the speech recognition device in the embodiments of the above application.
The present application further provides a chip system, where the chip system includes a processor, and is configured to support a network device or a terminal device to implement the functions involved in any of the foregoing embodiments, such as generating or processing data and/or information involved in the foregoing methods. In one possible design, the system-on-chip may further include a memory for program instructions and data necessary for the speech recognition device. The chip system may be constituted by a chip, or may include a chip and other discrete devices.
Embodiments of the present application further provide a processor, coupled to the memory, for performing any of the methods and functions related to the voice recognition device in any of the embodiments.
Embodiments of the present application further provide a computer program product containing instructions, which when executed on a computer, cause the computer to perform any of the methods and functions related to the voice recognition device in any of the above embodiments.
Embodiments of the present application further provide an apparatus for performing any method and function related to a speech recognition device in any of the foregoing embodiments.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
The above-mentioned embodiments further explain the objects, technical solutions and advantages of the present application in detail. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (10)

1. A method of speech recognition, the method comprising:
extracting features of a voice signal stream, inputting the features into a first model to obtain a first recognition result, and determining a first score of the first recognition result, wherein the first model is used for recognizing words of a general expression;
when the first model cannot identify the language vocabulary in the voice signal stream, loading a second model according to the domain type of the language vocabulary, inputting the language vocabulary into the second model to obtain a second identification result, and determining a second score of the second identification result, wherein the second model is used for identifying the vocabulary of the domain expression;
and determining a final recognition result according to the first score and the second score.
2. The method of claim 1, wherein the vocabulary of the first model includes a special identifier, and wherein loading the second model according to a domain type to which the language vocabulary belongs comprises:
determining a special identifier corresponding to the field type to which the language vocabulary belongs;
and searching the second model from a plurality of preset domain models according to the special identifier and loading.
3. The method of claim 1, wherein the first model comprises a first common language model generated according to a binary language model and a first re-scoring language model generated according to a quintet language model;
inputting the extracted features of the voice signal stream into a first model to obtain a first recognition result, and determining a first score of the first recognition result comprises:
recognizing the voice signal stream according to the first common language model to obtain a plurality of first recognition results, wherein one first recognition result corresponds to one first score;
and sequencing the plurality of first recognition results according to the first re-scoring language model, and selecting the first recognition result corresponding to the highest first score.
4. The method of claim 1, wherein the second model comprises a second domain language model and a second scoring language model, the second domain language model generated from a unary language model;
the inputting the language vocabulary into the second model to obtain a second recognition result, and the determining a second score of the second recognition result comprises:
recognizing the language vocabulary according to the second domain language model to obtain a plurality of second recognition results, wherein one second recognition result corresponds to one second score;
and sequencing the second recognition results according to the second re-scoring language model, and selecting the second recognition result corresponding to the highest second score.
5. The method of any of claims 1-4, wherein said determining a final recognition result based on the first score and the second score comprises:
calculating a weighted average of the first score and the second score;
and determining the final recognition result according to the weighted average value.
6. The method of claim 1, wherein the second domain language model comprises a vocabulary to be enhanced for the first model; the method further comprises the following steps:
and replacing the recognition result of the vocabulary to be strengthened recognized by the first model with the recognition result of the vocabulary to be strengthened recognized by the second model to obtain the final recognition result.
7. A speech recognition apparatus, characterized in that the apparatus comprises:
the processing module is used for extracting the characteristics of the voice signal flow, inputting the characteristics into a first model to obtain a first recognition result, and determining a first score of the first recognition result, wherein the first model is used for recognizing the vocabulary of the general expression;
the processing module is further configured to, when the first model cannot recognize a language vocabulary in the speech signal stream, load a second model according to a domain type to which the language vocabulary belongs, input the language vocabulary to the second model to obtain a second recognition result, and determine a second score of the second recognition result, where the second model is used to recognize a vocabulary of a domain utterance;
and the determining module is used for determining a final recognition result according to the first score and the second score.
8. The apparatus of claim 7,
the processing module is further used for determining a special identifier corresponding to the field type to which the language vocabulary belongs; and searching the second model from a plurality of preset domain models according to the special identifier and loading.
9. The method of claim 7 or 8,
the determining module is further configured to calculate a weighted average of the first score and the second score; and determining the final recognition result according to the weighted average value.
10. A computer-readable storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the method according to any one of claims 1 to 6.
CN201910880013.5A 2019-09-17 2019-09-17 Voice recognition method, device and storage medium Pending CN110797026A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910880013.5A CN110797026A (en) 2019-09-17 2019-09-17 Voice recognition method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910880013.5A CN110797026A (en) 2019-09-17 2019-09-17 Voice recognition method, device and storage medium

Publications (1)

Publication Number Publication Date
CN110797026A true CN110797026A (en) 2020-02-14

Family

ID=69427269

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910880013.5A Pending CN110797026A (en) 2019-09-17 2019-09-17 Voice recognition method, device and storage medium

Country Status (1)

Country Link
CN (1) CN110797026A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111508478A (en) * 2020-04-08 2020-08-07 北京字节跳动网络技术有限公司 Speech recognition method and device
CN112562640A (en) * 2020-12-01 2021-03-26 北京声智科技有限公司 Multi-language speech recognition method, device, system and computer readable storage medium
CN112735380A (en) * 2020-12-28 2021-04-30 苏州思必驰信息科技有限公司 Scoring method and voice recognition method for re-scoring language model
CN112885336A (en) * 2021-01-29 2021-06-01 深圳前海微众银行股份有限公司 Training and recognition method and device of voice recognition system, and electronic equipment
CN113299280A (en) * 2021-05-12 2021-08-24 山东浪潮科学研究院有限公司 Professional vocabulary speech recognition method based on Kaldi
WO2022267451A1 (en) * 2021-06-24 2022-12-29 平安科技(深圳)有限公司 Automatic speech recognition method based on neural network, device, and readable storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1320902A (en) * 2000-03-14 2001-11-07 索尼公司 Voice identifying device and method, and recording medium
CN1725295A (en) * 2004-07-22 2006-01-25 索尼株式会社 Speech processing apparatus, speech processing method, program, and recording medium
CN1979638A (en) * 2005-12-02 2007-06-13 中国科学院自动化研究所 Method for correcting error of voice identification result
CN108415898A (en) * 2018-01-19 2018-08-17 苏州思必驰信息科技有限公司 The word figure of deep learning language model beats again a point method and system
CN108711422A (en) * 2018-05-14 2018-10-26 腾讯科技(深圳)有限公司 Audio recognition method, device, computer readable storage medium and computer equipment
CN109215630A (en) * 2018-11-14 2019-01-15 北京羽扇智信息科技有限公司 Real-time speech recognition method, apparatus, equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1320902A (en) * 2000-03-14 2001-11-07 索尼公司 Voice identifying device and method, and recording medium
CN1725295A (en) * 2004-07-22 2006-01-25 索尼株式会社 Speech processing apparatus, speech processing method, program, and recording medium
CN1979638A (en) * 2005-12-02 2007-06-13 中国科学院自动化研究所 Method for correcting error of voice identification result
CN108415898A (en) * 2018-01-19 2018-08-17 苏州思必驰信息科技有限公司 The word figure of deep learning language model beats again a point method and system
CN108711422A (en) * 2018-05-14 2018-10-26 腾讯科技(深圳)有限公司 Audio recognition method, device, computer readable storage medium and computer equipment
CN109215630A (en) * 2018-11-14 2019-01-15 北京羽扇智信息科技有限公司 Real-time speech recognition method, apparatus, equipment and storage medium

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111508478A (en) * 2020-04-08 2020-08-07 北京字节跳动网络技术有限公司 Speech recognition method and device
CN112562640A (en) * 2020-12-01 2021-03-26 北京声智科技有限公司 Multi-language speech recognition method, device, system and computer readable storage medium
CN112562640B (en) * 2020-12-01 2024-04-12 北京声智科技有限公司 Multilingual speech recognition method, device, system, and computer-readable storage medium
CN112735380A (en) * 2020-12-28 2021-04-30 苏州思必驰信息科技有限公司 Scoring method and voice recognition method for re-scoring language model
CN112885336A (en) * 2021-01-29 2021-06-01 深圳前海微众银行股份有限公司 Training and recognition method and device of voice recognition system, and electronic equipment
CN112885336B (en) * 2021-01-29 2024-02-02 深圳前海微众银行股份有限公司 Training and recognition method and device of voice recognition system and electronic equipment
CN113299280A (en) * 2021-05-12 2021-08-24 山东浪潮科学研究院有限公司 Professional vocabulary speech recognition method based on Kaldi
WO2022267451A1 (en) * 2021-06-24 2022-12-29 平安科技(深圳)有限公司 Automatic speech recognition method based on neural network, device, and readable storage medium

Similar Documents

Publication Publication Date Title
CN110797026A (en) Voice recognition method, device and storage medium
CN108847241B (en) Method for recognizing conference voice as text, electronic device and storage medium
US8180641B2 (en) Sequential speech recognition with two unequal ASR systems
CN113692616B (en) Phoneme-based contextualization for cross-language speech recognition in an end-to-end model
CN113811946A (en) End-to-end automatic speech recognition of digital sequences
EP3405912A1 (en) Analyzing textual data
CN111292740B (en) Speech recognition system and method thereof
CN112016275A (en) Intelligent error correction method and system for voice recognition text and electronic equipment
CN112992125B (en) Voice recognition method and device, electronic equipment and readable storage medium
JP2020004382A (en) Method and device for voice interaction
CN112216284B (en) Training data updating method and system, voice recognition method and system and equipment
CN113225612B (en) Subtitle generating method, device, computer readable storage medium and electronic equipment
CN114154487A (en) Text automatic error correction method and device, electronic equipment and storage medium
CN114999463B (en) Voice recognition method, device, equipment and medium
CN112331229A (en) Voice detection method, device, medium and computing equipment
CN112562640A (en) Multi-language speech recognition method, device, system and computer readable storage medium
Sokolov et al. Neural machine translation for multilingual grapheme-to-phoneme conversion
CN111508497B (en) Speech recognition method, device, electronic equipment and storage medium
US11126797B2 (en) Toxic vector mapping across languages
US20170031893A1 (en) Set-based Parsing for Computer-Implemented Linguistic Analysis
CN111326144A (en) Voice data processing method, device, medium and computing equipment
WO2022203773A1 (en) Lookup-table recurrent language model
CN111063337B (en) Large-scale voice recognition method and system capable of rapidly updating language model
CN114783405B (en) Speech synthesis method, device, electronic equipment and storage medium
US20100145677A1 (en) System and Method for Making a User Dependent Language Model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40021011

Country of ref document: HK

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination