WO2024124697A1

WO2024124697A1 - Speech recognition method, apparatus and device, and storage medium

Info

Publication number: WO2024124697A1
Application number: PCT/CN2023/078636
Authority: WO
Inventors: 潘嘉; 王孟之; 万根顺; 刘聪; 刘庆峰
Original assignee: 科大讯飞股份有限公司; 科大讯飞(苏州)科技有限公司
Priority date: 2022-12-12
Filing date: 2023-02-28
Publication date: 2024-06-20
Also published as: CN115910070A

Abstract

Disclosed in the present application are a speech recognition method, apparatus and device, and a storage medium. The method comprises: on the basis of a speech to be recognized, obtaining a preliminary recognized text consisting of an entity word category label and characters of the remaining non-entity words; further, on the basis of a speech segment corresponding to the entity word category label and a preset pronunciation dictionary and a language model, obtaining entity word characters corresponding to the entity word category label; and replacing the corresponding entity word category label in the preliminary recognized text with the entity word characters so as to obtain a final recognized text. Thus, when a new domain entity word appears, only the pronunciation dictionary and the language model need to be updated, and iterative updating of a speech recognition model is not needed, thus making the learning cost lower, avoiding the catastrophic forgetting problem caused by updating speech recognition models, and ensuring the recognition accuracy of newly-appearing domain entity words.

Description

Speech recognition method, device, equipment and storage medium

Technical Field

This application claims priority to a domestic application filed with the China Patent Office on December 12, 2022, with application number 202211589720.7 and invention name “Speech Recognition Method, Device, Equipment and Storage Medium”, all contents of which are incorporated by reference in this application.

Background technique

With the development of artificial intelligence and deep learning, speech recognition technology has been widely used, covering all areas of human-computer interaction. The core difficulty of domain speech recognition lies in the existence of a large number of domain-specific entity words. Domain-specific entity words, especially low-frequency words, usually appear less frequently in the training data of speech recognition models, and domain-specific entity vocabulary is constantly updated. For example, in voice navigation applications, new company names and place names continue to appear. The above characteristics of domain-specific entity words determine that in practical applications, the speech recognition system needs to be continuously updated to achieve a high accuracy rate in domain speech recognition.

In order to meet the recognition rate requirements of newly emerging domain-specific entity words, existing methods usually need to record or synthesize sentences containing domain-specific entity words to update and learn the speech recognition model. For example: First, use rules or trained context extension models to construct a large number of different context texts based on the text of the current domain entity words. For example, after a new song A appears, it is necessary to construct context texts such as "Give me a song A" and "I want to listen to the new song A". Then, use the speech synthesis model to synthesize the speech corresponding to the above text, and perform data enhancement operations such as adding noise, reverberation, and timbre conversion on the speech. Finally, use the above corpus to update and iterate the current speech recognition model. The new model obtained can usually improve the recognition accuracy of newly added domain entity words.

However, the above processing method also has disadvantages, for example:

First, existing technologies require continuous updating and learning of speech recognition models, so the whole process is time-consuming, labor-intensive and costly.

Second, the existing technology is not stable in improving the recognition accuracy of newly added domain entity words. First, the improvement in recognition accuracy is highly dependent on the constructed training corpus. The recognition accuracy is usually improved very little when the context is changed.

Third, it is difficult to achieve incremental learning with existing technologies, that is, it is difficult to ensure that the recognition accuracy of the updated speech recognition model for existing domain entity words does not decrease. One of the long-standing unresolved issues in the field of machine learning is the problem of catastrophic forgetting. Since the speech recognition model in the existing technology is updated according to the new corpus, it will more or less forget the previous training data. Especially after multiple updates, the catastrophic forgetting problem will become particularly serious, and the new will be learned and the old will be forgotten.

Summary of the invention

In view of the above problems, this application is proposed to provide a speech recognition method, device, equipment and storage medium to ensure the recognition accuracy of newly appeared domain entity words without updating the speech recognition model. The specific solution is as follows:

In a first aspect, a speech recognition method is provided, comprising:

Get the speech to be recognized;

Obtaining a preliminary recognition text based on the speech to be recognized, wherein the preliminary recognition text includes entity word category labels and other non-entity word characters;

Based on the speech segment corresponding to the entity word category label in the speech to be recognized and the preset pronunciation dictionary and language model, the entity word characters corresponding to the entity word category label are obtained, and the corresponding entity word category label in the preliminary recognition text is replaced by the entity word characters to obtain the final recognition text.

Preferably, the above process of obtaining a preliminary recognized text based on the speech to be recognized, obtaining entity word characters corresponding to the entity word category label, replacing the corresponding entity word category label in the preliminary recognized text with the entity word characters, and obtaining the final recognized text is achieved through a preconfigured speech recognition model.

Preferably, it also includes:

When a newly added domain entity word is obtained, the syllable or phoneme corresponding to the domain entity word is determined, and the correspondence between the domain entity word and the syllable or phoneme is added to the preset pronunciation dictionary, and the domain entity word is added to the language model.

Preferably, the language model is a language model constructed based on entity words in various fields.

Preferably, the speech recognition model includes an encoder, a primary decoder, a secondary decoder and an output layer;

The encoder is used to encode the input speech to be recognized to obtain acoustic coding features;

The first-level decoder is used to decode the characters as modeling units based on the acoustic coding features to obtain a preliminary recognition text consisting of entity word category labels and other non-entity word characters;

The secondary decoder is used to decode the syllable or phoneme corresponding to the entity word category label based on the acoustic coding features of the speech segment corresponding to the entity word category label using the syllable or phoneme as the modeling unit, and convert the syllable or phoneme into a character in combination with a preset pronunciation dictionary and language model to obtain the entity word character corresponding to the entity word category label;

The output layer is used to replace the corresponding entity word category label in the preliminary recognition text with the entity word character to obtain the final output recognition text.

Preferably, the first-level decoder uses characters as modeling units and decodes to obtain a preliminary recognition text consisting of entity word category labels and other non-entity word characters based on the acoustic coding features, including:

The first-level decoder uses characters as modeling units, and based on the acoustic coding features and the real-time state features of the first-level decoder, decodes to obtain a preliminary recognition text consisting of entity word category labels and other non-entity word characters.

Preferably, the primary decoder uses characters as modeling units, and based on the acoustic coding features and the real-time state features of the primary decoder, decodes to obtain a preliminary recognition text consisting of entity word category labels and other non-entity word characters, including:

The first-level decoder uses characters as modeling units, takes the attention degree of each frame of acoustic coding features when decoding the t-th character as the weight, performs weighted summation on the acoustic coding features of each frame, and obtains the acoustic coding feature c _t when decoding the t-th character. Based on the acoustic coding feature c _t when decoding the t-th character and the state feature d _t of the first-level decoder when decoding the t-th character, the t-th character is decoded until all characters are decoded to obtain a preliminary recognition text consisting of entity word category labels and other non-entity word characters.

Preferably, the secondary decoder uses syllables or phonemes as modeling units, and based on the acoustic coding features of the speech segment corresponding to the entity word category label, decodes the syllables or phonemes corresponding to the entity word category label, including:

The secondary decoder uses syllables or phonemes as modeling units, and decodes the syllables or phonemes corresponding to the entity word category labels based on the acoustic coding features when the primary decoder decodes the entity word category labels.

Preferably, the training process of the speech recognition model includes:

Obtaining training speech and corresponding recognition text, wherein the recognition text is annotated with category labels of entity words;

Use the category label of the entity word to replace the corresponding entity word in the recognized text to obtain the edited recognized text;

Input the training speech into the speech recognition model to obtain the preliminary recognition text output by the first-level decoder and the entity word characters corresponding to the entity word category labels output by the second-level decoder;

Determine a first loss function based on the preliminary recognition text output by the first-level decoder and the edited recognition text, and determine a second loss function based on the entity word characters corresponding to the entity word category labels output by the second-level decoder and the original entity words corresponding to the entity word category labels;

The network parameters of the speech recognition model are trained by combining the first loss function and the second loss function until the training end condition is met.

Preferably, the step of replacing the corresponding entity words in the recognized text with the category labels of the entity words to obtain the edited recognized text includes:

Determine the number of characters contained in the entity word, and replace the corresponding entity word in the recognized text with an equal number of entity word category labels to obtain the edited recognized text.

In a second aspect, a speech recognition method is provided, comprising:

Get the speech to be recognized;

Using a primary recognition module of a preconfigured speech recognition model, obtaining a preliminary recognition text based on the speech to be recognized, wherein the preliminary recognition text includes entity word category labels and other non-entity word characters;

Utilizing the secondary recognition module of the speech recognition model, based on the speech segment corresponding to the entity word category label in the speech to be recognized and a preset pronunciation dictionary and language model, the entity word character corresponding to the entity word category label is obtained;

The entity word character replaces the corresponding entity word category label in the preliminary recognition text to obtain the final recognition text.

Preferably, the primary identification module comprises:

encoder and primary decoder;

The encoder is used to encode the speech to be recognized to obtain acoustic coding features;

The first-level decoder is used to use characters as modeling units and decode based on the acoustic coding features to obtain preliminary recognition text consisting of entity word category labels and other non-entity word characters.

Preferably, the secondary recognition module includes: a secondary encoder, which is used to use syllables or phonemes as modeling units, decode the syllables or phonemes corresponding to the entity word category label based on the acoustic coding features of the speech segment corresponding to the entity word category label, and convert the syllables or phonemes into characters in combination with a preset pronunciation dictionary and language model to obtain entity word characters corresponding to the entity word category label.

In a third aspect, a speech recognition device is provided, comprising:

A speech acquisition unit to be recognized, used for acquiring the speech to be recognized;

A preliminary recognition text determination unit, used to obtain preliminary recognition text based on the speech to be recognized, wherein the preliminary recognition text includes entity word category labels and other non-entity word characters;

The final recognition text determination unit is used to obtain the entity word characters corresponding to the entity word category labels in the speech to be recognized based on the speech fragments corresponding to the entity word category labels in the speech to be recognized and the preset pronunciation dictionary and language model, and replace the corresponding entity word category labels in the preliminary recognition text with the entity word characters to obtain the final recognition text.

In a fourth aspect, a speech recognition device is provided, comprising:

A speech recognition model processing unit is used to use the primary recognition module of a preconfigured speech recognition model to obtain a preliminary recognition text based on the speech to be recognized, wherein the preliminary recognition text includes an entity word category label and other non-entity word characters; use the secondary recognition module of the speech recognition model to obtain the entity word characters corresponding to the entity word category label in the speech to be recognized based on the speech fragment corresponding to the entity word category label in the speech to be recognized and a preset pronunciation dictionary and language model; replace the corresponding entity word category label in the preliminary recognition text with the entity word character to obtain the final recognition text.

In a fifth aspect, a speech recognition device is provided, comprising: a memory and a processor;

The memory is used to store programs;

The processor is used to execute the program to implement the various steps of the speech recognition method described above.

In a sixth aspect, a storage medium is provided, on which a computer program is stored. When the computer program is executed by a processor, the various steps of the speech recognition method described above are implemented.

By means of the above technical scheme, the present application divides the speech recognition process into two stages. In the first recognition stage, a preliminary recognition text consisting of entity word category labels and other non-entity word characters can be obtained based on the speech to be recognized. In the second recognition stage, based on the speech segment corresponding to the entity word category label in the speech to be recognized and the preset pronunciation dictionary and language model, the entity word characters corresponding to the entity word category label are obtained, and the corresponding entity word category label in the preliminary recognition text is replaced by the entity word characters to obtain the recognition text outputted finally. Through the first recognition stage, the entity words in the speech to be recognized are recognized as corresponding category labels, and non-entity words can be directly recognized as characters, which can greatly reduce the probability that low-frequency entity words or newly appeared entity words in the same category label are mistakenly recognized as characters under non-category labels, and improve the recognition accuracy of entity words. In the second recognition stage, the entity words corresponding to the entity word category label are predicted in combination with the pronunciation dictionary and the language model. When a new domain entity word appears, the newly appeared domain entity word can be added to the preset pronunciation dictionary and language model, so that the recognition accuracy of the newly appeared domain entity word can be guaranteed.

Furthermore, when new domain entity words appear, this case only needs to update the pronunciation dictionary and language model, and there is no need to iteratively update the speech recognition model. The solution has better scalability, lower learning cost, and will not cause catastrophic forgetting problems caused by updating the speech recognition model.

BRIEF DESCRIPTION OF THE DRAWINGS

Various other advantages and benefits will become apparent to those of ordinary skill in the art by reading the detailed description of the preferred embodiments below. The accompanying drawings are only for the purpose of illustrating the preferred embodiments and are not to be considered as limiting the present application. Also, the same reference symbols are used throughout the accompanying drawings to represent the same components. In the accompanying drawings:

FIG1 is a flow chart of a speech recognition method provided by an embodiment of the present application;

FIG2 illustrates a schematic diagram of the structure of a speech recognition model;

FIG3 illustrates a schematic diagram of a two-stage decoding process of a speech recognition model;

FIG4 illustrates a schematic diagram of a process for determining a decoded character by combining a pronunciation dictionary and a speech model;

FIG5 is a schematic diagram of the structure of a speech recognition device provided in an embodiment of the present application;

FIG6 is a flow chart of another speech recognition method provided in an embodiment of the present application;

FIG7 is a schematic diagram of the structure of another speech recognition device provided in an embodiment of the present application;

FIG8 is a schematic diagram of the structure of a speech recognition device provided in an embodiment of the present application.

Detailed ways

The following will be combined with the drawings in the embodiments of the present application to clearly and completely describe the technical solutions in the embodiments of the present application. Obviously, the described embodiments are only part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of this application.

The present application provides a speech recognition solution that can be applied to various scenarios for speech recognition, especially for domain entity word speech recognition scenarios, and can ensure a high recognition accuracy rate for newly emerging domain entity words.

The present application solution can be implemented based on a terminal with data processing capabilities, which can be a mobile phone, computer, server, cloud, etc.

Next, in conjunction with FIG. 1 , the speech recognition method of the present application may include the following steps:

Step S100: Acquire speech to be recognized.

Step S110: obtaining a preliminary recognition text consisting of entity word category labels and other non-entity word characters based on the speech to be recognized.

Among them, the entity word category label is the pre-set field category to which the entity word belongs, such as labels such as person name, place name, organization name, singer, song, drug name, film and television drama name, etc.

The speech recognition method provided in this embodiment divides the speech recognition process into two stages. In the first recognition stage, entity words in the speech to be recognized are recognized as corresponding category labels, and non-entity words are directly recognized as characters to obtain the initial recognition text. This can greatly reduce the probability that low-frequency entity words or newly appeared entity words in the same category label are mistakenly recognized as characters not under the category label. Improve the recognition accuracy of entity words.

The process of identifying and obtaining the preliminary recognized text in this step can adopt an end-to-end modeling method, that is, taking characters as modeling units, and obtaining the preliminary recognized text based on the decoding of the speech to be recognized. In addition, other modeling methods can also be adopted, such as taking syllables or phonemes as modeling units, obtaining corresponding syllables or phonemes based on the decoding of the speech to be recognized, and then combining the preset pronunciation dictionary and language model to convert the decoded syllables or phonemes into characters to obtain the preliminary recognized text.

Step S120, combining a preset pronunciation dictionary and a language model, modeling entity word characters corresponding to entity word category labels, replacing corresponding entity word category labels in the preliminary recognition text with the entity word characters, and obtaining a final recognition text.

Specifically, in the second recognition stage, the entity word characters corresponding to the entity word category labels can be modeled based on the speech segments corresponding to the entity word category labels in the speech to be recognized and the preset pronunciation dictionary and language model. The pronunciation dictionary and language model can include existing and newly emerging entity words of various types.

In this embodiment, in the second recognition stage, syllables or phonemes can be selected as modeling units, and entity word characters corresponding to the entity word category labels in the speech to be recognized can be modeled based on the speech segments corresponding to the entity word category labels in the speech to be recognized and the preset pronunciation dictionary and language model.

When syllables are used as modeling units, the corresponding pronunciation dictionary may be a syllable pronunciation dictionary, which includes the correspondence between syllables and characters. When phonemes are used as modeling units, the corresponding pronunciation dictionary may be a phoneme pronunciation dictionary, which includes the correspondence between phonemes and characters.

By using syllables or phonemes as modeling units, the corresponding phonemes or syllables can be modeled based on the speech segments corresponding to the entity word category labels in the speech to be recognized. Further, the candidate characters corresponding to the phonemes or syllables are determined in combination with the pronunciation dictionary, and the probability score of each candidate character is determined in combination with the language model, and the candidate character with the highest probability score is selected as the entity word character corresponding to the entity word category label.

According to the speech recognition method introduced in this embodiment, when facing a newly appearing entity word, it is only necessary to update the pronunciation dictionary and the language model, that is, it is only necessary to expand the decoding path, and there is no need to iteratively update the speech recognition model. The solution has better scalability, lower learning cost, and will not cause catastrophic forgetting problems caused by updating the speech recognition model.

Furthermore, in order to ensure the recognition accuracy of the speech recognition method of the present application for the newly added domain entity words, when the newly added domain entity words are obtained, the syllables or phonemes of the newly added domain entity words can be determined, and then the correspondence between the newly added domain entity words and their syllables or phonemes is added to the preset pronunciation dictionary to complete the update of the pronunciation dictionary. At the same time, the newly added domain entity words are added to the preset language model to complete the update of the language model.

Optionally, the second recognition stage of the speech recognition method of the present application, in combination with the pronunciation dictionary and the language model, obtains the entity word characters corresponding to the entity word category label based on the modeling of the speech fragments corresponding to the entity word category label, that is, the second recognition stage only needs to perform entity word recognition, and does not need to perform non-entity word recognition. In order to improve the accuracy of entity word recognition, in this embodiment, the language model can be configured as a domain entity language model. Specifically, the domain entity language model can be constructed based on various domain entity words, that is, the domain entity language model only contains entity words.

On this basis, when using the entity language model in this field to convert syllables or phonemes to characters, non-entity word characters will not be converted, thereby greatly improving the accuracy of entity word recognition.

The above embodiment introduces a method for performing speech recognition in a two-stage recognition manner, wherein the above two speech recognition stages can be implemented by a variety of different means.

For example, in the first recognition stage, the first stage speech recognition can be performed through a pre-trained speech recognition system, that is, the speech to be recognized is input into the pre-trained speech recognition system, and the decoding output is a preliminary recognition text consisting of entity word category labels and remaining non-entity word characters.

In the second recognition stage, the second-stage speech recognition can be performed through a pre-trained speech recognition model. That is, the speech recognition model can use syllables or phonemes as modeling units, combined with a preset pronunciation dictionary and language model, to model the speech fragments corresponding to the entity word category labels in the speech to be recognized, and obtain the entity word characters corresponding to the entity word category labels.

Finally, the entity word characters corresponding to the entity word category labels obtained in the second recognition stage replace the corresponding entity word category labels in the preliminary recognition text obtained in the first recognition stage to obtain the final recognition text.

In addition, the embodiment of the present application also provides another optional implementation method of the above two-stage speech recognition. Specifically, the present application can pre-train a speech recognition model. The recognition model implements the above two-stage speech recognition process.

Specifically, in this embodiment, a speech recognition model may be pre-trained, and the speech recognition model may be configured as follows:

Based on the decoding of the speech to be recognized, a preliminary recognition text consisting of entity word category labels and other non-entity word characters is obtained. Syllables or phonemes are used as modeling units. In combination with a preset pronunciation dictionary and language model, entity word characters corresponding to the entity word category labels are obtained based on the speech fragments corresponding to the entity word category labels. The corresponding entity word category labels in the preliminary recognition text are replaced by the entity word characters to obtain the recognition text that is finally output.

On this basis, the implementation process of the above steps S110-S120 includes:

The speech to be recognized is input into the speech recognition model configured as above to obtain the final recognition text output by the speech recognition model.

The speech recognition model divides the speech recognition process into two stages. In the first stage, entity words in the speech to be recognized are recognized as corresponding category labels, and non-entity words are directly recognized as characters to obtain the initial recognition text. This can greatly reduce the probability that low-frequency entity words or newly appeared entity words in the same category label are mistakenly recognized as characters not under the category label, and improve the recognition accuracy of entity words.

In the second recognition stage, syllables or phonemes are used as modeling units, combined with pronunciation dictionaries and language models to predict entity word characters corresponding to entity word category labels. Among them, the pronunciation dictionary and language model can include existing and newly emerging entity words of various types. Specifically, when a new domain entity word appears, the newly emerging domain entity word can be added to the preset pronunciation dictionary and language model, so as to ensure the recognition accuracy of the newly emerging domain entity word.

When performing speech recognition according to the speech recognition model provided in this embodiment, when facing a newly appearing entity word, it is only necessary to update the pronunciation dictionary and the language model, that is, it is only necessary to expand the decoding path, and there is no need to iteratively update the speech recognition model. The solution has better scalability, lower learning cost, and will not cause catastrophic forgetting problems caused by updating the speech recognition model.

In some embodiments of the present application, the structure of the above-mentioned speech recognition model is explained.

As shown in FIG. 2 , the speech recognition model introduced in this embodiment may include an encoder, The first-level decoder Decoder1, the second-level decoder Decoder2 and the output layer (not shown in FIG. 2 ).

in:

1. Encoder, used to encode the input speech to be recognized to obtain acoustic coding features.

Specifically, the input of the encoder can be the speech features of the speech to be recognized, such as the amplitude spectrum features, etc. Taking the amplitude spectrum features as an example, it can be the log filter bank energy (LFBE). The encoder is used to extract the representation of the speech features of the speech to be recognized. The encoder encodes the speech features of the speech to be recognized to obtain acoustic coding features.

In this embodiment, the encoder may adopt a convolutional neural network, LSTM or Transformer structure. The encoder may be represented by h=f(x), where h=[h ₁ ,h ₂ ,…,h _T ] is the acoustic coding feature output by the encoder, x=[x ₁ ,x ₂ ,…,x _T ] is the speech feature of the speech to be recognized input by the encoder, and T represents the number of frames of the speech to be recognized.

2. A first-level decoder is used to decode the characters into a preliminary recognition text consisting of entity word category labels and other non-entity word characters based on the acoustic coding features using the characters as modeling units.

Optionally, the first-level decoder can use characters as modeling units for end-to-end modeling. In addition, syllables or phonemes can be used as modeling units, and then the pronunciation dictionary and language model are combined to obtain preliminary recognition text. In the speech recognition model of the example in Figure 2, only the character modeling unit is used as an example for explanation.

The first-level decoder uses characters as modeling units for decoding, and can directly decode to obtain preliminary recognition text. For entity words contained in the speech to be recognized, the first-level decoder decodes them into entity word category labels. For non-entity words in the speech to be recognized, the first-level decoder decodes them normally into corresponding characters, and finally obtains the preliminary recognition text output by the first-level decoder.

It should be noted that one entity word can correspond to one entity word category label. In addition, the same number of entity word category labels can be obtained according to the number of characters contained in the entity word, that is, one entity word can correspond to multiple identical entity word category labels.

Take the speech to be recognized as "listen to Zhang San's song" as an example, where "Zhang San" is an entity word and its category label is "singer". When the speech recognition model in this case is used to recognize the speech to be recognized, the preliminary recognition text output by the first-level decoder can be: listen to <singer>'s song. In addition, the preliminary recognition text output by the first-level decoder can also be: listen to <singer>'s song.

Optionally, the first-level decoder in this embodiment can adopt a network structure with attention mechanism and autoregression, such as transformer, LSTM and other network structures. On this basis, when the first-level decoder decodes, it uses characters as modeling units, and decodes to obtain a preliminary recognition text consisting of entity word category labels and other non-entity word characters based on the acoustic coding features and the real-time state features of the first-level decoder.

Specifically, when decoding, the first-level decoder can refer to the acoustic coding feature _ct and the real-time state feature of the first-level decoder at the same time, where the real-time state feature of the first-level decoder can be understood as the contextual language feature, that is, when the first-level decoder decodes word by word, it can refer to the characters decoded at the previous moment to guide the decoding process at the current moment, taking into account the contextual association of the text, so that the decoding result is more accurate.

3. A secondary decoder is used to use syllables or phonemes as modeling units, decode the syllables or phonemes corresponding to the entity word category labels based on the acoustic coding features of the speech fragments corresponding to the entity word category labels, and convert the syllables or phonemes into characters in combination with a preset pronunciation dictionary and language model to obtain the entity word characters corresponding to the entity word category labels.

Specifically, for the part decoded as the entity word category label by the first-level decoder, the second-level decoder decodes the syllable or phoneme corresponding to the entity word category label based on the acoustic coding features of the speech segment corresponding to the entity word category label. It can be understood that if the second-level decoder uses syllables as modeling units, the syllable corresponding to the entity word category label is decoded here; if the second-level decoder uses phonemes as modeling units, the phoneme corresponding to the entity word category label is decoded here.

In this embodiment, considering that many domain entity words are self-made words, such as company names, there is no obvious correlation between the preceding and following characters of domain entity words. Therefore, the secondary decoder can weaken the modeling of context relevance, that is, different from the way in which the primary decoder decodes by referring to both the acoustic coding features and the real-time state features of the primary decoder, the secondary decoder can decode based only on the acoustic coding features.

After the secondary decoder decodes and obtains the syllable or phoneme corresponding to the entity word category label, it can convert the syllable or phoneme into a character in combination with a preset pronunciation dictionary and language model to obtain the entity word character corresponding to the entity word category label.

Combined with Figure 3:

Assume that the input speech to be recognized is "I bought a Yuse mobile phone". Among them, "Yuse mobile phone" is a newly added domain entity word. The decoding result of the first-level decoder dec1 is shown in Figure 3: I bought a <device name><device name><device name><device name>.

Where <device name> refers to the category label of the entity word.

The second-level decoder dec2 further decodes each entity word category label output by the first-level decoder to obtain the corresponding syllable: yu3se4shou3ji1.

Each syllable is converted into a corresponding character through the pronunciation dictionary and language model: 宇色手机.

Further in conjunction with FIG. 4 , taking the syllable “yu3” as an example, the process of converting it into the character “宇” is introduced.

The characters corresponding to "yu3" in the syllable pronunciation dictionary include "雨", "宇", and "语". The language model scores of the three characters in the language model are 2, 3, and 1, respectively. Therefore, the character "宇" with the highest language model score is selected as the final decoding character corresponding to "yu3".

4. An output layer is used to replace the corresponding entity word category label in the preliminary recognition text with the entity word to obtain the final output recognition text.

Let’s use the above example to illustrate:

The initial recognition text is: I bought a <device name><device name><device name><device name>. The entity word characters obtained by the secondary decoder are "宇色手机". The entity word category label in the initial recognition text is replaced with "宇色手机", and the final output recognition text is: I bought a 宇色手机.

In some embodiments of the present application, the process of decoding the above-mentioned first-level decoder to obtain the preliminary recognition text is introduced.

The first-level decoder can use characters as modeling units, and can directly decode characters during decoding. The first-level decoder can use the network results with an attention mechanism, and can refer to acoustic information and language information at the same time during decoding to improve decoding accuracy.

Specifically, the first-level decoder takes the attention degree of each frame acoustic coding feature when decoding the t-th character as the weight, performs weighted summation on the acoustic coding features of each frame, and obtains the acoustic coding feature c _t when decoding the t-th character. Based on the acoustic coding feature c t when decoding the t-th character and the acoustic coding feature c _t when decoding the t-th character, The state feature d _t of the first-level decoder is used when the character is t, and the t-th character is decoded until all characters are decoded to obtain the preliminary recognition text consisting of the entity word category label and the remaining non-entity word characters.

Specifically, the output of the first-level decoder can be calculated by referring to the following formula:

d _t = LSTM([y _t-1 ; c _t-1 ])

e _tj =V ^T tanh(W _q d _t +W _k h _j )

dec ₁ = argmax(W ₁ [ d _t ; c _t ])

Among them, y _t-1 is the last decoded character, c _t-1 is the acoustic coding feature when the first-level decoder decodes the previous character, d _t is the state feature of the first-level decoder when decoding the t-th character (here the first-level decoder adopts the LSTM structure as an example for explanation), which can also be understood as the language feature when decoding the t-th character, a _tj is the normalized attention to the acoustic coding feature of the j-th frame when decoding the t-th character, W _q , W _k , V, W ₁ are network parameters, h _j is the acoustic coding feature of the j-th frame, and dec ₁ is the output of the first-level decoder.

Based on the above, the process of obtaining the syllable or phoneme corresponding to the entity word category label by the secondary decoder is further introduced.

Specifically, unlike the first-level decoder which uses both acoustic coding features and decoder real-time state features during decoding, the second-level decoder needs to weaken the correlation between previous and next characters, and therefore can use only acoustic coding features.

Based on this, the secondary decoder can only use the acoustic coding features when decoding the entity word category label to decode the syllable or phoneme corresponding to the entity word category label. The output of the secondary decoder is expressed as:

dec ₂ = argmax(W ₂ c _t )

Among them, dec ₂ represents the output of the secondary decoder and W ₂ is the network parameter.

In some embodiments of the present application, the training process of the above-mentioned speech recognition model is further introduced.

This embodiment illustrates an optional training process of a speech recognition model, which may include the following steps:

S1. Obtain training speech and corresponding recognition text, wherein the recognition text is annotated with category labels of entity words.

Specifically, in the actual application of speech recognition, we will face a variety of domain entity words, such as names of people, places, institutions, singers, songs, etc. First, according to the actual speech recognition scenario, we need to obtain all categories of domain entity words and form a category label set. Furthermore, we need to obtain existing entity words and determine the corresponding category labels for the existing entity words. The acquisition of entity words can be obtained by performing NER recognition on text data using the named entity recognition (NER) tool.

It should be noted that some entity words have multiple meanings, that is, they can correspond to multiple different category labels. For example, "apple" can correspond to two category labels: "fruit" and "institution name".

Given the limited human effort, the entity words obtained above are limited and it is difficult to cover all categories of entity words. Therefore, the entity words and their category labels obtained above can be used to train a classification neural network model to determine the category of entity words contained in the input text. The classification neural network can use a model with context modeling capabilities, such as Transformer, LSTM, etc.

Based on the trained classification neural network model, the category labels of entity words in the training corpus text can be determined.

The training corpus text may be the recognition text corresponding to the collected training speech.

S2. Use the category label of the entity word to replace the corresponding entity word in the recognized text to obtain the edited recognized text.

Specifically, in order to train the first-level decoder in the speech recognition model, it is necessary to construct the training sample labels of the first-level decoder, that is, to replace the entity words in the recognition text with the corresponding entity word category labels to obtain the edited recognition text.

For example, the recognized text is "I bought an Apple phone", which contains the entity word "Apple The corresponding category label is <Organization Name>, then the edited recognition text can be "I bought a <Organization Name> mobile phone".

In addition, in order to facilitate the secondary decoder to decode the accurate entity words, we hope that the number of entity word category labels in the output of the first decoder is equal to the number of characters contained in the entity word. Based on this, the secondary decoder can perform secondary decoding on each entity word category label to obtain the corresponding characters, and each character constitutes a complete entity word.

To this end, when editing the recognized text, the corresponding entity words in the recognized text are replaced with the category labels of the entity words to obtain the edited recognized text, which may specifically include:

Still taking the above-mentioned recognized text as an example, the recognized text after editing may be "I bought a <organization name> <organization name> mobile phone".

S3. Input the training speech into a speech recognition model to obtain a preliminary recognition text output by the first-level decoder and entity word characters corresponding to the entity word category label output by the second-level decoder.

S4. Determine a first loss function based on the preliminary recognition text output by the first-level decoder and the edited recognition text, and determine a second loss function based on the entity word characters corresponding to the entity word category labels output by the second-level decoder and the original entity words corresponding to the entity word category labels.

Specifically, based on the preliminary recognition text and the edited recognition text output by the primary decoder, a first loss function can be calculated by a cross entropy loss function, and the first loss function represents the decoding loss of the primary decoder.

Furthermore, a second loss function can be calculated through a cross-entropy loss function based on the entity word characters corresponding to the entity word category labels output by the secondary decoder and the original entity words corresponding to the entity word category labels in the recognized text. The second loss function represents the decoding loss of the secondary decoder.

S5. Combine the first loss function and the second loss function to train the network parameters of the speech recognition model until the training end condition is met.

Specifically, the total loss function is calculated by combining the first loss function and the second loss function, and the network parameters of the speech recognition model are trained based on the total loss function until the training end condition is met, thereby obtaining the final trained speech recognition model.

The following is a description of a speech recognition device provided in an embodiment of the present application. The speech recognition device described below and the speech recognition method described above can be referenced to each other.

See FIG. 5 , which is a schematic diagram of the structure of a speech recognition device disclosed in an embodiment of the present application.

As shown in FIG5 , the device may include:

A speech acquisition unit 11 for acquiring speech to be recognized, used for acquiring speech to be recognized;

A preliminary recognition text determination unit 12 is used to obtain preliminary recognition text based on the speech to be recognized, wherein the preliminary recognition text includes entity word category labels and other non-entity word characters;

The preliminary recognition text determination unit may use characters as modeling units, and obtain preliminary recognition text based on the decoding of the speech to be recognized. In addition, other modeling methods may be used, such as using syllables or phonemes as modeling units, obtaining corresponding syllables or phonemes based on the decoding of the speech to be recognized, and then converting the decoded syllables or phonemes into characters in combination with a preset pronunciation dictionary and language model to obtain preliminary recognition text.

The final recognition text determination unit 13 is used to obtain the entity word characters corresponding to the entity word category labels in the speech to be recognized based on the speech segments corresponding to the entity word category labels in the speech to be recognized and the preset pronunciation dictionary and language model, and replace the corresponding entity word category labels in the preliminary recognition text with the entity word characters to obtain the final recognition text.

Specifically, the above-mentioned final recognition text determination unit can use syllables or phonemes as modeling units, and based on the speech fragments corresponding to the entity word category labels in the speech to be recognized and the preset pronunciation dictionary and language model, model the entity word characters corresponding to the entity word category labels.

Optionally, the processing of the preliminary recognition text determination unit and the final recognition text determination unit can be implemented by a model processing unit, which is used to process the speech data to be recognized using a preconfigured speech recognition model to obtain the recognition text output by the model, wherein the speech recognition model is configured as follows:

Based on the decoding of the speech to be recognized, a preliminary recognition text consisting of entity word category labels and other non-entity word characters is obtained. Syllables or phonemes are used as modeling units. Based on the speech segments corresponding to the entity word category labels in the speech to be recognized and the preset pronunciation dictionary and language model, entity word characters corresponding to the entity word category labels are modeled. The entity word characters are used to replace the Preliminarily identify the corresponding entity word category labels in the text and obtain the final output recognition text.

Optionally, the device of the present application may further include:

The pronunciation dictionary and language model updating unit is used to determine the syllable or phoneme corresponding to the domain entity word when a newly added domain entity word is obtained, and add the correspondence between the domain entity word and the syllable or phoneme to the preset pronunciation dictionary, and add the domain entity word to the language model.

Optionally, the above-mentioned language model may be a language model constructed based on entity words in various fields.

Optionally, the speech recognition model may include an encoder, a primary decoder, a secondary decoder and an output layer;

Optionally, the first-level decoder may adopt a network structure with an attention mechanism and autoregression. Specifically, taking characters as modeling units, based on the acoustic coding features and the real-time state features of the first-level decoder, decoding to obtain a preliminary recognition text consisting of entity word category labels and other non-entity word characters, the process may include:

Optionally, the secondary decoder uses syllables or phonemes as modeling units, and based on the acoustic coding features of the speech segment corresponding to the entity word category label, decodes the syllables or phonemes corresponding to the entity word category label, which may include:

Optionally, the device of the present application may further include:

The model training unit is used to train the speech recognition model. The training process may include:

Optionally, the model training unit replaces the corresponding entity word in the recognized text with the category label of the entity word to obtain the edited recognized text, which may include:

In some embodiments of the present application, another speech recognition method is further provided. As shown in FIG. 6 , the method may include:

Step S200: Acquire speech to be recognized.

Step S210: using the primary recognition module of the pre-configured speech recognition model, obtaining a preliminary recognition text based on the speech to be recognized, wherein the preliminary recognition text includes entity word category labels and their The remaining non-entity characters.

In the speech recognition method provided in this embodiment, a speech recognition model is pre-configured. The speech recognition model includes a two-level framework, namely a primary recognition module and a secondary recognition module.

Among them, the first-level recognition module can decode the input speech to be recognized to obtain a preliminary recognition text composed of entity word category labels and other non-entity word characters.

Specifically, the primary recognition module can use characters as modeling units, and obtain preliminary recognition text based on the decoding of the speech to be recognized. In addition, other modeling methods can also be used, such as using syllables or phonemes as modeling units, obtaining corresponding syllables or phonemes based on the decoding of the speech to be recognized, and then combining the preset pronunciation dictionary and language model to convert the decoded syllables or phonemes into characters to obtain preliminary recognition text.

Step S220, using the secondary recognition module of the speech recognition model, based on the speech segment corresponding to the entity word category label in the speech to be recognized and a preset pronunciation dictionary and language model, obtain the entity word character corresponding to the entity word category label.

Among them, the secondary recognition module in the speech recognition model can use syllables or phonemes as modeling units to perform secondary decoding on the speech segments corresponding to the entity word category labels decoded by the primary recognition module to obtain decoded syllables or phonemes. Further, in combination with a preset pronunciation dictionary and language model, the decoded syllables or phonemes are converted into character form to obtain entity word characters corresponding to the entity word category labels.

Step S230: Replace the corresponding entity word category label in the preliminary recognition text with the entity word character to obtain the final recognition text.

The speech recognition method provided in the embodiment of the present application, the pre-configured speech recognition model includes a two-level recognition module. The first-level recognition module recognizes the entity words in the speech to be recognized as corresponding category labels, and directly recognizes non-entity words as characters to obtain an initial recognition text. In this way, the probability of low-frequency entity words or newly appeared entity words in the same category label being mistakenly recognized as characters not under the category label can be greatly reduced, and the recognition accuracy of entity words can be improved. The second-level recognition module only needs to recognize the entity words corresponding to the entity word category label. Finally, the entity word characters recognized by the second-level recognition module replace the corresponding entity word category label in the preliminary recognition text to obtain the final recognition text.

When the speech recognition model of this embodiment is used for speech recognition, when facing a new entity When updating a word, it is only necessary to update the pronunciation dictionary and language model, that is, it is only necessary to expand the decoding path, and there is no need to iteratively update the speech recognition model. The solution has better scalability, lower learning cost, and will not cause catastrophic forgetting problems caused by updating the speech recognition model.

Furthermore, in order to ensure the recognition accuracy of the speech recognition model for the newly added domain entity words, when the newly added domain entity words are obtained, the syllables or phonemes of the newly added domain entity words can be determined, and then the corresponding relationship between the newly added domain entity words and their syllables or phonemes is added to the preset pronunciation dictionary to complete the update of the pronunciation dictionary. At the same time, the newly added domain entity words are added to the preset language model to complete the update of the language model.

In conjunction with the speech recognition model shown in FIG2 , the structure of the speech recognition model is introduced.

The primary recognition module may include: an encoder and a primary decoder.

The encoder is used to encode the speech to be recognized to obtain acoustic coding features.

The first-level decoder is used to use characters as modeling units and decode the initial recognition text composed of entity word category labels and other non-entity word characters based on the acoustic coding features.

Among them, the first-level decoder can adopt a network structure with an attention mechanism and autoregression, that is, when the first-level decoder decodes, it can use characters as modeling units, and decode to obtain a preliminary recognition text consisting of entity word category labels and other non-entity word characters based on the acoustic coding features and the real-time state features of the first-level decoder. The specific implementation process can refer to the relevant introduction in the previous article, which will not be repeated here.

The secondary recognition module may include: a secondary encoder, which is used to use syllables or phonemes as modeling units, decode the syllables or phonemes corresponding to the entity word category label based on the acoustic coding features of the speech segment corresponding to the entity word category label, and convert the syllables or phonemes into characters in combination with a preset pronunciation dictionary and language model to obtain the entity word characters corresponding to the entity word category label.

Furthermore, the secondary recognition module may also include an output layer, which is used to replace the corresponding entity word category labels in the preliminary recognition text with the entity word characters obtained by the secondary recognition module, and output the final recognition text.

For the processing process of the secondary recognition module, please refer to the relevant introduction in the previous article, which will not be repeated here.

The speech recognition device provided in an embodiment of the present application is described below. The speech recognition device described below and the second speech recognition method described above can be referenced to each other.

See FIG. 7 , which is a schematic diagram of the structure of another speech recognition device disclosed in an embodiment of the present application.

As shown in FIG. 7 , the device may include:

The to-be-recognized speech acquisition unit 21 is used to acquire the to-be-recognized speech;

The speech recognition model processing unit 22 is used to use the primary recognition module of the preconfigured speech recognition model to obtain a preliminary recognition text based on the speech to be recognized, and the preliminary recognition text includes an entity word category label and other non-entity word characters; use the secondary recognition module of the speech recognition model to obtain the entity word characters corresponding to the entity word category label in the speech to be recognized based on the speech segment corresponding to the entity word category label in the speech to be recognized and a preset pronunciation dictionary and language model; replace the corresponding entity word category label in the preliminary recognition text with the entity word character to obtain the final recognition text.

Among them, the composition structure of the speech recognition model can refer to the introduction of the aforementioned speech recognition method part, which will not be repeated here.

The aforementioned two different speech recognition devices provided in the embodiments of the present application can be applied to speech recognition devices, such as terminals: mobile phones, computers, etc. Optionally, FIG8 shows a hardware structure block diagram of a speech recognition device. Referring to FIG8 , the hardware structure of the speech recognition device may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;

In the embodiment of the present application, the number of the processor 1, the communication interface 2, the memory 3, and the communication bus 4 is at least one, and the processor 1, the communication interface 2, and the memory 3 communicate with each other through the communication bus 4;

The processor 1 may be a central processing unit CPU, or an application specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement the embodiments of the present invention, etc.;

The memory 3 may include a high-speed RAM memory, and may also include a non-volatile memory, such as at least one disk memory;

The memory stores a program, and the processor can call the program stored in the memory, and the program is used to: implement each step of the speech recognition method introduced in the above embodiments.

Optionally, the detailed functions and extended functions of the program may refer to the above description.

An embodiment of the present application further provides a storage medium, which can store a program suitable for execution by a processor, wherein the program is used to implement the various steps of the speech recognition method introduced in the aforementioned embodiments.

Finally, it should be noted that, in this article, relational terms such as first and second, etc. are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Moreover, the terms "include", "comprise" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements includes not only those elements, but also other elements not explicitly listed, or also includes elements inherent to such process, method, article or device. In the absence of further restrictions, the elements defined by the sentence "comprise a ..." do not exclude the presence of other identical elements in the process, method, article or device including the elements.

The various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments. The various embodiments can be combined as needed, and the same or similar parts can refer to each other.

The above description of the disclosed embodiments enables those skilled in the art to implement or use the present application. Various modifications to these embodiments will be apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the present application. Therefore, the present application will not be limited to the embodiments shown herein, but will conform to the widest scope consistent with the principles and novel features disclosed herein.

Claims

A speech recognition method, characterized by comprising:

Get the speech to be recognized;

Obtaining a preliminary recognition text based on the speech to be recognized, wherein the preliminary recognition text includes entity word category labels and other non-entity word characters;

Based on the speech segment corresponding to the entity word category label in the speech to be recognized and the preset pronunciation dictionary and language model, the entity word characters corresponding to the entity word category label are obtained, and the corresponding entity word category label in the preliminary recognition text is replaced by the entity word characters to obtain the final recognition text.
The method according to claim 1 is characterized in that a preliminary recognition text is obtained based on the speech to be recognized, and entity word characters corresponding to the entity word category label are obtained, and the entity word category labels corresponding to the preliminary recognition text are replaced by the entity word characters to obtain the final recognition text, which is achieved through a preconfigured speech recognition model.
The method according to claim 1, further comprising:

When a newly added domain entity word is obtained, the syllable or phoneme corresponding to the domain entity word is determined, and the correspondence between the domain entity word and the syllable or phoneme is added to the preset pronunciation dictionary, and the domain entity word is added to the language model.
The method according to claim 1 is characterized in that the language model is a language model constructed based on entity words in various fields.
The method according to claim 2, characterized in that the speech recognition model includes an encoder, a primary decoder, a secondary decoder and an output layer;

The encoder is used to encode the input speech to be recognized to obtain acoustic coding features;

The first-level decoder is used to decode the characters as modeling units based on the acoustic coding features to obtain a preliminary recognition text consisting of entity word category labels and other non-entity word characters;

The secondary decoder is used to decode the syllable or phoneme corresponding to the entity word category label based on the acoustic coding features of the speech segment corresponding to the entity word category label using the syllable or phoneme as the modeling unit, and convert the syllable or phoneme into a character in combination with a preset pronunciation dictionary and language model to obtain the entity word character corresponding to the entity word category label;

The output layer is used to replace the corresponding entity word category label in the preliminary recognition text with the entity word character to obtain the final output recognition text.
The method according to claim 5 is characterized in that the first-level decoder uses characters as modeling units and decodes the initial recognition text composed of entity word category labels and other non-entity word characters based on the acoustic coding features, comprising:

The first-level decoder uses characters as modeling units, and based on the acoustic coding features and the real-time state features of the first-level decoder, decodes to obtain a preliminary recognition text consisting of entity word category labels and other non-entity word characters.
The method according to claim 6 is characterized in that the first-level decoder uses characters as modeling units, and based on the acoustic coding features and the real-time state features of the first-level decoder, decodes to obtain a preliminary recognition text consisting of entity word category labels and other non-entity word characters, comprising:

The first-level decoder uses characters as modeling units, takes the attention degree of each frame of acoustic coding features when decoding the t-th character as the weight, performs weighted summation on the acoustic coding features of each frame, and obtains the acoustic coding feature c t when decoding the t-th character. Based on the acoustic coding feature c t when decoding the t-th character and the state feature d t of the first-level decoder when decoding the t-th character, the t-th character is decoded until all characters are decoded to obtain a preliminary recognition text consisting of entity word category labels and other non-entity word characters.
The method according to claim 5 is characterized in that the secondary decoder uses syllables or phonemes as modeling units, and decodes the syllables or phonemes corresponding to the entity word category labels based on the acoustic coding features of the speech segments corresponding to the entity word category labels, comprising:

The secondary decoder uses syllables or phonemes as modeling units, and decodes the syllables or phonemes corresponding to the entity word category labels based on the acoustic coding features when the primary decoder decodes the entity word category labels.
The method according to claim 5, characterized in that the training process of the speech recognition model comprises:

Obtaining training speech and corresponding recognition text, wherein the recognition text is annotated with category labels of entity words;

Use the category label of the entity word to replace the corresponding entity word in the recognized text to obtain the edited recognized text;

Input the training speech into the speech recognition model to obtain the preliminary recognition text output by the first-level decoder and the entity word characters corresponding to the entity word category labels output by the second-level decoder;

Determine a first loss function based on the preliminary recognition text output by the first-level decoder and the edited recognition text, and determine a second loss function based on the entity word characters corresponding to the entity word category labels output by the second-level decoder and the original entity words corresponding to the entity word category labels;

The network parameters of the speech recognition model are trained by combining the first loss function and the second loss function until the training end condition is met.
The method according to claim 9 is characterized in that the step of replacing the corresponding entity words in the recognized text with the category labels of the entity words to obtain the edited recognized text comprises:

Determine the number of characters contained in the entity word, and replace the corresponding entity word in the recognized text with an equal number of entity word category labels to obtain the edited recognized text.
A speech recognition method, characterized by comprising:

Get the speech to be recognized;

Using a primary recognition module of a preconfigured speech recognition model, obtaining a preliminary recognition text based on the speech to be recognized, wherein the preliminary recognition text includes entity word category labels and other non-entity word characters;

Utilizing the secondary recognition module of the speech recognition model, based on the speech segment corresponding to the entity word category label in the speech to be recognized and a preset pronunciation dictionary and language model, the entity word character corresponding to the entity word category label is obtained;

The entity word character replaces the corresponding entity word category label in the preliminary recognition text to obtain the final recognition text.
The method according to claim 11, characterized in that the primary identification module comprises:

encoder and primary decoder;

The encoder is used to encode the speech to be recognized to obtain acoustic coding features;

The first-level decoder is used to use characters as modeling units and decode based on the acoustic coding features to obtain preliminary recognition text consisting of entity word category labels and other non-entity word characters.
The method according to claim 12, characterized in that the secondary identification module comprises include:

The secondary encoder is used to use syllables or phonemes as modeling units, decode the syllables or phonemes corresponding to the entity word category labels based on the acoustic coding features of the speech fragments corresponding to the entity word category labels, and convert the syllables or phonemes into characters in combination with a preset pronunciation dictionary and language model to obtain entity word characters corresponding to the entity word category labels.
A speech recognition device, comprising:

A speech acquisition unit to be recognized, used for acquiring the speech to be recognized;

A preliminary recognition text determination unit, used to obtain preliminary recognition text based on the speech to be recognized, wherein the preliminary recognition text includes entity word category labels and other non-entity word characters;

The final recognition text determination unit is used to obtain the entity word characters corresponding to the entity word category labels in the speech to be recognized based on the speech fragments corresponding to the entity word category labels in the speech to be recognized and the preset pronunciation dictionary and language model, and replace the corresponding entity word category labels in the preliminary recognition text with the entity word characters to obtain the final recognition text.
A speech recognition device, comprising:

A speech acquisition unit to be recognized, used for acquiring the speech to be recognized;

A speech recognition model processing unit is used to use the primary recognition module of a preconfigured speech recognition model to obtain a preliminary recognition text based on the speech to be recognized, wherein the preliminary recognition text includes an entity word category label and other non-entity word characters; use the secondary recognition module of the speech recognition model to obtain the entity word characters corresponding to the entity word category label in the speech to be recognized based on the speech fragment corresponding to the entity word category label in the speech to be recognized and a preset pronunciation dictionary and language model; replace the corresponding entity word category label in the preliminary recognition text with the entity word character to obtain the final recognition text.
A speech recognition device, characterized in that it comprises: a memory and a processor;

The memory is used to store programs;

The processor is used to execute the program to implement each step of the speech recognition method according to any one of claims 1 to 11.
A storage medium having a computer program stored thereon, wherein when the computer program is executed by a processor, the speech recognition method according to any one of claims 1 to 11 is implemented The various steps of the method.