WO2024124697A1 - Speech recognition method, apparatus and device, and storage medium - Google Patents

Speech recognition method, apparatus and device, and storage medium Download PDF

Info

Publication number
WO2024124697A1
WO2024124697A1 PCT/CN2023/078636 CN2023078636W WO2024124697A1 WO 2024124697 A1 WO2024124697 A1 WO 2024124697A1 CN 2023078636 W CN2023078636 W CN 2023078636W WO 2024124697 A1 WO2024124697 A1 WO 2024124697A1
Authority
WO
WIPO (PCT)
Prior art keywords
entity word
speech
characters
recognized
entity
Prior art date
Application number
PCT/CN2023/078636
Other languages
French (fr)
Chinese (zh)
Inventor
潘嘉
王孟之
万根顺
刘聪
刘庆峰
Original Assignee
科大讯飞股份有限公司
科大讯飞(苏州)科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 科大讯飞股份有限公司, 科大讯飞(苏州)科技有限公司 filed Critical 科大讯飞股份有限公司
Publication of WO2024124697A1 publication Critical patent/WO2024124697A1/en

Links

Definitions

  • domain speech recognition technology has been widely used, covering all areas of human-computer interaction.
  • the core difficulty of domain speech recognition lies in the existence of a large number of domain-specific entity words.
  • Domain-specific entity words especially low-frequency words, usually appear less frequently in the training data of speech recognition models, and domain-specific entity vocabulary is constantly updated. For example, in voice navigation applications, new company names and place names continue to appear.
  • the above characteristics of domain-specific entity words determine that in practical applications, the speech recognition system needs to be continuously updated to achieve a high accuracy rate in domain speech recognition.
  • the existing technology is not stable in improving the recognition accuracy of newly added domain entity words.
  • the improvement in recognition accuracy is highly dependent on the constructed training corpus.
  • the recognition accuracy is usually improved very little when the context is changed.
  • this application is proposed to provide a speech recognition method, device, equipment and storage medium to ensure the recognition accuracy of newly appeared domain entity words without updating the speech recognition model.
  • the specific solution is as follows:
  • a speech recognition method comprising:
  • the entity word characters corresponding to the entity word category label are obtained, and the corresponding entity word category label in the preliminary recognition text is replaced by the entity word characters to obtain the final recognition text.
  • the above process of obtaining a preliminary recognized text based on the speech to be recognized, obtaining entity word characters corresponding to the entity word category label, replacing the corresponding entity word category label in the preliminary recognized text with the entity word characters, and obtaining the final recognized text is achieved through a preconfigured speech recognition model.
  • it also includes:
  • the syllable or phoneme corresponding to the domain entity word is determined, and the correspondence between the domain entity word and the syllable or phoneme is added to the preset pronunciation dictionary, and the domain entity word is added to the language model.
  • the language model is a language model constructed based on entity words in various fields.
  • the speech recognition model includes an encoder, a primary decoder, a secondary decoder and an output layer;
  • the encoder is used to encode the input speech to be recognized to obtain acoustic coding features
  • the first-level decoder is used to decode the characters as modeling units based on the acoustic coding features to obtain a preliminary recognition text consisting of entity word category labels and other non-entity word characters;
  • the secondary decoder is used to decode the syllable or phoneme corresponding to the entity word category label based on the acoustic coding features of the speech segment corresponding to the entity word category label using the syllable or phoneme as the modeling unit, and convert the syllable or phoneme into a character in combination with a preset pronunciation dictionary and language model to obtain the entity word character corresponding to the entity word category label;
  • the output layer is used to replace the corresponding entity word category label in the preliminary recognition text with the entity word character to obtain the final output recognition text.
  • the first-level decoder uses characters as modeling units and decodes to obtain a preliminary recognition text consisting of entity word category labels and other non-entity word characters based on the acoustic coding features, including:
  • the first-level decoder uses characters as modeling units, and based on the acoustic coding features and the real-time state features of the first-level decoder, decodes to obtain a preliminary recognition text consisting of entity word category labels and other non-entity word characters.
  • the primary decoder uses characters as modeling units, and based on the acoustic coding features and the real-time state features of the primary decoder, decodes to obtain a preliminary recognition text consisting of entity word category labels and other non-entity word characters, including:
  • the first-level decoder uses characters as modeling units, takes the attention degree of each frame of acoustic coding features when decoding the t-th character as the weight, performs weighted summation on the acoustic coding features of each frame, and obtains the acoustic coding feature c t when decoding the t-th character. Based on the acoustic coding feature c t when decoding the t-th character and the state feature d t of the first-level decoder when decoding the t-th character, the t-th character is decoded until all characters are decoded to obtain a preliminary recognition text consisting of entity word category labels and other non-entity word characters.
  • the secondary decoder uses syllables or phonemes as modeling units, and based on the acoustic coding features of the speech segment corresponding to the entity word category label, decodes the syllables or phonemes corresponding to the entity word category label, including:
  • the secondary decoder uses syllables or phonemes as modeling units, and decodes the syllables or phonemes corresponding to the entity word category labels based on the acoustic coding features when the primary decoder decodes the entity word category labels.
  • the training process of the speech recognition model includes:
  • the network parameters of the speech recognition model are trained by combining the first loss function and the second loss function until the training end condition is met.
  • the step of replacing the corresponding entity words in the recognized text with the category labels of the entity words to obtain the edited recognized text includes:
  • a speech recognition method comprising:
  • obtaining a preliminary recognition text based on the speech to be recognized wherein the preliminary recognition text includes entity word category labels and other non-entity word characters;
  • the secondary recognition module of the speech recognition model Utilizing the secondary recognition module of the speech recognition model, based on the speech segment corresponding to the entity word category label in the speech to be recognized and a preset pronunciation dictionary and language model, the entity word character corresponding to the entity word category label is obtained;
  • the entity word character replaces the corresponding entity word category label in the preliminary recognition text to obtain the final recognition text.
  • the primary identification module comprises:
  • the encoder is used to encode the speech to be recognized to obtain acoustic coding features
  • the first-level decoder is used to use characters as modeling units and decode based on the acoustic coding features to obtain preliminary recognition text consisting of entity word category labels and other non-entity word characters.
  • the secondary recognition module includes: a secondary encoder, which is used to use syllables or phonemes as modeling units, decode the syllables or phonemes corresponding to the entity word category label based on the acoustic coding features of the speech segment corresponding to the entity word category label, and convert the syllables or phonemes into characters in combination with a preset pronunciation dictionary and language model to obtain entity word characters corresponding to the entity word category label.
  • a secondary encoder which is used to use syllables or phonemes as modeling units, decode the syllables or phonemes corresponding to the entity word category label based on the acoustic coding features of the speech segment corresponding to the entity word category label, and convert the syllables or phonemes into characters in combination with a preset pronunciation dictionary and language model to obtain entity word characters corresponding to the entity word category label.
  • a speech recognition device comprising:
  • a speech acquisition unit to be recognized used for acquiring the speech to be recognized
  • a preliminary recognition text determination unit used to obtain preliminary recognition text based on the speech to be recognized, wherein the preliminary recognition text includes entity word category labels and other non-entity word characters;
  • the final recognition text determination unit is used to obtain the entity word characters corresponding to the entity word category labels in the speech to be recognized based on the speech fragments corresponding to the entity word category labels in the speech to be recognized and the preset pronunciation dictionary and language model, and replace the corresponding entity word category labels in the preliminary recognition text with the entity word characters to obtain the final recognition text.
  • a speech recognition device comprising:
  • a speech acquisition unit to be recognized used for acquiring the speech to be recognized
  • a speech recognition model processing unit is used to use the primary recognition module of a preconfigured speech recognition model to obtain a preliminary recognition text based on the speech to be recognized, wherein the preliminary recognition text includes an entity word category label and other non-entity word characters; use the secondary recognition module of the speech recognition model to obtain the entity word characters corresponding to the entity word category label in the speech to be recognized based on the speech fragment corresponding to the entity word category label in the speech to be recognized and a preset pronunciation dictionary and language model; replace the corresponding entity word category label in the preliminary recognition text with the entity word character to obtain the final recognition text.
  • a speech recognition device comprising: a memory and a processor
  • the memory is used to store programs
  • the processor is used to execute the program to implement the various steps of the speech recognition method described above.
  • a storage medium on which a computer program is stored.
  • the computer program is executed by a processor, the various steps of the speech recognition method described above are implemented.
  • the present application divides the speech recognition process into two stages.
  • a preliminary recognition text consisting of entity word category labels and other non-entity word characters can be obtained based on the speech to be recognized.
  • the second recognition stage based on the speech segment corresponding to the entity word category label in the speech to be recognized and the preset pronunciation dictionary and language model, the entity word characters corresponding to the entity word category label are obtained, and the corresponding entity word category label in the preliminary recognition text is replaced by the entity word characters to obtain the recognition text outputted finally.
  • the entity words in the speech to be recognized are recognized as corresponding category labels, and non-entity words can be directly recognized as characters, which can greatly reduce the probability that low-frequency entity words or newly appeared entity words in the same category label are mistakenly recognized as characters under non-category labels, and improve the recognition accuracy of entity words.
  • the entity words corresponding to the entity word category label are predicted in combination with the pronunciation dictionary and the language model. When a new domain entity word appears, the newly appeared domain entity word can be added to the preset pronunciation dictionary and language model, so that the recognition accuracy of the newly appeared domain entity word can be guaranteed.
  • FIG1 is a flow chart of a speech recognition method provided by an embodiment of the present application.
  • FIG2 illustrates a schematic diagram of the structure of a speech recognition model
  • FIG3 illustrates a schematic diagram of a two-stage decoding process of a speech recognition model
  • FIG4 illustrates a schematic diagram of a process for determining a decoded character by combining a pronunciation dictionary and a speech model
  • FIG5 is a schematic diagram of the structure of a speech recognition device provided in an embodiment of the present application.
  • FIG6 is a flow chart of another speech recognition method provided in an embodiment of the present application.
  • FIG7 is a schematic diagram of the structure of another speech recognition device provided in an embodiment of the present application.
  • FIG8 is a schematic diagram of the structure of a speech recognition device provided in an embodiment of the present application.
  • the present application provides a speech recognition solution that can be applied to various scenarios for speech recognition, especially for domain entity word speech recognition scenarios, and can ensure a high recognition accuracy rate for newly emerging domain entity words.
  • the present application solution can be implemented based on a terminal with data processing capabilities, which can be a mobile phone, computer, server, cloud, etc.
  • the speech recognition method of the present application may include the following steps:
  • Step S100 Acquire speech to be recognized.
  • Step S110 obtaining a preliminary recognition text consisting of entity word category labels and other non-entity word characters based on the speech to be recognized.
  • the entity word category label is the pre-set field category to which the entity word belongs, such as labels such as person name, place name, organization name, singer, song, drug name, film and television drama name, etc.
  • the speech recognition method provided in this embodiment divides the speech recognition process into two stages.
  • entity words in the speech to be recognized are recognized as corresponding category labels, and non-entity words are directly recognized as characters to obtain the initial recognition text. This can greatly reduce the probability that low-frequency entity words or newly appeared entity words in the same category label are mistakenly recognized as characters not under the category label. Improve the recognition accuracy of entity words.
  • the process of identifying and obtaining the preliminary recognized text in this step can adopt an end-to-end modeling method, that is, taking characters as modeling units, and obtaining the preliminary recognized text based on the decoding of the speech to be recognized.
  • other modeling methods can also be adopted, such as taking syllables or phonemes as modeling units, obtaining corresponding syllables or phonemes based on the decoding of the speech to be recognized, and then combining the preset pronunciation dictionary and language model to convert the decoded syllables or phonemes into characters to obtain the preliminary recognized text.
  • Step S120 combining a preset pronunciation dictionary and a language model, modeling entity word characters corresponding to entity word category labels, replacing corresponding entity word category labels in the preliminary recognition text with the entity word characters, and obtaining a final recognition text.
  • the entity word characters corresponding to the entity word category labels can be modeled based on the speech segments corresponding to the entity word category labels in the speech to be recognized and the preset pronunciation dictionary and language model.
  • the pronunciation dictionary and language model can include existing and newly emerging entity words of various types.
  • syllables or phonemes can be selected as modeling units, and entity word characters corresponding to the entity word category labels in the speech to be recognized can be modeled based on the speech segments corresponding to the entity word category labels in the speech to be recognized and the preset pronunciation dictionary and language model.
  • the corresponding pronunciation dictionary may be a syllable pronunciation dictionary, which includes the correspondence between syllables and characters.
  • the corresponding pronunciation dictionary may be a phoneme pronunciation dictionary, which includes the correspondence between phonemes and characters.
  • the corresponding phonemes or syllables can be modeled based on the speech segments corresponding to the entity word category labels in the speech to be recognized. Further, the candidate characters corresponding to the phonemes or syllables are determined in combination with the pronunciation dictionary, and the probability score of each candidate character is determined in combination with the language model, and the candidate character with the highest probability score is selected as the entity word character corresponding to the entity word category label.
  • the speech recognition method introduced in this embodiment when facing a newly appearing entity word, it is only necessary to update the pronunciation dictionary and the language model, that is, it is only necessary to expand the decoding path, and there is no need to iteratively update the speech recognition model.
  • the solution has better scalability, lower learning cost, and will not cause catastrophic forgetting problems caused by updating the speech recognition model.
  • the syllables or phonemes of the newly added domain entity words can be determined, and then the correspondence between the newly added domain entity words and their syllables or phonemes is added to the preset pronunciation dictionary to complete the update of the pronunciation dictionary.
  • the newly added domain entity words are added to the preset language model to complete the update of the language model.
  • the second recognition stage of the speech recognition method of the present application in combination with the pronunciation dictionary and the language model, obtains the entity word characters corresponding to the entity word category label based on the modeling of the speech fragments corresponding to the entity word category label, that is, the second recognition stage only needs to perform entity word recognition, and does not need to perform non-entity word recognition.
  • the language model can be configured as a domain entity language model.
  • the domain entity language model can be constructed based on various domain entity words, that is, the domain entity language model only contains entity words.
  • the above embodiment introduces a method for performing speech recognition in a two-stage recognition manner, wherein the above two speech recognition stages can be implemented by a variety of different means.
  • the first stage speech recognition can be performed through a pre-trained speech recognition system, that is, the speech to be recognized is input into the pre-trained speech recognition system, and the decoding output is a preliminary recognition text consisting of entity word category labels and remaining non-entity word characters.
  • the second-stage speech recognition can be performed through a pre-trained speech recognition model. That is, the speech recognition model can use syllables or phonemes as modeling units, combined with a preset pronunciation dictionary and language model, to model the speech fragments corresponding to the entity word category labels in the speech to be recognized, and obtain the entity word characters corresponding to the entity word category labels.
  • the entity word characters corresponding to the entity word category labels obtained in the second recognition stage replace the corresponding entity word category labels in the preliminary recognition text obtained in the first recognition stage to obtain the final recognition text.
  • the embodiment of the present application also provides another optional implementation method of the above two-stage speech recognition.
  • the present application can pre-train a speech recognition model.
  • the recognition model implements the above two-stage speech recognition process.
  • a speech recognition model may be pre-trained, and the speech recognition model may be configured as follows:
  • a preliminary recognition text consisting of entity word category labels and other non-entity word characters is obtained.
  • Syllables or phonemes are used as modeling units.
  • entity word characters corresponding to the entity word category labels are obtained based on the speech fragments corresponding to the entity word category labels.
  • the corresponding entity word category labels in the preliminary recognition text are replaced by the entity word characters to obtain the recognition text that is finally output.
  • the speech to be recognized is input into the speech recognition model configured as above to obtain the final recognition text output by the speech recognition model.
  • the speech recognition model divides the speech recognition process into two stages.
  • entity words in the speech to be recognized are recognized as corresponding category labels, and non-entity words are directly recognized as characters to obtain the initial recognition text. This can greatly reduce the probability that low-frequency entity words or newly appeared entity words in the same category label are mistakenly recognized as characters not under the category label, and improve the recognition accuracy of entity words.
  • the pronunciation dictionary and language model can include existing and newly emerging entity words of various types. Specifically, when a new domain entity word appears, the newly emerging domain entity word can be added to the preset pronunciation dictionary and language model, so as to ensure the recognition accuracy of the newly emerging domain entity word.
  • the speech recognition model introduced in this embodiment may include an encoder, The first-level decoder Decoder1, the second-level decoder Decoder2 and the output layer (not shown in FIG. 2 ).
  • Encoder used to encode the input speech to be recognized to obtain acoustic coding features.
  • the input of the encoder can be the speech features of the speech to be recognized, such as the amplitude spectrum features, etc. Taking the amplitude spectrum features as an example, it can be the log filter bank energy (LFBE).
  • the encoder is used to extract the representation of the speech features of the speech to be recognized.
  • the encoder encodes the speech features of the speech to be recognized to obtain acoustic coding features.
  • the encoder may adopt a convolutional neural network, LSTM or Transformer structure.
  • a first-level decoder is used to decode the characters into a preliminary recognition text consisting of entity word category labels and other non-entity word characters based on the acoustic coding features using the characters as modeling units.
  • the first-level decoder can use characters as modeling units for end-to-end modeling.
  • syllables or phonemes can be used as modeling units, and then the pronunciation dictionary and language model are combined to obtain preliminary recognition text.
  • the speech recognition model of the example in Figure 2 only the character modeling unit is used as an example for explanation.
  • the first-level decoder uses characters as modeling units for decoding, and can directly decode to obtain preliminary recognition text. For entity words contained in the speech to be recognized, the first-level decoder decodes them into entity word category labels. For non-entity words in the speech to be recognized, the first-level decoder decodes them normally into corresponding characters, and finally obtains the preliminary recognition text output by the first-level decoder.
  • one entity word can correspond to one entity word category label.
  • the same number of entity word category labels can be obtained according to the number of characters contained in the entity word, that is, one entity word can correspond to multiple identical entity word category labels.
  • the preliminary recognition text output by the first-level decoder can be: listen to ⁇ singer>'s song.
  • the preliminary recognition text output by the first-level decoder can also be: listen to ⁇ singer>'s song.
  • the first-level decoder in this embodiment can adopt a network structure with attention mechanism and autoregression, such as transformer, LSTM and other network structures.
  • attention mechanism and autoregression such as transformer, LSTM and other network structures.
  • the first-level decoder uses characters as modeling units, and decodes to obtain a preliminary recognition text consisting of entity word category labels and other non-entity word characters based on the acoustic coding features and the real-time state features of the first-level decoder.
  • the first-level decoder when decoding, can refer to the acoustic coding feature ct and the real-time state feature of the first-level decoder at the same time, where the real-time state feature of the first-level decoder can be understood as the contextual language feature, that is, when the first-level decoder decodes word by word, it can refer to the characters decoded at the previous moment to guide the decoding process at the current moment, taking into account the contextual association of the text, so that the decoding result is more accurate.
  • a secondary decoder is used to use syllables or phonemes as modeling units, decode the syllables or phonemes corresponding to the entity word category labels based on the acoustic coding features of the speech fragments corresponding to the entity word category labels, and convert the syllables or phonemes into characters in combination with a preset pronunciation dictionary and language model to obtain the entity word characters corresponding to the entity word category labels.
  • the second-level decoder decodes the syllable or phoneme corresponding to the entity word category label based on the acoustic coding features of the speech segment corresponding to the entity word category label. It can be understood that if the second-level decoder uses syllables as modeling units, the syllable corresponding to the entity word category label is decoded here; if the second-level decoder uses phonemes as modeling units, the phoneme corresponding to the entity word category label is decoded here.
  • the secondary decoder can weaken the modeling of context relevance, that is, different from the way in which the primary decoder decodes by referring to both the acoustic coding features and the real-time state features of the primary decoder, the secondary decoder can decode based only on the acoustic coding features.
  • the secondary decoder After the secondary decoder decodes and obtains the syllable or phoneme corresponding to the entity word category label, it can convert the syllable or phoneme into a character in combination with a preset pronunciation dictionary and language model to obtain the entity word character corresponding to the entity word category label.
  • ⁇ device name> refers to the category label of the entity word.
  • the second-level decoder dec2 further decodes each entity word category label output by the first-level decoder to obtain the corresponding syllable: yu3se4shou3ji1.
  • Each syllable is converted into a corresponding character through the pronunciation dictionary and language model: ⁇ .
  • the characters corresponding to "yu3" in the syllable pronunciation dictionary include “ ⁇ ", " ⁇ ", and “ ⁇ ”.
  • the language model scores of the three characters in the language model are 2, 3, and 1, respectively. Therefore, the character “ ⁇ ” with the highest language model score is selected as the final decoding character corresponding to "yu3".
  • An output layer is used to replace the corresponding entity word category label in the preliminary recognition text with the entity word to obtain the final output recognition text.
  • the initial recognition text is: I bought a ⁇ device name> ⁇ device name> ⁇ device name> ⁇ device name>.
  • the entity word characters obtained by the secondary decoder are " ⁇ ".
  • the entity word category label in the initial recognition text is replaced with " ⁇ ”
  • the final output recognition text is: I bought a ⁇ .
  • the process of decoding the above-mentioned first-level decoder to obtain the preliminary recognition text is introduced.
  • the first-level decoder can use characters as modeling units, and can directly decode characters during decoding.
  • the first-level decoder can use the network results with an attention mechanism, and can refer to acoustic information and language information at the same time during decoding to improve decoding accuracy.
  • the first-level decoder takes the attention degree of each frame acoustic coding feature when decoding the t-th character as the weight, performs weighted summation on the acoustic coding features of each frame, and obtains the acoustic coding feature c t when decoding the t-th character.
  • the state feature d t of the first-level decoder is used when the character is t, and the t-th character is decoded until all characters are decoded to obtain the preliminary recognition text consisting of the entity word category label and the remaining non-entity word characters.
  • the output of the first-level decoder can be calculated by referring to the following formula:
  • dec 1 argmax(W 1 [ d t ; c t ])
  • y t-1 is the last decoded character
  • c t-1 is the acoustic coding feature when the first-level decoder decodes the previous character
  • d t is the state feature of the first-level decoder when decoding the t-th character (here the first-level decoder adopts the LSTM structure as an example for explanation), which can also be understood as the language feature when decoding the t-th character
  • a tj is the normalized attention to the acoustic coding feature of the j-th frame when decoding the t-th character
  • W q , W k , V, W 1 are network parameters
  • h j is the acoustic coding feature of the j-th frame
  • dec 1 is the output of the first-level decoder.
  • the secondary decoder uses syllables or phonemes as modeling units, and decodes the syllables or phonemes corresponding to the entity word category labels based on the acoustic coding features when the primary decoder decodes the entity word category labels.
  • the second-level decoder needs to weaken the correlation between previous and next characters, and therefore can use only acoustic coding features.
  • the secondary decoder can only use the acoustic coding features when decoding the entity word category label to decode the syllable or phoneme corresponding to the entity word category label.
  • the output of the secondary decoder is expressed as:
  • dec 2 argmax(W 2 c t )
  • dec 2 represents the output of the secondary decoder and W 2 is the network parameter.
  • the training process of the above-mentioned speech recognition model is further introduced.
  • This embodiment illustrates an optional training process of a speech recognition model, which may include the following steps:
  • domain entity words such as names of people, places, institutions, singers, songs, etc.
  • NER named entity recognition
  • the entity words obtained above are limited and it is difficult to cover all categories of entity words. Therefore, the entity words and their category labels obtained above can be used to train a classification neural network model to determine the category of entity words contained in the input text.
  • the classification neural network can use a model with context modeling capabilities, such as Transformer, LSTM, etc.
  • the category labels of entity words in the training corpus text can be determined.
  • the training corpus text may be the recognition text corresponding to the collected training speech.
  • the training sample labels of the first-level decoder that is, to replace the entity words in the recognition text with the corresponding entity word category labels to obtain the edited recognition text.
  • the recognized text is "I bought an Apple phone", which contains the entity word "Apple The corresponding category label is ⁇ Organization Name>, then the edited recognition text can be "I bought a ⁇ Organization Name> mobile phone”.
  • the secondary decoder can perform secondary decoding on each entity word category label to obtain the corresponding characters, and each character constitutes a complete entity word.
  • the corresponding entity words in the recognized text are replaced with the category labels of the entity words to obtain the edited recognized text, which may specifically include:
  • the recognized text after editing may be "I bought a ⁇ organization name> ⁇ organization name> mobile phone”.
  • a first loss function can be calculated by a cross entropy loss function, and the first loss function represents the decoding loss of the primary decoder.
  • a second loss function can be calculated through a cross-entropy loss function based on the entity word characters corresponding to the entity word category labels output by the secondary decoder and the original entity words corresponding to the entity word category labels in the recognized text.
  • the second loss function represents the decoding loss of the secondary decoder.
  • the total loss function is calculated by combining the first loss function and the second loss function, and the network parameters of the speech recognition model are trained based on the total loss function until the training end condition is met, thereby obtaining the final trained speech recognition model.
  • FIG. 5 is a schematic diagram of the structure of a speech recognition device disclosed in an embodiment of the present application.
  • the device may include:
  • a preliminary recognition text determination unit 12 is used to obtain preliminary recognition text based on the speech to be recognized, wherein the preliminary recognition text includes entity word category labels and other non-entity word characters;
  • the preliminary recognition text determination unit may use characters as modeling units, and obtain preliminary recognition text based on the decoding of the speech to be recognized.
  • other modeling methods may be used, such as using syllables or phonemes as modeling units, obtaining corresponding syllables or phonemes based on the decoding of the speech to be recognized, and then converting the decoded syllables or phonemes into characters in combination with a preset pronunciation dictionary and language model to obtain preliminary recognition text.
  • the final recognition text determination unit 13 is used to obtain the entity word characters corresponding to the entity word category labels in the speech to be recognized based on the speech segments corresponding to the entity word category labels in the speech to be recognized and the preset pronunciation dictionary and language model, and replace the corresponding entity word category labels in the preliminary recognition text with the entity word characters to obtain the final recognition text.
  • the above-mentioned final recognition text determination unit can use syllables or phonemes as modeling units, and based on the speech fragments corresponding to the entity word category labels in the speech to be recognized and the preset pronunciation dictionary and language model, model the entity word characters corresponding to the entity word category labels.
  • the processing of the preliminary recognition text determination unit and the final recognition text determination unit can be implemented by a model processing unit, which is used to process the speech data to be recognized using a preconfigured speech recognition model to obtain the recognition text output by the model, wherein the speech recognition model is configured as follows:
  • a preliminary recognition text consisting of entity word category labels and other non-entity word characters is obtained.
  • Syllables or phonemes are used as modeling units.
  • entity word characters corresponding to the entity word category labels are modeled. The entity word characters are used to replace the Preliminarily identify the corresponding entity word category labels in the text and obtain the final output recognition text.
  • the device of the present application may further include:
  • the pronunciation dictionary and language model updating unit is used to determine the syllable or phoneme corresponding to the domain entity word when a newly added domain entity word is obtained, and add the correspondence between the domain entity word and the syllable or phoneme to the preset pronunciation dictionary, and add the domain entity word to the language model.
  • the above-mentioned language model may be a language model constructed based on entity words in various fields.
  • the speech recognition model may include an encoder, a primary decoder, a secondary decoder and an output layer;
  • the encoder is used to encode the input speech to be recognized to obtain acoustic coding features
  • the first-level decoder is used to decode the characters as modeling units based on the acoustic coding features to obtain a preliminary recognition text consisting of entity word category labels and other non-entity word characters;
  • the secondary decoder is used to decode the syllable or phoneme corresponding to the entity word category label based on the acoustic coding features of the speech segment corresponding to the entity word category label using the syllable or phoneme as the modeling unit, and convert the syllable or phoneme into a character in combination with a preset pronunciation dictionary and language model to obtain the entity word character corresponding to the entity word category label;
  • the output layer is used to replace the corresponding entity word category label in the preliminary recognition text with the entity word character to obtain the final output recognition text.
  • the first-level decoder may adopt a network structure with an attention mechanism and autoregression. Specifically, taking characters as modeling units, based on the acoustic coding features and the real-time state features of the first-level decoder, decoding to obtain a preliminary recognition text consisting of entity word category labels and other non-entity word characters, the process may include:
  • the first-level decoder uses characters as modeling units, takes the attention degree of each frame of acoustic coding features when decoding the t-th character as the weight, performs weighted summation on the acoustic coding features of each frame, and obtains the acoustic coding feature c t when decoding the t-th character. Based on the acoustic coding feature c t when decoding the t-th character and the state feature d t of the first-level decoder when decoding the t-th character, the t-th character is decoded until all characters are decoded to obtain a preliminary recognition text consisting of entity word category labels and other non-entity word characters.
  • the secondary decoder uses syllables or phonemes as modeling units, and based on the acoustic coding features of the speech segment corresponding to the entity word category label, decodes the syllables or phonemes corresponding to the entity word category label, which may include:
  • the secondary decoder uses syllables or phonemes as modeling units, and decodes the syllables or phonemes corresponding to the entity word category labels based on the acoustic coding features when the primary decoder decodes the entity word category labels.
  • the device of the present application may further include:
  • the model training unit is used to train the speech recognition model.
  • the training process may include:
  • the network parameters of the speech recognition model are trained by combining the first loss function and the second loss function until the training end condition is met.
  • the model training unit replaces the corresponding entity word in the recognized text with the category label of the entity word to obtain the edited recognized text, which may include:
  • another speech recognition method is further provided. As shown in FIG. 6 , the method may include:
  • Step S200 Acquire speech to be recognized.
  • Step S210 using the primary recognition module of the pre-configured speech recognition model, obtaining a preliminary recognition text based on the speech to be recognized, wherein the preliminary recognition text includes entity word category labels and their The remaining non-entity characters.
  • a speech recognition model is pre-configured.
  • the speech recognition model includes a two-level framework, namely a primary recognition module and a secondary recognition module.
  • the first-level recognition module can decode the input speech to be recognized to obtain a preliminary recognition text composed of entity word category labels and other non-entity word characters.
  • the primary recognition module can use characters as modeling units, and obtain preliminary recognition text based on the decoding of the speech to be recognized.
  • other modeling methods can also be used, such as using syllables or phonemes as modeling units, obtaining corresponding syllables or phonemes based on the decoding of the speech to be recognized, and then combining the preset pronunciation dictionary and language model to convert the decoded syllables or phonemes into characters to obtain preliminary recognition text.
  • Step S220 using the secondary recognition module of the speech recognition model, based on the speech segment corresponding to the entity word category label in the speech to be recognized and a preset pronunciation dictionary and language model, obtain the entity word character corresponding to the entity word category label.
  • the secondary recognition module in the speech recognition model can use syllables or phonemes as modeling units to perform secondary decoding on the speech segments corresponding to the entity word category labels decoded by the primary recognition module to obtain decoded syllables or phonemes. Further, in combination with a preset pronunciation dictionary and language model, the decoded syllables or phonemes are converted into character form to obtain entity word characters corresponding to the entity word category labels.
  • Step S230 Replace the corresponding entity word category label in the preliminary recognition text with the entity word character to obtain the final recognition text.
  • the pre-configured speech recognition model includes a two-level recognition module.
  • the first-level recognition module recognizes the entity words in the speech to be recognized as corresponding category labels, and directly recognizes non-entity words as characters to obtain an initial recognition text. In this way, the probability of low-frequency entity words or newly appeared entity words in the same category label being mistakenly recognized as characters not under the category label can be greatly reduced, and the recognition accuracy of entity words can be improved.
  • the second-level recognition module only needs to recognize the entity words corresponding to the entity word category label. Finally, the entity word characters recognized by the second-level recognition module replace the corresponding entity word category label in the preliminary recognition text to obtain the final recognition text.
  • the syllables or phonemes of the newly added domain entity words can be determined, and then the corresponding relationship between the newly added domain entity words and their syllables or phonemes is added to the preset pronunciation dictionary to complete the update of the pronunciation dictionary.
  • the newly added domain entity words are added to the preset language model to complete the update of the language model.
  • the primary recognition module may include: an encoder and a primary decoder.
  • the encoder is used to encode the speech to be recognized to obtain acoustic coding features.
  • the first-level decoder is used to use characters as modeling units and decode the initial recognition text composed of entity word category labels and other non-entity word characters based on the acoustic coding features.
  • the first-level decoder can adopt a network structure with an attention mechanism and autoregression, that is, when the first-level decoder decodes, it can use characters as modeling units, and decode to obtain a preliminary recognition text consisting of entity word category labels and other non-entity word characters based on the acoustic coding features and the real-time state features of the first-level decoder.
  • the specific implementation process can refer to the relevant introduction in the previous article, which will not be repeated here.
  • the secondary recognition module may include: a secondary encoder, which is used to use syllables or phonemes as modeling units, decode the syllables or phonemes corresponding to the entity word category label based on the acoustic coding features of the speech segment corresponding to the entity word category label, and convert the syllables or phonemes into characters in combination with a preset pronunciation dictionary and language model to obtain the entity word characters corresponding to the entity word category label.
  • a secondary encoder which is used to use syllables or phonemes as modeling units, decode the syllables or phonemes corresponding to the entity word category label based on the acoustic coding features of the speech segment corresponding to the entity word category label, and convert the syllables or phonemes into characters in combination with a preset pronunciation dictionary and language model to obtain the entity word characters corresponding to the entity word category label.
  • the secondary recognition module may also include an output layer, which is used to replace the corresponding entity word category labels in the preliminary recognition text with the entity word characters obtained by the secondary recognition module, and output the final recognition text.
  • the speech recognition device provided in an embodiment of the present application is described below.
  • the speech recognition device described below and the second speech recognition method described above can be referenced to each other.
  • FIG. 7 is a schematic diagram of the structure of another speech recognition device disclosed in an embodiment of the present application.
  • the device may include:
  • the to-be-recognized speech acquisition unit 21 is used to acquire the to-be-recognized speech
  • the speech recognition model processing unit 22 is used to use the primary recognition module of the preconfigured speech recognition model to obtain a preliminary recognition text based on the speech to be recognized, and the preliminary recognition text includes an entity word category label and other non-entity word characters; use the secondary recognition module of the speech recognition model to obtain the entity word characters corresponding to the entity word category label in the speech to be recognized based on the speech segment corresponding to the entity word category label in the speech to be recognized and a preset pronunciation dictionary and language model; replace the corresponding entity word category label in the preliminary recognition text with the entity word character to obtain the final recognition text.
  • composition structure of the speech recognition model can refer to the introduction of the aforementioned speech recognition method part, which will not be repeated here.
  • FIG8 shows a hardware structure block diagram of a speech recognition device.
  • the hardware structure of the speech recognition device may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;
  • the number of the processor 1, the communication interface 2, the memory 3, and the communication bus 4 is at least one, and the processor 1, the communication interface 2, and the memory 3 communicate with each other through the communication bus 4;
  • the processor 1 may be a central processing unit CPU, or an application specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement the embodiments of the present invention, etc.;
  • ASIC Application Specific Integrated Circuit
  • the memory 3 may include a high-speed RAM memory, and may also include a non-volatile memory, such as at least one disk memory;
  • the memory stores a program
  • the processor can call the program stored in the memory, and the program is used to: implement each step of the speech recognition method introduced in the above embodiments.
  • An embodiment of the present application further provides a storage medium, which can store a program suitable for execution by a processor, wherein the program is used to implement the various steps of the speech recognition method introduced in the aforementioned embodiments.

Landscapes

  • Machine Translation (AREA)

Abstract

Disclosed in the present application are a speech recognition method, apparatus and device, and a storage medium. The method comprises: on the basis of a speech to be recognized, obtaining a preliminary recognized text consisting of an entity word category label and characters of the remaining non-entity words; further, on the basis of a speech segment corresponding to the entity word category label and a preset pronunciation dictionary and a language model, obtaining entity word characters corresponding to the entity word category label; and replacing the corresponding entity word category label in the preliminary recognized text with the entity word characters so as to obtain a final recognized text. Thus, when a new domain entity word appears, only the pronunciation dictionary and the language model need to be updated, and iterative updating of a speech recognition model is not needed, thus making the learning cost lower, avoiding the catastrophic forgetting problem caused by updating speech recognition models, and ensuring the recognition accuracy of newly-appearing domain entity words.

Description

语音识别方法、装置、设备及存储介质Speech recognition method, device, equipment and storage medium 技术领域Technical Field
本申请要求于2022年12月12日提交中国专利局、申请号为202211589720.7、发明名称为“语音识别方法、装置、设备及存储介质”的国内申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to a domestic application filed with the China Patent Office on December 12, 2022, with application number 202211589720.7 and invention name “Speech Recognition Method, Device, Equipment and Storage Medium”, all contents of which are incorporated by reference in this application.
背景技术Background technique
随着人工智能和深度学习的发展,语音识别技术得到广泛使用,涵盖了人机交互的各个领域。领域语音识别的核心难题在于存在大量的领域专业实体词。领域专业实体词尤其是较低频词通常来说在语音识别模型的训练数据中较少出现,并且领域专业实体词汇是不断更新的,例如,在语音导航应用中,不断会有新的公司名称和地点名称出现。领域专业实体词的上述特点决定了在实际应用中,需要不断的对语音识别系统进行更新,以实现领域语音识别保持较高的准确率。With the development of artificial intelligence and deep learning, speech recognition technology has been widely used, covering all areas of human-computer interaction. The core difficulty of domain speech recognition lies in the existence of a large number of domain-specific entity words. Domain-specific entity words, especially low-frequency words, usually appear less frequently in the training data of speech recognition models, and domain-specific entity vocabulary is constantly updated. For example, in voice navigation applications, new company names and place names continue to appear. The above characteristics of domain-specific entity words determine that in practical applications, the speech recognition system needs to be continuously updated to achieve a high accuracy rate in domain speech recognition.
为了满足新出现的领域专业实体词的识别率要求,现有方法通常需要录制或者合成出包含领域专业实体词的语句对语音识别模型进行更新学习。示例如:首先,利用规则或者训练好的上下文扩展模型,根据当前领域实体词的文字构造大量不同的上下文文本。例如,一首新的歌曲A出现了之后,需要构造出“给我来一首A”,“我想听新歌A”等上下文文本。接着,利用语音合成模型合成上述文本对应的语音,并对语音做加噪、加混响、音色转换等数据增强操作。最后,利用上述语料,对当前语音识别模型进行更新迭代学习。得到的新模型通常可以提高新增领域实体词的识别准确率。In order to meet the recognition rate requirements of newly emerging domain-specific entity words, existing methods usually need to record or synthesize sentences containing domain-specific entity words to update and learn the speech recognition model. For example: First, use rules or trained context extension models to construct a large number of different context texts based on the text of the current domain entity words. For example, after a new song A appears, it is necessary to construct context texts such as "Give me a song A" and "I want to listen to the new song A". Then, use the speech synthesis model to synthesize the speech corresponding to the above text, and perform data enhancement operations such as adding noise, reverberation, and timbre conversion on the speech. Finally, use the above corpus to update and iterate the current speech recognition model. The new model obtained can usually improve the recognition accuracy of newly added domain entity words.
但是,上述处理方式也存在缺点,示例如:However, the above processing method also has disadvantages, for example:
第一,现有技术需要对语音识别模型进行不断的更新学习,因此整个过程费时费力,成本较高。First, existing technologies require continuous updating and learning of speech recognition models, so the whole process is time-consuming, labor-intensive and costly.
第二,现有技术对于新增领域实体词的识别准确率提升效果不稳定。首先,识别准确率的提升幅度高度依赖于所构造的训练语料。对于未构造 的上下文说法,识别准确率通常提升幅度十分有限。Second, the existing technology is not stable in improving the recognition accuracy of newly added domain entity words. First, the improvement in recognition accuracy is highly dependent on the constructed training corpus. The recognition accuracy is usually improved very little when the context is changed.
第三,现有技术难以实现增量的学习,即难以保证更新后的语音识别模型对于已有领域实体词的识别准确率不下降。机器学习领域长期以来一直悬而未决的问题之一就是灾难性遗忘问题。由于现有技术中语音识别模型根据新的语料进行了更新,或多或少的会出现对之前训练数据的遗忘,尤其在多次的更新之后,灾难性遗忘问题会变得尤为严重,学了新的,忘了旧的。Third, it is difficult to achieve incremental learning with existing technologies, that is, it is difficult to ensure that the recognition accuracy of the updated speech recognition model for existing domain entity words does not decrease. One of the long-standing unresolved issues in the field of machine learning is the problem of catastrophic forgetting. Since the speech recognition model in the existing technology is updated according to the new corpus, it will more or less forget the previous training data. Especially after multiple updates, the catastrophic forgetting problem will become particularly serious, and the new will be learned and the old will be forgotten.
发明内容Summary of the invention
鉴于上述问题,提出了本申请以便提供一种语音识别方法、装置、设备及存储介质,以实现在不需要对语音识别模型进行更新的情况下,保证对新出现的领域实体词的识别准确率。具体方案如下:In view of the above problems, this application is proposed to provide a speech recognition method, device, equipment and storage medium to ensure the recognition accuracy of newly appeared domain entity words without updating the speech recognition model. The specific solution is as follows:
第一方面,提供了一种语音识别方法,包括:In a first aspect, a speech recognition method is provided, comprising:
获取待识别语音;Get the speech to be recognized;
基于所述待识别语音得到初步识别文本,所述初步识别文本包括实体词类别标签及其余非实体词的字符;Obtaining a preliminary recognition text based on the speech to be recognized, wherein the preliminary recognition text includes entity word category labels and other non-entity word characters;
基于所述待识别语音中所述实体词类别标签对应的语音片段和预设的发音词典及语言模型,得到所述实体词类别标签对应的实体词字符,由所述实体词字符替换掉所述初步识别文本中对应的实体词类别标签,得到最终的识别文本。Based on the speech segment corresponding to the entity word category label in the speech to be recognized and the preset pronunciation dictionary and language model, the entity word characters corresponding to the entity word category label are obtained, and the corresponding entity word category label in the preliminary recognition text is replaced by the entity word characters to obtain the final recognition text.
优选地,上述基于待识别语音得到初步识别文本,以及,得到实体词类别标签对应的实体词字符,由所述实体词字符替换掉所述初步识别文本中对应的实体词类别标签,得到最终的识别文本的过程,通过预配置的语音识别模型实现。Preferably, the above process of obtaining a preliminary recognized text based on the speech to be recognized, obtaining entity word characters corresponding to the entity word category label, replacing the corresponding entity word category label in the preliminary recognized text with the entity word characters, and obtaining the final recognized text is achieved through a preconfigured speech recognition model.
优选地,还包括:Preferably, it also includes:
在获取到新增的领域实体词时,确定所述领域实体词对应的音节或音素,并将所述领域实体词与音节或音素的对应关系添加到所述预设的发音词典中,以及,将所述领域实体词添加到所述语言模型中。When a newly added domain entity word is obtained, the syllable or phoneme corresponding to the domain entity word is determined, and the correspondence between the domain entity word and the syllable or phoneme is added to the preset pronunciation dictionary, and the domain entity word is added to the language model.
优选地,所述语言模型为基于各领域实体词所构建的语言模型。 Preferably, the language model is a language model constructed based on entity words in various fields.
优选地,所述语音识别模型包括编码器、一级解码器、二级解码器及输出层;Preferably, the speech recognition model includes an encoder, a primary decoder, a secondary decoder and an output layer;
所述编码器,用于对输入的待识别语音进行编码,得到声学编码特征;The encoder is used to encode the input speech to be recognized to obtain acoustic coding features;
所述一级解码器,用于以字符为建模单元,基于所述声学编码特征,解码得到由实体词类别标签及其余非实体词的字符组成的初步识别文本;The first-level decoder is used to decode the characters as modeling units based on the acoustic coding features to obtain a preliminary recognition text consisting of entity word category labels and other non-entity word characters;
所述二级解码器,用于以音节或音素为建模单元,基于实体词类别标签对应的语音片段的声学编码特征,解码得到实体词类别标签对应的音节或音素,并结合预设的发音词典及语言模型将音节或音素转换为字符,得到实体词类别标签对应的实体词字符;The secondary decoder is used to decode the syllable or phoneme corresponding to the entity word category label based on the acoustic coding features of the speech segment corresponding to the entity word category label using the syllable or phoneme as the modeling unit, and convert the syllable or phoneme into a character in combination with a preset pronunciation dictionary and language model to obtain the entity word character corresponding to the entity word category label;
所述输出层,用于利用所述实体词字符替换掉所述初步识别文本中对应的实体词类别标签,得到最终输出的识别文本。The output layer is used to replace the corresponding entity word category label in the preliminary recognition text with the entity word character to obtain the final output recognition text.
优选地,所述一级解码器,以字符为建模单元,基于所述声学编码特征,解码得到由实体词类别标签及其余非实体词的字符组成的初步识别文本的过程,包括:Preferably, the first-level decoder uses characters as modeling units and decodes to obtain a preliminary recognition text consisting of entity word category labels and other non-entity word characters based on the acoustic coding features, including:
所述一级解码器以字符为建模单元,基于所述声学编码特征及一级解码器的实时状态特征,解码得到由实体词类别标签及其余非实体词的字符组成的初步识别文本。The first-level decoder uses characters as modeling units, and based on the acoustic coding features and the real-time state features of the first-level decoder, decodes to obtain a preliminary recognition text consisting of entity word category labels and other non-entity word characters.
优选地,所述一级解码器以字符为建模单元,基于所述声学编码特征及一级解码器的实时状态特征,解码得到由实体词类别标签及其余非实体词的字符组成的初步识别文本的过程,包括:Preferably, the primary decoder uses characters as modeling units, and based on the acoustic coding features and the real-time state features of the primary decoder, decodes to obtain a preliminary recognition text consisting of entity word category labels and other non-entity word characters, including:
一级解码器以字符为建模单元,以解码第t个字符时对每一帧声学编码特征的关注度为权重,对各帧声学编码特征进行加权求和,得到解码第t个字符时的声学编码特征ct,基于解码第t个字符时的声学编码特征ct及解码第t个字符时一级解码器的状态特征dt,解码第t个字符,直至全部解码后得到由实体词类别标签及其余非实体词的字符组成的初步识别文本。The first-level decoder uses characters as modeling units, takes the attention degree of each frame of acoustic coding features when decoding the t-th character as the weight, performs weighted summation on the acoustic coding features of each frame, and obtains the acoustic coding feature c t when decoding the t-th character. Based on the acoustic coding feature c t when decoding the t-th character and the state feature d t of the first-level decoder when decoding the t-th character, the t-th character is decoded until all characters are decoded to obtain a preliminary recognition text consisting of entity word category labels and other non-entity word characters.
优选地,所述二级解码器以音节或音素为建模单元,基于实体词类别标签对应的语音片段的声学编码特征,解码得到实体词类别标签对应的音节或音素的过程,包括: Preferably, the secondary decoder uses syllables or phonemes as modeling units, and based on the acoustic coding features of the speech segment corresponding to the entity word category label, decodes the syllables or phonemes corresponding to the entity word category label, including:
二级解码器以音节或音素为建模单元,基于一级解码器解码实体词类别标签时的声学编码特征,解码得到实体词类别标签对应的音节或音素。The secondary decoder uses syllables or phonemes as modeling units, and decodes the syllables or phonemes corresponding to the entity word category labels based on the acoustic coding features when the primary decoder decodes the entity word category labels.
优选地,所述语音识别模型的训练过程,包括:Preferably, the training process of the speech recognition model includes:
获取训练语音及对应的识别文本,所述识别文本中标注有实体词的类别标签;Obtaining training speech and corresponding recognition text, wherein the recognition text is annotated with category labels of entity words;
利用实体词的类别标签替换掉识别文本中对应的实体词,得到编辑后识别文本;Use the category label of the entity word to replace the corresponding entity word in the recognized text to obtain the edited recognized text;
将所述训练语音输入语音识别模型,得到一级解码器输出的初步识别文本,以及二级解码器输出的实体词类别标签对应的实体词字符;Input the training speech into the speech recognition model to obtain the preliminary recognition text output by the first-level decoder and the entity word characters corresponding to the entity word category labels output by the second-level decoder;
基于一级解码器输出的初步识别文本及所述编辑后识别文本确定第一损失函数,基于二级解码器输出的实体词类别标签对应的实体词字符及实体词类别标签对应的原始实体词确定第二损失函数;Determine a first loss function based on the preliminary recognition text output by the first-level decoder and the edited recognition text, and determine a second loss function based on the entity word characters corresponding to the entity word category labels output by the second-level decoder and the original entity words corresponding to the entity word category labels;
结合所述第一损失函数和所述第二损失函数,训练语音识别模型的网络参数,直至满足训练结束条件为止。The network parameters of the speech recognition model are trained by combining the first loss function and the second loss function until the training end condition is met.
优选地,所述利用实体词的类别标签替换掉识别文本中对应的实体词,得到编辑后识别文本,包括:Preferably, the step of replacing the corresponding entity words in the recognized text with the category labels of the entity words to obtain the edited recognized text includes:
确定实体词包含的字符数量,并以同等数量的实体词类别标签替换掉识别文本中对应的实体词,得到编辑后识别文本。Determine the number of characters contained in the entity word, and replace the corresponding entity word in the recognized text with an equal number of entity word category labels to obtain the edited recognized text.
第二方面,提供了一种语音识别方法,包括:In a second aspect, a speech recognition method is provided, comprising:
获取待识别语音;Get the speech to be recognized;
利用预配置的语音识别模型的一级识别模块,基于所述待识别语音得到初步识别文本,所述初步识别文本包括实体词类别标签及其余非实体词的字符;Using a primary recognition module of a preconfigured speech recognition model, obtaining a preliminary recognition text based on the speech to be recognized, wherein the preliminary recognition text includes entity word category labels and other non-entity word characters;
利用所述语音识别模型的二级识别模块,基于所述待识别语音中所述实体词类别标签对应的语音片段和预设的发音词典及语言模型,得到所述实体词类别标签对应的实体词字符;Utilizing the secondary recognition module of the speech recognition model, based on the speech segment corresponding to the entity word category label in the speech to be recognized and a preset pronunciation dictionary and language model, the entity word character corresponding to the entity word category label is obtained;
由所述实体词字符替换掉所述初步识别文本中对应的实体词类别标签,得到最终的识别文本。 The entity word character replaces the corresponding entity word category label in the preliminary recognition text to obtain the final recognition text.
优选地,所述一级识别模块包括:Preferably, the primary identification module comprises:
编码器和一级解码器;encoder and primary decoder;
所述编码器,用于对所述待识别语音进行编码,得到声学编码特征;The encoder is used to encode the speech to be recognized to obtain acoustic coding features;
所述一级解码器,用于以字符为建模单元,基于所述声学编码特征,解码得到由实体词类别标签及其余非实体词的字符组成的初步识别文本。The first-level decoder is used to use characters as modeling units and decode based on the acoustic coding features to obtain preliminary recognition text consisting of entity word category labels and other non-entity word characters.
优选地,所述二级识别模块包括:二级编码器,用于以音节或音素为建模单元,基于所述实体词类别标签对应的语音片段的声学编码特征,解码得到实体词类别标签对应的音节或音素,并结合预设的发音词典及语言模型将音节或音素转换为字符,得到实体词类别标签对应的实体词字符。Preferably, the secondary recognition module includes: a secondary encoder, which is used to use syllables or phonemes as modeling units, decode the syllables or phonemes corresponding to the entity word category label based on the acoustic coding features of the speech segment corresponding to the entity word category label, and convert the syllables or phonemes into characters in combination with a preset pronunciation dictionary and language model to obtain entity word characters corresponding to the entity word category label.
第三方面,提供了一种语音识别装置,包括:In a third aspect, a speech recognition device is provided, comprising:
待识别语音获取单元,用于获取待识别语音;A speech acquisition unit to be recognized, used for acquiring the speech to be recognized;
初步识别文本确定单元,用于基于所述待识别语音得到初步识别文本,所述初步识别文本包括实体词类别标签及其余非实体词的字符;A preliminary recognition text determination unit, used to obtain preliminary recognition text based on the speech to be recognized, wherein the preliminary recognition text includes entity word category labels and other non-entity word characters;
最终识别文本确定单元,用于基于所述待识别语音中所述实体词类别标签对应的语音片段和预设的发音词典及语言模型,得到所述实体词类别标签对应的实体词字符,由所述实体词字符替换掉所述初步识别文本中对应的实体词类别标签,得到最终的识别文本。The final recognition text determination unit is used to obtain the entity word characters corresponding to the entity word category labels in the speech to be recognized based on the speech fragments corresponding to the entity word category labels in the speech to be recognized and the preset pronunciation dictionary and language model, and replace the corresponding entity word category labels in the preliminary recognition text with the entity word characters to obtain the final recognition text.
第四方面,提供了一种语音识别装置,包括:In a fourth aspect, a speech recognition device is provided, comprising:
待识别语音获取单元,用于获取待识别语音;A speech acquisition unit to be recognized, used for acquiring the speech to be recognized;
语音识别模型处理单元,用于利用预配置的语音识别模型的一级识别模块,基于所述待识别语音得到初步识别文本,所述初步识别文本包括实体词类别标签及其余非实体词的字符;利用所述语音识别模型的二级识别模块,基于所述待识别语音中所述实体词类别标签对应的语音片段和预设的发音词典及语言模型,得到所述实体词类别标签对应的实体词字符;由所述实体词字符替换掉所述初步识别文本中对应的实体词类别标签,得到最终的识别文本。A speech recognition model processing unit is used to use the primary recognition module of a preconfigured speech recognition model to obtain a preliminary recognition text based on the speech to be recognized, wherein the preliminary recognition text includes an entity word category label and other non-entity word characters; use the secondary recognition module of the speech recognition model to obtain the entity word characters corresponding to the entity word category label in the speech to be recognized based on the speech fragment corresponding to the entity word category label in the speech to be recognized and a preset pronunciation dictionary and language model; replace the corresponding entity word category label in the preliminary recognition text with the entity word character to obtain the final recognition text.
第五方面,提供了一种语音识别设备,包括:存储器和处理器;In a fifth aspect, a speech recognition device is provided, comprising: a memory and a processor;
所述存储器,用于存储程序; The memory is used to store programs;
所述处理器,用于执行所述程序,实现如上所述的语音识别方法的各个步骤。The processor is used to execute the program to implement the various steps of the speech recognition method described above.
第六方面,提供了一种存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时,实现如上所述的语音识别方法的各个步骤。In a sixth aspect, a storage medium is provided, on which a computer program is stored. When the computer program is executed by a processor, the various steps of the speech recognition method described above are implemented.
借由上述技术方案,本申请将语音识别过程分为两个阶段,第一识别阶段,能够基于待识别语音得到由实体词类别标签及其余非实体词的字符组成的初步识别文本,第二识别阶段,基于待识别语音中实体词类别标签对应的语音片段和预设的发音词典及语言模型,得到实体词类别标签对应的实体词字符,由实体词字符替换掉初步识别文本中对应的实体词类别标签,得到最终输出的识别文本。通过第一识别阶段将待识别语音中的实体词识别为对应的类别标签,非实体词可以直接识别为字符,这样可以大幅降低同类别标签中低频实体词或新出现的实体词被错误识别为非该类别标签下的字符的概率,提升实体词的识别准确率。在第二识别阶段,结合发音词典及语言模型来预测实体词类别标签对应的实体词。当出现新的领域实体词后,可以将该新出现的领域实体词添加到预设的发音词典及语言模型中,如此可以保证对新出现的领域实体词的识别准确度。By means of the above technical scheme, the present application divides the speech recognition process into two stages. In the first recognition stage, a preliminary recognition text consisting of entity word category labels and other non-entity word characters can be obtained based on the speech to be recognized. In the second recognition stage, based on the speech segment corresponding to the entity word category label in the speech to be recognized and the preset pronunciation dictionary and language model, the entity word characters corresponding to the entity word category label are obtained, and the corresponding entity word category label in the preliminary recognition text is replaced by the entity word characters to obtain the recognition text outputted finally. Through the first recognition stage, the entity words in the speech to be recognized are recognized as corresponding category labels, and non-entity words can be directly recognized as characters, which can greatly reduce the probability that low-frequency entity words or newly appeared entity words in the same category label are mistakenly recognized as characters under non-category labels, and improve the recognition accuracy of entity words. In the second recognition stage, the entity words corresponding to the entity word category label are predicted in combination with the pronunciation dictionary and the language model. When a new domain entity word appears, the newly appeared domain entity word can be added to the preset pronunciation dictionary and language model, so that the recognition accuracy of the newly appeared domain entity word can be guaranteed.
进一步,在出现新的领域实体词时,本案只需要对发音词典和语言模型进行更新即可,无需对语音识别模型进行迭代更新,方案的扩展性更好、学习成本更低,且不会出现由于更新语音识别模型导致的灾难性遗忘问题。Furthermore, when new domain entity words appear, this case only needs to update the pronunciation dictionary and language model, and there is no need to iteratively update the speech recognition model. The solution has better scalability, lower learning cost, and will not cause catastrophic forgetting problems caused by updating the speech recognition model.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
通过阅读下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的,而并不认为是对本申请的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:Various other advantages and benefits will become apparent to those of ordinary skill in the art by reading the detailed description of the preferred embodiments below. The accompanying drawings are only for the purpose of illustrating the preferred embodiments and are not to be considered as limiting the present application. Also, the same reference symbols are used throughout the accompanying drawings to represent the same components. In the accompanying drawings:
图1为本申请实施例提供的语音识别方法的一流程示意图;FIG1 is a flow chart of a speech recognition method provided by an embodiment of the present application;
图2示例了一种语音识别模型的结构示意图;FIG2 illustrates a schematic diagram of the structure of a speech recognition model;
图3示例了一种语音识别模型两阶段解码过程示意图; FIG3 illustrates a schematic diagram of a two-stage decoding process of a speech recognition model;
图4示例了一种结合发音词典及语音模型确定解码字符的过程示意图;FIG4 illustrates a schematic diagram of a process for determining a decoded character by combining a pronunciation dictionary and a speech model;
图5为本申请实施例提供的一种语音识别装置结构示意图;FIG5 is a schematic diagram of the structure of a speech recognition device provided in an embodiment of the present application;
图6为本申请实施例提供的另一种语音识别方法的流程示意图;FIG6 is a flow chart of another speech recognition method provided in an embodiment of the present application;
图7为本申请实施例提供的另一种语音识别装置结构示意图;FIG7 is a schematic diagram of the structure of another speech recognition device provided in an embodiment of the present application;
图8为本申请实施例提供的语音识别设备的结构示意图。FIG8 is a schematic diagram of the structure of a speech recognition device provided in an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The following will be combined with the drawings in the embodiments of the present application to clearly and completely describe the technical solutions in the embodiments of the present application. Obviously, the described embodiments are only part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of this application.
本申请提供了一种语音识别方案,可以适用于各种对语音进行识别的场景,尤其是对于领域实体词语音识别场景,对于新出现的领域实体词能够保证较高的识别准确率。The present application provides a speech recognition solution that can be applied to various scenarios for speech recognition, especially for domain entity word speech recognition scenarios, and can ensure a high recognition accuracy rate for newly emerging domain entity words.
本申请方案可以基于具备数据处理能力的终端实现,该终端可以是手机、电脑、服务器、云端等。The present application solution can be implemented based on a terminal with data processing capabilities, which can be a mobile phone, computer, server, cloud, etc.
接下来,结合图1所述,本申请的语音识别方法可以包括如下步骤:Next, in conjunction with FIG. 1 , the speech recognition method of the present application may include the following steps:
步骤S100、获取待识别语音。Step S100: Acquire speech to be recognized.
步骤S110、基于所述待识别语音得到由实体词类别标签及其余非实体词的字符组成的初步识别文本。Step S110: obtaining a preliminary recognition text consisting of entity word category labels and other non-entity word characters based on the speech to be recognized.
其中,实体词类别标签为预先设定的实体词所属领域类别,如人名、地名、机构名、歌手、歌曲、药品名、影视剧名等标签。Among them, the entity word category label is the pre-set field category to which the entity word belongs, such as labels such as person name, place name, organization name, singer, song, drug name, film and television drama name, etc.
本实施例提供的语音识别方法将语音识别过程划分为两个阶段,第一个识别阶段,将待识别语音中的实体词识别为对应的类别标签,非实体词直接识别为字符,得到初始识别文本。这样可以大幅降低同类别标签中低频实体词或新出现的实体词被错误识别为非该类别标签下的字符的概率, 提升实体词的识别准确率。The speech recognition method provided in this embodiment divides the speech recognition process into two stages. In the first recognition stage, entity words in the speech to be recognized are recognized as corresponding category labels, and non-entity words are directly recognized as characters to obtain the initial recognition text. This can greatly reduce the probability that low-frequency entity words or newly appeared entity words in the same category label are mistakenly recognized as characters not under the category label. Improve the recognition accuracy of entity words.
本步骤中识别得到初步识别文本的过程,可以采用端到端的建模方式,也即以字符为建模单元,基于待识别语音解码得到初步识别文本。除此之外,还可以采用其他建模方式,如以音节或音素为建模单元,基于待识别语音解码得到对应的音节或音素,进而结合预设的发音词典和语言模型,将解码的音节或音素转换为字符,得到初步识别文本。The process of identifying and obtaining the preliminary recognized text in this step can adopt an end-to-end modeling method, that is, taking characters as modeling units, and obtaining the preliminary recognized text based on the decoding of the speech to be recognized. In addition, other modeling methods can also be adopted, such as taking syllables or phonemes as modeling units, obtaining corresponding syllables or phonemes based on the decoding of the speech to be recognized, and then combining the preset pronunciation dictionary and language model to convert the decoded syllables or phonemes into characters to obtain the preliminary recognized text.
步骤S120、结合预设的发音词典及语言模型,建模得到实体词类别标签对应的实体词字符,由所述实体词字符替换掉所述初步识别文本中对应的实体词类别标签,得到最终的识别文本。Step S120, combining a preset pronunciation dictionary and a language model, modeling entity word characters corresponding to entity word category labels, replacing corresponding entity word category labels in the preliminary recognition text with the entity word characters, and obtaining a final recognition text.
具体地,在第二个识别阶段,可以基于所述待识别语音中所述实体词类别标签对应的语音片段和预设的发音词典及语言模型,建模得到实体词类别标签对应的实体词字符。其中,发音词典和语言模型可以包括已有的以及新出现的各类型的实体词。Specifically, in the second recognition stage, the entity word characters corresponding to the entity word category labels can be modeled based on the speech segments corresponding to the entity word category labels in the speech to be recognized and the preset pronunciation dictionary and language model. The pronunciation dictionary and language model can include existing and newly emerging entity words of various types.
本实施例中,在第二识别阶段,可以选择以音节或音素为建模单元,基于所述待识别语音中所述实体词类别标签对应的语音片段和预设的发音词典及语言模型,建模得到实体词类别标签对应的实体词字符。In this embodiment, in the second recognition stage, syllables or phonemes can be selected as modeling units, and entity word characters corresponding to the entity word category labels in the speech to be recognized can be modeled based on the speech segments corresponding to the entity word category labels in the speech to be recognized and the preset pronunciation dictionary and language model.
当以音节作为建模单元时,对应的发音词典可以是音节发音词典,其中包含了音节与字符的对应关系。当以音素作为建模单元时,对应的发音词典可以是音素发音词典,其中包含了音素与字符的对应关系。When syllables are used as modeling units, the corresponding pronunciation dictionary may be a syllable pronunciation dictionary, which includes the correspondence between syllables and characters. When phonemes are used as modeling units, the corresponding pronunciation dictionary may be a phoneme pronunciation dictionary, which includes the correspondence between phonemes and characters.
通过采用音节或音素作为建模单元,基于待识别语音中实体词类别标签对应的语音片段可以建模得到对应的音素或音节。进一步,结合发音词典确定音素或音节对应的候选字符,并结合语言模型确定每个候选字符的概率得分,选取概率得分最高的候选字符作为实体词类别标签对应的实体词字符。By using syllables or phonemes as modeling units, the corresponding phonemes or syllables can be modeled based on the speech segments corresponding to the entity word category labels in the speech to be recognized. Further, the candidate characters corresponding to the phonemes or syllables are determined in combination with the pronunciation dictionary, and the probability score of each candidate character is determined in combination with the language model, and the candidate character with the highest probability score is selected as the entity word character corresponding to the entity word category label.
按照本实施例介绍的语音识别方法,在面对新出现的实体词时,只需要通过更新发音词典及语言模型即可,也即只需要对解码路径进行扩展即可,无需迭代更新语音识别模型。方案的扩展性更好、学习成本更低,且不会出现由于更新语音识别模型导致的灾难性遗忘问题。 According to the speech recognition method introduced in this embodiment, when facing a newly appearing entity word, it is only necessary to update the pronunciation dictionary and the language model, that is, it is only necessary to expand the decoding path, and there is no need to iteratively update the speech recognition model. The solution has better scalability, lower learning cost, and will not cause catastrophic forgetting problems caused by updating the speech recognition model.
进一步地,为了保证本申请的语音识别方法对新增的领域实体词的识别准确率,当获取到新增的领域实体词时,可以确定该新增的领域实体词的音节或音素,进而将新增的领域实体词与其音节或音素的对应关系添加到预设的发音词典中,以完成对发音词典的更新。同时,将新增的领域实体词添加到预设的语言模型中,以完成对语言模型的更新。Furthermore, in order to ensure the recognition accuracy of the speech recognition method of the present application for the newly added domain entity words, when the newly added domain entity words are obtained, the syllables or phonemes of the newly added domain entity words can be determined, and then the correspondence between the newly added domain entity words and their syllables or phonemes is added to the preset pronunciation dictionary to complete the update of the pronunciation dictionary. At the same time, the newly added domain entity words are added to the preset language model to complete the update of the language model.
可选的,本申请的语音识别方法的第二识别阶段,结合发音词典及语言模型,基于实体词类别标签对应的语音片段建模得到实体词类别标签对应的实体词字符,也即,第二识别阶段仅需要进行实体词的识别,不需要进行非实体词的识别。为了提升实体词识别的准确度,本实施例中可以将语言模型配置为领域实体语言模型,具体地,该领域实体语言模型可以基于各个领域实体词进行构建,也即,领域实体语言模型中仅包含实体词。Optionally, the second recognition stage of the speech recognition method of the present application, in combination with the pronunciation dictionary and the language model, obtains the entity word characters corresponding to the entity word category label based on the modeling of the speech fragments corresponding to the entity word category label, that is, the second recognition stage only needs to perform entity word recognition, and does not need to perform non-entity word recognition. In order to improve the accuracy of entity word recognition, in this embodiment, the language model can be configured as a domain entity language model. Specifically, the domain entity language model can be constructed based on various domain entity words, that is, the domain entity language model only contains entity words.
在此基础上,利用该领域实体语言模型进行音节或音素到字符的转换时,不会转换到非实体词的字符,从而大大提升了实体词识别准确度。On this basis, when using the entity language model in this field to convert syllables or phonemes to characters, non-entity word characters will not be converted, thereby greatly improving the accuracy of entity word recognition.
上述实施例中介绍了一种通过两阶段识别的方式进行语音识别的方法。其中,上述两个语音识别阶段可以采用多种不同的手段实现。The above embodiment introduces a method for performing speech recognition in a two-stage recognition manner, wherein the above two speech recognition stages can be implemented by a variety of different means.
示例如,在第一识别阶段中,可以通过预训练的语音识别系统进行第一阶段语音识别,也即,将待识别语音输入预训练的语音识别系统,解码输出由实体词类别标签和其余非实体词的字符组成的初步识别文本。For example, in the first recognition stage, the first stage speech recognition can be performed through a pre-trained speech recognition system, that is, the speech to be recognized is input into the pre-trained speech recognition system, and the decoding output is a preliminary recognition text consisting of entity word category labels and remaining non-entity word characters.
在第二识别阶段,可以通过预训练的语音识别模型进行第二阶段语音识别,也即,语音识别模型可以以音节或音素为建模单元,结合预设的发音词典及语言模型,基于待识别语音中所述实体词类别标签对应的语音片段建模得到实体词类别标签对应的实体词字符。In the second recognition stage, the second-stage speech recognition can be performed through a pre-trained speech recognition model. That is, the speech recognition model can use syllables or phonemes as modeling units, combined with a preset pronunciation dictionary and language model, to model the speech fragments corresponding to the entity word category labels in the speech to be recognized, and obtain the entity word characters corresponding to the entity word category labels.
最终由第二识别阶段得到的实体词类别标签对应的实体词字符,替换掉第一识别阶段得到的初步识别文本中对应的实体词类别标签,得到最终的识别文本。Finally, the entity word characters corresponding to the entity word category labels obtained in the second recognition stage replace the corresponding entity word category labels in the preliminary recognition text obtained in the first recognition stage to obtain the final recognition text.
除此之外,本申请实施例中还提供了上述两阶段语音识别的另一种可选实现手段,具体地,本申请可以预先训练一个语音识别模型,由该语音 识别模型实现上述两个阶段的语音识别过程。In addition, the embodiment of the present application also provides another optional implementation method of the above two-stage speech recognition. Specifically, the present application can pre-train a speech recognition model. The recognition model implements the above two-stage speech recognition process.
具体地,本实施例中可以预先训练一个语音识别模型,该语音识别模型可以被配置为:Specifically, in this embodiment, a speech recognition model may be pre-trained, and the speech recognition model may be configured as follows:
基于待识别语音解码得到由实体词类别标签及其余非实体词的字符组成的初步识别文本,以音节或音素为建模单元,结合预设的发音词典及语言模型,基于实体词类别标签对应的语音片段建模得到实体词类别标签对应的实体词字符,由所述实体词字符替换掉所述初步识别文本中对应的实体词类别标签,得到最终输出的识别文本。Based on the decoding of the speech to be recognized, a preliminary recognition text consisting of entity word category labels and other non-entity word characters is obtained. Syllables or phonemes are used as modeling units. In combination with a preset pronunciation dictionary and language model, entity word characters corresponding to the entity word category labels are obtained based on the speech fragments corresponding to the entity word category labels. The corresponding entity word category labels in the preliminary recognition text are replaced by the entity word characters to obtain the recognition text that is finally output.
在此基础上,上述步骤S110-S120的实现过程,包括:On this basis, the implementation process of the above steps S110-S120 includes:
将所述待识别语音输入上述配置的语音识别模型,得到语音识别模型输出的最终的识别文本。The speech to be recognized is input into the speech recognition model configured as above to obtain the final recognition text output by the speech recognition model.
其中,语音识别模型将语音识别过程划分为两个阶段,第一识别阶段将待识别语音中的实体词识别为对应的类别标签,非实体词直接识别为字符,得到初始识别文本。这样可以大幅降低同类别标签中低频实体词或新出现的实体词被错误识别为非该类别标签下的字符的概率,提升实体词的识别准确率。The speech recognition model divides the speech recognition process into two stages. In the first stage, entity words in the speech to be recognized are recognized as corresponding category labels, and non-entity words are directly recognized as characters to obtain the initial recognition text. This can greatly reduce the probability that low-frequency entity words or newly appeared entity words in the same category label are mistakenly recognized as characters not under the category label, and improve the recognition accuracy of entity words.
在第二识别阶段,以音节或音素为建模单元,结合发音词典及语言模型来预测实体词类别标签对应的实体词字符。其中,发音词典和语言模型可以包括已有的以及新出现的各类型的实体词。具体地,当出现新的领域实体词后,可以将该新出现的领域实体词添加到预设的发音词典及语言模型中,如此可以保证对新出现的领域实体词的识别准确度。In the second recognition stage, syllables or phonemes are used as modeling units, combined with pronunciation dictionaries and language models to predict entity word characters corresponding to entity word category labels. Among them, the pronunciation dictionary and language model can include existing and newly emerging entity words of various types. Specifically, when a new domain entity word appears, the newly emerging domain entity word can be added to the preset pronunciation dictionary and language model, so as to ensure the recognition accuracy of the newly emerging domain entity word.
按照本实施例提供的语音识别模型进行语音识别时,在面对新出现的实体词时,只需要通过更新发音词典及语言模型即可,也即只需要对解码路径进行扩展即可,无需迭代更新语音识别模型。方案的扩展性更好、学习成本更低,且不会出现由于更新语音识别模型导致的灾难性遗忘问题。When performing speech recognition according to the speech recognition model provided in this embodiment, when facing a newly appearing entity word, it is only necessary to update the pronunciation dictionary and the language model, that is, it is only necessary to expand the decoding path, and there is no need to iteratively update the speech recognition model. The solution has better scalability, lower learning cost, and will not cause catastrophic forgetting problems caused by updating the speech recognition model.
在本申请的一些实施例中,对上述语音识别模型的结构进行说明。In some embodiments of the present application, the structure of the above-mentioned speech recognition model is explained.
结合图2所示,本实施例介绍的语音识别模型可以包括编码器Encoder、 一级解码器Decoder1、二级解码器Decoder2和输出层(图2中未示出)。As shown in FIG. 2 , the speech recognition model introduced in this embodiment may include an encoder, The first-level decoder Decoder1, the second-level decoder Decoder2 and the output layer (not shown in FIG. 2 ).
其中:in:
1、编码器,用于对输入的待识别语音进行编码,得到声学编码特征。1. Encoder, used to encode the input speech to be recognized to obtain acoustic coding features.
具体地,编码器的输入可以是待识别语音的语音特征,如幅度谱特征等,以幅度谱特征为例,其可以是对数滤波器组能量(Log Filter Bank Energy,LFBE)。编码器用于提取待识别语音的语音特征的表征,编码器对待识别语音的语音特征进行编码,得到声学编码特征。Specifically, the input of the encoder can be the speech features of the speech to be recognized, such as the amplitude spectrum features, etc. Taking the amplitude spectrum features as an example, it can be the log filter bank energy (LFBE). The encoder is used to extract the representation of the speech features of the speech to be recognized. The encoder encodes the speech features of the speech to be recognized to obtain acoustic coding features.
本实施例中编码器可以采用卷积神经网络、LSTM或Transformer等结构。编码器可以表示为h=f(x),其中h=[h1,h2,…,hT]为编码器输出的声学编码特征,x=[x1,x2,…,xT]为编码器输入的待识别语音的语音特征,T表示待识别语音的帧数。In this embodiment, the encoder may adopt a convolutional neural network, LSTM or Transformer structure. The encoder may be represented by h=f(x), where h=[h 1 ,h 2 ,…,h T ] is the acoustic coding feature output by the encoder, x=[x 1 ,x 2 ,…,x T ] is the speech feature of the speech to be recognized input by the encoder, and T represents the number of frames of the speech to be recognized.
2、一级解码器,用于以字符为建模单元,基于所述声学编码特征,解码得到由实体词类别标签及其余非实体词的字符组成的初步识别文本。2. A first-level decoder is used to decode the characters into a preliminary recognition text consisting of entity word category labels and other non-entity word characters based on the acoustic coding features using the characters as modeling units.
可选的,一级解码器可以采用字符为建模单元进行端到端的建模。除此之外,还可以采用音节或音素作为建模单元,进而结合发音词典和语言模型得到初步识别文本。图2示例的语音识别模型中,仅以字符建模单元为例进行说明。Optionally, the first-level decoder can use characters as modeling units for end-to-end modeling. In addition, syllables or phonemes can be used as modeling units, and then the pronunciation dictionary and language model are combined to obtain preliminary recognition text. In the speech recognition model of the example in Figure 2, only the character modeling unit is used as an example for explanation.
一级解码器以字符为建模单元进行解码,可以直接解码得出初步识别文本。对于待识别语音中包含的实体词,一级解码器解码为实体词类别标签,对于待识别语音中非实体词,一级解码器正常解码为对应的字符,最终得到一级解码器输出的初步识别文本。The first-level decoder uses characters as modeling units for decoding, and can directly decode to obtain preliminary recognition text. For entity words contained in the speech to be recognized, the first-level decoder decodes them into entity word category labels. For non-entity words in the speech to be recognized, the first-level decoder decodes them normally into corresponding characters, and finally obtains the preliminary recognition text output by the first-level decoder.
需要说明的是,一个实体词可以对应一个实体词类别标签,除此之外,还可以根据实体词包含字符的数量,得到同等数量的实体词类别标签,也即一个实体词可以对应多个相同的实体词类别标签。It should be noted that one entity word can correspond to one entity word category label. In addition, the same number of entity word category labels can be obtained according to the number of characters contained in the entity word, that is, one entity word can correspond to multiple identical entity word category labels.
以待识别语音为“听张三的歌”为例,其中“张三”为实体词,其所属类别标签为“歌手”,则利用本案的语音识别模型对待识别语音进行识别时,一级解码器输出的初步识别文本可以是:听<歌手>的歌。除此之外,一级解码器输出的初步识别文本还可以是:听<歌手><歌手>的歌。 Take the speech to be recognized as "listen to Zhang San's song" as an example, where "Zhang San" is an entity word and its category label is "singer". When the speech recognition model in this case is used to recognize the speech to be recognized, the preliminary recognition text output by the first-level decoder can be: listen to <singer>'s song. In addition, the preliminary recognition text output by the first-level decoder can also be: listen to <singer>'s song.
可选的,本实施例中一级解码器可以采用带有注意力机制和自回归的网络结构,如transformer、LSTM等网络结构。在此基础上,一级解码器解码时,以字符为建模单元,基于所述声学编码特征及一级解码器的实时状态特征,解码得到由实体词类别标签及其余非实体词的字符组成的初步识别文本。Optionally, the first-level decoder in this embodiment can adopt a network structure with attention mechanism and autoregression, such as transformer, LSTM and other network structures. On this basis, when the first-level decoder decodes, it uses characters as modeling units, and decodes to obtain a preliminary recognition text consisting of entity word category labels and other non-entity word characters based on the acoustic coding features and the real-time state features of the first-level decoder.
具体地,一级解码器在解码时,可以同时参考声学编码特征ct及一级解码器的实时状态特征,其中一级解码器的实时状态特征可以理解为上下文语言特征,也即,一级解码器在逐字解码时,可以参考前一时刻解码的字符,来指导当前时刻的解码过程,考虑了文本的上下文关联关系,使得解码结果更加准确。Specifically, when decoding, the first-level decoder can refer to the acoustic coding feature ct and the real-time state feature of the first-level decoder at the same time, where the real-time state feature of the first-level decoder can be understood as the contextual language feature, that is, when the first-level decoder decodes word by word, it can refer to the characters decoded at the previous moment to guide the decoding process at the current moment, taking into account the contextual association of the text, so that the decoding result is more accurate.
3、二级解码器,用于以音节或音素为建模单元,基于实体词类别标签对应的语音片段的声学编码特征,解码得到实体词类别标签对应的音节或音素,并结合预设的发音词典及语言模型将音节或音素转换为字符,得到实体词类别标签对应的实体词字符。3. A secondary decoder is used to use syllables or phonemes as modeling units, decode the syllables or phonemes corresponding to the entity word category labels based on the acoustic coding features of the speech fragments corresponding to the entity word category labels, and convert the syllables or phonemes into characters in combination with a preset pronunciation dictionary and language model to obtain the entity word characters corresponding to the entity word category labels.
具体地,对于一级解码器解码为实体词类别标签的部分,由二级解码器基于该实体词类别标签对应的语音片段的声学编码特征,解码得到实体词类别标签对应的音节或音素。可以理解的是,若二级解码器以音节为建模单元,则此处解码得到实体词类别标签对应的音节;若二级解码器以音素为建模单元,则此处解码得到实体词类别标签对应的音素。Specifically, for the part decoded as the entity word category label by the first-level decoder, the second-level decoder decodes the syllable or phoneme corresponding to the entity word category label based on the acoustic coding features of the speech segment corresponding to the entity word category label. It can be understood that if the second-level decoder uses syllables as modeling units, the syllable corresponding to the entity word category label is decoded here; if the second-level decoder uses phonemes as modeling units, the phoneme corresponding to the entity word category label is decoded here.
本实施例中,考虑到领域实体词很多是自造词,比如公司名称等,因此领域实体词的前后字之间并无明显的关联性。因此,二级解码器可以弱化对上下文关联性的建模,也即区别于一级解码器中同时参考声学编码特征和一级解码器的实时状态特征进行解码的方式,二级解码器可以只基于声学编码特征进行解码。In this embodiment, considering that many domain entity words are self-made words, such as company names, there is no obvious correlation between the preceding and following characters of domain entity words. Therefore, the secondary decoder can weaken the modeling of context relevance, that is, different from the way in which the primary decoder decodes by referring to both the acoustic coding features and the real-time state features of the primary decoder, the secondary decoder can decode based only on the acoustic coding features.
二级解码器解码得到实体词类别标签对应的音节或音素之后,可以结合预设的发音词典及语言模型将音节或音素转换为字符,得到实体词类别标签对应的实体词字符。After the secondary decoder decodes and obtains the syllable or phoneme corresponding to the entity word category label, it can convert the syllable or phoneme into a character in combination with a preset pronunciation dictionary and language model to obtain the entity word character corresponding to the entity word category label.
结合图3所示: Combined with Figure 3:
假设输入的待识别语音为“我买了一个宇色手机”。其中,“宇色手机”是一个新增的领域实体词。则一级解码器dec1解码后结果如图3所示:我买了一个<设备名><设备名><设备名><设备名>。Assume that the input speech to be recognized is "I bought a Yuse mobile phone". Among them, "Yuse mobile phone" is a newly added domain entity word. The decoding result of the first-level decoder dec1 is shown in Figure 3: I bought a <device name><device name><device name><device name>.
其中<设备名>指代实体词的类别标签。Where <device name> refers to the category label of the entity word.
二级解码器dec2对于一级解码器输出的每个实体词类别标签,进一步解码得到对应的音节:yu3se4shou3ji1。The second-level decoder dec2 further decodes each entity word category label output by the first-level decoder to obtain the corresponding syllable: yu3se4shou3ji1.
每个音节通过发音词典及语言模型转换为对应的字符:宇色手机。Each syllable is converted into a corresponding character through the pronunciation dictionary and language model: 宇色手机.
进一步结合图4,以音节“yu3”为例,介绍其转换为字符“宇”的过程。Further in conjunction with FIG. 4 , taking the syllable “yu3” as an example, the process of converting it into the character “宇” is introduced.
音节发音词典中“yu3”对应的字符包括“雨”、“宇”、“语”。语言模型中该三个字符各自的语言模型得分依次为2、3、1。因此,选取语言模型得分最高的字符“宇”作为“yu3”对应的最终解码字符。The characters corresponding to "yu3" in the syllable pronunciation dictionary include "雨", "宇", and "语". The language model scores of the three characters in the language model are 2, 3, and 1, respectively. Therefore, the character "宇" with the highest language model score is selected as the final decoding character corresponding to "yu3".
4、输出层,用于利用所述实体词替换掉所述初步识别文本中对应的实体词类别标签,得到最终输出的识别文本。4. An output layer is used to replace the corresponding entity word category label in the preliminary recognition text with the entity word to obtain the final output recognition text.
仍以上述例子进行说明:Let’s use the above example to illustrate:
初步识别文本为:我买了一个<设备名><设备名><设备名><设备名>。二级解码器得到的实体词字符为“宇色手机”。利用“宇色手机”替换掉初步识别文本中的实体词类别标签,得到最终输出的识别文本为:我买了一个宇色手机。The initial recognition text is: I bought a <device name><device name><device name><device name>. The entity word characters obtained by the secondary decoder are "宇色手机". The entity word category label in the initial recognition text is replaced with "宇色手机", and the final output recognition text is: I bought a 宇色手机.
本申请的一些实施例中,对上述一级解码器解码得到初步识别文本的过程进行介绍。In some embodiments of the present application, the process of decoding the above-mentioned first-level decoder to obtain the preliminary recognition text is introduced.
一级解码器可以采用字符为建模单元,解码时可以直接解码得到字符。一级解码器可以采用带有注意力机制的网络结果,解码时可以同时参考声学信息和语言信息,提升解码准确度。The first-level decoder can use characters as modeling units, and can directly decode characters during decoding. The first-level decoder can use the network results with an attention mechanism, and can refer to acoustic information and language information at the same time during decoding to improve decoding accuracy.
具体地,一级解码器以解码第t个字符时对每一帧声学编码特征的关注度为权重,对各帧声学编码特征进行加权求和,得到解码第t个字符时的声学编码特征ct,基于解码第t个字符时的声学编码特征ct及解码第t 个字符时一级解码器的状态特征dt,解码第t个字符,直至全部解码后得到由实体词类别标签及其余非实体词的字符组成的初步识别文本。Specifically, the first-level decoder takes the attention degree of each frame acoustic coding feature when decoding the t-th character as the weight, performs weighted summation on the acoustic coding features of each frame, and obtains the acoustic coding feature c t when decoding the t-th character. Based on the acoustic coding feature c t when decoding the t-th character and the acoustic coding feature c t when decoding the t-th character, The state feature d t of the first-level decoder is used when the character is t, and the t-th character is decoded until all characters are decoded to obtain the preliminary recognition text consisting of the entity word category label and the remaining non-entity word characters.
具体地,一级解码器的输出可以参考下述公式计算:Specifically, the output of the first-level decoder can be calculated by referring to the following formula:
dt=LSTM([yt-1;ct-1])d t = LSTM([y t-1 ; c t-1 ])
etj=VTtanh(Wqdt+Wkhj)

e tj =V T tanh(W q d t +W k h j )

dec1=argmax(W1[dt;ct])dec 1 = argmax(W 1 [ d t ; c t ])
其中,yt-1是上一个解码的字符,ct-1是一级解码器解码上一个字符时的声学编码特征,dt是解码第t个字符时一级解码器的状态特征(此处以一级解码器采用LSTM结构为例进行说明),也可以理解为解码第t个字符时的语言特征,atj是解码第t个字符时对第j帧声学编码特征的归一化后的关注度,Wq,Wk,V,W1是网络参数,hj是第j帧声学编码特征,dec1是一级解码器的输出。Among them, y t-1 is the last decoded character, c t-1 is the acoustic coding feature when the first-level decoder decodes the previous character, d t is the state feature of the first-level decoder when decoding the t-th character (here the first-level decoder adopts the LSTM structure as an example for explanation), which can also be understood as the language feature when decoding the t-th character, a tj is the normalized attention to the acoustic coding feature of the j-th frame when decoding the t-th character, W q , W k , V, W 1 are network parameters, h j is the acoustic coding feature of the j-th frame, and dec 1 is the output of the first-level decoder.
在上述基础上,进一步介绍二级解码器解码得到实体词类别标签对应的音节或音素的过程。Based on the above, the process of obtaining the syllable or phoneme corresponding to the entity word category label by the secondary decoder is further introduced.
二级解码器以音节或音素为建模单元,基于一级解码器解码实体词类别标签时的声学编码特征,解码得到实体词类别标签对应的音节或音素。The secondary decoder uses syllables or phonemes as modeling units, and decodes the syllables or phonemes corresponding to the entity word category labels based on the acoustic coding features when the primary decoder decodes the entity word category labels.
具体地,区别于一级解码器解码时同时使用声学编码特征和解码器实时状态特征,二级解码器需要弱化前后字符间的关联性,因此可以仅使用声学编码特征。Specifically, unlike the first-level decoder which uses both acoustic coding features and decoder real-time state features during decoding, the second-level decoder needs to weaken the correlation between previous and next characters, and therefore can use only acoustic coding features.
基于此,二级解码器可以仅使用解码实体词类别标签时的声学编码特征,解码得到实体词类别标签对应的音节或音素,二级解码器的输出表示为:Based on this, the secondary decoder can only use the acoustic coding features when decoding the entity word category label to decode the syllable or phoneme corresponding to the entity word category label. The output of the secondary decoder is expressed as:
dec2=argmax(W2ct) dec 2 = argmax(W 2 c t )
其中,dec2表示二级解码器的输出,W2是网络参数。Among them, dec 2 represents the output of the secondary decoder and W 2 is the network parameter.
在本申请的一些实施例中,进一步对上述语音识别模型的训练过程进行介绍。In some embodiments of the present application, the training process of the above-mentioned speech recognition model is further introduced.
本实施例中示例了语音识别模型的一种可选训练过程,可以包括如下步骤:This embodiment illustrates an optional training process of a speech recognition model, which may include the following steps:
S1、获取训练语音及对应的识别文本,所述识别文本中标注有实体词的类别标签。S1. Obtain training speech and corresponding recognition text, wherein the recognition text is annotated with category labels of entity words.
具体地,在语音识别实际应用中会面临各种各样的领域实体词,如人名、地名、机构名、歌手、歌曲等等。首先要根据实际语音识别场景,获取领域实体词的所有类别,组成类别标签集。进一步,获取已有的实体词,并对已有的实体词确定其对应的类别标签。实体词的获取可以通过命名实体识别NER工具对文本数据进行NER识别之后得到。Specifically, in the actual application of speech recognition, we will face a variety of domain entity words, such as names of people, places, institutions, singers, songs, etc. First, according to the actual speech recognition scenario, we need to obtain all categories of domain entity words and form a category label set. Furthermore, we need to obtain existing entity words and determine the corresponding category labels for the existing entity words. The acquisition of entity words can be obtained by performing NER recognition on text data using the named entity recognition (NER) tool.
需要说明的是,部分实体词存在多种含义,也即可以对应多个不同的类别标签,如“苹果”可以对应到“水果”和“机构名”两个类别标签。It should be noted that some entity words have multiple meanings, that is, they can correspond to multiple different category labels. For example, "apple" can correspond to two category labels: "fruit" and "institution name".
鉴于人力工作有限,上述得到的实体词有限,难以覆盖所有类别的实体词。因此,可以利用上述整理得到的实体词及其类别标签,训练一个分类神经网络模型,用于确定输入文本中包含的实体词类别。该分类神经网络可以采用具有上下文建模能力的模型,如Transformer、LSTM等。Given the limited human effort, the entity words obtained above are limited and it is difficult to cover all categories of entity words. Therefore, the entity words and their category labels obtained above can be used to train a classification neural network model to determine the category of entity words contained in the input text. The classification neural network can use a model with context modeling capabilities, such as Transformer, LSTM, etc.
基于训练后的分类神经网络模型,可以确定训练语料文本中的实体词的类别标签。Based on the trained classification neural network model, the category labels of entity words in the training corpus text can be determined.
其中,训练语料文本可以是收集的训练语音对应的识别文本。The training corpus text may be the recognition text corresponding to the collected training speech.
S2、利用实体词的类别标签替换掉识别文本中对应的实体词,得到编辑后识别文本。S2. Use the category label of the entity word to replace the corresponding entity word in the recognized text to obtain the edited recognized text.
具体地,为了训练语音识别模型中的一级解码器,需要构造一级解码器的训练样本标签,也即将识别文本中的实体词利用对应的实体词类别标签替换掉,得到编辑后识别文本。Specifically, in order to train the first-level decoder in the speech recognition model, it is necessary to construct the training sample labels of the first-level decoder, that is, to replace the entity words in the recognition text with the corresponding entity word category labels to obtain the edited recognition text.
示例如,识别文本为“我买了一个苹果手机”,其中包含的实体词为“苹 果”,对应的类别标签为<机构名>,则编辑后识别文本可以是“我买了一个<机构名>手机”。For example, the recognized text is "I bought an Apple phone", which contains the entity word "Apple The corresponding category label is <Organization Name>, then the edited recognition text can be "I bought a <Organization Name> mobile phone".
此外,为了便于二级解码器解码得到准确的实体词,我们希望一级解码器的输出中对于实体词类别标签,其数量等于实体词包含字符的数量。基于此,二级解码器可以对每个实体词类别标签进行二级解码,得到对应的字符,由各个字符组成完整的实体词。In addition, in order to facilitate the secondary decoder to decode the accurate entity words, we hope that the number of entity word category labels in the output of the first decoder is equal to the number of characters contained in the entity word. Based on this, the secondary decoder can perform secondary decoding on each entity word category label to obtain the corresponding characters, and each character constitutes a complete entity word.
为此,在对识别文本进行编辑时,利用实体词的类别标签替换掉识别文本中对应的实体词,得到编辑后识别文本的过程,具体可以包括:To this end, when editing the recognized text, the corresponding entity words in the recognized text are replaced with the category labels of the entity words to obtain the edited recognized text, which may specifically include:
确定实体词包含的字符数量,并以同等数量的实体词类别标签替换掉识别文本中对应的实体词,得到编辑后识别文本。Determine the number of characters contained in the entity word, and replace the corresponding entity word in the recognized text with an equal number of entity word category labels to obtain the edited recognized text.
仍以上述识别文本为例,编辑后识别文本可以是“我买了一个<机构名><机构名>手机”。Still taking the above-mentioned recognized text as an example, the recognized text after editing may be "I bought a <organization name> <organization name> mobile phone".
S3、将所述训练语音输入语音识别模型,得到一级解码器输出的初步识别文本,以及二级解码器输出的实体词类别标签对应的实体词字符。S3. Input the training speech into a speech recognition model to obtain a preliminary recognition text output by the first-level decoder and entity word characters corresponding to the entity word category label output by the second-level decoder.
S4、基于一级解码器输出的初步识别文本及所述编辑后识别文本确定第一损失函数,基于二级解码器输出的实体词类别标签对应的实体词字符及实体词类别标签对应的原始实体词确定第二损失函数。S4. Determine a first loss function based on the preliminary recognition text output by the first-level decoder and the edited recognition text, and determine a second loss function based on the entity word characters corresponding to the entity word category labels output by the second-level decoder and the original entity words corresponding to the entity word category labels.
具体地,可以基于一级解码器输出的初步识别文本和编辑后识别文本,通过交叉熵损失函数计算第一损失函数,该第一损失函数表示一级解码器的解码损失。Specifically, based on the preliminary recognition text and the edited recognition text output by the primary decoder, a first loss function can be calculated by a cross entropy loss function, and the first loss function represents the decoding loss of the primary decoder.
进一步,可以基于二级解码器输出的实体词类别标签对应的实体词字符,以及实体词类别标签在识别文本中对应的原始实体词,通过交叉熵损失函数计算第二损失函数,该第二损失函数表示二级解码器的解码损失。Furthermore, a second loss function can be calculated through a cross-entropy loss function based on the entity word characters corresponding to the entity word category labels output by the secondary decoder and the original entity words corresponding to the entity word category labels in the recognized text. The second loss function represents the decoding loss of the secondary decoder.
S5、结合所述第一损失函数和所述第二损失函数,训练语音识别模型的网络参数,直至满足训练结束条件为止。S5. Combine the first loss function and the second loss function to train the network parameters of the speech recognition model until the training end condition is met.
具体地,结合第一损失函数和第二损失函数计算总损失函数,并基于总损失函数训练语音识别模型的网络参数,直至满足训练结束条件为止,得到最终训练后的语音识别模型。 Specifically, the total loss function is calculated by combining the first loss function and the second loss function, and the network parameters of the speech recognition model are trained based on the total loss function until the training end condition is met, thereby obtaining the final trained speech recognition model.
下面对本申请实施例提供的语音识别装置进行描述,下文描述的语音识别装置与上文描述的语音识别方法可相互对应参照。The following is a description of a speech recognition device provided in an embodiment of the present application. The speech recognition device described below and the speech recognition method described above can be referenced to each other.
参见图5,图5为本申请实施例公开的一种语音识别装置结构示意图。See FIG. 5 , which is a schematic diagram of the structure of a speech recognition device disclosed in an embodiment of the present application.
如图5所示,该装置可以包括:As shown in FIG5 , the device may include:
待识别语音获取单元11,用于获取待识别语音;A speech acquisition unit 11 for acquiring speech to be recognized, used for acquiring speech to be recognized;
初步识别文本确定单元12,用于基于所述待识别语音得到初步识别文本,所述初步识别文本包括实体词类别标签及其余非实体词的字符;A preliminary recognition text determination unit 12 is used to obtain preliminary recognition text based on the speech to be recognized, wherein the preliminary recognition text includes entity word category labels and other non-entity word characters;
其中,初步识别文本确定单元可以采用字符为建模单元,基于待识别语音解码得到初步识别文本。除此之外,还可以采用其他建模方式,如以音节或音素为建模单元,基于待识别语音解码得到对应的音节或音素,进而结合预设的发音词典和语言模型,将解码的音节或音素转换为字符,得到初步识别文本。The preliminary recognition text determination unit may use characters as modeling units, and obtain preliminary recognition text based on the decoding of the speech to be recognized. In addition, other modeling methods may be used, such as using syllables or phonemes as modeling units, obtaining corresponding syllables or phonemes based on the decoding of the speech to be recognized, and then converting the decoded syllables or phonemes into characters in combination with a preset pronunciation dictionary and language model to obtain preliminary recognition text.
最终识别文本确定单元13,用于基于所述待识别语音中所述实体词类别标签对应的语音片段和预设的发音词典及语言模型,得到所述实体词类别标签对应的实体词字符,由所述实体词字符替换掉所述初步识别文本中对应的实体词类别标签,得到最终的识别文本。The final recognition text determination unit 13 is used to obtain the entity word characters corresponding to the entity word category labels in the speech to be recognized based on the speech segments corresponding to the entity word category labels in the speech to be recognized and the preset pronunciation dictionary and language model, and replace the corresponding entity word category labels in the preliminary recognition text with the entity word characters to obtain the final recognition text.
具体地,上述最终识别文本确定单元可以以音节或音素为建模单元,基于所述待识别语音中所述实体词类别标签对应的语音片段和预设的发音词典及语言模型,建模得到所述实体词类别标签对应的实体词字符。Specifically, the above-mentioned final recognition text determination unit can use syllables or phonemes as modeling units, and based on the speech fragments corresponding to the entity word category labels in the speech to be recognized and the preset pronunciation dictionary and language model, model the entity word characters corresponding to the entity word category labels.
可选的,上述初步识别文本确定单元和最终识别文本确定单元的处理过程,可以通过模型处理单元实现,该模型处理单元用于利用预配置的语音识别模型处理所述待识别语音数据,得到模型输出的识别文本,其中,所述语音识别模型被配置为:Optionally, the processing of the preliminary recognition text determination unit and the final recognition text determination unit can be implemented by a model processing unit, which is used to process the speech data to be recognized using a preconfigured speech recognition model to obtain the recognition text output by the model, wherein the speech recognition model is configured as follows:
基于待识别语音解码得到由实体词类别标签及其余非实体词的字符组成的初步识别文本,以音节或音素为建模单元,基于所述待识别语音中所述实体词类别标签对应的语音片段和预设的发音词典及语言模型,建模得到所述实体词类别标签对应的实体词字符,由所述实体词字符替换掉所述 初步识别文本中对应的实体词类别标签,得到最终输出的识别文本。Based on the decoding of the speech to be recognized, a preliminary recognition text consisting of entity word category labels and other non-entity word characters is obtained. Syllables or phonemes are used as modeling units. Based on the speech segments corresponding to the entity word category labels in the speech to be recognized and the preset pronunciation dictionary and language model, entity word characters corresponding to the entity word category labels are modeled. The entity word characters are used to replace the Preliminarily identify the corresponding entity word category labels in the text and obtain the final output recognition text.
可选的,本申请的装置还可以包括:Optionally, the device of the present application may further include:
发音词典及语言模型更新单元,用于在获取到新增的领域实体词时,确定所述领域实体词对应的音节或音素,并将所述领域实体词与音节或音素的对应关系添加到所述预设的发音词典中,以及,将所述领域实体词添加到所述语言模型中。The pronunciation dictionary and language model updating unit is used to determine the syllable or phoneme corresponding to the domain entity word when a newly added domain entity word is obtained, and add the correspondence between the domain entity word and the syllable or phoneme to the preset pronunciation dictionary, and add the domain entity word to the language model.
可选的,上述所述语言模型可以是基于各领域实体词所构建的语言模型。Optionally, the above-mentioned language model may be a language model constructed based on entity words in various fields.
可选的,上述语音识别模型可以包括编码器、一级解码器、二级解码器及输出层;Optionally, the speech recognition model may include an encoder, a primary decoder, a secondary decoder and an output layer;
所述编码器,用于对输入的待识别语音进行编码,得到声学编码特征;The encoder is used to encode the input speech to be recognized to obtain acoustic coding features;
所述一级解码器,用于以字符为建模单元,基于所述声学编码特征,解码得到由实体词类别标签及其余非实体词的字符组成的初步识别文本;The first-level decoder is used to decode the characters as modeling units based on the acoustic coding features to obtain a preliminary recognition text consisting of entity word category labels and other non-entity word characters;
所述二级解码器,用于以音节或音素为建模单元,基于实体词类别标签对应的语音片段的声学编码特征,解码得到实体词类别标签对应的音节或音素,并结合预设的发音词典及语言模型将音节或音素转换为字符,得到实体词类别标签对应的实体词字符;The secondary decoder is used to decode the syllable or phoneme corresponding to the entity word category label based on the acoustic coding features of the speech segment corresponding to the entity word category label using the syllable or phoneme as the modeling unit, and convert the syllable or phoneme into a character in combination with a preset pronunciation dictionary and language model to obtain the entity word character corresponding to the entity word category label;
所述输出层,用于利用所述实体词字符替换掉所述初步识别文本中对应的实体词类别标签,得到最终输出的识别文本。The output layer is used to replace the corresponding entity word category label in the preliminary recognition text with the entity word character to obtain the final output recognition text.
可选的,上述一级解码器可以采用带有注意力机制和自回归的网络结构,具体地:以字符为建模单元,基于所述声学编码特征及一级解码器的实时状态特征,解码得到由实体词类别标签及其余非实体词的字符组成的初步识别文本,该过程可以包括:Optionally, the first-level decoder may adopt a network structure with an attention mechanism and autoregression. Specifically, taking characters as modeling units, based on the acoustic coding features and the real-time state features of the first-level decoder, decoding to obtain a preliminary recognition text consisting of entity word category labels and other non-entity word characters, the process may include:
一级解码器以字符为建模单元,以解码第t个字符时对每一帧声学编码特征的关注度为权重,对各帧声学编码特征进行加权求和,得到解码第t个字符时的声学编码特征ct,基于解码第t个字符时的声学编码特征ct及解码第t个字符时一级解码器的状态特征dt,解码第t个字符,直至全部解码后得到由实体词类别标签及其余非实体词的字符组成的初步识别文本。 The first-level decoder uses characters as modeling units, takes the attention degree of each frame of acoustic coding features when decoding the t-th character as the weight, performs weighted summation on the acoustic coding features of each frame, and obtains the acoustic coding feature c t when decoding the t-th character. Based on the acoustic coding feature c t when decoding the t-th character and the state feature d t of the first-level decoder when decoding the t-th character, the t-th character is decoded until all characters are decoded to obtain a preliminary recognition text consisting of entity word category labels and other non-entity word characters.
可选的,上述二级解码器以音节或音素为建模单元,基于实体词类别标签对应的语音片段的声学编码特征,解码得到实体词类别标签对应的音节或音素的过程,可以包括:Optionally, the secondary decoder uses syllables or phonemes as modeling units, and based on the acoustic coding features of the speech segment corresponding to the entity word category label, decodes the syllables or phonemes corresponding to the entity word category label, which may include:
二级解码器以音节或音素为建模单元,基于一级解码器解码实体词类别标签时的声学编码特征,解码得到实体词类别标签对应的音节或音素。The secondary decoder uses syllables or phonemes as modeling units, and decodes the syllables or phonemes corresponding to the entity word category labels based on the acoustic coding features when the primary decoder decodes the entity word category labels.
可选的,本申请的装置还可以包括:Optionally, the device of the present application may further include:
模型训练单元,用于训练语音识别模型,该训练过程可以包括:The model training unit is used to train the speech recognition model. The training process may include:
获取训练语音及对应的识别文本,所述识别文本中标注有实体词的类别标签;Obtaining training speech and corresponding recognition text, wherein the recognition text is annotated with category labels of entity words;
利用实体词的类别标签替换掉识别文本中对应的实体词,得到编辑后识别文本;Use the category label of the entity word to replace the corresponding entity word in the recognized text to obtain the edited recognized text;
将所述训练语音输入语音识别模型,得到一级解码器输出的初步识别文本,以及二级解码器输出的实体词类别标签对应的实体词字符;Input the training speech into the speech recognition model to obtain the preliminary recognition text output by the first-level decoder and the entity word characters corresponding to the entity word category labels output by the second-level decoder;
基于一级解码器输出的初步识别文本及所述编辑后识别文本确定第一损失函数,基于二级解码器输出的实体词类别标签对应的实体词字符及实体词类别标签对应的原始实体词确定第二损失函数;Determine a first loss function based on the preliminary recognition text output by the first-level decoder and the edited recognition text, and determine a second loss function based on the entity word characters corresponding to the entity word category labels output by the second-level decoder and the original entity words corresponding to the entity word category labels;
结合所述第一损失函数和所述第二损失函数,训练语音识别模型的网络参数,直至满足训练结束条件为止。The network parameters of the speech recognition model are trained by combining the first loss function and the second loss function until the training end condition is met.
可选的,上述模型训练单元利用实体词的类别标签替换掉识别文本中对应的实体词,得到编辑后识别文本的过程,可以包括:Optionally, the model training unit replaces the corresponding entity word in the recognized text with the category label of the entity word to obtain the edited recognized text, which may include:
确定实体词包含的字符数量,并以同等数量的实体词类别标签替换掉识别文本中对应的实体词,得到编辑后识别文本。Determine the number of characters contained in the entity word, and replace the corresponding entity word in the recognized text with an equal number of entity word category labels to obtain the edited recognized text.
在本申请的一些实施例中,进一步提供了另一种语音识别方法,参照图6所示,该方法可以包括:In some embodiments of the present application, another speech recognition method is further provided. As shown in FIG. 6 , the method may include:
步骤S200、获取待识别语音。Step S200: Acquire speech to be recognized.
步骤S210、利用预配置的语音识别模型的一级识别模块,基于所述待识别语音得到初步识别文本,所述初步识别文本包括实体词类别标签及其 余非实体词的字符。Step S210: using the primary recognition module of the pre-configured speech recognition model, obtaining a preliminary recognition text based on the speech to be recognized, wherein the preliminary recognition text includes entity word category labels and their The remaining non-entity characters.
本实施例提供的语音识别方法中,预先配置了语音识别模型,该语音识别模型包括两级框架,分别为一级识别模块和二级识别模块。In the speech recognition method provided in this embodiment, a speech recognition model is pre-configured. The speech recognition model includes a two-level framework, namely a primary recognition module and a secondary recognition module.
其中,一级识别模块可以基于输入的待识别语音进行解码,得到实体词类别标签及其余非实体词的字符组成的初步识别文本。Among them, the first-level recognition module can decode the input speech to be recognized to obtain a preliminary recognition text composed of entity word category labels and other non-entity word characters.
具体地,一级识别模块可以以字符为建模单元,基于待识别语音解码得到初步识别文本。除此之外,还可以采用其他建模方式,如以音节或音素为建模单元,基于待识别语音解码得到对应的音节或音素,进而结合预设的发音词典和语言模型,将解码的音节或音素转换为字符,得到初步识别文本。Specifically, the primary recognition module can use characters as modeling units, and obtain preliminary recognition text based on the decoding of the speech to be recognized. In addition, other modeling methods can also be used, such as using syllables or phonemes as modeling units, obtaining corresponding syllables or phonemes based on the decoding of the speech to be recognized, and then combining the preset pronunciation dictionary and language model to convert the decoded syllables or phonemes into characters to obtain preliminary recognition text.
步骤S220、利用所述语音识别模型的二级识别模块,基于所述待识别语音中所述实体词类别标签对应的语音片段和预设的发音词典及语言模型,得到实体词类别标签对应的实体词字符。Step S220, using the secondary recognition module of the speech recognition model, based on the speech segment corresponding to the entity word category label in the speech to be recognized and a preset pronunciation dictionary and language model, obtain the entity word character corresponding to the entity word category label.
其中,语音识别模型中的二级识别模块,可以以音节或音素为建模单元,对一级识别模块解码的实体词类别标签对应的语音片段进行二级解码,得到解码后的音节或音素。进一步,结合预设的发音词典及语言模型,将解码后的音节或音素转换为字符形式,得到实体词类别标签对应的实体词字符。Among them, the secondary recognition module in the speech recognition model can use syllables or phonemes as modeling units to perform secondary decoding on the speech segments corresponding to the entity word category labels decoded by the primary recognition module to obtain decoded syllables or phonemes. Further, in combination with a preset pronunciation dictionary and language model, the decoded syllables or phonemes are converted into character form to obtain entity word characters corresponding to the entity word category labels.
步骤S230、由所述实体词字符替换掉所述初步识别文本中对应的实体词类别标签,得到最终的识别文本。Step S230: Replace the corresponding entity word category label in the preliminary recognition text with the entity word character to obtain the final recognition text.
本申请实施例提供的语音识别方法,预先配置的语音识别模型包含两级识别模块,一级识别模块,将待识别语音中的实体词识别为对应的类别标签,非实体词直接识别为字符,得到初始识别文本。这样可以大幅降低同类别标签中低频实体词或新出现的实体词被错误识别为非该类别标签下的字符的概率,提升实体词的识别准确率。二级识别模块只需要识别实体词类别标签对应的实体词即可。最终由二级识别模块识别的实体词字符替换掉初步识别文本中对应的实体词类别标签,得到最终的识别文本。The speech recognition method provided in the embodiment of the present application, the pre-configured speech recognition model includes a two-level recognition module. The first-level recognition module recognizes the entity words in the speech to be recognized as corresponding category labels, and directly recognizes non-entity words as characters to obtain an initial recognition text. In this way, the probability of low-frequency entity words or newly appeared entity words in the same category label being mistakenly recognized as characters not under the category label can be greatly reduced, and the recognition accuracy of entity words can be improved. The second-level recognition module only needs to recognize the entity words corresponding to the entity word category label. Finally, the entity word characters recognized by the second-level recognition module replace the corresponding entity word category label in the preliminary recognition text to obtain the final recognition text.
采用本实施例的语音识别模型进行语音识别时,在面对新出现的实体 词时,只需要通过更新发音词典及语言模型即可,也即只需要对解码路径进行扩展即可,无需迭代更新语音识别模型。方案的扩展性更好、学习成本更低,且不会出现由于更新语音识别模型导致的灾难性遗忘问题。When the speech recognition model of this embodiment is used for speech recognition, when facing a new entity When updating a word, it is only necessary to update the pronunciation dictionary and language model, that is, it is only necessary to expand the decoding path, and there is no need to iteratively update the speech recognition model. The solution has better scalability, lower learning cost, and will not cause catastrophic forgetting problems caused by updating the speech recognition model.
进一步地,为了保证语音识别模型对新增的领域实体词的识别准确率,当获取到新增的领域实体词时,可以确定该新增的领域实体词的音节或音素,进而将新增的领域实体词与其音节或音素的对应关系添加到预设的发音词典中,以完成对发音词典的更新。同时,将新增的领域实体词添加到预设的语言模型中,以完成对语言模型的更新。Furthermore, in order to ensure the recognition accuracy of the speech recognition model for the newly added domain entity words, when the newly added domain entity words are obtained, the syllables or phonemes of the newly added domain entity words can be determined, and then the corresponding relationship between the newly added domain entity words and their syllables or phonemes is added to the preset pronunciation dictionary to complete the update of the pronunciation dictionary. At the same time, the newly added domain entity words are added to the preset language model to complete the update of the language model.
结合图2示例的语音识别模型,对上述语音识别模型的结构进行介绍。In conjunction with the speech recognition model shown in FIG2 , the structure of the speech recognition model is introduced.
其中,一级识别模块可以包括:编码器和一级解码器。The primary recognition module may include: an encoder and a primary decoder.
编码器,用于对所述待识别语音进行编码,得到声学编码特征。The encoder is used to encode the speech to be recognized to obtain acoustic coding features.
一级解码器,用于以字符为建模单元,基于所述声学编码特征,解码得到由实体词类别标签及其余非实体词的字符组成的初步识别文本。The first-level decoder is used to use characters as modeling units and decode the initial recognition text composed of entity word category labels and other non-entity word characters based on the acoustic coding features.
其中,一级解码器可以采用带有注意力机制和自回归的网络结构,也即一级解码器解码时,可以以字符为建模单元,基于所述声学编码特征及一级解码器的实时状态特征,解码得到由实体词类别标签及其余非实体词的字符组成的初步识别文本。具体实现过程可以参照前文相关介绍,此处不再赘述。Among them, the first-level decoder can adopt a network structure with an attention mechanism and autoregression, that is, when the first-level decoder decodes, it can use characters as modeling units, and decode to obtain a preliminary recognition text consisting of entity word category labels and other non-entity word characters based on the acoustic coding features and the real-time state features of the first-level decoder. The specific implementation process can refer to the relevant introduction in the previous article, which will not be repeated here.
二级识别模块可以包括:二级编码器,用于以音节或音素为建模单元,基于所述实体词类别标签对应的语音片段的声学编码特征,解码得到实体词类别标签对应的音节或音素,并结合预设的发音词典及语言模型将音节或音素转换为字符,得到实体词类别标签对应的实体词字符。The secondary recognition module may include: a secondary encoder, which is used to use syllables or phonemes as modeling units, decode the syllables or phonemes corresponding to the entity word category label based on the acoustic coding features of the speech segment corresponding to the entity word category label, and convert the syllables or phonemes into characters in combination with a preset pronunciation dictionary and language model to obtain the entity word characters corresponding to the entity word category label.
进一步地,二级识别模块还可以包括输出层,用于利用二级识别模块得到的实体词字符替换掉初步识别文本中对应的实体词类别标签,输出最终的识别文本。Furthermore, the secondary recognition module may also include an output layer, which is used to replace the corresponding entity word category labels in the preliminary recognition text with the entity word characters obtained by the secondary recognition module, and output the final recognition text.
对于二级识别模块的处理过程,可以参照前文相关介绍,此处不再赘述。 For the processing process of the secondary recognition module, please refer to the relevant introduction in the previous article, which will not be repeated here.
下面对本申请实施例提供的语音识别装置进行描述,下文描述的语音识别装置与上文描述的第二种语音识别方法可相互对应参照。The speech recognition device provided in an embodiment of the present application is described below. The speech recognition device described below and the second speech recognition method described above can be referenced to each other.
参见图7,图7为本申请实施例公开的另一种语音识别装置结构示意图。See FIG. 7 , which is a schematic diagram of the structure of another speech recognition device disclosed in an embodiment of the present application.
如图7所示,该装置可以包括:As shown in FIG. 7 , the device may include:
待识别语音获取单元21,用于获取待识别语音;The to-be-recognized speech acquisition unit 21 is used to acquire the to-be-recognized speech;
语音识别模型处理单元22,用于利用预配置的语音识别模型的一级识别模块,基于所述待识别语音得到初步识别文本,所述初步识别文本包括实体词类别标签及其余非实体词的字符;利用所述语音识别模型的二级识别模块,基于所述待识别语音中所述实体词类别标签对应的语音片段和预设的发音词典及语言模型,得到所述实体词类别标签对应的实体词字符;由所述实体词字符替换掉所述初步识别文本中对应的实体词类别标签,得到最终的识别文本。The speech recognition model processing unit 22 is used to use the primary recognition module of the preconfigured speech recognition model to obtain a preliminary recognition text based on the speech to be recognized, and the preliminary recognition text includes an entity word category label and other non-entity word characters; use the secondary recognition module of the speech recognition model to obtain the entity word characters corresponding to the entity word category label in the speech to be recognized based on the speech segment corresponding to the entity word category label in the speech to be recognized and a preset pronunciation dictionary and language model; replace the corresponding entity word category label in the preliminary recognition text with the entity word character to obtain the final recognition text.
其中,语音识别模型的组成结构可以参照前述语音识别方法部分的介绍,此处不再赘述。Among them, the composition structure of the speech recognition model can refer to the introduction of the aforementioned speech recognition method part, which will not be repeated here.
本申请实施例提供的前述两种不同的语音识别装置可应用于语音识别设备,如终端:手机、电脑等。可选的,图8示出了语音识别设备的硬件结构框图,参照图8,语音识别设备的硬件结构可以包括:至少一个处理器1,至少一个通信接口2,至少一个存储器3和至少一个通信总线4;The aforementioned two different speech recognition devices provided in the embodiments of the present application can be applied to speech recognition devices, such as terminals: mobile phones, computers, etc. Optionally, FIG8 shows a hardware structure block diagram of a speech recognition device. Referring to FIG8 , the hardware structure of the speech recognition device may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;
在本申请实施例中,处理器1、通信接口2、存储器3、通信总线4的数量为至少一个,且处理器1、通信接口2、存储器3通过通信总线4完成相互间的通信;In the embodiment of the present application, the number of the processor 1, the communication interface 2, the memory 3, and the communication bus 4 is at least one, and the processor 1, the communication interface 2, and the memory 3 communicate with each other through the communication bus 4;
处理器1可能是一个中央处理器CPU,或者是特定集成电路ASIC(Application Specific Integrated Circuit),或者是被配置成实施本发明实施例的一个或多个集成电路等;The processor 1 may be a central processing unit CPU, or an application specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement the embodiments of the present invention, etc.;
存储器3可能包含高速RAM存储器,也可能还包括非易失性存储器(non-volatile memory)等,例如至少一个磁盘存储器;The memory 3 may include a high-speed RAM memory, and may also include a non-volatile memory, such as at least one disk memory;
其中,存储器存储有程序,处理器可调用存储器存储的程序,所述程序用于:实现前述各实施例介绍的语音识别方法的各个步骤。 The memory stores a program, and the processor can call the program stored in the memory, and the program is used to: implement each step of the speech recognition method introduced in the above embodiments.
可选的,所述程序的细化功能和扩展功能可参照上文描述。Optionally, the detailed functions and extended functions of the program may refer to the above description.
本申请实施例还提供一种存储介质,该存储介质可存储有适于处理器执行的程序,所述程序用于:实现前述各实施例介绍的语音识别方法的各个步骤。An embodiment of the present application further provides a storage medium, which can store a program suitable for execution by a processor, wherein the program is used to implement the various steps of the speech recognition method introduced in the aforementioned embodiments.
可选的,所述程序的细化功能和扩展功能可参照上文描述。Optionally, the detailed functions and extended functions of the program may refer to the above description.
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。Finally, it should be noted that, in this article, relational terms such as first and second, etc. are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Moreover, the terms "include", "comprise" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements includes not only those elements, but also other elements not explicitly listed, or also includes elements inherent to such process, method, article or device. In the absence of further restrictions, the elements defined by the sentence "comprise a ..." do not exclude the presence of other identical elements in the process, method, article or device including the elements.
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间可以根据需要进行组合,且相同相似部分互相参见即可。The various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments. The various embodiments can be combined as needed, and the same or similar parts can refer to each other.
对所公开的实施例的上述说明,使本领域技术人员能够实现或使用本申请。对这些实施例的多种修改对本领域的技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本申请的精神或范围的情况下,在其它实施例中实现。因此,本申请将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。 The above description of the disclosed embodiments enables those skilled in the art to implement or use the present application. Various modifications to these embodiments will be apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the present application. Therefore, the present application will not be limited to the embodiments shown herein, but will conform to the widest scope consistent with the principles and novel features disclosed herein.

Claims (17)

  1. 一种语音识别方法,其特征在于,包括:A speech recognition method, characterized by comprising:
    获取待识别语音;Get the speech to be recognized;
    基于所述待识别语音得到初步识别文本,所述初步识别文本包括实体词类别标签及其余非实体词的字符;Obtaining a preliminary recognition text based on the speech to be recognized, wherein the preliminary recognition text includes entity word category labels and other non-entity word characters;
    基于所述待识别语音中所述实体词类别标签对应的语音片段和预设的发音词典及语言模型,得到所述实体词类别标签对应的实体词字符,由所述实体词字符替换掉所述初步识别文本中对应的实体词类别标签,得到最终的识别文本。Based on the speech segment corresponding to the entity word category label in the speech to be recognized and the preset pronunciation dictionary and language model, the entity word characters corresponding to the entity word category label are obtained, and the corresponding entity word category label in the preliminary recognition text is replaced by the entity word characters to obtain the final recognition text.
  2. 根据权利要求1所述的方法,其特征在于,基于待识别语音得到初步识别文本,以及,得到实体词类别标签对应的实体词字符,由所述实体词字符替换掉所述初步识别文本中对应的实体词类别标签,得到最终的识别文本的过程,通过预配置的语音识别模型实现。The method according to claim 1 is characterized in that a preliminary recognition text is obtained based on the speech to be recognized, and entity word characters corresponding to the entity word category label are obtained, and the entity word category labels corresponding to the preliminary recognition text are replaced by the entity word characters to obtain the final recognition text, which is achieved through a preconfigured speech recognition model.
  3. 根据权利要求1所述的方法,其特征在于,还包括:The method according to claim 1, further comprising:
    在获取到新增的领域实体词时,确定所述领域实体词对应的音节或音素,并将所述领域实体词与音节或音素的对应关系添加到所述预设的发音词典中,以及,将所述领域实体词添加到所述语言模型中。When a newly added domain entity word is obtained, the syllable or phoneme corresponding to the domain entity word is determined, and the correspondence between the domain entity word and the syllable or phoneme is added to the preset pronunciation dictionary, and the domain entity word is added to the language model.
  4. 根据权利要求1所述的方法,其特征在于,所述语言模型为基于各领域实体词所构建的语言模型。The method according to claim 1 is characterized in that the language model is a language model constructed based on entity words in various fields.
  5. 根据权利要求2所述的方法,其特征在于,所述语音识别模型包括编码器、一级解码器、二级解码器及输出层;The method according to claim 2, characterized in that the speech recognition model includes an encoder, a primary decoder, a secondary decoder and an output layer;
    所述编码器,用于对输入的待识别语音进行编码,得到声学编码特征;The encoder is used to encode the input speech to be recognized to obtain acoustic coding features;
    所述一级解码器,用于以字符为建模单元,基于所述声学编码特征,解码得到由实体词类别标签及其余非实体词的字符组成的初步识别文本;The first-level decoder is used to decode the characters as modeling units based on the acoustic coding features to obtain a preliminary recognition text consisting of entity word category labels and other non-entity word characters;
    所述二级解码器,用于以音节或音素为建模单元,基于实体词类别标签对应的语音片段的声学编码特征,解码得到实体词类别标签对应的音节或音素,并结合预设的发音词典及语言模型将音节或音素转换为字符,得到实体词类别标签对应的实体词字符; The secondary decoder is used to decode the syllable or phoneme corresponding to the entity word category label based on the acoustic coding features of the speech segment corresponding to the entity word category label using the syllable or phoneme as the modeling unit, and convert the syllable or phoneme into a character in combination with a preset pronunciation dictionary and language model to obtain the entity word character corresponding to the entity word category label;
    所述输出层,用于利用所述实体词字符替换掉所述初步识别文本中对应的实体词类别标签,得到最终输出的识别文本。The output layer is used to replace the corresponding entity word category label in the preliminary recognition text with the entity word character to obtain the final output recognition text.
  6. 根据权利要求5所述的方法,其特征在于,所述一级解码器,以字符为建模单元,基于所述声学编码特征,解码得到由实体词类别标签及其余非实体词的字符组成的初步识别文本的过程,包括:The method according to claim 5 is characterized in that the first-level decoder uses characters as modeling units and decodes the initial recognition text composed of entity word category labels and other non-entity word characters based on the acoustic coding features, comprising:
    所述一级解码器以字符为建模单元,基于所述声学编码特征及一级解码器的实时状态特征,解码得到由实体词类别标签及其余非实体词的字符组成的初步识别文本。The first-level decoder uses characters as modeling units, and based on the acoustic coding features and the real-time state features of the first-level decoder, decodes to obtain a preliminary recognition text consisting of entity word category labels and other non-entity word characters.
  7. 根据权利要求6所述的方法,其特征在于,所述一级解码器以字符为建模单元,基于所述声学编码特征及一级解码器的实时状态特征,解码得到由实体词类别标签及其余非实体词的字符组成的初步识别文本的过程,包括:The method according to claim 6 is characterized in that the first-level decoder uses characters as modeling units, and based on the acoustic coding features and the real-time state features of the first-level decoder, decodes to obtain a preliminary recognition text consisting of entity word category labels and other non-entity word characters, comprising:
    一级解码器以字符为建模单元,以解码第t个字符时对每一帧声学编码特征的关注度为权重,对各帧声学编码特征进行加权求和,得到解码第t个字符时的声学编码特征ct,基于解码第t个字符时的声学编码特征ct及解码第t个字符时一级解码器的状态特征dt,解码第t个字符,直至全部解码后得到由实体词类别标签及其余非实体词的字符组成的初步识别文本。The first-level decoder uses characters as modeling units, takes the attention degree of each frame of acoustic coding features when decoding the t-th character as the weight, performs weighted summation on the acoustic coding features of each frame, and obtains the acoustic coding feature c t when decoding the t-th character. Based on the acoustic coding feature c t when decoding the t-th character and the state feature d t of the first-level decoder when decoding the t-th character, the t-th character is decoded until all characters are decoded to obtain a preliminary recognition text consisting of entity word category labels and other non-entity word characters.
  8. 根据权利要求5所述的方法,其特征在于,所述二级解码器以音节或音素为建模单元,基于实体词类别标签对应的语音片段的声学编码特征,解码得到实体词类别标签对应的音节或音素的过程,包括:The method according to claim 5 is characterized in that the secondary decoder uses syllables or phonemes as modeling units, and decodes the syllables or phonemes corresponding to the entity word category labels based on the acoustic coding features of the speech segments corresponding to the entity word category labels, comprising:
    二级解码器以音节或音素为建模单元,基于一级解码器解码实体词类别标签时的声学编码特征,解码得到实体词类别标签对应的音节或音素。The secondary decoder uses syllables or phonemes as modeling units, and decodes the syllables or phonemes corresponding to the entity word category labels based on the acoustic coding features when the primary decoder decodes the entity word category labels.
  9. 根据权利要求5所述的方法,其特征在于,所述语音识别模型的训练过程,包括:The method according to claim 5, characterized in that the training process of the speech recognition model comprises:
    获取训练语音及对应的识别文本,所述识别文本中标注有实体词的类别标签;Obtaining training speech and corresponding recognition text, wherein the recognition text is annotated with category labels of entity words;
    利用实体词的类别标签替换掉识别文本中对应的实体词,得到编辑后识别文本; Use the category label of the entity word to replace the corresponding entity word in the recognized text to obtain the edited recognized text;
    将所述训练语音输入语音识别模型,得到一级解码器输出的初步识别文本,以及二级解码器输出的实体词类别标签对应的实体词字符;Input the training speech into the speech recognition model to obtain the preliminary recognition text output by the first-level decoder and the entity word characters corresponding to the entity word category labels output by the second-level decoder;
    基于一级解码器输出的初步识别文本及所述编辑后识别文本确定第一损失函数,基于二级解码器输出的实体词类别标签对应的实体词字符及实体词类别标签对应的原始实体词确定第二损失函数;Determine a first loss function based on the preliminary recognition text output by the first-level decoder and the edited recognition text, and determine a second loss function based on the entity word characters corresponding to the entity word category labels output by the second-level decoder and the original entity words corresponding to the entity word category labels;
    结合所述第一损失函数和所述第二损失函数,训练语音识别模型的网络参数,直至满足训练结束条件为止。The network parameters of the speech recognition model are trained by combining the first loss function and the second loss function until the training end condition is met.
  10. 根据权利要求9所述的方法,其特征在于,所述利用实体词的类别标签替换掉识别文本中对应的实体词,得到编辑后识别文本,包括:The method according to claim 9 is characterized in that the step of replacing the corresponding entity words in the recognized text with the category labels of the entity words to obtain the edited recognized text comprises:
    确定实体词包含的字符数量,并以同等数量的实体词类别标签替换掉识别文本中对应的实体词,得到编辑后识别文本。Determine the number of characters contained in the entity word, and replace the corresponding entity word in the recognized text with an equal number of entity word category labels to obtain the edited recognized text.
  11. 一种语音识别方法,其特征在于,包括:A speech recognition method, characterized by comprising:
    获取待识别语音;Get the speech to be recognized;
    利用预配置的语音识别模型的一级识别模块,基于所述待识别语音得到初步识别文本,所述初步识别文本包括实体词类别标签及其余非实体词的字符;Using a primary recognition module of a preconfigured speech recognition model, obtaining a preliminary recognition text based on the speech to be recognized, wherein the preliminary recognition text includes entity word category labels and other non-entity word characters;
    利用所述语音识别模型的二级识别模块,基于所述待识别语音中所述实体词类别标签对应的语音片段和预设的发音词典及语言模型,得到所述实体词类别标签对应的实体词字符;Utilizing the secondary recognition module of the speech recognition model, based on the speech segment corresponding to the entity word category label in the speech to be recognized and a preset pronunciation dictionary and language model, the entity word character corresponding to the entity word category label is obtained;
    由所述实体词字符替换掉所述初步识别文本中对应的实体词类别标签,得到最终的识别文本。The entity word character replaces the corresponding entity word category label in the preliminary recognition text to obtain the final recognition text.
  12. 根据权利要求11所述的方法,其特征在于,所述一级识别模块包括:The method according to claim 11, characterized in that the primary identification module comprises:
    编码器和一级解码器;encoder and primary decoder;
    所述编码器,用于对所述待识别语音进行编码,得到声学编码特征;The encoder is used to encode the speech to be recognized to obtain acoustic coding features;
    所述一级解码器,用于以字符为建模单元,基于所述声学编码特征,解码得到由实体词类别标签及其余非实体词的字符组成的初步识别文本。The first-level decoder is used to use characters as modeling units and decode based on the acoustic coding features to obtain preliminary recognition text consisting of entity word category labels and other non-entity word characters.
  13. 根据权利要求12所述的方法,其特征在于,所述二级识别模块包 括:The method according to claim 12, characterized in that the secondary identification module comprises include:
    二级编码器,用于以音节或音素为建模单元,基于所述实体词类别标签对应的语音片段的声学编码特征,解码得到实体词类别标签对应的音节或音素,并结合预设的发音词典及语言模型将音节或音素转换为字符,得到实体词类别标签对应的实体词字符。The secondary encoder is used to use syllables or phonemes as modeling units, decode the syllables or phonemes corresponding to the entity word category labels based on the acoustic coding features of the speech fragments corresponding to the entity word category labels, and convert the syllables or phonemes into characters in combination with a preset pronunciation dictionary and language model to obtain entity word characters corresponding to the entity word category labels.
  14. 一种语音识别装置,其特征在于,包括:A speech recognition device, comprising:
    待识别语音获取单元,用于获取待识别语音;A speech acquisition unit to be recognized, used for acquiring the speech to be recognized;
    初步识别文本确定单元,用于基于所述待识别语音得到初步识别文本,所述初步识别文本包括实体词类别标签及其余非实体词的字符;A preliminary recognition text determination unit, used to obtain preliminary recognition text based on the speech to be recognized, wherein the preliminary recognition text includes entity word category labels and other non-entity word characters;
    最终识别文本确定单元,用于基于所述待识别语音中所述实体词类别标签对应的语音片段和预设的发音词典及语言模型,得到所述实体词类别标签对应的实体词字符,由所述实体词字符替换掉所述初步识别文本中对应的实体词类别标签,得到最终的识别文本。The final recognition text determination unit is used to obtain the entity word characters corresponding to the entity word category labels in the speech to be recognized based on the speech fragments corresponding to the entity word category labels in the speech to be recognized and the preset pronunciation dictionary and language model, and replace the corresponding entity word category labels in the preliminary recognition text with the entity word characters to obtain the final recognition text.
  15. 一种语音识别装置,其特征在于,包括:A speech recognition device, comprising:
    待识别语音获取单元,用于获取待识别语音;A speech acquisition unit to be recognized, used for acquiring the speech to be recognized;
    语音识别模型处理单元,用于利用预配置的语音识别模型的一级识别模块,基于所述待识别语音得到初步识别文本,所述初步识别文本包括实体词类别标签及其余非实体词的字符;利用所述语音识别模型的二级识别模块,基于所述待识别语音中所述实体词类别标签对应的语音片段和预设的发音词典及语言模型,得到所述实体词类别标签对应的实体词字符;由所述实体词字符替换掉所述初步识别文本中对应的实体词类别标签,得到最终的识别文本。A speech recognition model processing unit is used to use the primary recognition module of a preconfigured speech recognition model to obtain a preliminary recognition text based on the speech to be recognized, wherein the preliminary recognition text includes an entity word category label and other non-entity word characters; use the secondary recognition module of the speech recognition model to obtain the entity word characters corresponding to the entity word category label in the speech to be recognized based on the speech fragment corresponding to the entity word category label in the speech to be recognized and a preset pronunciation dictionary and language model; replace the corresponding entity word category label in the preliminary recognition text with the entity word character to obtain the final recognition text.
  16. 一种语音识别设备,其特征在于,包括:存储器和处理器;A speech recognition device, characterized in that it comprises: a memory and a processor;
    所述存储器,用于存储程序;The memory is used to store programs;
    所述处理器,用于执行所述程序,实现如权利要求1~11中任一项所述的语音识别方法的各个步骤。The processor is used to execute the program to implement each step of the speech recognition method according to any one of claims 1 to 11.
  17. 一种存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时,实现如权利要求1~11中任一项所述的语音识别方 法的各个步骤。 A storage medium having a computer program stored thereon, wherein when the computer program is executed by a processor, the speech recognition method according to any one of claims 1 to 11 is implemented The various steps of the method.
PCT/CN2023/078636 2022-12-12 2023-02-28 Speech recognition method, apparatus and device, and storage medium WO2024124697A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211589720.7 2022-12-12
CN202211589720.7A CN115910070A (en) 2022-12-12 2022-12-12 Voice recognition method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2024124697A1 true WO2024124697A1 (en) 2024-06-20

Family

ID=86476490

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/078636 WO2024124697A1 (en) 2022-12-12 2023-02-28 Speech recognition method, apparatus and device, and storage medium

Country Status (2)

Country Link
CN (1) CN115910070A (en)
WO (1) WO2024124697A1 (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20150066361A (en) * 2013-12-06 2015-06-16 주식회사 케이티 Method and system for automatic word spacing of voice recognition using named entity recognition
CN112257449A (en) * 2020-11-13 2021-01-22 腾讯科技(深圳)有限公司 Named entity recognition method and device, computer equipment and storage medium
CN112347768A (en) * 2020-10-12 2021-02-09 出门问问(苏州)信息科技有限公司 Entity identification method and device
US10997223B1 (en) * 2017-06-28 2021-05-04 Amazon Technologies, Inc. Subject-specific data set for named entity resolution
CN113656561A (en) * 2021-10-20 2021-11-16 腾讯科技(深圳)有限公司 Entity word recognition method, apparatus, device, storage medium and program product
CN113821592A (en) * 2021-06-23 2021-12-21 腾讯科技(深圳)有限公司 Data processing method, device, equipment and storage medium
CN113990293A (en) * 2021-10-19 2022-01-28 京东科技信息技术有限公司 Voice recognition method and device, storage medium and electronic equipment
CN115048940A (en) * 2022-06-23 2022-09-13 之江实验室 Chinese financial text data enhancement method based on entity word attribute characteristics and translation

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20150066361A (en) * 2013-12-06 2015-06-16 주식회사 케이티 Method and system for automatic word spacing of voice recognition using named entity recognition
US10997223B1 (en) * 2017-06-28 2021-05-04 Amazon Technologies, Inc. Subject-specific data set for named entity resolution
CN112347768A (en) * 2020-10-12 2021-02-09 出门问问(苏州)信息科技有限公司 Entity identification method and device
CN112257449A (en) * 2020-11-13 2021-01-22 腾讯科技(深圳)有限公司 Named entity recognition method and device, computer equipment and storage medium
CN113821592A (en) * 2021-06-23 2021-12-21 腾讯科技(深圳)有限公司 Data processing method, device, equipment and storage medium
CN113990293A (en) * 2021-10-19 2022-01-28 京东科技信息技术有限公司 Voice recognition method and device, storage medium and electronic equipment
CN113656561A (en) * 2021-10-20 2021-11-16 腾讯科技(深圳)有限公司 Entity word recognition method, apparatus, device, storage medium and program product
CN115048940A (en) * 2022-06-23 2022-09-13 之江实验室 Chinese financial text data enhancement method based on entity word attribute characteristics and translation

Also Published As

Publication number Publication date
CN115910070A (en) 2023-04-04

Similar Documents

Publication Publication Date Title
CN112712804B (en) Speech recognition method, system, medium, computer device, terminal and application
WO2021232725A1 (en) Voice interaction-based information verification method and apparatus, and device and computer storage medium
US10176804B2 (en) Analyzing textual data
CN111883110B (en) Acoustic model training method, system, equipment and medium for speech recognition
CN107195296B (en) Voice recognition method, device, terminal and system
WO2021139108A1 (en) Intelligent emotion recognition method and apparatus, electronic device, and storage medium
CN109887484B (en) Dual learning-based voice recognition and voice synthesis method and device
Watts Unsupervised learning for text-to-speech synthesis
CN114580382A (en) Text error correction method and device
KR102041621B1 (en) System for providing artificial intelligence based dialogue type corpus analyze service, and building method therefor
WO2004034378A1 (en) Language model creation/accumulation device, speech recognition device, language model creation method, and speech recognition method
Kadyan et al. Refinement of HMM model parameters for Punjabi automatic speech recognition (PASR) system
CN111462748B (en) Speech recognition processing method and device, electronic equipment and storage medium
WO2023245389A1 (en) Song generation method, apparatus, electronic device, and storage medium
CN111930914A (en) Question generation method and device, electronic equipment and computer-readable storage medium
CN111508466A (en) Text processing method, device and equipment and computer readable storage medium
CN112686041A (en) Pinyin marking method and device
CN116775873A (en) Multi-mode dialogue emotion recognition method
WO2024124697A1 (en) Speech recognition method, apparatus and device, and storage medium
WO2023123892A1 (en) Construction method for information prediction module, information prediction method, and related device
Kurian et al. Connected digit speech recognition system for Malayalam language
CN115132170A (en) Language classification method and device and computer readable storage medium
CN114446278A (en) Speech synthesis method and apparatus, device and storage medium
CN111489742B (en) Acoustic model training method, voice recognition device and electronic equipment
Kolehmainen et al. Personalization for bert-based discriminative speech recognition rescoring