WO2024124697A1 - Procédé, appareil et dispositif de reconnaissance de parole, et support de stockage - Google Patents

Procédé, appareil et dispositif de reconnaissance de parole, et support de stockage Download PDF

Info

Publication number
WO2024124697A1
WO2024124697A1 PCT/CN2023/078636 CN2023078636W WO2024124697A1 WO 2024124697 A1 WO2024124697 A1 WO 2024124697A1 CN 2023078636 W CN2023078636 W CN 2023078636W WO 2024124697 A1 WO2024124697 A1 WO 2024124697A1
Authority
WO
WIPO (PCT)
Prior art keywords
entity word
speech
characters
recognized
entity
Prior art date
Application number
PCT/CN2023/078636
Other languages
English (en)
Chinese (zh)
Inventor
潘嘉
王孟之
万根顺
刘聪
刘庆峰
Original Assignee
科大讯飞股份有限公司
科大讯飞(苏州)科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 科大讯飞股份有限公司, 科大讯飞(苏州)科技有限公司 filed Critical 科大讯飞股份有限公司
Publication of WO2024124697A1 publication Critical patent/WO2024124697A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture

Definitions

  • domain speech recognition technology has been widely used, covering all areas of human-computer interaction.
  • the core difficulty of domain speech recognition lies in the existence of a large number of domain-specific entity words.
  • Domain-specific entity words especially low-frequency words, usually appear less frequently in the training data of speech recognition models, and domain-specific entity vocabulary is constantly updated. For example, in voice navigation applications, new company names and place names continue to appear.
  • the above characteristics of domain-specific entity words determine that in practical applications, the speech recognition system needs to be continuously updated to achieve a high accuracy rate in domain speech recognition.
  • the existing technology is not stable in improving the recognition accuracy of newly added domain entity words.
  • the improvement in recognition accuracy is highly dependent on the constructed training corpus.
  • the recognition accuracy is usually improved very little when the context is changed.
  • this application is proposed to provide a speech recognition method, device, equipment and storage medium to ensure the recognition accuracy of newly appeared domain entity words without updating the speech recognition model.
  • the specific solution is as follows:
  • a speech recognition method comprising:
  • the entity word characters corresponding to the entity word category label are obtained, and the corresponding entity word category label in the preliminary recognition text is replaced by the entity word characters to obtain the final recognition text.
  • the above process of obtaining a preliminary recognized text based on the speech to be recognized, obtaining entity word characters corresponding to the entity word category label, replacing the corresponding entity word category label in the preliminary recognized text with the entity word characters, and obtaining the final recognized text is achieved through a preconfigured speech recognition model.
  • it also includes:
  • the syllable or phoneme corresponding to the domain entity word is determined, and the correspondence between the domain entity word and the syllable or phoneme is added to the preset pronunciation dictionary, and the domain entity word is added to the language model.
  • the language model is a language model constructed based on entity words in various fields.
  • the speech recognition model includes an encoder, a primary decoder, a secondary decoder and an output layer;
  • the encoder is used to encode the input speech to be recognized to obtain acoustic coding features
  • the first-level decoder is used to decode the characters as modeling units based on the acoustic coding features to obtain a preliminary recognition text consisting of entity word category labels and other non-entity word characters;
  • the secondary decoder is used to decode the syllable or phoneme corresponding to the entity word category label based on the acoustic coding features of the speech segment corresponding to the entity word category label using the syllable or phoneme as the modeling unit, and convert the syllable or phoneme into a character in combination with a preset pronunciation dictionary and language model to obtain the entity word character corresponding to the entity word category label;
  • the output layer is used to replace the corresponding entity word category label in the preliminary recognition text with the entity word character to obtain the final output recognition text.
  • the first-level decoder uses characters as modeling units and decodes to obtain a preliminary recognition text consisting of entity word category labels and other non-entity word characters based on the acoustic coding features, including:
  • the first-level decoder uses characters as modeling units, and based on the acoustic coding features and the real-time state features of the first-level decoder, decodes to obtain a preliminary recognition text consisting of entity word category labels and other non-entity word characters.
  • the primary decoder uses characters as modeling units, and based on the acoustic coding features and the real-time state features of the primary decoder, decodes to obtain a preliminary recognition text consisting of entity word category labels and other non-entity word characters, including:
  • the first-level decoder uses characters as modeling units, takes the attention degree of each frame of acoustic coding features when decoding the t-th character as the weight, performs weighted summation on the acoustic coding features of each frame, and obtains the acoustic coding feature c t when decoding the t-th character. Based on the acoustic coding feature c t when decoding the t-th character and the state feature d t of the first-level decoder when decoding the t-th character, the t-th character is decoded until all characters are decoded to obtain a preliminary recognition text consisting of entity word category labels and other non-entity word characters.
  • the secondary decoder uses syllables or phonemes as modeling units, and based on the acoustic coding features of the speech segment corresponding to the entity word category label, decodes the syllables or phonemes corresponding to the entity word category label, including:
  • the secondary decoder uses syllables or phonemes as modeling units, and decodes the syllables or phonemes corresponding to the entity word category labels based on the acoustic coding features when the primary decoder decodes the entity word category labels.
  • the training process of the speech recognition model includes:
  • the network parameters of the speech recognition model are trained by combining the first loss function and the second loss function until the training end condition is met.
  • the step of replacing the corresponding entity words in the recognized text with the category labels of the entity words to obtain the edited recognized text includes:
  • a speech recognition method comprising:
  • obtaining a preliminary recognition text based on the speech to be recognized wherein the preliminary recognition text includes entity word category labels and other non-entity word characters;
  • the secondary recognition module of the speech recognition model Utilizing the secondary recognition module of the speech recognition model, based on the speech segment corresponding to the entity word category label in the speech to be recognized and a preset pronunciation dictionary and language model, the entity word character corresponding to the entity word category label is obtained;
  • the entity word character replaces the corresponding entity word category label in the preliminary recognition text to obtain the final recognition text.
  • the primary identification module comprises:
  • the encoder is used to encode the speech to be recognized to obtain acoustic coding features
  • the first-level decoder is used to use characters as modeling units and decode based on the acoustic coding features to obtain preliminary recognition text consisting of entity word category labels and other non-entity word characters.
  • the secondary recognition module includes: a secondary encoder, which is used to use syllables or phonemes as modeling units, decode the syllables or phonemes corresponding to the entity word category label based on the acoustic coding features of the speech segment corresponding to the entity word category label, and convert the syllables or phonemes into characters in combination with a preset pronunciation dictionary and language model to obtain entity word characters corresponding to the entity word category label.
  • a secondary encoder which is used to use syllables or phonemes as modeling units, decode the syllables or phonemes corresponding to the entity word category label based on the acoustic coding features of the speech segment corresponding to the entity word category label, and convert the syllables or phonemes into characters in combination with a preset pronunciation dictionary and language model to obtain entity word characters corresponding to the entity word category label.
  • a speech recognition device comprising:
  • a speech acquisition unit to be recognized used for acquiring the speech to be recognized
  • a preliminary recognition text determination unit used to obtain preliminary recognition text based on the speech to be recognized, wherein the preliminary recognition text includes entity word category labels and other non-entity word characters;
  • the final recognition text determination unit is used to obtain the entity word characters corresponding to the entity word category labels in the speech to be recognized based on the speech fragments corresponding to the entity word category labels in the speech to be recognized and the preset pronunciation dictionary and language model, and replace the corresponding entity word category labels in the preliminary recognition text with the entity word characters to obtain the final recognition text.
  • a speech recognition device comprising:
  • a speech acquisition unit to be recognized used for acquiring the speech to be recognized
  • a speech recognition model processing unit is used to use the primary recognition module of a preconfigured speech recognition model to obtain a preliminary recognition text based on the speech to be recognized, wherein the preliminary recognition text includes an entity word category label and other non-entity word characters; use the secondary recognition module of the speech recognition model to obtain the entity word characters corresponding to the entity word category label in the speech to be recognized based on the speech fragment corresponding to the entity word category label in the speech to be recognized and a preset pronunciation dictionary and language model; replace the corresponding entity word category label in the preliminary recognition text with the entity word character to obtain the final recognition text.
  • a speech recognition device comprising: a memory and a processor
  • the memory is used to store programs
  • the processor is used to execute the program to implement the various steps of the speech recognition method described above.
  • a storage medium on which a computer program is stored.
  • the computer program is executed by a processor, the various steps of the speech recognition method described above are implemented.
  • the present application divides the speech recognition process into two stages.
  • a preliminary recognition text consisting of entity word category labels and other non-entity word characters can be obtained based on the speech to be recognized.
  • the second recognition stage based on the speech segment corresponding to the entity word category label in the speech to be recognized and the preset pronunciation dictionary and language model, the entity word characters corresponding to the entity word category label are obtained, and the corresponding entity word category label in the preliminary recognition text is replaced by the entity word characters to obtain the recognition text outputted finally.
  • the entity words in the speech to be recognized are recognized as corresponding category labels, and non-entity words can be directly recognized as characters, which can greatly reduce the probability that low-frequency entity words or newly appeared entity words in the same category label are mistakenly recognized as characters under non-category labels, and improve the recognition accuracy of entity words.
  • the entity words corresponding to the entity word category label are predicted in combination with the pronunciation dictionary and the language model. When a new domain entity word appears, the newly appeared domain entity word can be added to the preset pronunciation dictionary and language model, so that the recognition accuracy of the newly appeared domain entity word can be guaranteed.
  • FIG1 is a flow chart of a speech recognition method provided by an embodiment of the present application.
  • FIG2 illustrates a schematic diagram of the structure of a speech recognition model
  • FIG3 illustrates a schematic diagram of a two-stage decoding process of a speech recognition model
  • FIG4 illustrates a schematic diagram of a process for determining a decoded character by combining a pronunciation dictionary and a speech model
  • FIG5 is a schematic diagram of the structure of a speech recognition device provided in an embodiment of the present application.
  • FIG6 is a flow chart of another speech recognition method provided in an embodiment of the present application.
  • FIG7 is a schematic diagram of the structure of another speech recognition device provided in an embodiment of the present application.
  • FIG8 is a schematic diagram of the structure of a speech recognition device provided in an embodiment of the present application.
  • the present application provides a speech recognition solution that can be applied to various scenarios for speech recognition, especially for domain entity word speech recognition scenarios, and can ensure a high recognition accuracy rate for newly emerging domain entity words.
  • the present application solution can be implemented based on a terminal with data processing capabilities, which can be a mobile phone, computer, server, cloud, etc.
  • the speech recognition method of the present application may include the following steps:
  • Step S100 Acquire speech to be recognized.
  • Step S110 obtaining a preliminary recognition text consisting of entity word category labels and other non-entity word characters based on the speech to be recognized.
  • the entity word category label is the pre-set field category to which the entity word belongs, such as labels such as person name, place name, organization name, singer, song, drug name, film and television drama name, etc.
  • the speech recognition method provided in this embodiment divides the speech recognition process into two stages.
  • entity words in the speech to be recognized are recognized as corresponding category labels, and non-entity words are directly recognized as characters to obtain the initial recognition text. This can greatly reduce the probability that low-frequency entity words or newly appeared entity words in the same category label are mistakenly recognized as characters not under the category label. Improve the recognition accuracy of entity words.
  • the process of identifying and obtaining the preliminary recognized text in this step can adopt an end-to-end modeling method, that is, taking characters as modeling units, and obtaining the preliminary recognized text based on the decoding of the speech to be recognized.
  • other modeling methods can also be adopted, such as taking syllables or phonemes as modeling units, obtaining corresponding syllables or phonemes based on the decoding of the speech to be recognized, and then combining the preset pronunciation dictionary and language model to convert the decoded syllables or phonemes into characters to obtain the preliminary recognized text.
  • Step S120 combining a preset pronunciation dictionary and a language model, modeling entity word characters corresponding to entity word category labels, replacing corresponding entity word category labels in the preliminary recognition text with the entity word characters, and obtaining a final recognition text.
  • the entity word characters corresponding to the entity word category labels can be modeled based on the speech segments corresponding to the entity word category labels in the speech to be recognized and the preset pronunciation dictionary and language model.
  • the pronunciation dictionary and language model can include existing and newly emerging entity words of various types.
  • syllables or phonemes can be selected as modeling units, and entity word characters corresponding to the entity word category labels in the speech to be recognized can be modeled based on the speech segments corresponding to the entity word category labels in the speech to be recognized and the preset pronunciation dictionary and language model.
  • the corresponding pronunciation dictionary may be a syllable pronunciation dictionary, which includes the correspondence between syllables and characters.
  • the corresponding pronunciation dictionary may be a phoneme pronunciation dictionary, which includes the correspondence between phonemes and characters.
  • the corresponding phonemes or syllables can be modeled based on the speech segments corresponding to the entity word category labels in the speech to be recognized. Further, the candidate characters corresponding to the phonemes or syllables are determined in combination with the pronunciation dictionary, and the probability score of each candidate character is determined in combination with the language model, and the candidate character with the highest probability score is selected as the entity word character corresponding to the entity word category label.
  • the speech recognition method introduced in this embodiment when facing a newly appearing entity word, it is only necessary to update the pronunciation dictionary and the language model, that is, it is only necessary to expand the decoding path, and there is no need to iteratively update the speech recognition model.
  • the solution has better scalability, lower learning cost, and will not cause catastrophic forgetting problems caused by updating the speech recognition model.
  • the syllables or phonemes of the newly added domain entity words can be determined, and then the correspondence between the newly added domain entity words and their syllables or phonemes is added to the preset pronunciation dictionary to complete the update of the pronunciation dictionary.
  • the newly added domain entity words are added to the preset language model to complete the update of the language model.
  • the second recognition stage of the speech recognition method of the present application in combination with the pronunciation dictionary and the language model, obtains the entity word characters corresponding to the entity word category label based on the modeling of the speech fragments corresponding to the entity word category label, that is, the second recognition stage only needs to perform entity word recognition, and does not need to perform non-entity word recognition.
  • the language model can be configured as a domain entity language model.
  • the domain entity language model can be constructed based on various domain entity words, that is, the domain entity language model only contains entity words.
  • the above embodiment introduces a method for performing speech recognition in a two-stage recognition manner, wherein the above two speech recognition stages can be implemented by a variety of different means.
  • the first stage speech recognition can be performed through a pre-trained speech recognition system, that is, the speech to be recognized is input into the pre-trained speech recognition system, and the decoding output is a preliminary recognition text consisting of entity word category labels and remaining non-entity word characters.
  • the second-stage speech recognition can be performed through a pre-trained speech recognition model. That is, the speech recognition model can use syllables or phonemes as modeling units, combined with a preset pronunciation dictionary and language model, to model the speech fragments corresponding to the entity word category labels in the speech to be recognized, and obtain the entity word characters corresponding to the entity word category labels.
  • the entity word characters corresponding to the entity word category labels obtained in the second recognition stage replace the corresponding entity word category labels in the preliminary recognition text obtained in the first recognition stage to obtain the final recognition text.
  • the embodiment of the present application also provides another optional implementation method of the above two-stage speech recognition.
  • the present application can pre-train a speech recognition model.
  • the recognition model implements the above two-stage speech recognition process.
  • a speech recognition model may be pre-trained, and the speech recognition model may be configured as follows:
  • a preliminary recognition text consisting of entity word category labels and other non-entity word characters is obtained.
  • Syllables or phonemes are used as modeling units.
  • entity word characters corresponding to the entity word category labels are obtained based on the speech fragments corresponding to the entity word category labels.
  • the corresponding entity word category labels in the preliminary recognition text are replaced by the entity word characters to obtain the recognition text that is finally output.
  • the speech to be recognized is input into the speech recognition model configured as above to obtain the final recognition text output by the speech recognition model.
  • the speech recognition model divides the speech recognition process into two stages.
  • entity words in the speech to be recognized are recognized as corresponding category labels, and non-entity words are directly recognized as characters to obtain the initial recognition text. This can greatly reduce the probability that low-frequency entity words or newly appeared entity words in the same category label are mistakenly recognized as characters not under the category label, and improve the recognition accuracy of entity words.
  • the pronunciation dictionary and language model can include existing and newly emerging entity words of various types. Specifically, when a new domain entity word appears, the newly emerging domain entity word can be added to the preset pronunciation dictionary and language model, so as to ensure the recognition accuracy of the newly emerging domain entity word.
  • the speech recognition model introduced in this embodiment may include an encoder, The first-level decoder Decoder1, the second-level decoder Decoder2 and the output layer (not shown in FIG. 2 ).
  • Encoder used to encode the input speech to be recognized to obtain acoustic coding features.
  • the input of the encoder can be the speech features of the speech to be recognized, such as the amplitude spectrum features, etc. Taking the amplitude spectrum features as an example, it can be the log filter bank energy (LFBE).
  • the encoder is used to extract the representation of the speech features of the speech to be recognized.
  • the encoder encodes the speech features of the speech to be recognized to obtain acoustic coding features.
  • the encoder may adopt a convolutional neural network, LSTM or Transformer structure.
  • a first-level decoder is used to decode the characters into a preliminary recognition text consisting of entity word category labels and other non-entity word characters based on the acoustic coding features using the characters as modeling units.
  • the first-level decoder can use characters as modeling units for end-to-end modeling.
  • syllables or phonemes can be used as modeling units, and then the pronunciation dictionary and language model are combined to obtain preliminary recognition text.
  • the speech recognition model of the example in Figure 2 only the character modeling unit is used as an example for explanation.
  • the first-level decoder uses characters as modeling units for decoding, and can directly decode to obtain preliminary recognition text. For entity words contained in the speech to be recognized, the first-level decoder decodes them into entity word category labels. For non-entity words in the speech to be recognized, the first-level decoder decodes them normally into corresponding characters, and finally obtains the preliminary recognition text output by the first-level decoder.
  • one entity word can correspond to one entity word category label.
  • the same number of entity word category labels can be obtained according to the number of characters contained in the entity word, that is, one entity word can correspond to multiple identical entity word category labels.
  • the preliminary recognition text output by the first-level decoder can be: listen to ⁇ singer>'s song.
  • the preliminary recognition text output by the first-level decoder can also be: listen to ⁇ singer>'s song.
  • the first-level decoder in this embodiment can adopt a network structure with attention mechanism and autoregression, such as transformer, LSTM and other network structures.
  • attention mechanism and autoregression such as transformer, LSTM and other network structures.
  • the first-level decoder uses characters as modeling units, and decodes to obtain a preliminary recognition text consisting of entity word category labels and other non-entity word characters based on the acoustic coding features and the real-time state features of the first-level decoder.
  • the first-level decoder when decoding, can refer to the acoustic coding feature ct and the real-time state feature of the first-level decoder at the same time, where the real-time state feature of the first-level decoder can be understood as the contextual language feature, that is, when the first-level decoder decodes word by word, it can refer to the characters decoded at the previous moment to guide the decoding process at the current moment, taking into account the contextual association of the text, so that the decoding result is more accurate.
  • a secondary decoder is used to use syllables or phonemes as modeling units, decode the syllables or phonemes corresponding to the entity word category labels based on the acoustic coding features of the speech fragments corresponding to the entity word category labels, and convert the syllables or phonemes into characters in combination with a preset pronunciation dictionary and language model to obtain the entity word characters corresponding to the entity word category labels.
  • the second-level decoder decodes the syllable or phoneme corresponding to the entity word category label based on the acoustic coding features of the speech segment corresponding to the entity word category label. It can be understood that if the second-level decoder uses syllables as modeling units, the syllable corresponding to the entity word category label is decoded here; if the second-level decoder uses phonemes as modeling units, the phoneme corresponding to the entity word category label is decoded here.
  • the secondary decoder can weaken the modeling of context relevance, that is, different from the way in which the primary decoder decodes by referring to both the acoustic coding features and the real-time state features of the primary decoder, the secondary decoder can decode based only on the acoustic coding features.
  • the secondary decoder After the secondary decoder decodes and obtains the syllable or phoneme corresponding to the entity word category label, it can convert the syllable or phoneme into a character in combination with a preset pronunciation dictionary and language model to obtain the entity word character corresponding to the entity word category label.
  • ⁇ device name> refers to the category label of the entity word.
  • the second-level decoder dec2 further decodes each entity word category label output by the first-level decoder to obtain the corresponding syllable: yu3se4shou3ji1.
  • Each syllable is converted into a corresponding character through the pronunciation dictionary and language model: ⁇ .
  • the characters corresponding to "yu3" in the syllable pronunciation dictionary include “ ⁇ ", " ⁇ ", and “ ⁇ ”.
  • the language model scores of the three characters in the language model are 2, 3, and 1, respectively. Therefore, the character “ ⁇ ” with the highest language model score is selected as the final decoding character corresponding to "yu3".
  • An output layer is used to replace the corresponding entity word category label in the preliminary recognition text with the entity word to obtain the final output recognition text.
  • the initial recognition text is: I bought a ⁇ device name> ⁇ device name> ⁇ device name> ⁇ device name>.
  • the entity word characters obtained by the secondary decoder are " ⁇ ".
  • the entity word category label in the initial recognition text is replaced with " ⁇ ”
  • the final output recognition text is: I bought a ⁇ .
  • the process of decoding the above-mentioned first-level decoder to obtain the preliminary recognition text is introduced.
  • the first-level decoder can use characters as modeling units, and can directly decode characters during decoding.
  • the first-level decoder can use the network results with an attention mechanism, and can refer to acoustic information and language information at the same time during decoding to improve decoding accuracy.
  • the first-level decoder takes the attention degree of each frame acoustic coding feature when decoding the t-th character as the weight, performs weighted summation on the acoustic coding features of each frame, and obtains the acoustic coding feature c t when decoding the t-th character.
  • the state feature d t of the first-level decoder is used when the character is t, and the t-th character is decoded until all characters are decoded to obtain the preliminary recognition text consisting of the entity word category label and the remaining non-entity word characters.
  • the output of the first-level decoder can be calculated by referring to the following formula:
  • dec 1 argmax(W 1 [ d t ; c t ])
  • y t-1 is the last decoded character
  • c t-1 is the acoustic coding feature when the first-level decoder decodes the previous character
  • d t is the state feature of the first-level decoder when decoding the t-th character (here the first-level decoder adopts the LSTM structure as an example for explanation), which can also be understood as the language feature when decoding the t-th character
  • a tj is the normalized attention to the acoustic coding feature of the j-th frame when decoding the t-th character
  • W q , W k , V, W 1 are network parameters
  • h j is the acoustic coding feature of the j-th frame
  • dec 1 is the output of the first-level decoder.
  • the secondary decoder uses syllables or phonemes as modeling units, and decodes the syllables or phonemes corresponding to the entity word category labels based on the acoustic coding features when the primary decoder decodes the entity word category labels.
  • the second-level decoder needs to weaken the correlation between previous and next characters, and therefore can use only acoustic coding features.
  • the secondary decoder can only use the acoustic coding features when decoding the entity word category label to decode the syllable or phoneme corresponding to the entity word category label.
  • the output of the secondary decoder is expressed as:
  • dec 2 argmax(W 2 c t )
  • dec 2 represents the output of the secondary decoder and W 2 is the network parameter.
  • the training process of the above-mentioned speech recognition model is further introduced.
  • This embodiment illustrates an optional training process of a speech recognition model, which may include the following steps:
  • domain entity words such as names of people, places, institutions, singers, songs, etc.
  • NER named entity recognition
  • the entity words obtained above are limited and it is difficult to cover all categories of entity words. Therefore, the entity words and their category labels obtained above can be used to train a classification neural network model to determine the category of entity words contained in the input text.
  • the classification neural network can use a model with context modeling capabilities, such as Transformer, LSTM, etc.
  • the category labels of entity words in the training corpus text can be determined.
  • the training corpus text may be the recognition text corresponding to the collected training speech.
  • the training sample labels of the first-level decoder that is, to replace the entity words in the recognition text with the corresponding entity word category labels to obtain the edited recognition text.
  • the recognized text is "I bought an Apple phone", which contains the entity word "Apple The corresponding category label is ⁇ Organization Name>, then the edited recognition text can be "I bought a ⁇ Organization Name> mobile phone”.
  • the secondary decoder can perform secondary decoding on each entity word category label to obtain the corresponding characters, and each character constitutes a complete entity word.
  • the corresponding entity words in the recognized text are replaced with the category labels of the entity words to obtain the edited recognized text, which may specifically include:
  • the recognized text after editing may be "I bought a ⁇ organization name> ⁇ organization name> mobile phone”.
  • a first loss function can be calculated by a cross entropy loss function, and the first loss function represents the decoding loss of the primary decoder.
  • a second loss function can be calculated through a cross-entropy loss function based on the entity word characters corresponding to the entity word category labels output by the secondary decoder and the original entity words corresponding to the entity word category labels in the recognized text.
  • the second loss function represents the decoding loss of the secondary decoder.
  • the total loss function is calculated by combining the first loss function and the second loss function, and the network parameters of the speech recognition model are trained based on the total loss function until the training end condition is met, thereby obtaining the final trained speech recognition model.
  • FIG. 5 is a schematic diagram of the structure of a speech recognition device disclosed in an embodiment of the present application.
  • the device may include:
  • a preliminary recognition text determination unit 12 is used to obtain preliminary recognition text based on the speech to be recognized, wherein the preliminary recognition text includes entity word category labels and other non-entity word characters;
  • the preliminary recognition text determination unit may use characters as modeling units, and obtain preliminary recognition text based on the decoding of the speech to be recognized.
  • other modeling methods may be used, such as using syllables or phonemes as modeling units, obtaining corresponding syllables or phonemes based on the decoding of the speech to be recognized, and then converting the decoded syllables or phonemes into characters in combination with a preset pronunciation dictionary and language model to obtain preliminary recognition text.
  • the final recognition text determination unit 13 is used to obtain the entity word characters corresponding to the entity word category labels in the speech to be recognized based on the speech segments corresponding to the entity word category labels in the speech to be recognized and the preset pronunciation dictionary and language model, and replace the corresponding entity word category labels in the preliminary recognition text with the entity word characters to obtain the final recognition text.
  • the above-mentioned final recognition text determination unit can use syllables or phonemes as modeling units, and based on the speech fragments corresponding to the entity word category labels in the speech to be recognized and the preset pronunciation dictionary and language model, model the entity word characters corresponding to the entity word category labels.
  • the processing of the preliminary recognition text determination unit and the final recognition text determination unit can be implemented by a model processing unit, which is used to process the speech data to be recognized using a preconfigured speech recognition model to obtain the recognition text output by the model, wherein the speech recognition model is configured as follows:
  • a preliminary recognition text consisting of entity word category labels and other non-entity word characters is obtained.
  • Syllables or phonemes are used as modeling units.
  • entity word characters corresponding to the entity word category labels are modeled. The entity word characters are used to replace the Preliminarily identify the corresponding entity word category labels in the text and obtain the final output recognition text.
  • the device of the present application may further include:
  • the pronunciation dictionary and language model updating unit is used to determine the syllable or phoneme corresponding to the domain entity word when a newly added domain entity word is obtained, and add the correspondence between the domain entity word and the syllable or phoneme to the preset pronunciation dictionary, and add the domain entity word to the language model.
  • the above-mentioned language model may be a language model constructed based on entity words in various fields.
  • the speech recognition model may include an encoder, a primary decoder, a secondary decoder and an output layer;
  • the encoder is used to encode the input speech to be recognized to obtain acoustic coding features
  • the first-level decoder is used to decode the characters as modeling units based on the acoustic coding features to obtain a preliminary recognition text consisting of entity word category labels and other non-entity word characters;
  • the secondary decoder is used to decode the syllable or phoneme corresponding to the entity word category label based on the acoustic coding features of the speech segment corresponding to the entity word category label using the syllable or phoneme as the modeling unit, and convert the syllable or phoneme into a character in combination with a preset pronunciation dictionary and language model to obtain the entity word character corresponding to the entity word category label;
  • the output layer is used to replace the corresponding entity word category label in the preliminary recognition text with the entity word character to obtain the final output recognition text.
  • the first-level decoder may adopt a network structure with an attention mechanism and autoregression. Specifically, taking characters as modeling units, based on the acoustic coding features and the real-time state features of the first-level decoder, decoding to obtain a preliminary recognition text consisting of entity word category labels and other non-entity word characters, the process may include:
  • the first-level decoder uses characters as modeling units, takes the attention degree of each frame of acoustic coding features when decoding the t-th character as the weight, performs weighted summation on the acoustic coding features of each frame, and obtains the acoustic coding feature c t when decoding the t-th character. Based on the acoustic coding feature c t when decoding the t-th character and the state feature d t of the first-level decoder when decoding the t-th character, the t-th character is decoded until all characters are decoded to obtain a preliminary recognition text consisting of entity word category labels and other non-entity word characters.
  • the secondary decoder uses syllables or phonemes as modeling units, and based on the acoustic coding features of the speech segment corresponding to the entity word category label, decodes the syllables or phonemes corresponding to the entity word category label, which may include:
  • the secondary decoder uses syllables or phonemes as modeling units, and decodes the syllables or phonemes corresponding to the entity word category labels based on the acoustic coding features when the primary decoder decodes the entity word category labels.
  • the device of the present application may further include:
  • the model training unit is used to train the speech recognition model.
  • the training process may include:
  • the network parameters of the speech recognition model are trained by combining the first loss function and the second loss function until the training end condition is met.
  • the model training unit replaces the corresponding entity word in the recognized text with the category label of the entity word to obtain the edited recognized text, which may include:
  • another speech recognition method is further provided. As shown in FIG. 6 , the method may include:
  • Step S200 Acquire speech to be recognized.
  • Step S210 using the primary recognition module of the pre-configured speech recognition model, obtaining a preliminary recognition text based on the speech to be recognized, wherein the preliminary recognition text includes entity word category labels and their The remaining non-entity characters.
  • a speech recognition model is pre-configured.
  • the speech recognition model includes a two-level framework, namely a primary recognition module and a secondary recognition module.
  • the first-level recognition module can decode the input speech to be recognized to obtain a preliminary recognition text composed of entity word category labels and other non-entity word characters.
  • the primary recognition module can use characters as modeling units, and obtain preliminary recognition text based on the decoding of the speech to be recognized.
  • other modeling methods can also be used, such as using syllables or phonemes as modeling units, obtaining corresponding syllables or phonemes based on the decoding of the speech to be recognized, and then combining the preset pronunciation dictionary and language model to convert the decoded syllables or phonemes into characters to obtain preliminary recognition text.
  • Step S220 using the secondary recognition module of the speech recognition model, based on the speech segment corresponding to the entity word category label in the speech to be recognized and a preset pronunciation dictionary and language model, obtain the entity word character corresponding to the entity word category label.
  • the secondary recognition module in the speech recognition model can use syllables or phonemes as modeling units to perform secondary decoding on the speech segments corresponding to the entity word category labels decoded by the primary recognition module to obtain decoded syllables or phonemes. Further, in combination with a preset pronunciation dictionary and language model, the decoded syllables or phonemes are converted into character form to obtain entity word characters corresponding to the entity word category labels.
  • Step S230 Replace the corresponding entity word category label in the preliminary recognition text with the entity word character to obtain the final recognition text.
  • the pre-configured speech recognition model includes a two-level recognition module.
  • the first-level recognition module recognizes the entity words in the speech to be recognized as corresponding category labels, and directly recognizes non-entity words as characters to obtain an initial recognition text. In this way, the probability of low-frequency entity words or newly appeared entity words in the same category label being mistakenly recognized as characters not under the category label can be greatly reduced, and the recognition accuracy of entity words can be improved.
  • the second-level recognition module only needs to recognize the entity words corresponding to the entity word category label. Finally, the entity word characters recognized by the second-level recognition module replace the corresponding entity word category label in the preliminary recognition text to obtain the final recognition text.
  • the syllables or phonemes of the newly added domain entity words can be determined, and then the corresponding relationship between the newly added domain entity words and their syllables or phonemes is added to the preset pronunciation dictionary to complete the update of the pronunciation dictionary.
  • the newly added domain entity words are added to the preset language model to complete the update of the language model.
  • the primary recognition module may include: an encoder and a primary decoder.
  • the encoder is used to encode the speech to be recognized to obtain acoustic coding features.
  • the first-level decoder is used to use characters as modeling units and decode the initial recognition text composed of entity word category labels and other non-entity word characters based on the acoustic coding features.
  • the first-level decoder can adopt a network structure with an attention mechanism and autoregression, that is, when the first-level decoder decodes, it can use characters as modeling units, and decode to obtain a preliminary recognition text consisting of entity word category labels and other non-entity word characters based on the acoustic coding features and the real-time state features of the first-level decoder.
  • the specific implementation process can refer to the relevant introduction in the previous article, which will not be repeated here.
  • the secondary recognition module may include: a secondary encoder, which is used to use syllables or phonemes as modeling units, decode the syllables or phonemes corresponding to the entity word category label based on the acoustic coding features of the speech segment corresponding to the entity word category label, and convert the syllables or phonemes into characters in combination with a preset pronunciation dictionary and language model to obtain the entity word characters corresponding to the entity word category label.
  • a secondary encoder which is used to use syllables or phonemes as modeling units, decode the syllables or phonemes corresponding to the entity word category label based on the acoustic coding features of the speech segment corresponding to the entity word category label, and convert the syllables or phonemes into characters in combination with a preset pronunciation dictionary and language model to obtain the entity word characters corresponding to the entity word category label.
  • the secondary recognition module may also include an output layer, which is used to replace the corresponding entity word category labels in the preliminary recognition text with the entity word characters obtained by the secondary recognition module, and output the final recognition text.
  • the speech recognition device provided in an embodiment of the present application is described below.
  • the speech recognition device described below and the second speech recognition method described above can be referenced to each other.
  • FIG. 7 is a schematic diagram of the structure of another speech recognition device disclosed in an embodiment of the present application.
  • the device may include:
  • the to-be-recognized speech acquisition unit 21 is used to acquire the to-be-recognized speech
  • the speech recognition model processing unit 22 is used to use the primary recognition module of the preconfigured speech recognition model to obtain a preliminary recognition text based on the speech to be recognized, and the preliminary recognition text includes an entity word category label and other non-entity word characters; use the secondary recognition module of the speech recognition model to obtain the entity word characters corresponding to the entity word category label in the speech to be recognized based on the speech segment corresponding to the entity word category label in the speech to be recognized and a preset pronunciation dictionary and language model; replace the corresponding entity word category label in the preliminary recognition text with the entity word character to obtain the final recognition text.
  • composition structure of the speech recognition model can refer to the introduction of the aforementioned speech recognition method part, which will not be repeated here.
  • FIG8 shows a hardware structure block diagram of a speech recognition device.
  • the hardware structure of the speech recognition device may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;
  • the number of the processor 1, the communication interface 2, the memory 3, and the communication bus 4 is at least one, and the processor 1, the communication interface 2, and the memory 3 communicate with each other through the communication bus 4;
  • the processor 1 may be a central processing unit CPU, or an application specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement the embodiments of the present invention, etc.;
  • ASIC Application Specific Integrated Circuit
  • the memory 3 may include a high-speed RAM memory, and may also include a non-volatile memory, such as at least one disk memory;
  • the memory stores a program
  • the processor can call the program stored in the memory, and the program is used to: implement each step of the speech recognition method introduced in the above embodiments.
  • An embodiment of the present application further provides a storage medium, which can store a program suitable for execution by a processor, wherein the program is used to implement the various steps of the speech recognition method introduced in the aforementioned embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Mathematical Physics (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)

Abstract

La présente demande divulgue un procédé, un appareil et un dispositif de reconnaissance de parole, ainsi qu'un support de stockage. Le procédé consiste à : sur la base d'une parole à reconnaître, obtenir un texte reconnu préliminaire constitué d'une étiquette de catégorie de mots d'entité et de caractères des mots de non-entité restants; en outre, sur la base d'un segment de parole correspondant à l'étiquette de catégorie de mots d'entité et d'un dictionnaire de prononciation prédéfini et d'un modèle de langage, obtenir des caractères de mots d'entité correspondant à l'étiquette de catégorie de mots d'entité; et remplacer l'étiquette de catégorie de mots d'entité correspondante dans le texte reconnu préliminaire par les caractères de mots d'entité de façon à obtenir un texte reconnu final. Ainsi, lorsqu'un nouveau mot d'entité de domaine apparaît, seuls le dictionnaire de prononciation et le modèle de langage doivent être mis à jour, et la mise à jour itérative d'un modèle de reconnaissance de parole n'est pas nécessaire, ce qui permet d'abaisser le coût d'apprentissage, d'éviter le problème d'oubli catastrophique provoqué par la mise à jour de modèles de reconnaissance de parole, et d'assurer la précision de reconnaissance de mots d'entité de domaine nouvellement apparus.
PCT/CN2023/078636 2022-12-12 2023-02-28 Procédé, appareil et dispositif de reconnaissance de parole, et support de stockage WO2024124697A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211589720.7A CN115910070A (zh) 2022-12-12 2022-12-12 语音识别方法、装置、设备及存储介质
CN202211589720.7 2022-12-12

Publications (1)

Publication Number Publication Date
WO2024124697A1 true WO2024124697A1 (fr) 2024-06-20

Family

ID=86476490

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/078636 WO2024124697A1 (fr) 2022-12-12 2023-02-28 Procédé, appareil et dispositif de reconnaissance de parole, et support de stockage

Country Status (2)

Country Link
CN (1) CN115910070A (fr)
WO (1) WO2024124697A1 (fr)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20150066361A (ko) * 2013-12-06 2015-06-16 주식회사 케이티 개체명 인식을 이용한 음성인식 띄어쓰기 보정 방법 및 시스템
CN112257449A (zh) * 2020-11-13 2021-01-22 腾讯科技(深圳)有限公司 命名实体识别方法、装置、计算机设备和存储介质
CN112347768A (zh) * 2020-10-12 2021-02-09 出门问问(苏州)信息科技有限公司 一种实体识别方法及装置
US10997223B1 (en) * 2017-06-28 2021-05-04 Amazon Technologies, Inc. Subject-specific data set for named entity resolution
CN113656561A (zh) * 2021-10-20 2021-11-16 腾讯科技(深圳)有限公司 实体词识别方法、装置、设备、存储介质及程序产品
CN113821592A (zh) * 2021-06-23 2021-12-21 腾讯科技(深圳)有限公司 一种数据处理方法、装置、设备以及存储介质
CN113990293A (zh) * 2021-10-19 2022-01-28 京东科技信息技术有限公司 语音识别方法及装置、存储介质、电子设备
CN115048940A (zh) * 2022-06-23 2022-09-13 之江实验室 基于实体词属性特征和回译的中文金融文本数据增强方法

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20150066361A (ko) * 2013-12-06 2015-06-16 주식회사 케이티 개체명 인식을 이용한 음성인식 띄어쓰기 보정 방법 및 시스템
US10997223B1 (en) * 2017-06-28 2021-05-04 Amazon Technologies, Inc. Subject-specific data set for named entity resolution
CN112347768A (zh) * 2020-10-12 2021-02-09 出门问问(苏州)信息科技有限公司 一种实体识别方法及装置
CN112257449A (zh) * 2020-11-13 2021-01-22 腾讯科技(深圳)有限公司 命名实体识别方法、装置、计算机设备和存储介质
CN113821592A (zh) * 2021-06-23 2021-12-21 腾讯科技(深圳)有限公司 一种数据处理方法、装置、设备以及存储介质
CN113990293A (zh) * 2021-10-19 2022-01-28 京东科技信息技术有限公司 语音识别方法及装置、存储介质、电子设备
CN113656561A (zh) * 2021-10-20 2021-11-16 腾讯科技(深圳)有限公司 实体词识别方法、装置、设备、存储介质及程序产品
CN115048940A (zh) * 2022-06-23 2022-09-13 之江实验室 基于实体词属性特征和回译的中文金融文本数据增强方法

Also Published As

Publication number Publication date
CN115910070A (zh) 2023-04-04

Similar Documents

Publication Publication Date Title
CN112712804B (zh) 语音识别方法、系统、介质、计算机设备、终端及应用
WO2021232725A1 (fr) Procédé et appareil de vérification d'informations basée sur une interaction vocale et dispositif et support de stockage sur ordinateur
CN111883110B (zh) 语音识别的声学模型训练方法、系统、设备及介质
US10176804B2 (en) Analyzing textual data
CN107195296B (zh) 一种语音识别方法、装置、终端及系统
WO2021139108A1 (fr) Appareil et procédé de reconnaissance intelligente d'émotions, dispositif électronique et support d'enregistrement
CN109887484B (zh) 一种基于对偶学习的语音识别与语音合成方法及装置
KR102041621B1 (ko) 인공지능 음성인식 기반 기계학습의 대규모 말뭉치 구축을 위한 대화형 말뭉치 분석 서비스 제공 시스템 및 구축 방법
Watts Unsupervised learning for text-to-speech synthesis
CN114580382A (zh) 文本纠错方法以及装置
WO2004034378A1 (fr) Dispositif d'accumulation/creation de modele de langage, dispositif de reconnaissance vocale, procede de creation de modele de langage et procede de reconnaissance vocale
CN111462748B (zh) 语音识别处理方法、装置、电子设备及存储介质
Kadyan et al. Refinement of HMM model parameters for Punjabi automatic speech recognition (PASR) system
CN111930914A (zh) 问题生成方法和装置、电子设备以及计算机可读存储介质
WO2023245389A1 (fr) Procédé de gestion de chanson, appareil, dispositif électronique et support de stockage
CN111508466A (zh) 一种文本处理方法、装置、设备及计算机可读存储介质
CN103885924A (zh) 一种领域自适应的公开课字幕自动生成系统及方法
Mei et al. A particular character speech synthesis system based on deep learning
CN112686041A (zh) 一种拼音标注方法及装置
CN116775873A (zh) 一种多模态对话情感识别方法
WO2024124697A1 (fr) Procédé, appareil et dispositif de reconnaissance de parole, et support de stockage
WO2023123892A1 (fr) Procédé de construction pour module de prédiction d'informations, procédé de prédiction d'informations et dispositif associé
Kurian et al. Connected digit speech recognition system for Malayalam language
CN115132170A (zh) 语种分类方法、装置及计算机可读存储介质
Dua et al. A review on Gujarati language based automatic speech recognition (ASR) systems

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23901915

Country of ref document: EP

Kind code of ref document: A1