CN114822519A - Chinese speech recognition error correction method and device and electronic equipment - Google Patents

Chinese speech recognition error correction method and device and electronic equipment Download PDF

Info

Publication number
CN114822519A
CN114822519A CN202110058472.2A CN202110058472A CN114822519A CN 114822519 A CN114822519 A CN 114822519A CN 202110058472 A CN202110058472 A CN 202110058472A CN 114822519 A CN114822519 A CN 114822519A
Authority
CN
China
Prior art keywords
probability
pinyin
chinese character
information
error correction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110058472.2A
Other languages
Chinese (zh)
Inventor
尹旭贤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN202110058472.2A priority Critical patent/CN114822519A/en
Publication of CN114822519A publication Critical patent/CN114822519A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The application provides a Chinese speech recognition error correction method, a device and electronic equipment, which are suitable for the technical field of speech recognition in artificial intelligence and can realize Chinese speech recognition and error correction, and the method comprises the following steps: and acquiring voice data, and processing the voice data by using the Chinese voice recognition error correction model to obtain a corrected Chinese voice recognition result. The Chinese speech recognition error correction model is a model based on a neural network, and comprises two submodels, namely an acoustic model and a first text error correction model, and optionally also comprises a language model. In the embodiment of the application, the position expansion weighting is adopted to form the first mixed batch input which fuses pinyin information and Chinese character information, so that the Chinese speech recognition error correction model in the embodiment of the application can fully utilize the information of speech data, and has a good error correction effect.

Description

Chinese speech recognition error correction method and device and electronic equipment
Technical Field
The application belongs to the technical field of voice recognition, and particularly relates to a Chinese voice recognition error correction method, a Chinese voice recognition error correction device and electronic equipment.
Background
A human-computer interaction technology based on Speech Recognition (ASR) is a very important technology in the field of Artificial Intelligence (AI) of terminals, and is widely applied to various electronic devices (e.g., mobile phones, tablet computers, desktop computers, etc.) to improve human-computer interaction efficiency between users and the electronic devices. For example, the intelligent voice assistant can recognize the voice of the user, further understand the intention of the user and automatically execute corresponding operation; for another example, the voice input method can identify the voice of the user and directly transcribe the voice into characters, thereby omitting the complex process of typing and inputting. The efficiency of human-computer interaction based on speech recognition depends to a large extent on the effect of speech recognition.
As a branch of the speech recognition technology field, chinese speech recognition generally faces the following technical problems. On one hand, one pronunciation in the Chinese can correspond to a plurality of Chinese characters, so that Chinese speech recognition is easy to generate homophone errors; on the other hand, pronunciation habit problems caused by dialects make Chinese speech recognition prone to confusing errors. Therefore, the chinese speech recognition needs to be performed with error correction to obtain a more accurate recognition result.
Disclosure of Invention
The invention provides a Chinese voice recognition error correction method, a device and electronic equipment, which can realize the Chinese voice recognition and error correction.
In a first aspect, the present invention provides a method for recognizing and correcting a chinese speech, the method comprising: first, pinyin information and first Chinese character information of voice data are acquired. The voice data may be user voice data received and recorded in real time through a radio module of the electronic device, or may be a segment of voice data stored in the electronic device. The pinyin information and the first Chinese character information are obtained based on the voice data, wherein the pinyin information can be obtained based on the voice data, the first Chinese character information can be obtained based on the pinyin information, the first Chinese character information can be obtained based on the voice data, the pinyin information can be obtained based on the Chinese character information, or the pinyin information and the first Chinese character information can be obtained based on the voice data. And then, fusing the pinyin information and the first Chinese character information to obtain mixed information. The mixed information obtained after fusion contains information in the pinyin information and information in the first Chinese character information, and the pinyin information and the first Chinese character information in the mixed information can be the same in weight or different in weight. And then, processing the mixed information by applying a text error correction model to obtain second Chinese character information. The text error correction model may be a model based on a neural network, and the text error correction model is obtained by training in advance using speech data used for model training, so that the trained text error correction model has a function of processing the speech data and outputting an error correction result. Because the information of the pinyin and the Chinese characters is fused in the mixed information, the text error correction model can comprehensively utilize the information of the pinyin and the Chinese characters for error correction, thereby obtaining better error correction effect. The obtained second Chinese character information contains a corrected voice recognition result. And finally, outputting the corrected voice recognition result, thereby completing the Chinese voice recognition error correction.
In a first implementation manner of the first aspect, the pinyin information includes pinyin probabilities, and the first chinese character information includes chinese character probabilities. That is, probability values can be used to represent the possibility of different pinyins in the pinyin information and the possibility of different chinese characters in the first chinese character information, so that the pinyin information and the first chinese character information can be represented in a quantized manner by having a higher probability value at a pinyin or a chinese character to which the voice data is more likely to correspond and a lower probability value at a pinyin or a chinese character to which the voice data is less likely to correspond. The probability value may be a fraction between 0 and 1. The step of fusing the pinyin information and the first Chinese character information to obtain mixed information specifically comprises the following steps: and carrying out weighted fusion on the pinyin probability in the pinyin information and the Chinese character probability in the first Chinese character information to obtain the mixed information containing a plurality of sub-mixed information. Therefore, the mixed information can simultaneously contain pinyin information and first Chinese character information, and the text error correction model can simultaneously use information in the pinyin and the Chinese characters for error correction, so that a better error correction effect can be obtained.
In a second implementation manner of the first aspect, before performing weighted fusion on the pinyin probability in the pinyin information and the chinese character probability in the first chinese character information, the method further includes: and determining the positions of the Chinese characters with the Chinese character probability smaller than the threshold value in the Chinese character probability of the first Chinese character information, and performing weighted fusion according to the positions. The threshold may be a preset threshold, for example, 0.9; if the probability of one or some Chinese characters in the first Chinese character information is smaller than the threshold, it may be considered that one or more Chinese characters corresponding to the probability of the one or more Chinese characters may have errors, so that different weighting strategies may be adopted for weighting fusion according to the pertinence of the one or more Chinese characters, and targeted weighting probability adjustment is realized to obtain a better error correction effect.
In a third implementation manner of the first aspect, the step of performing weighted fusion on the pinyin probability in the pinyin information and the Chinese character probability in the first Chinese character information specifically includes: obtaining a plurality of position expansion areas based on the positions according to a preset rule; the preset rule comprises a plurality of left offset amounts and a plurality of right offset amounts; the position spread area covers the position, a left offset position located on the left side of the position, and a right offset position located on the right side of the position. So that the site expansion area covers, in addition to the site, one or more sites around the site. Since the probability of the chinese character at the position is smaller than the threshold, the chinese character at the position may have an error, and therefore the chinese characters located at the positions around the position may also have an error. Therefore, the position expansion process is carried out to obtain the position expansion area with wider coverage, Chinese characters possibly with errors are included in the position expansion area as much as possible, and pinyin probability with higher weight is introduced into the probability of the Chinese characters at the position in the position expansion area to use pinyin information to assist in correcting the error of the Chinese characters, so that better error correction effect is obtained.
In a fourth implementation manner of the first aspect, the step of performing weighted fusion on the pinyin probability in the pinyin information and the Chinese character probability in the first Chinese character information further includes: and replacing the Chinese character probability in the first Chinese character information, which is positioned in the position expansion area, with the pinyin mixed weighting probability, and replacing the Chinese character probability in the first Chinese character information, which is positioned outside the position expansion area, with the Chinese character mixed weighting probability to complete weighted fusion. The pinyin mixing weighting probability and the Chinese character mixing weighting probability are mixed and weighted according to different weights, so that pinyin information is introduced to more possibly wrong Chinese characters in the position expansion area in a targeted manner to assist error correction, and only a small amount of pinyin information is introduced to Chinese characters outside the position expansion area to assist error correction, so that targeted Chinese error correction is realized, and a better error correction effect is obtained.
In a fifth implementation form of the first aspect, the method further has the following features: the pinyin mixing weighting probability is obtained by weighted addition of the pinyin probability in the pinyin information multiplied by the first weight and the Chinese character probability in the first Chinese character information multiplied by the second weight; the Chinese character mixed weighted probability is obtained by weighted addition of the Chinese character probability in the first Chinese character information multiplied by the first weight and the pinyin probability in the pinyin information multiplied by the second weight; the first weight is greater than the second weight. Therefore, pinyin information with a larger proportion is introduced into the probability of Chinese characters with more possibility of errors in the position expansion area to assist the text error correction model in error correction, and pinyin information with a smaller proportion is introduced into the probability of Chinese characters outside the position expansion area to assist error correction, so that the pinyin information with different weights is introduced for different Chinese characters in error correction, and the error correction effect of the text error correction model is improved.
In a sixth implementation form of the first aspect of the present invention, the method further has the following features: the pinyin information is a pinyin probability tensor which comprises a pinyin probability matrix formed by pinyin probability vectors; the first Chinese character information is a Chinese character probability tensor which comprises a Chinese character probability matrix formed by Chinese character probability vectors; the mixing information is a mixing tensor which comprises a pinyin and Chinese character mixing probability matrix formed by pinyin and Chinese character mixing probability vectors. That is, in one possible implementation manner provided by the present invention, a probability vector may be used to represent a chinese pinyin or a chinese character, a plurality of vectors may be further used to represent a sentence containing a plurality of pinyin or chinese characters, a matrix formed by a plurality of vectors may be further used to represent the sentence, a plurality of matrices may be further used to represent a plurality of sentences which may be simultaneously input to the text error correction model as a batch of data, and a tensor containing a plurality of matrices may be further used to represent mixed information which may contain a plurality of sentences.
In a seventh implementation manner of the first aspect, the method further has the following features: the pinyin probability vector and the Chinese character probability vector are probability vectors based on a word list, and the word list comprises a plurality of pinyins and a plurality of Chinese characters; wherein, the value corresponding to the word list pinyin in the pinyin probability vector is nonzero, and the value corresponding to the word list Chinese character is zero; the numerical value of the Chinese character probability vector corresponding to the pinyin of the word list is zero, and the numerical value of the Chinese character probability vector corresponding to the Chinese character list is nonzero; the numerical values of the pinyin and Chinese character mixed probability vector corresponding to the pinyin of the word list and the Chinese characters are nonzero. That is, the pinyin probability vector and the kanji probability vector represent information of pinyin and kanji of the voice data based on the same vocabulary. The pinyin probability vectors represent pinyin information possibly corresponding to the voice data at the pinyin fields in the word list by non-zero probability values, the Chinese character probability vectors represent Chinese character information possibly corresponding to the voice data at the Chinese character fields in the word list by non-zero probability values, and the pinyin and Chinese character mixed probability vectors correspond to weighted fusion of the pinyin and the Chinese character information because the pinyin fields and the Chinese character fields in the word list are not zero. Therefore, each pinyin, each Chinese character and pinyin and Chinese character mixed information obtained by weighted fusion of the pinyin and the Chinese character information are expressed in a quantitative mode.
In an eighth implementation manner of the first aspect, the step of fusing the pinyin information and the first chinese character information to obtain the mixed information in the method specifically includes: and carrying out weighted fusion on the pinyin probability vectors in the pinyin probability tensor and the Chinese character probability vectors in the Chinese character probability tensor to obtain a mixed tensor containing a plurality of pinyin and Chinese character mixed probability matrixes.
In a ninth implementation manner of the first aspect, based on the eighth implementation manner, before performing weighted fusion of the pinyin probability vector and the chinese character probability vector, the method further includes: determining the position of the Chinese character probability vector of which the maximum Chinese character probability is smaller than a threshold value in the Chinese character probability vectors in the Chinese character probability tensor, and performing weighted fusion according to the position; the maximum Chinese character probability is the maximum probability numerical value in the Chinese character probability vector. Because the Chinese character probability vector comprises a plurality of Chinese characters and a plurality of Chinese character probabilities corresponding to the Chinese characters, in the implementation mode, whether the Chinese character judgment result corresponding to the Chinese character probability vector is possibly wrong is judged by considering the maximum Chinese character probability in the Chinese character probability vector. Therefore, more pinyin information is pertinently introduced into the probability vector of the Chinese character with possible errors to assist the text error correction model in correcting errors, and a better error correction effect is obtained.
In a tenth implementation manner of the first aspect, the step of obtaining pinyin information and first chinese character information of the voice data specifically includes: and processing the pinyin information by applying the text error correction model to obtain first Chinese character information. The existing method for obtaining Chinese characters through pinyin generally uses a language model. In the implementation mode provided by the invention, the text error correction model can be used for realizing the function of converting the pinyin of the language model into Chinese characters, so that the language model in the whole Chinese speech recognition error correction model becomes an optional sub-model, and the language model can be saved in a mode of multiplexing the text error correction model, thereby obtaining the beneficial effects of reducing the number of model parameters and reducing the size of the model.
In an eleventh implementation manner of the first aspect, the step of obtaining pinyin information and first chinese character information of the voice data further includes: and processing the voice data by applying an acoustic model to obtain pinyin information, wherein the acoustic model can be a neural network model. The acoustic model can complete the process of converting the voice into pinyin by extracting the voice characteristics in the voice data.
In a second aspect, the present invention provides an electronic device, configured to execute the methods in the first aspect and the various implementation manners of the first aspect to perform recognition and error correction on a chinese speech.
In a third aspect, the present invention provides a computer-readable storage medium storing computer instructions for executing the method in the first aspect and various implementation manners of the first aspect.
In a fourth aspect, the invention provides a chip apparatus for executing the computer instructions of the third aspect.
Drawings
FIG. 1A is a functional and structural diagram of a Chinese speech recognition model provided in an embodiment of the present application;
FIG. 1B is a functional and structural diagram of a Chinese speech recognition error correction model according to an embodiment of the present application;
FIG. 2A is a schematic diagram of an inference process of the Chinese speech recognition error correction model according to an embodiment of the present application;
FIG. 2B is a diagram illustrating another inference process of the Chinese speech recognition error correction model according to an embodiment of the present application;
FIG. 3A is a diagram illustrating exemplary vocabulary fields provided in accordance with an embodiment of the present application;
FIG. 3B is a diagram of another exemplary vocabulary field provided in an embodiment of the present application;
FIG. 3C is a diagram illustrating a further exemplary vocabulary field provided by an embodiment of the present application;
FIG. 4 is a diagram illustrating another inference process of the Chinese speech recognition error correction model according to an embodiment of the present application;
FIG. 5 is a flowchart of a location dilation weighting process provided by an embodiment of the present application;
FIG. 6 is a schematic view of an expanded position provided by an embodiment of the present application;
FIG. 7 is a schematic weighting diagram provided by an embodiment of the present application;
FIG. 8 is a flowchart of an integration, search, and decoding process provided by an embodiment of the present application;
FIG. 9 is a schematic diagram of a beam search provided by an embodiment of the present application;
FIG. 10 is a structural diagram of a text correction model according to an embodiment of the present application;
FIG. 11 is a diagram illustrating a relationship between a training phase and an inference phase of a text correction model according to an embodiment of the present application;
FIG. 12 is a diagram illustrating a training phase of a text correction model according to an embodiment of the present application;
FIG. 13 is a flowchart of a text correction model training phase according to an embodiment of the present application;
FIG. 14 is a flowchart of a second stage of training a text correction model according to an embodiment of the present application;
FIG. 15 is a schematic diagram of an application of the Chinese speech recognition error correction model provided in an embodiment of the present application in an intelligent speech assistant of a terminal device;
FIG. 16 is a schematic diagram of an application of a Chinese speech recognition error correction model in a speech input method of a terminal device according to an embodiment of the present application;
FIG. 17 is a schematic diagram illustrating an application of a Chinese speech recognition error correction model in a speech-to-text function of a terminal device according to an embodiment of the present application;
fig. 18A is a schematic structural diagram of an electronic device according to an embodiment of the present application;
fig. 18B is a block diagram of a software structure of an electronic device according to an embodiment of the present application.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
To facilitate an understanding of the embodiments of the present application, a brief introduction to relevant knowledge in the art is provided herein.
Currently, speech recognition systems typically use acoustic models (acoustic models) and language models (language models) to represent the statistical characteristics of speech. In which an acoustic model is used to establish a relationship between a speech signal and a phoneme (phoneme) or a phonetic unit (phonemic unit), for example, in the case of chinese, the acoustic model can convert a speech signal into a pinyin. The language model is used to construct relationships between words or words in a language, for example, for Chinese, the language model may convert a piece of pinyin output by the acoustic model into a piece of text. The acoustic models and language models may be obtained by training and/or learning, for example by training a depth model using labeled data. The acoustic model and the language model are combined to realize the recognition of a section of voice signals into a section of characters, thereby realizing the voice recognition.
A large number of homophones exist in Chinese, and one pinyin often corresponds to a plurality of Chinese characters; there are also a number of confusing sounds in the text, such as front and back nasal confusion due to dialect speaking habits. These characteristics make chinese speech recognition error prone. For example, a homophonic error that identifies "key" as "yes," an confusing audible error that identifies "today" as "through the day," is generated. Therefore, the chinese speech recognition needs to be performed with error correction to obtain a more accurate recognition result.
In view of this, the invention provides a deep learning-based Chinese speech recognition error correction model and method, which can realize Chinese speech recognition error correction with better accuracy. The embodiments provided by the present invention will be described in turn with reference to the accompanying drawings.
Fig. 1A exemplarily illustrates a functional structure schematic diagram of a chinese speech recognition model provided in an embodiment of the present application. As shown in fig. 1A, the chinese speech recognition model may include two submodels, namely, an acoustic model and a language model, and the speech data is sequentially processed by the acoustic model and the language model in the chinese speech recognition model, and is converted into a speech recognition result for output. As described above, the acoustic model and the language model may be obtained through training, for example, in the model and the method provided in the embodiment of the present application, the language model may be a deep learning model based on a neural network architecture, and the language model that is trained by using labeled data and completes learning may have the function as described above, that is, the function of converting pinyin output by the acoustic model into text. Optionally, other submodels may be included in the Chinese speech recognition model.
Fig. 1B exemplarily shows a functional structure diagram of a chinese speech recognition error correction model according to an embodiment of the present application. As shown in FIG. 1B, the Chinese speech recognition error correction model may include an acoustic model, a language model, and a first text error correction model. The first text error correction model can correct the acoustic model output result and/or the language model output result, and accordingly outputs a corrected voice recognition result. In an alternative provided by the present embodiment, the first text error correction model may use the output results of the acoustic model and the language model as input data at the same time. In another optional mode provided by this embodiment, the first text error correction model further has a function of a language model, and can replace the language model to realize a function of converting pinyin into text, so that the chinese speech recognition error correction model may not include the language model, and the first text error correction model can directly use an output result of the acoustic model as input data. Optionally, other submodels may be included in the Chinese speech recognition error correction model.
It should be understood that, in the embodiments and drawings provided in the present application, when the output result of the previous model is exemplarily described as the input data of the next model, it is not limited that the output result of the previous model is not subjected to other processing before the input of the next model. For example, the output result of the acoustic model in fig. 1B may be subjected to data preprocessing or the like, such as conversion into a data form suitable for input and blending a part of other information into the output result, before being input as input data into the first text correction model. For another example, before the output result of the acoustic model and the output result of the language model are input to the first text error correction model at the same time in fig. 1B, a data fusion process of the two output results may be performed first, and the fused result is used as the input data of the first text error correction model. Similarly, the output result of the first text error correction model may also be subjected to some post-processing before becoming the final output corrected speech recognition result of the chinese speech recognition error correction model, for example, one of the results may be selected from the plurality of error correction results according to a preset policy and output as the corrected speech recognition result.
The Chinese speech recognition error correction model provided by the embodiment of the application can be an error correction model based on deep learning, and the error correction model can be realized based on a neural network architecture. Therefore, the functions provided by the model provided in the embodiment of the present application are obtained by training using data. Specifically, an initial model composed of a plurality of neural network layers is established, and the initial model can have random initialization parameters; then, using the labeled data as training data, iteratively updating model parameters (such as weight coefficients and offset of neurons) through a loop process of steps of back propagation, gradient descent and the like according to a preset loss function until the loss value of the loss function is smaller than a preset threshold value to finish the loop, and finishing the model training; and finally, obtaining the trained model after the last cycle and the parameter updating are finished.
The trained model has the function of reasoning for data similar to the type of the training data. For example, if a speech recognition error correction model is trained using normal speech data, the model has a function of performing speech recognition error correction on the normal speech. Specifically, a section of common speech data is input, and the model can better realize the recognition and error correction of the section of speech data; if a piece of Cantonese speech data is input, the model may not be well recognized and error corrected. For another example, a speech recognition error correction model trained with daily conversational chat speech does not provide good recognition and error correction of speech on news broadcasts. It should be clear that the models and methods provided by the embodiments of the present application are not limited to the type of training data used, and those skilled in the art can train the models they need for the type of speech data they want to correct.
Next, an inference stage of the chinese speech recognition error correction model provided in the embodiment of the present application, that is, a process of applying the model to recognize and correct the unlabeled speech data, is first described.
Fig. 2A is a schematic diagram of an inference process of a chinese speech recognition error correction model according to an embodiment of the present application. The inference process shown in fig. 2A may be an inference process of the chinese speech recognition error correction model including the acoustic model, the language model, and the first text error correction model shown in fig. 1B.
The inference process shown in fig. 2A includes:
the method comprises the following steps that first voice data are input into an acoustic model, and after the acoustic model extracts and processes acoustic features of the first voice data, a first tensor is output.
The first voice data may be a piece of unlabeled voice data to be recognized. The label-free state refers to a state in which the model cannot obtain what the correct speech recognition result actually corresponds to the first language data. The model is used to process the first speech data in order to identify possible speech recognition results corresponding to the first speech data. Each model (for example, the acoustic model, the language model, and the first text correction model in fig. 2A) applied in the reasoning stage is a trained model that completes parameter learning, and the model has reasoning capability, and can output a result that the model considers to be possibly correct by using unlabeled data as input. The processing of the first speech data by the acoustic model may include processes of extracting, recognizing, and/or classifying acoustic features (e.g., mel-frequency features).
Therefore, in this step, the acoustic model takes the unlabeled first voice data as input data, and the output first tensor can be pinyin data corresponding to the first voice data which is considered possibly correct by the acoustic model. For example, if the first speech data is a piece of speech data with "good weather" content, the acoustic model may output a result like "tian 1 qi4 bu2 cuo 4" where the pinyin-based numbers are used to indicate the tones of chinese characters, here 0-4 for soft, first, second, third and fourth.
In this embodiment, "the number of tensors" is used to collectively refer to the output data of each model, but this does not limit the embodiment of the present embodiment. The tensor in the scheme can refer to a vector, a matrix or a tensor, and can contain various types of information such as character information, numerical information, sorting information and the like.
For example, if the first speech data is a piece of speech data whose content is "feed", and the acoustic model considers that the piece of speech data may be "wei 2" with a probability of 0.95, may be "wei 3" with a probability of 0.03, and may be "wei 4" with a probability of 0.02, the acoustic model may output a 3-dimensional vector containing pinyin probability information similar to [ (wei2, 0.95), (wei3, 0.03), (wei4, 0.02) ]. If "feed" is considered to be a sentence containing a word, the acoustic model may represent the recognition result for the sentence by a matrix of shape 1 × 3 [ [ (wei2, 0.95), (wei3, 0.03), (wei4, 0.02) ]. If this sentence belongs to a batch of input data containing 1 speech sample, the acoustic model can represent the recognition result for the batch of input data with a tensor [ [ [ (wei2, 0.95), (wei3, 0.03), (wei4, 0.02) ] ] of 1 × 1 × 3 in shape.
The above description of the acoustic model output data is only one possible implementation way provided by the embodiments of the present application, and is not limited. The data output by the acoustic model is not beyond the range covered by the embodiment of the application as long as the data contains pinyin information.
And secondly, inputting the first tensor into a language model, and outputting a second tensor after the first tensor is processed by the language model.
As described above, the language model has a function of converting phoneme information output from the acoustic model into character information. The language model in the Chinese speech recognition error correction model provided by the embodiment of the application has the function of converting pinyin information output by the acoustic model into character (such as Chinese character) information. The second tensor can therefore be an output result containing textual information.
For example, if the first vector is a vector containing pinyin probability information similar to [ (wei2, 0.95), (wei3, 0.03), (wei4, 0.02) ], the language model considers that the text information to which the first vector may correspond may be "fed" with a probability of 0.98, may be "yes" with a probability of 0.01, and may be "bit" with a probability of 0.01, the language model may output a vector containing kanji probability information similar to [ (feed, 0.98), (0.01), (bit, 0.01) ]. Similarly, if the first tensor output by the acoustic model is in the form of a matrix or tensor, the language model may correspondingly output the first tensor in the form of a corresponding matrix or tensor.
And thirdly, carrying out information fusion on the first tensor and the second tensor, and converting the first tensor into a third tensor.
The third tensor obtained after information fusion contains the information in the first tensor and the information in the second tensor. The Chinese speech recognition error correction model provided by the embodiment of the invention jointly uses pinyin information and character information, so that a good error correction effect can be obtained based on two aspects of information. Specifically, the information fusion process may be, for example, a weighted addition process of the first tensor and the second tensor according to a preset weighting rule and a preset weighting coefficient. For example, the third tensor can be a weighted fusion of both the 0.9 times second tensor and the 0.1 times first tensor. For another example, a part of the third tensor is a weighted fusion of the second tensor 0.9 times and the first tensor 0.1 times, and another part of the third tensor is a weighted fusion of the second tensor 0.1 times and the first tensor 0.9 times. Those skilled in the art can configure suitable weighting rules and weighting coefficients according to specific requirements, without departing from the scope of the embodiments of the present invention. The present specification will provide a feasible information fusion method, i.e., a location-based dilation weighting method, in the following embodiments, and specific steps of the method will be described in detail in the following embodiments, which are not described herein again.
And fourthly, inputting the third tensor into the first text error correction model, and outputting a fourth tensor after the third tensor is processed by the first text error correction model.
As described above, the first text error correction model used in the inference process is a model obtained after training is completed, and the model has a function of correcting input data, and can output an output result that the model considers to be more correct. For example, the first text error correction model may recognize the language model as "good weather today" based on the input information, and output "good weather today" after error correction.
Alternatively, the fourth tensor can be the first text error correction model output containing a plurality of possible error correction results and a probability corresponding to each result. For example, a section of first speech data with "good weather today" is processed in the previous step to obtain a third tensor, and the first text error correction model outputs a result that the first speech data is considered to be possibly correct according to the third tensor. For example, the first text error correction model may output a plurality of possible chinese characters and corresponding probability judgment results for the chinese characters at each position, considering that the chinese character at the first position has a probability of 0.95 as "present" and a probability of 0.05 as "pass", the chinese character at the second position has a probability of 0.99 as "day" and a probability of 0.01 as "field", and so on. In the subsequent step, post-processing may be performed according to the plurality of probability determination results, for example, searching according to a preset rule, and selecting a most suitable result as a corrected voice recognition result.
And fifthly, post-processing the fourth tensor, and converting the fourth tensor into a corrected voice recognition result, namely a final output result of the whole voice recognition error correction model.
Optionally, if the fourth tensor is already a deterministic error correction judgment result, the output of the first text error correction model may be directly used as the corrected speech recognition result without performing post-processing.
FIG. 2B is a diagram illustrating an inference process of another Chinese speech recognition error correction model according to an embodiment of the present application. The inference process shown in fig. 2B may be an inference process of the chinese speech recognition and correction model including the acoustic model and the first text correction model shown in fig. 1B.
It can be seen that the inference process shown in fig. 2B differs from the inference process shown in fig. 2A in that the output results of the acoustic model are not processed using the language model, but are processed using the first text error correction model. That is, the first text error correction model is used to replace the function of the language model, and the multiplexing of the first text error correction model is realized, so that the whole chinese speech recognition error correction model may not include the language model, and thus the embodiment may realize the beneficial effects of reducing the model volume and reducing the number of model parameters.
The multiplexing of the first text error correction model in the inference process shown in fig. 2B is achieved because the first text error correction model obtained after training also has the function of converting pinyin information into text information through a specific training method. The specific training method will be described in detail in the following examples.
The reasoning process shown in fig. 2B is otherwise the same as that shown in fig. 2A, and will not be described repeatedly here.
As can be seen from the reasoning process of the chinese speech recognition error correction model provided in the embodiment of the present application shown in fig. 2A and 2B, the functions of the acoustic model and the language model are as follows: and the acoustic model is used for extracting and processing the acoustic characteristics of the first voice data, and the first voice data is expressed as a first tensor containing pinyin information. And the language model is used for processing the first tensor output by the acoustic model and converting the first tensor into a second tensor containing Chinese character information. The first text correction model may have two functions: on one hand, the method has the function of a language model, and can realize the process of converting data containing pinyin information into data containing Chinese character information; on the other hand, the numerical value correction can be completed, and the numerical value in the third tensor integrating the pinyin information and the Chinese character information is processed and adjusted to enable the numerical value to be close to the possible correct voice recognition result represented by the numerical value. That is, the first text error correction model processes the third tensor so that the numerical value becomes larger at the position representing the possibly-correct speech recognition result and becomes smaller at the position representing the possibly-incorrect speech recognition result, thereby biasing the numerical value in the fourth tensor output by the first text error correction model toward the more-possibly-correct recognition result considered by the first text error correction model. And finally, the fourth tensor can be subjected to post-processing to obtain a corrected voice recognition result, and the whole voice error correction model has an error correction function.
From the above analysis, the first text correction model has two functions. As can be seen in conjunction with fig. 1A and 1B:
on the one hand, adding the first text error correction model to the chinese speech recognition model (fig. 1A) having no error correction function or a weak error correction function can obtain the chinese speech recognition error correction model (fig. 1B) having an error correction function or a stronger error correction function compared to the chinese speech recognition model.
On the other hand, since the first text error correction model also has the function of a language model, the first text error correction model can be used instead of the language model, i.e., the language model is not used any more, but the function of the original language model is realized by multiplexing the first text error correction model. Thus, the language model which is indispensable to the original Chinese speech recognition model can be selected and deleted, so that the whole volume of the Chinese speech recognition error correction model is reduced, and the parameter quantity of the whole model is reduced.
In an optional implementation manner provided by the embodiment of the present application, each of the first to fourth tensors is a three-dimensional tensor. The first tensor and the second tensor have the same shape and different positions of nonzero numerical values, and the first tensor is one-dimensional in a first dimension; the third tensor and the fourth tensor are the same in shape and at least one-dimensional in the first dimension, and the dimensions in the second dimension and the third dimension are the same as those of the first tensor, the second dimension and the third dimension of the second tensor. For example, assuming that the first tensor and the second tensor are tensors having a shape of [1, U, V ], the third tensor and the fourth tensor are tensors having a shape of [ B, U, V ], where B ≧ 1.
As an optional implementation manner provided by the embodiment of the present application, the acoustic model may be a model based on a transformer (transformer) structure, the language model may be a model based on a Convolutional Neural Network (CNN) layer, and the first text error correction model may be a model based on a Bidirectional Encoder (BERT) structure of a transformer architecture.
In an optional implementation manner provided by the embodiment of the present application, the first to fourth tensors are probability tensors based on a Vocabulary (vocubulary), and the probability values are used to represent character elements in the Vocabulary, that is, the values corresponding to the character elements in the tensor are decimals greater than 0 and smaller than 1. The word list is an ordered list containing V elements, and the elements at least contain character elements and can also contain non-character elements. Each element in the vocabulary corresponds to a positive integer for indicating the position of the element in the list, so that each element in the vocabulary is arranged in a fixed order.
In an alternative implementation manner provided by the embodiment of the present application, in the vocabulary used by the chinese speech recognition model and the chinese speech recognition error correction model, the character elements include at least chinese characters and pinyin, and optionally, may further include other character elements such as pinyin of foreign language (e.g. the foreign language pinyin "wai", "fai" of the foreign language "WiFi"), and the non-character elements may include but are not limited to one or more of the following: zero padding bits, reserved bits, character bits, spacers, etc.
FIG. 3A is a diagram of a vocabulary field. As shown in fig. 3A, the vocabulary consists of 6538 elements in total. The 1 st to 5301 th bits of the vocabulary are 5301 common Chinese characters arranged in the order of the dictionary from 'o' to 'vinegar'; 5302-6538 bits 1237 common Chinese phonetic alphabets arranged in alphabetical order from "a 0" to "zuo 4
Fig. 3B is a schematic diagram of another vocabulary field. As shown in fig. 3B, the vocabulary consists of 6896 elements in total. The 1 st to 5301 th bits of the vocabulary are 5301 common Chinese characters arranged in the order of the dictionary from 'o' to 'vinegar'; 5302-6538 bits 1237 common Pinyin arranged alphabetically from "a 0" to "zuo 4"; the 6539-6896 th positions are 358 common foreign language pinyin arranged according to the alphabetical order from "a" to "zi".
Fig. 3C is a schematic diagram of yet another vocabulary field. As shown in fig. 3C, the vocabulary consists of a total of 7008 elements. The 1 st bit of the vocabulary is 1 zero padding bit, marked as [ PAD ], and is used for marking elements needing to be padded with zero; the 2 nd to 100 th bits are 99 reserved bits which are reserved for the subsequent addition of new elements; 101 st-109 bits are 9 character bits including [ UNK ], [ CLS ], [ SEP ], [ MASK ] and ], [ MASK ]. Is there a | A "five punctuation marks, wherein [ UNK ] is used to represent Out-of-Vocabulary (OOV), [ CLS ] is a sentence start symbol used to identify the beginning of a sentence, [ SEP ] is a sentence separator used to separate two sentences, and [ MASK ] is used to cover characters during model training; the 110-; the 5411 th and 5412 th bits are spacers; 5413-6649 bit is 1237 common Chinese phonetic alphabets arranged in alphabetical order from "a 0" to "zuo 4"; 6650 th 7007 is 358 common foreign language pinyin arranged from "a" to "zi" according to the alphabetical order; bit 7008 is a spacer.
It should be clear that the above-mentioned arrangement sequence of the vocabulary fields and fields shown in fig. 3A, 3B and 3C is only three possible embodiments in this solution, and is not limited. Those skilled in the art can make corresponding adjustments according to the actual application requirements, for example, the vocabulary may include less than or more than 5301 chinese characters, less than or more than 1237 chinese pinyin, or other types of elements besides zero padding bits, reserved bits, character bits, chinese characters, spacers, chinese pinyin, foreign language pinyin, or design the corresponding arrangement order of the character elements and non-character elements according to the actual requirements.
FIG. 4 is a diagram illustrating another inference process of the Chinese speech recognition error correction model according to the embodiment of the present application.
As can be seen from fig. 4, the inference process of the chinese speech recognition error correction model provided in this embodiment is as follows: firstly, an acoustic model extracts and processes acoustic features of first voice data, and the first voice data are converted into a first pinyin probability tensor to be output; then, the first pinyin probability tensor is converted into a first Chinese character probability tensor to be output after being processed by a language model or a first text error correction model; then, the first pinyin probability tensor and the first Chinese character probability tensor are converted into a first mixed batch for inputting after information fusion; then, the first mixed batch input is converted into a second Chinese character probability tensor to be output after being processed by a first text error correction model; and finally, converting the second Chinese character probability tensor into a corrected voice recognition result through post-processing and outputting the corrected voice recognition result, namely the final output of the whole voice recognition error correction model.
In a possible embodiment of the present solution, the first pinyin probability tensor, the first chinese character probability tensor and the second chinese character probability tensor can be probability tensors based on a vocabulary shown in fig. 3A, 3B or 3C.
Taking the probability tensor based on the vocabulary shown in FIG. 3A as an example, the first Pinyin probability isThe tensor is a tensor with the shape of 1 × U × 6538, wherein the first dimension is 1, and the number of samples in the Pinyin probability tensor is 1; the second dimension is U, which means that a segment of the first speech data contains U characters, for example, there are 6 kanji characters in "today is good weather", so its corresponding U equals 6; the third dimension is the number of elements of the word list based on which 6538 elements are included since the present embodiment uses the word list shown in fig. 3A. Recording the first phonetic probability tensor as I [ [ I ] 1,1 ,I 1,2 ,…,I 1,U ]]Wherein, I 1 =[I 1,1 ,I 1,2 ,…,I 1,U ]The pinyin probability matrix corresponding to the first sample in the first pinyin probability tensor; the pinyin probability matrix I corresponding to the nth sample in the first pinyin probability tensor n The pinyin probability vector corresponding to the U character in (1) is marked as I n,U =[a n,U,1 ,a n,U,2 ,…,a n,U,6538 ]For the first pinyin probability tensor, n is 1; in the scheme, the values of the pinyin probability vectors in the Chinese character field are all 0, the values of the pinyin field are all decimal numbers which are more than 0 and less than 1, and the sum of all decimal numbers is 1, so that the probability value is represented, and for the pinyin probability vector I in the embodiment, the pinyin probability vector I is n,U The conditions to be satisfied can be expressed as: a is n,U,1 To a n,U,5301 Is 0, a n,U,5302 To a n,U,6538 Is a decimal fraction of more than 0 and less than 1 and a n,U,5302 +a n,U,5303 +…+a n,U,6538 =1。
The first kanji probability tensor is a tensor having a shape of 1 × U × 6538, and is denoted as J [ [ J ] for the first kanji probability tensor 1,1 ,J 1,2 ,…,J 1,U ]]Wherein, J 1 =[J 1,1 ,J 1,2 ,…,J 1,U ]A Chinese character probability matrix corresponding to a first sample in the first Chinese character probability tensor; corresponding the n-th sample in the first Chinese character probability tensor to the Chinese character probability matrix J n The probability vector of Chinese character corresponding to the U-th character in (1) is denoted as J n,U =[b n,U,1 ,b n,U,2 ,…,b n,U,6538 ]For the first Chinese character probability tensor, n is 1;in the scheme, the values of the Chinese character probability vectors in the Chinese pinyin fields are all 0, the values of the Chinese character fields are all decimal numbers which are more than 0 and less than 1, and the sum of all decimal numbers is 1, so that the probability value is represented, and for the Chinese character probability vector J in the embodiment n,U The conditions to be satisfied can be expressed as: b n,U,5302 To b n,U,6538 Is 0, b n,U,1 To b n,U,5301 Is a decimal number greater than 0 and less than 1 and b n,U,1 +b n,U,2 +…+b n,U,5301 =1。
The second kanji probability tensor is a tensor having a shape of nxu × 6538, and is denoted as K [ [ K ] for the second kanji probability tensor 1,1 ,K 1,2 ,…,K 1,U ],[K 2,1 ,K 2,2 ,…,K 2,U ],…,[K n,1 ,K n,2 ,…,K n,U ]]Wherein, K is n = [K n,1 ,K n,2 ,…,K n,U ]The Chinese character probability matrix corresponding to the nth sample in the second Chinese character probability tensor; the Chinese character probability matrix K corresponding to the nth sample in the second Chinese character probability tensor n The probability vector of the Chinese character corresponding to the U-th character in the Chinese character is marked as K n,U =[c n,U,1 ,c n,U,2 ,c n,U,3 ,…,c n,U,6538 ]For the probability tensor of the second Chinese character, n is more than or equal to 1; according to the condition that the Chinese character probability vector in the scheme is satisfied, the Chinese character probability vector K in the embodiment is n,U The conditions to be satisfied can be expressed as: c. C n,U,5302 To c n,U,6538 Is 0, c n,U,1 To c n,U,5301 Is a decimal greater than 0, less than 1 and c n,U,1 +c n,U,2 +…+c n,U,5301 =1。
Compared with the first Chinese character probability tensor, the numerical value of the second Chinese character probability tensor is corrected, so that the probability numerical value of the second Chinese character probability tensor at the Chinese character position corresponding to the correct voice recognition result becomes larger, the probability numerical value of the second Chinese character probability tensor at the Chinese character position corresponding to the wrong voice recognition result becomes smaller, and further, the results of the subsequent integration, search and decoding processes are more accurate, and the text error correction model has an error correction function.
Shown in FIG. 4In the inference process of the speech recognition error correction model, a section of first speech data with specific semantic information is processed by an acoustic model, and then a first pinyin probability tensor is output, the first pinyin probability tensor represents the judgment result of the acoustic model on the semantic information possibly possessed by the first speech data by using pinyin probability values with different sizes, the first speech data with a section of content of 'good weather today' is input into the acoustic model by taking the probability tensor based on the vocabulary shown in fig. 3A as an example, and after being processed by the acoustic model, a first pinyin probability tensor I [ [ I ] is output 1,1 ,I 1,2 ,I 1,3 ,I 1,4 ,I 1,5 ,I 1,6 ]]Wherein, I 1 =[I 1,1 ,I 1,2 ,I 1,3 ,I 1,4 ,I 1,5 ,I 1,6 ]Is a Pinyin probability matrix corresponding to a sample' good weather today 1,1 To I 1,6 The pinyin probability vectors corresponding to the 6 characters from "present" to "wrong" are sequentially corresponded. The judgment result of the acoustic model on the pinyin probability vector corresponding to the first voice data corresponding to the current voice is assumed to be: there are 0.95 probability "jin 1", 0.02 probability "jin 1", 0.008 probability "jin 4" and so on, and in the vocabulary shown in fig. 3A, the pinyin characters "jin 1", "jin 1" and "jin 4" correspond to the 5729 th, 5732 nd and 5731 st elements, respectively, so that there are a in the pinyin probability vectors corresponding to "present" in the pinyin probability matrix corresponding to the sample "good weather today" in the first pinyin probability tensor output by the acoustic model 1,1,5729 =0.95,a 1,1,5732 =0.02,a 1,1,5731 0.008 and a 1,1,1 To a 1,1,5301 Are all 0, a 1,1,5302 +a 1,1,5303 +…+a 1,1,6538 1. For convenience of recording and visual representation, but not limitation, the pinyin probability matrix I in the first pinyin probability tensor corresponding to the section of first voice data of "good weather today" can be used 1 The pinyin probability vector corresponding to the Chinese and modern is marked as I 1,1 =[(jin1,0.95),(jing1,0.02),(jin4,0.008),…]It should be clear that in this way of recording, the sequential order of the elements does not represent the actual position of the elements in the vector, but merely for convenienceThe recording and visual representation aim is to place the elements with higher probability in front of the elements with lower probability.
Then, the first pinyin probability tensor output by the acoustic model is used as input, the first pinyin probability tensor is output through the processing of the language model or the first text error correction model, the first Chinese character probability tensor takes Chinese character probability numerical values with different sizes, the language model or the first text error correction model reflects the information provided by the language model or the first pinyin probability tensor based on the first pinyin probability tensor, the judgment result of the corresponding relation between the pinyin and the Chinese characters is obtained through the inference capability of the model, similarly, the probability tensor based on the word list shown in fig. 3A is taken as an example, a section of first voice data with the content of 'weather is good today', and as mentioned above, the pinyin probability vector corresponding to 'today' in the pinyin probability matrix in the first pinyin probability tensor output after the first voice data is processed by the acoustic model is taken as I 1,1 =[(jin1,0.95),(jing1,0.02),(jin4,0.008),…]Let I assume 1,1 After being processed by the language model or the first text error correction model, the language model or the first text error correction model obtains a judgment result by combining the context semantics of the whole section of the first voice data and utilizing the self reasoning capability of the model based on the probability numerical value in the first pinyin probability tensor: the probability of 0.975 is "present", the probability of 0.015 is "gold", the probability of 0.003 is "strong", etc., and the "present", "gold" and "strong" in the vocabulary shown in fig. 3A correspond to the 1814, 1822 and 1830 elements respectively, then the sample "present weather is good" in the Chinese character probability vector corresponding to "present" in the first Chinese character probability matrix output by the language model or the first text error correction model, there is b 1,1,814 =0.975, b 1,1,1822 =0.015,b 1,1,1830 0.003, and b 1,1,1 +b 1,1,2 +…+b 1,1,5301 =1、b 1,1,5302 To b 1,1,6538 Are all 0. Similarly, for convenience of recording and visual representation, but not limitation, the kanji probability matrix J in the first kanji probability tensor corresponding to the first speech data of the section of "good weather today" can be used 1 Chinese character probability vector corresponding to Chinese character and modern characterIs marked as J 1,1 (jin, 0.975), (gold, 0.015), (jin, 0.003), …]It should be clear that in this way of recording, the order of the elements does not represent the actual positions of the elements, but rather the elements with higher probability are placed before the elements with lower probability merely for the purpose of convenient recording and visual representation.
Fig. 5 exemplarily illustrates a flowchart of an information fusion process provided by an embodiment of the present application, that is, a flowchart of location dilation weighting, which may include steps 501 and 503:
501. searching Chinese character probability matrix J in first Chinese character probability tensor J n Middle, maximum probability P n,U,max Chinese character probability vector J smaller than threshold T n,U The position of (a). Wherein the maximum probability P n,U,max Probability matrix J for representing Chinese characters n Chinese character probability vector J corresponding to the U-th character n,U The maximum probability value of (1).
502. And performing position expansion on the searched one or more positions based on a plurality of preset rules. Specifically, at the maximum probability P n,U,max Chinese character probability vector J smaller than threshold T n,U And performing position expansion according to a plurality of preset left offset and right offset to obtain a plurality of position expansion modes.
503. And replacing the Chinese character probability vector positioned in the position expansion area with a pinyin mixing weighting vector, and replacing the Chinese character probability vector positioned outside the position expansion area with a Chinese character mixing weighting vector. Specifically, the pinyin mixing weighting vector may be a sum of λ times of pinyin probability vectors and (1- λ) times of chinese character probability vectors; the Chinese character mixed weighting vector can be the sum of (1-lambda) times of pinyin probability vectors and lambda times of Chinese character probability vectors, wherein lambda belongs to (0.5, 1).
504. A first mixed batch input is obtained.
Next, with reference to fig. 6 and fig. 7, a detailed implementation of the position dilation weighting provided in the embodiment of the present application is described in detail by way of example.
Fig. 6 is a schematic diagram schematically illustrating a method for determining a location expansion area according to an embodiment of the present application. As shown in fig. 6, assuming that the vocabulary contains V characters, taking a piece of speech data with content of "the woman is called bera" as an example, U is 7, the speech data with content of "the woman is called bera" is processed by the acoustic model, and then a first pinyin probability tensor with shape of 1 × 7 × V is output, the first pinyin probability tensor contains a pinyin probability matrix 607 with shape of 7 × V, and the pinyin probability matrix contains 7V-dimensional pinyin probability vectors; after the first pinyin probability tensor is processed by the language model or the first text error correction model, outputting a first Chinese character probability tensor with the shape of 1 × 7 × V, wherein the first Chinese character probability tensor comprises a Chinese character probability matrix 608 with the shape of 7 × V, and the Chinese character probability matrix comprises 7V-dimensional Chinese character probability vectors; and with the pinyin probability matrix 607 and the Chinese character probability matrix 608 as input, executing the position expansion related steps:
firstly, the method comprises the following steps: in the Chinese character probability matrix 608, the maximum probability P is found n,U,max Chinese character probability vector J smaller than threshold T n,U Wherein the maximum probability P n,U,max Representing the maximum probability value in the probability vector of the chinese character corresponding to the U-th character in the probability matrix 608. Taking fig. 6 as an example, if T is set to 0.9, … is … because of the kanji probability vector corresponding to the 6 th character "bei" (yawn, 0.49509)]Maximum probability P in n,6,max 0.49509, less than the threshold 0.9, so that the probability vector 601 of the Chinese character with the maximum probability less than the threshold in the probability matrix 608 is found to be J n,6 (iii) quilt (0.49509), …]。
II, secondly: at maximum probability P n,U,max Chinese character probability vector J smaller than threshold T n,U As the shift center, the position is expanded by a preset left offset (left offset) and right offset (right offset). Taking FIG. 6 as an example, J is taken n,6 If the left offset vector left _ offset is set to [0,0, -1 ] for the offset center in advance]The right offset vector right _ offset is [0,1,0,1 ═ 0]Then, there are: (1) when left _ offset is 0 and right _ offset is 0, the 1 st position expansion region 603 is obtained, that is, J is used n,6 Expand 0 characters to the left and 0 characters to the right for the offset center, so that the position expansion area contains J n,6 (2) in the left _ offset is 0 andwhen right _ offset is 1, the 2 nd type position expansion region 604 is obtained, i.e. J n,6 Expand 0 characters to the left and 1 character to the right for the offset center, so that the position expansion area contains J n,6 And J n,7 (3) when left _ offset is-1 and right _ offset is 0, the 3 rd type position expansion region 605 is obtained, that is, J is used n,6 Expand the offset center by-1 character to the left and 0 character to the right, so that the position expansion area contains J n,5 And J n,6 (4) when left _ offset is-1 and right _ offset is 1, the 4 th type of position expansion area 606 is obtained, that is, J is used n,6 Expand the center by-1 character to the left and 1 character to the right, so that the position expanded region contains J n,5 、J n,6 And J n,7
Fig. 6 shows a schematic 609 of 4 location-expanded regions in the embodiment, which schematically shows the 4 location-expanded regions, for example, the sample of the location-expanded region 603 of the 1 st type is represented by "bei 4 for that woman", and it should be clear that this representation is only for convenience to distinguish different processing methods of the probability vectors of the kanji within the location-expanded region and outside the location-expanded region during the subsequent weighting processing according to the location-expanded region, but not for the actual processing method of the location-expanded region.
The embodiment shown in fig. 6 is the case where there is a chinese character probability vector in the chinese character probability matrix 608 with a maximum probability less than a threshold. If there is no Chinese character vector with the maximum probability smaller than the threshold in the first Chinese character probability matrix, accordingly, there is no offset center on which the position expansion is performed, and thus the position expansion cannot be performed according to the left offset and the right offset, and therefore, in this case, all the Chinese character vectors are directly regarded as being located outside the position expansion region.
In another possible embodiment of the present scheme, there are two or more maximum probabilities P in the Chinese character probability matrix 608 n,U,max The Chinese character probability vectors smaller than the threshold T may correspond to step 501 and step 502 in FIG. 5, and in this case, a plurality of Chinese character probability vectors with the maximum probability smaller than the threshold are simultaneously used as the offset centersAnd carrying out position expansion according to a plurality of preset combination conditions of a plurality of left deviation amounts and right deviation amounts to obtain a plurality of position expansion areas. For example, take a segment of 10 as an example, if the 2 nd character (J) n,2 ) And the 9 th character (J) n,9 ) The corresponding Chinese character probability vector satisfies that the maximum probability is less than the threshold value, and the left offset left _ offset [ -1,0 [ -1]And right offset right _ offset is [0,1 ]]Then, there are: (1) at J n,2 Left _ offset-1, right _ offset-0, and J as offset center n,9 When left _ offset is-1 and right _ offset is 0, the 1 st position expansion area is obtained, i.e., J is used as the offset center n,2 Expand the center by-1 character to the left, 0 character to the right, and J n,9 Expand the center by-1 character to the left and 0 character to the right, so that the position expanded region contains J n,1 、J n,2 、J n,8 And J n,9 And (2) at J n,2 Left _ offset-1, right _ offset-0, and J as offset center n,9 When left _ offset is 0 and right _ offset is 1, which are offset centers, the 2 nd type position expansion area, i.e., J is obtained n,2 Expand the center by-1 character to the left, 0 character to the right, and J 9 Expand 0 characters to the left and 1 character to the right for the offset center, so that the position expansion area contains J n,1 、J n,2 、J n,9 And J n,10 And (3) at J n,2 Left _ offset is 0, right _ offset is 1, and J is the offset center n,9 When left _ offset is-1 and right _ offset is 0, which are offset centers, the 3 rd type position expanded region is obtained, i.e., J is used n,2 Expand 0 characters to the left, 1 character to the right for offset centers, and J n,9 Expand the offset center by-1 character to the left and 0 character to the right, so that the position expanded region contains J n,2 、J n,3 、J n,8 And J n,9 And (4) at J n,2 Left _ offset is 0, right _ offset is 1, and J is the offset center n,9 When left _ offset is 0 and right _ offset is 1, the 4 th position expansion area is obtained, i.e., J is used as the offset center n,2 To shift the center by 0Character, right expanded by 1 character, and J n,9 Expand 0 characters to the left and 1 character to the right for shifting the center, so that the position expansion area contains J n,2 、J n,3 、J n,9 And J n,10
Fig. 7 is a schematic diagram of weighting according to the location dilation region. As shown in fig. 7, according to the position expansion process performed in the example of fig. 6, when the left offset vector left _ offset is set to [0,0, -1]And right offset vector right _ offset is [0,1,0,1 ═ 0]Then, 4 kinds of site-expanded regions were generated, each being (J) n,6 )、(J n,6 ,J n,7 )、(J n,5 ,J n,6 ) And (J) n,5 ,J n,6 ,J n,7 ) Based on the 4 location expansion areas, executing a weighted correlation step:
firstly, the method comprises the following steps: for the 4 kinds of position expansion areas, replacing the Chinese character probability vectors positioned in the position expansion areas with pinyin mixed weighting vectors, and replacing the Chinese character vectors positioned outside the position expansion areas with Chinese character mixed weighting vectors; the pinyin mixing weighting vector is the sum of a pinyin probability vector multiplied by lambda and a Chinese character probability vector multiplied by (1-lambda), the Chinese character mixing weighting vector is the sum of the pinyin probability vector multiplied by (1-lambda) and the Chinese character probability vector multiplied by lambda, wherein lambda belongs to (0.5,1) and is expressed by a formula, and the pinyin mixing weighting vector is lambda.I n,U +(1-λ)·J n,U The Chinese character mixed weighting vector is (1-lambda). I n,U’ + λ·J n,U’ Wherein the subscript comprises U representing a vector within the position expanded region and the subscript comprises U' representing a vector outside the position expanded region, likewise, λ e (0.5, 1). As shown in fig. 7, when λ is 0.9, there are: (1) expand the region for the 1 st position, i.e. (J) n,6 ) A mixture of J and n,6 the Chinese character probability vector is replaced by a pinyin mixed weighting vector 0.9. I n,U +0.1·J n,U A mixture of J and n,1 to J n,5 、J n,7 The Chinese character probability vector is replaced by a Chinese character mixed weighted vector 0.1. I n,U’ +0.9·J n,U’ Here, U ═ 6 and U' ═ 1,2,3,4,5, 7; (2) expand the region for the 2 nd position, i.e. (J) n,6 ,J n,7 ) A mixture of J and n,6 、J n,7 the Chinese character probability vectors are respectively replaced by pinyin mixed weighted vectors of 0.9. I n,U +0.1·J n,U A mixture of J and n,1 to J n,5 The Chinese character probability vector is replaced by a Chinese character mixed weighted vector 0.1. I n,U’ +0.9·J n,U’ Where U is 6,7 and U' is 1,2,3,4, 5; (3) expand the region for the 3 rd position, i.e. (J) n,5 ,J n,6 ) A mixture of J and n,5 、J n,6 the Chinese character probability vectors are respectively replaced by pinyin mixed weighted vectors of 0.9. I n,U +0.1·J n,U A mixture of J and n,1 to J n,4 And J n,7 The Chinese character vector is replaced by a Chinese character mixed weighted vector 0.1. I n,U’ +0.9·J n,U’ Here, U ═ 5,6 and U ═ 1,2,3,4, 7; (4) expand the region for the 4 th position, i.e. (J) n,5 ,J n,6 ,J n,7 ) A mixture of J and n,5 、J n,6 and J n,7 The Chinese character probability vectors are respectively replaced by pinyin mixed weighted vectors of 0.9. I n,U +0.1·J n,U A mixture of J and n,1 to J n,4 The Chinese character vector is replaced by a Chinese character mixed weighted vector 0.1. I n,U’ +0.9·J n,U’ Here, U is 5,6,7 and U' is 1,2,3, 4. It should be clear that in the pinyin mixing weighting vector and hanzi mixing weighting vector representation shown in fig. 7, the precedence order of the elements in the vector does not represent the real positions of the elements in the vector, but rather, the element with the higher probability is placed in front of the element with the lower probability only for the purpose of convenient recording and intuitive representation.
II, secondly: and forming a first mixed batch input by using a plurality of mixed weighting matrixes generated under the plurality of position expansion areas. As shown in fig. 7, the region is expanded by 4 positions, and 4 kinds of mixed weighting matrices are generated, so that the 4 kinds of mixed weighting matrices constitute a first mixed batch input, the mixed weighting matrix exemplified in fig. 7 is a matrix with a shape of 7 × V, and the first mixed batch input is a tensor with a shape of 4 × 7 × V, where V is the number of elements in the adopted vocabulary, i.e., the dimension of the vocabulary.
The advantageous effects of the position expansion weighting process will be described by taking the embodiment shown in fig. 6 and 7 as an example. Before the position expansion processing is carried out, the phonetic data with the actual content of 'called bela of the woman' may have the recognition of the biased chinese pinyin in the pinyin probability matrix 607 in the first pinyin probability tensor output by the acoustic model due to the problems of nonstandard pronunciation, limitation of the recognition capability of the acoustic model and the like, for example, the 5 th pronunciation actually being 'jiao 4' is recognized as the pinyin probability vector [ (jiu4,0.74515), … ] by the acoustic model in fig. 6, so that the acoustic model considers the pronunciation actually being 'jiao 4' and generates the judgment bias with the maximum probability of 0.74515 being 'jiu 4'; next, based on the first pinyin probability tensor with the deviation, the language model or the first text error correction model outputs the judgment of the chinese pinyin corresponding to the chinese pinyin by the model based on the semantic comprehension capability of the model itself, that is, outputs the first chinese character probability tensor, and because of the deviation in the first pinyin probability tensor, there may be a chinese character in the chinese character probability matrix 608 in the output first chinese character probability tensor that the deviation occurs, for example, in fig. 6, the 5 th and 6 th chinese characters actually are "called" and "bei", which are identified as the chinese character probability vectors [ (just, 0.98870), … ] and [ (quilt, 0.49509), … ] by the language model or the first text error correction model; that is, the position expansion is performed by searching the kanji probability vector having the maximum probability smaller than the threshold value as the center of deviation in the kanji probability matrix 608 in step 501, and generating the position expanded region based on the center of deviation and the set left and right amounts of deviation in step 502, for example, the 6 th kanji probability vector [ (quilt, 0.49509), … ] illustrated in fig. 6 has the maximum probability of 0.49509 and is smaller than the set threshold value of 0.9, and therefore [ (quilt, 0.49509), … ] is used as the center of deviation; finally, in the mixing weights of step 504 and step 505, a mixing probability matrix is generated from the set left offset and right offset with [ (quilt, 0.49509), … ] as the offset center, and the first mixed batch input is formed.
If the position expansion is not adopted, that is, the processing of step 502 is not carried out, but the Chinese character probability vector with the maximum probability smaller than the set threshold value found in step 501 is directly replaced by the pinyin mixing matrix, the situation that [ (by 0.49509), … ] is replaced by the pinyin mixing weighting vector, but the [ (by 0.98870), … ] with the deviation is also identified in the Chinese character probability matrix is not replaced by the pinyin mixing weighting vector occurs, further in the subsequent process, the pinyin probability information cannot be introduced at the [ (by 0.98870), … ] with the deviation, so that the probability information cannot be provided for the error correction process of the first text pinyin correction model pair [ (by 0.98870), … ], the text error correction model is assisted to carry out error correction, and the first text error correction model can be based on the higher maximum probability 0.98870 in [ (by 0.98870), … ], it is assumed that there may be no bias in the decision of the model pair [ (i.e., 0.98870), … ] in the preceding step, and error correction of [ (i.e., 0.98870), … ] may be missed. If the position expansion is adopted, namely the probability vector of the Chinese character with the maximum probability smaller than the set threshold value is used as the offset center to carry out left-right offset, more types of position expansion areas are generated, the influence caused by incomplete or inaccurate judgment result only according to the condition that the maximum probability is smaller than the set threshold value is reduced, the possibility that the position expansion areas actually cover the Chinese character probability vectors with identification deviation is improved, as shown in the example in figure 6, the position expansion is adopted, so that the positions [ (i.e., 0.98870), … ] and [ (di, 0.49509) … ] are covered by the 3 rd and 4 th position expansion areas, and probability pinyin information corresponding to correct results called and Bei can be introduced into the first mixed batch input, in the pinyin probability matrix 607, the acoustic model has identification deviation for [ (jiu4,0.74515) and … ], however, the 5 th character is considered to correspond to the pinyin "jiu 4" only with the maximum probability of 0.74515, and the probability of the other characters except (jiu4,0.74515) in [ (jiu4,0.74515), … ] may still have other probabilities of correct pinyin with relatively high probability, which is lower than the maximum probability of 0.98870 in the Chinese character probability matrix 608.
If the position expansion weighting is not adopted, the Chinese character probability vector in the position expansion area is replaced by the pinyin probability vector, the Chinese character probability vector outside the position expansion area is not processed, namely the weighting is not carried out, only the Chinese character probability matrix in the first Chinese character probability tensor is converted into the pinyin probability vector and the Chinese character probability vector, and the formed mixed probability matrix is equivalent to the condition that the weighting parameter lambda is 1, so that the Chinese character probability information of the characters in the position expansion area is lost, and the characters outside the position expansion area cannot be fused to a certain extent, and the improvement of the error correction effect is not facilitated. If the position expansion weighting is adopted, the probability vector of each character position can contain pinyin probability information and Chinese character probability information, so that more possibly valuable information is provided for the first text error correction model, and the improvement of the error correction effect of the model is facilitated.
Fig. 8 exemplarily shows a flowchart of a post-processing procedure provided in the embodiment of the present application, which includes steps 801 and 803:
801. and integrating Chinese character probability vectors of the Chinese character probability matrixes corresponding to the samples in the second Chinese character probability tensor at the same character position.
802. And searching the characters in the integrated Chinese character probability vector to find out candidate characters meeting the conditions.
803. And taking the candidate character with the maximum probability sum as a final result, searching the word list for decoding, and outputting a corrected voice recognition result.
In a possible embodiment of the present invention, taking the voice data with the content "the woman called bera" shown in fig. 6 and fig. 7 as an input, and taking 4 types of position expansion areas generated by the 4 types of left offset and right offset as an example, the integration process in step 801 is as follows: the 4 hybrid weighting matrices shown in fig. 7 form a first hybrid batch input, which is a tensor 4 × 7 × V in shape, and after the first hybrid batch input is processed by the first text error correction model, the first text error correction model outputs a second kanji probability tensor 4 × 7 × V in shape, which is denoted as K [ [ K ] K × 1,1 ,K, 1,2 ,…,K 1,7 ],[K 2,1 ,K 2,2 ,…,K 2,7 ],…,[K 4,1 ,K 4,2 ,…,K 4,7 ]]Then, the aboveThe integration process is the probability vector K of the Chinese character 1,U 、K 2,U 、K 3,U And K 4,U Splicing together; for the integrated Chinese character probability vectors described above, assume that for U equal to 1, there is K 1,1 [ (that, 0.95), (which, 0.02), …],K 2,1 [ (that, 0.93), (which, 0.03), …],K 3,1 (iii) which, 0.001), …],K 4,1 ═ [ (that, 0.99), (which, 0.002), …]Then form K c,1 (iii) [ (that, 0.95), (which, 0.02), …, (that, 0.93), (which, 0.03), …, (that, 0.98), (which, 0.001), …, (that, 0.99), (which, 0.002), …]Similarly, for U ═ 2,3, …,7, the integration vector K is obtained in the same way c,2 To K c,7 . Step 802, searching for a character sequence corresponding to the sample in Chinese character probability numerical values in an integration vector corresponding to each character by using Search algorithms such as Exhaustive Search (greetive Search), Greedy Search (Greedy Search), Beam Search (Beam Search) and the like; it should be clear that the search algorithms exemplified herein are only a few alternative implementations of the search process in the present solution, and are not limited, and those skilled in the art may also select other suitable search algorithms according to the specific application requirements. In step 803, the decoding process is to find the decoding result corresponding to the number in the character sequence by querying the vocabulary according to the character sequence determined by the search, and assuming that the adopted vocabulary is the vocabulary shown in fig. 3A, the decoding result is a chinese character sequence.
Fig. 9 exemplarily illustrates a Beam search process employed in a possible embodiment provided by the present application, wherein a Beam search algorithm includes a parameter Beam width (Beam Size), the Beam width defines the number of sequences selected from each candidate sequence, the Beam search employed in fig. 9 sets the Beam width to 2, the sequences selected from each candidate sequence are represented in bold in the figure, as shown in the figure, an integration vector of a first character is [ (present, 0.6), (present, 0.4), … ], and since the Beam width is 2, when the search is performed to the first character, "present" and "via" are two chinese characters with the highest probability in the integration vector, and thus, "present" and "via" are selected as two candidate sequences with the first step length; the second character's integration vector is [ (day, 0.8), (field, 0.1), …, (past, 0.92), (day, 0.05), … ], and calculates the sum of sequence probabilities such as "today", "today" … … "past", "date field", etc., since the beam width is 2, when the search proceeds to the second character, "today" and "past" are the two sequences with the largest sum of sequence probabilities, and thus "today" and "past" are selected as the two candidate sequences of the second step length; the integration vector of the third character is [ (day, 0.65), (field, 0.3), …, (day, 0.34), (go, 0.2), … ], the sum of sequence probabilities of "day of the present", "field of the present" … … "passing day", "passing go" and the like is respectively calculated, and since the beam width is 2, when the search is carried out to the third character, "day of the present" and "field of the present" are two sequences with the largest sum of sequence probabilities, the "day of the present" and "field of the present" are selected as two candidate sequences with a third step size; and performing the subsequent beam searching process of the fourth character, the fifth character and the like in the same way until all the characters in the sample are searched, and finally outputting a sequence with the maximum probability sum as a searching result.
The exhaustive search is equivalent to the beam search with the beam width as the dimension of the integration vector, namely, the search is carried out in all possible results of each character, and all possible results are selected as candidate sequences before the search of the next character; the greedy search is equivalent to a beam search when the beam width is 1, that is, only a sequence having the highest probability is selected as a candidate sequence for each character. Therefore, the global optimal sequence can be found by exhaustive search, and the global optimal sequence can not be found by greedy search and beam search; compared with the three search algorithms, the calculation cost of exhaustive search is the largest, the time consumption is the longest, the calculation cost of greedy search is the smallest, the time consumption is the shortest, the calculation cost and the time consumption of beam search are between the calculation cost of exhaustive search and the time consumption of greedy search, and the calculation cost and the time consumption of beam search are determined by selecting the beam width of the hyper-parameter. In practical applications, a person skilled in the art can select a suitable search algorithm according to actual requirements, including but not limited to the three search algorithms described above.
Fig. 10 is a schematic structural diagram of a text error correction model in a possible embodiment of the present invention. As shown in fig. 10, the text error correction model may comprise two parts, an embedding layer 1002 and a neural network model 1004. In one possible embodiment of the present solution, the input 1001 and the output 1005 may be three-dimensional tensors, the embedded layer 1002 and the Neural network model 1004 may be implemented based on a Neural network structure, the function implemented by the embedded layer 1002 may be different according to the type of the Neural network model 1004 used, the embedded layer 1002 has at least a function of identifying embedding (token embedding), the embedded layer output 1003 may be a matrix or a tensor processed by at least identification (token), the Neural network model 1004 may be a Convolutional Neural Network (CNN), a cyclic Neural network (Recurrent Neural Networks, RNN), a Long Short Term Memory (Long Short Term Memory, LSTM) network, a Bidirectional Long Term Memory (Bi-directional Short Term Memory, LSTM) network, a Bidirectional transducer-based Bidirectional Encoder (Bidirectional Short Term Encoder), BERT) network or neural network having other structures, it should be understood that the above examples are only illustrative of possible embodiments and not limiting, and those skilled in the art can select a neural network having a suitable structure to implement the function of the neural network model 1004 in the text error correction model according to the actual application requirements.
In one possible implementation of the present scheme, the neural network model 1004 may be a Bidirectional coder (BERT) network based on a transformer architecture (hereinafter referred to as "BERT"), in which case the input 1001 and the output 1005 are tensors with a shape of B × U × V, where B represents the number of samples in the batch, U represents the maximum number of characters of a batch of samples in the batch, and V represents the number of elements in the vocabulary, i.e., the dimension of the vocabulary; the embedding layer 1002 is a neural network layer having a function of mark embedding and position embedding (position embedding), and the embedding layer output 1003 is a tensor having a shape of B × U × D subjected to the marking process and the position embedding process, where D denotes a dimension of a vector representing each character in the input sample after being subjected to the embedding layer process; the process of embedding the identification comprises the following steps: embedding a mark with a V multiplied by D dimensional parameter matrix into a neural network layer, converting a V dimensional probability vector used for expressing characters in an input sample into a D dimensional vector, wherein the mark embedding process can be understood as a process of changing data dimension by matrix multiplication, namely converting the characters expressed by the V dimensional probability vector into the characters expressed by the D dimensional vector after the characters expressed by the V dimensional probability vector are multiplied by a V multiplied by D dimensional matrix; the position embedding process comprises the following steps: converting position information of characters in an input sample, namely the characters are the second words in a sentence, into D-dimensional vectors by using a position embedding neural network with an L multiplied by D-dimensional parameter matrix, wherein the position embedding process can be understood as a process of searching a table, L represents the number of the position information of the characters which can be maximally represented by the table, and characters which are positioned at the same position in different sample sentences have the same position embedding vectors no matter whether the characters are the same or not, for example, the first word in the sentence "how the weather is today" has the same position embedding vector as the first word in the sentence "what is eaten at noon"; after completing the labeling process and the position embedding process on the input 1001, the embedding layer 1002 performs an addition process on the D-dimensional vector obtained in the label embedding process and the D-dimensional vector obtained in the position embedding process, so as to convert the input 1001 with the shape of B × U × V into a tensor with the shape of B × U × D, and further obtain an embedding layer output 1003; when the BERT used is a pre-training model, there are generally D768 and L512, it should be understood that the values of D and L are merely exemplary and not limiting, and those skilled in the art can use a BERT pre-training model with specific parameter values or train a BERT model with a custom parameter configuration according to the actual application requirements.
In a possible embodiment of the present disclosure, the several models based on the neural network structure may be implemented by software programming based on a general hardware device, or may also be implemented by a specific hardware device having a corresponding neural network structure, where the former is referred to as a software implementation and the latter is referred to as a hardware implementation, and then in the software implementation, the neural network programming framework that may be adopted may include, but is not limited to, tensrflow, PyTorch, Keras, Caffe, and the like, and in the hardware implementation, the hardware device that may be adopted includes, but is not limited to, a Field Programmable Gate Array (FPGA).
Fig. 11 exemplarily illustrates a relationship diagram of a text correction model training phase and an inference phase provided by an embodiment of the present application. As shown in fig. 11, the training phase is used to obtain a first text error correction model, and the inference phase performs application of the first text error correction model.
In a possible implementation manner of the present disclosure, in a training phase, a second text error correction model with randomly initialized parameters utilizes training data with a real label, that is, second speech data with corresponding correct text content, and performs parameter updating continuously based on configurations such as a set Learning Rate (Learning Rate) and a Loss Function (Loss Function) through modes such as Gradient Descent (Gradient component) and Back Propagation (Back Propagation) so that the model learns data distribution of the training data continuously, when an error calculated based on the set Loss Function is smaller than a set threshold, training is completed, updating of model parameters is stopped, and a trained text error correction model, that is, a first text error correction model, is obtained; when the model employed is a neural network-based model, the parameters may include weights and bias values for individual neurons in the neural network. Specifically, as shown in fig. 11, the training phase used in the present scheme may include two phases: and in the first training stage and the second training stage, the two training stages respectively need to undergo the steps of gradient descent, back propagation, loss function error calculation and the like, the training of the text error correction model is carried out twice, a third text error correction model is obtained after the first training stage is completed, and the third text error correction model is subjected to the second training stage to carry out the parameter updating again, so that the first text error correction model is finally obtained. In the inference stage, the parameters of the first text error correction model are not updated any more, and therefore the above steps of gradient descent, backward propagation, and calculation of the loss function error are also not performed any more, but a trained first text error correction model with fixed parameters is used, and for the first voice data without tags, the judgment result of the model on the text content of the first voice data is output through forward calculation, and the specific process of the inference stage is as shown in fig. 1B, fig. 2, fig. 4, and fig. 5, and will not be described repeatedly here.
Fig. 12 exemplarily illustrates a schematic diagram of a training phase of a text correction model provided in an embodiment of the present application. As shown in fig. 12, before entering a first training stage, training data is first prepared, and a second pinyin probability tensor is output after the second voice data with a label is processed by an acoustic model, where the second voice data with a label means that correct text content corresponding to the voice data is known, so that a loss function error can be calculated according to the known correct text content in the training process, and then model parameters are updated through processes such as gradient descent, back propagation, and the like. In an alternative embodiment of the present disclosure, if a probability matrix based on the vocabulary shown in fig. 3A is used, the value of the pinyin probability vector in the pinyin probability matrix in the second pinyin probability tensor is 0 in the chinese character field, and the value of the pinyin field is a decimal in the interval (0, 1).
As shown in fig. 12, before the training phase starts 1201, firstly, parameters of the second text error correction model and the probability tensor of the chinese character at the output end of the second text error correction model are initialized; in the process of the first training stage, a second pinyin probability tensor is used as input and is propagated forwards through a second text error correction model, the loss function error of the correct text content corresponding to the Chinese character probability tensor output by the second text error correction model and second voice data is calculated, if the loss function error is larger than a set threshold value, gradient descent and backward propagation are carried out, the parameters of the Chinese character probability tensors at the output ends of the second text error correction model and the second text error correction model are updated, the process is repeated continuously until the loss function error is smaller than the set threshold value, and the first training stage is completed; and (4) after the first training stage is finished 1202, namely after the last model parameter updating of the first training stage is finished, obtaining a third text error correction model and a third Chinese character probability tensor at the output end of the third text error correction model. In an alternative embodiment of the present disclosure, if a probability matrix based on the vocabulary shown in fig. 3A is used, the chinese character probability vector in the chinese character probability matrix in the third chinese character probability tensor has a decimal in the interval of (0,1) in the chinese character field and a0 in the chinese pinyin field. In an optional implementation manner of this solution, a Batch input processed by Batch Normalization may be used as an input of the text error correction model in the training stage, at this time, a numerical value of the second pinyin probability tensor may be a tensor with a shape of B × U × V after being subjected to Normalization processing, where B represents the number of pieces of data that are input into the second text error correction model in the same Batch and propagated forward, optionally, B may be set to 128, and this Batch of data may include second speech data with different sentence lengths corresponding thereto, the number of characters of the longest sentence may be U, a matrix corresponding to a sample smaller than this length is filled with 0 at a position corresponding to a vacant character, and V is the number of elements in the word list, that is, a dimension of the word list. When the batch input mode is adopted, parameters of the model are uniformly calculated for loss function errors after a batch of data is input and transmitted forwards, gradient descent and backward transmission are carried out, parameters of the model are updated, model training is carried out in the batch input mode, so that training is more stable, gradient loss and gradient explosion are effectively avoided, and the fault tolerance rate of parameter initialization is improved.
Before the second training stage begins 1203, first performing mixed weighting processing on the second pinyin probability tensor and a third Chinese character probability tensor obtained after the first training stage is completed to form a second mixed batch input, and using a third text error correction model and the third Chinese character probability tensor obtained after the first training stage is completed as initialization models; in the process of the second training stage, the second mixed batch input is used as input, the input is transmitted forwards through a third text error correction model, the probability tensor of the Chinese characters output by the third text error correction model and the loss function error of the correct text content corresponding to the second voice data are calculated, if the loss function error is larger than a set threshold value, gradient descent and backward transmission are carried out, the parameters of the probability tensor of the third Chinese characters at the output ends of the third text error correction model and the third text error correction model are updated, the process is continuously repeated until the loss function error is smaller than the set threshold value, and the second training stage is completed; and obtaining the first text error correction model and the fourth Chinese character probability tensor at the output end of the first text error correction model after the training stage two is completed 1204, namely after the last model parameter updating of the training stage two is completed.
The process of the hybrid weighting performed before the training phase two is similar to the process of the hybrid weighting performed in the inference phase, and specifically, the process of the hybrid weighting is as follows: replacing Chinese character probability vectors at character positions which cannot have the maximum probability at the correct Chinese character position in the Chinese character probability matrix in the third Chinese character probability tensor with pinyin mixed weighting vectors formed by pinyin probability vectors based on the pinyin probability matrix in the second pinyin probability tensor and Chinese character probability vectors of the Chinese character probability matrix in the third Chinese character probability tensor; replacing the Chinese character probability vector at the character position with the maximum probability at the correct Chinese character position in the Chinese character probability matrix in the third Chinese character probability tensor with a Chinese character mixed weighting vector formed by the pinyin probability vector based on the pinyin probability matrix in the second pinyin probability tensor and the Chinese character probability vector based on the Chinese character probability matrix in the third Chinese character probability tensor; the pinyin mixing weighting vector is the sum of a pinyin probability vector of a pinyin probability matrix in the second pinyin probability tensor multiplied by lambda and a Chinese character probability vector of a Chinese character probability matrix in the third Chinese character probability tensor multiplied by (1-lambda); the Chinese character mixed weighting vector is the sum of the pinyin probability vector of the pinyin probability matrix in the second pinyin probability tensor multiplied by (1-lambda) and the Chinese character probability vector of the Chinese character probability matrix in the third pinyin probability tensor multiplied by lambda, wherein lambda belongs to (0.5, 1). The hybrid weighting process performed before the training phase two is different from the position expansion weighting process in the inference phase in that: determining which Chinese character probability vectors at character positions are replaced by pinyin mixed weighting vectors and which Chinese character probability vectors at character positions are replaced by Chinese character mixed weighting vectors in different modes; specifically, the correct character corresponding to the first speech data input in the inference stage is unknown, and therefore, for which character positions are replaced by pinyin mixed weighted vectors and which character positions are replaced by Chinese character mixed weighted vectors, the correct character positions are determined not according to the exact labels, but according to the position expansion area generated by position expansion, and the second speech data with labels is adopted in the training stage, it is known in advance what the correct character corresponding to the second speech data corresponds to, so that the characters corresponding to the maximum probability of Chinese character probability vectors in which positions are wrong in judgment and which positions are correct in judgment in the Chinese character probability matrix in the third Chinese character probability tensor can be directly and exactly obtained, and therefore, the pinyin probability vectors with higher weights are mixedly adopted at the positions where the character corresponding to the maximum probability of Chinese character probability vectors is wrong in judgment, and forming a pinyin mixing weighting vector, and mixing the Chinese character probability vectors with higher weight at the position where the character corresponding to the maximum probability of the Chinese character probability vector is judged to be correct to form the Chinese character mixing weighting vector. The mixed weighting processing process can introduce pinyin information into the Chinese character probability tensor, so that under the condition that the Chinese character probability judgment result has deviation, a relatively correct result can be deduced according to the information provided by the pinyin probability, and the text error correction model makes full use of the pinyin and the Chinese character information and has better error correction capability compared with the text error correction model in the prior art.
Fig. 13 exemplarily illustrates a flowchart of a text correction model training phase according to an embodiment of the present application. As shown in FIG. 13, the training phase of the text correction model comprises steps 1301-:
1301. initializing parameters: and initializing parameters of the text error correction model and the Chinese character probability tensor at the output end of the text error correction model.
1302. Forward propagation: inputting the pinyin probability tensor output by the acoustic model into the text error correction model, and propagating the calculation result forward after calculation.
1303. And (3) error calculation: and calculating the total error between the Chinese character probability output by the text error correction model and the real character content corresponding to the second voice data.
1304. And (3) judging: and (4) judging whether the total error is smaller than the set threshold value, if so, executing the steps 1305 and 1306, otherwise, entering the step 1307, and finishing the training of the first stage.
1305. Updating model parameters: if the total error is larger than the set threshold value, the error is transmitted back to the text error correction model, and the parameters of the text error correction model are updated according to the error.
1306. Calculate the new error: and calculating the total error between the Chinese character probability output by the text error correction model after the parameters are updated and the real character content corresponding to the voice data. Then, the process proceeds to step 1304 again to determine whether the total error is smaller than the set threshold.
1307. Completing a first training stage: and if the total error is less than or equal to the set threshold, finishing the first training stage to obtain the text error correction model and the Chinese character probability tensor after the first training stage is finished.
In a possible implementation manner of this embodiment, the total error in step 1303 may be calculated by using a cross entropy loss function, where the calculation formula is as follows: loss ═ 1/N ═ Σ i to N [-log p(u i )]Where N represents the number of characters in the sample sentence, u i The ith character, p (u), representing a real character i ) And the probability value of the ith character corresponding to the real character in the probability tensor of the Chinese character is expressed, and log represents a logarithmic function with the base 2. Taking a section of corresponding real character as the voice data of 'good weather' as an example, if the probability matrix of the Chinese character corresponding to the sample output by the text error correction model, the probabilities corresponding to the four characters are [ (day, 0.25), (qi, 0.125), (not, 0.5), (wrong, 0.25)]Then, the formula for calculating the loss value by using the cross entropy loss function is as follows: loss [ -log0.25-log0.125-log0.5-log0.25 [ -log0.25 ] - ] [ -1/4 ]]=2。
In a possible implementation manner of the present invention, the step 1305, that is, "returning the error to the text error correction model, and updating the parameters of the text error correction model according to the error" may be implemented by using a gradient descent and a back propagation manner, that is, calculating the gradient of the loss function to the model parameters, calculating the magnitude of the parameter to be updated by multiplying the gradient by the set learning rate, and subtracting the magnitude to be updated from the original parameters, so as to obtain the updated model parameters. The gradient descent method adopted can comprise global gradient descent, random gradient descent and batch gradient descent according to different sample numbers participating in calculation each time, wherein the global gradient descent adopts all sample calculation errors, the random gradient descent randomly selects one sample calculation error each time, and the batch gradient descent calculates errors with a batch of sample data each time.
Fig. 14 exemplarily illustrates a flow chart of a text correction model training phase two provided by the embodiment of the present application. As shown in FIG. 14, the training phase of the text correction model comprises the steps 1401-:
1401. initializing parameters: and initializing the model by using the text error correction model and the Chinese character probability tensor obtained after the training stage I is completed.
1402. Forward propagation: and (3) after the pinyin probability tensor output by the acoustic model and the Chinese character probability tensor obtained after the training stage I is finished are mixed and weighted, the pinyin probability tensor is propagated forwards through the text error correction model.
1403. And (3) error calculation: calculating the total error between the Chinese character probability output by the text error correction model and the real character content corresponding to the second voice data;
1404. and (3) judging: and (4) judging whether the total error is smaller than a set threshold value, if so, executing steps 1405 and 1406, otherwise, entering step 1407, and finishing the training of the second stage.
1405. Updating model parameters: if the total error is larger than the set threshold value, the error is transmitted back to the text error correction model, and the parameters of the text error correction model are updated according to the error.
1406. Calculating a new error: and calculating the total error between the Chinese character probability output by the text error correction model after the parameters are updated and the real character content corresponding to the voice data. Step 1404 is then entered again to determine if the total error is less than the set threshold.
1407. And finishing a second training stage: and if the total error is less than or equal to the set threshold, finishing the second training stage to obtain the text error correction model and the Chinese character probability tensor after the second training stage is finished.
The threshold values in step 1304 and step 1404 may be the same or different, and this scheme is not limited to this.
In a possible implementation manner of this solution, the total error in step 1403 may be calculated by using a cross entropy loss function; the step 1405 of returning the error to the text error correction model and updating the parameters of the text error correction model according to the error can be realized by adopting a gradient descent and a back propagation mode. The specific processes of the cross entropy loss function, gradient descent and back propagation are as described above, and are not repeated here.
In a possible implementation manner of the scheme, hundreds of millions of pieces of voice data are used as training data to train a text error correction model, the effect of the text error correction model of the scheme is verified on test data containing more than 1 ten thousand pieces of voice data, and the obtained experimental effect is shown in table 1:
table 1: error correction effect experimental result of text error correction model
Input device Accuracy of writing Accuracy rate of sentences
Pinyin probability tensor 94.92% 71.72%
Probability tensor of Chinese characters 95.06% 72.86%
Pinyin and Chinese character mixed probability tensor 95.74% 78.41%
Pinyin and Chinese character mixed weighted probability tensor 95.90% 78.62%
Wherein, the Word standard Rate is 1 minus the Word Error Rate (WER); the sentence accuracy rate is the ratio of the number of correctly recognized sentences to the total number of sentences. The word error rate is the total number of the words to be replaced, deleted or inserted, divided by the percentage of the total number of the words in the correct result, and the calculation formula is as follows: 100% × (S + D + I)/N, where S denotes the number of replaced (failure) words, D denotes the number of deleted (Deletion) words, I denotes the number of inserted (Insertion) words, and N denotes the total number of words.
The experiment performed in this embodiment compares the word standard rate and sentence standard rate in the case of inputting data by the first text error correction model in4 forms: the input is a pinyin probability tensor, a Chinese character probability tensor, a pinyin and Chinese character mixed probability tensor and a pinyin and Chinese character mixed weighted probability tensor. As can be seen from table 1, the pinyin and chinese character mixed weighted probability tensor is used as an input to obtain an optimal effect, and then the pinyin and chinese character mixed probability tensor, the chinese character probability tensor, and the pinyin probability tensor are sequentially used. The pinyin and Chinese character mixed weighted probability tensor is the first mixed batch input and/or the second mixed batch input, and the pinyin and Chinese character mixed probability vector is the mixed probability vector obtained under the condition that the weighted parameter lambda is 1. Therefore, the method can prove that better error correction effect can be obtained by using the pinyin probability information and the Chinese character probability information in a mixed mode than by using only the pinyin probability information or only the Chinese character probability information, and the error correction effect can be further improved by using the pinyin probability information and the Chinese character probability information in a mixed weighting mode.
Fig. 15, fig. 16 and fig. 17 are schematic diagrams of three application scenarios of the chinese speech recognition error correction model according to the present embodiment.
FIG. 15 is an application of the Chinese speech recognition error correction model in an intelligent speech assistant of a terminal device. As shown in fig. 15(a), the intelligent voice assistant of the terminal device is started, the intelligent voice assistant software is started and operated, and the microphone enters a monitoring state; as shown in fig. 15(b), when the user speaks a piece of content "please play an english song for me", the microphone receives a voice signal, inputs the voice signal into the voice recognition error correction model in the intelligent voice assistant, and before the voice recognition error correction model completes the error correction inference process, the content containing misjudged characters may be displayed in the display interface of the terminal device, for example, due to the fact that the user pronounces nonstandard and other factors, "english" is recognized as "because"; as shown in fig. 15(c), the speech recognition error correction model in the intelligent speech assistant completes the error correction reasoning process, successfully corrects the "cause" of the previous misjudgment to be "english", and makes a corresponding response based on the semantic understanding of the correct sentence, i.e. a english song is played for the user. The intelligent voice assistant provided with the voice recognition error correction model with the error correction function can enable the voice recognition result to be more accurate, and provides more accurate text information for subsequent semantic understanding of the intelligent voice assistant, so that more accurate correspondence is made, and the intelligent voice assistant is more intelligent.
Fig. 16 shows an application of the chinese speech recognition error correction model in a speech input method of a terminal device. As shown in fig. 16(a), the terminal device receives a short message, and displays a pinyin keyboard input method interface in an input box by default; as shown in fig. 16(b), the user clicks "voice input", switches to the voice input method with error correction function, starts the microphone, receives the voice signal, inputs the voice recognition error correction model in the voice input method, and before the voice recognition error correction model completes the error correction inference process, the display interface of the terminal device may display the content containing the misjudged character, for example, because of the factors such as the user's pronunciation being not standard, the "good" in the "bad meaning" is recognized as "stay"; as shown in fig. 16(c), the speech recognition error correction model in the speech input method completes the error correction reasoning process, and successfully corrects the previously recognized "decocting" to "good"; as shown in fig. 16(d), the user clicks "send" to complete the sending of the short message. The voice input method provided with the voice recognition error correction model with the error correction function can reduce manual correction and change of automatic recognition results by a user, so that the user can still complete accurate text input under the condition that typing input is inconvenient, for example, in a driving state shown in fig. 16, and user experience is improved.
FIG. 17 illustrates the application of the Chinese speech recognition error correction model to the speech-to-text function of the terminal device. As shown in fig. 17(a), the terminal device receives a piece of voice information, and the user starts the function of converting voice into text by long-pressing the icon of the voice information; as shown in fig. 17(b), before the speech recognition error correction model in the speech to text function completes the error correction inference process, the display interface of the terminal device may display the content containing the misjudged characters, for example, due to the factors such as the sender of the received speech information not speaking normally, or the vocabulary of the chinese speech recognition model is limited, the "Wi-Fi" is recognized as "falsified; as shown in fig. 17(c), the speech recognition error correction model in the speech to text function completes the error correction reasoning process, successfully correcting the previously recognized "falseness" to "Wi-Fi". The voice-to-text function of the voice recognition error correction model with the error correction function can recognize the received voice information with higher accuracy, so that the process of playing the voice information is omitted, and a user can conveniently understand the message content and reply the message in time under the condition that the user is inconvenient to play and listen to the voice information.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
As used in this specification and the appended claims, the terms "if," if, "and" if "may be interpreted contextually as" when. Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".
Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance. It will also be understood that, although the terms first, second, etc. may be used herein to describe various elements in some embodiments of the application, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first table may be named a second table, and similarly, a second table may be named a first table, without departing from the scope of various described embodiments. The first table and the second table are both tables, but they are not the same table.
Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather mean "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise. Meanwhile, the term "a plurality" in the embodiments of the present application means two or more.
The voice recognition method provided by the embodiment of the application can be applied to electronic devices such as a mobile phone, a tablet personal computer, a wearable device, a vehicle-mounted device, an Augmented Reality (AR)/Virtual Reality (VR) device, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a Personal Digital Assistant (PDA), and the like, and the embodiment of the application does not limit the specific types of the electronic devices at all.
For example, the electronic device may be a Station (ST) in a WLAN, which may be a cellular phone, a cordless phone, a Session Initiation Protocol (SIP) phone, a Personal Digital Assistant (PDA) device, a handheld device with wireless communication capability, a computing device or other processing device connected to a wireless modem, an in-vehicle device, a car networking terminal, a computer, a laptop, a handheld communication device, a handheld computing device, a satellite wireless device, a wireless modem card, a Set Top Box (STB), a Customer Premises Equipment (CPE), and/or other devices for communicating over a wireless system, as well as a next generation communication system, such as an electronic device in a 5G Network or a future-evolved Public Land Mobile Network (Public Land Mobile Network, PLMN) electronic devices in the network, etc.
By way of example and not limitation, when the electronic device is a wearable device, the wearable device may also be a generic term for intelligently designing daily wear, developing wearable devices, such as glasses, gloves, watches, clothing, shoes, and the like, by applying wearable technology. A wearable device is a portable device that is worn directly on the body or integrated into the clothing or accessories of the user. The wearable device is not only a hardware device, but also realizes powerful functions through software support, data interaction and cloud-side interaction. The general sense wearing type smart machine is complete including the function, the size is big, can not rely on intelligent cell-phone to realize complete or partial function, like intelligent wrist-watch or intelligent glasses etc to and only be absorbed in a certain kind of application function, need use with other equipment like the cooperation of smart mobile phone, like all kinds of intelligent bracelet, intelligent ornament etc. that carry out the sign monitoring.
Fig. 18A shows a schematic structural diagram of the electronic device.
The electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a Universal Serial Bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, a key 190, a motor 191, an indicator 192, a camera 193, a display 194, a SIM card interface 195, and the like. Wherein the sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, a bone conduction sensor 180M, and the like.
It is to be understood that the illustrated structure of the embodiment of the present invention does not specifically limit the electronic device 100. In other embodiments of the present application, electronic device 100 may include more or fewer components than shown, or some components may be combined, some components may be split, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.
In addition, for the description of each component in the electronic device 100, reference may be made to the following publication numbers: CN110519451A, invention name: in the patent application of a shutdown management and control method and apparatus for an electronic device, the related descriptions in paragraphs [0054] to [0104] of the specification are omitted here for brevity.
The supplementary description of the mobile communication module 150 and the audio module 170 is as follows:
in the embodiment of the present application, the mobile communication module 150 may also be configured to perform information interaction with other electronic devices, that is, may send voice-related data to other electronic devices, or the mobile communication module 150 may be configured to receive a voice recognition and/or error correction request and encapsulate the received voice recognition and/or error correction request into a message in a specified format.
In addition, the electronic device 100 may implement an audio function through the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the earphone interface 170D, and the application processor. Such as music playing, recording, etc. In speech recognition and/or error correction, the picking up of the user's speech may be accomplished by the microphone 170C.
It should be understood that in practical applications, the electronic device 100 may include more or less components than those shown in fig. 18A, and the embodiment of the present application is not limited thereto. The illustrated electronic device 100 is merely an example, and the electronic device 100 may have more or fewer components than shown in the figures, may combine two or more components, or may have a different configuration of components. The various components shown in the figures may be implemented in hardware, software, or a combination of hardware and software, including one or more signal processing and/or application specific integrated circuits.
The software system of the electronic device may employ a layered architecture, an event-driven architecture, a micro-core architecture, a micro-service architecture, or a cloud architecture. The embodiment of the invention takes an Android system with a layered architecture as an example, and exemplarily illustrates a software structure of an electronic device.
Fig. 18B is a block diagram of a software structure of the electronic device according to the embodiment of the present invention.
For the description of the software system of the electronic device, reference may be made to the publication nos.: CN110519451A, invention name: in the patent application of a shutdown management and control method and apparatus for electronic devices, the descriptions in paragraphs [0107] to [0128] are omitted here for brevity.
The following describes exemplary work flows of software and hardware of the electronic device 100 in connection with a scenario in which the electronic device 100 performs real-time speech recognition and/or error correction.
When the microphone 170C picks up the user's voice data, a corresponding hardware interrupt is issued to the kernel layer. The kernel layer processes the voice data into raw input events, which are stored in the kernel layer. The application framework layer obtains the original input event from the kernel layer, and performs voice recognition and/or error correction on the voice data by calling a resource manager in the application framework layer.
It should be understood that the software structure of the electronic device according to the embodiment of the present invention is only for illustration and is not to be construed as a specific limitation of the electronic device.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware or a form of a software functional unit.
The embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps in the above-mentioned method embodiments.
The embodiments of the present application further provide a computer program product, which when executed on an electronic device, enables the electronic device to implement the steps in the above method embodiments.
An embodiment of the present application further provides a chip system, where the chip system includes a processor, the processor is coupled with a memory, and the processor executes a computer program stored in the memory to implement the steps in the above method embodiments.
The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow in the method of the embodiments described above can be realized by a computer program, which can be stored in a computer readable storage medium and can realize the steps of the above methods when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in source code form, object code form, an executable file or some intermediate form, etc. The computer-readable storage medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, and software distribution medium, etc.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.
Finally, it should be noted that: the above description is only an embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions within the technical scope of the present disclosure should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (15)

1. A Chinese speech recognition error correction method is characterized by comprising the following steps:
obtaining pinyin information and first Chinese character information of voice data;
fusing the pinyin information and the first Chinese character information to obtain mixed information;
processing the mixed information by applying a text error correction model to obtain second Chinese character information;
outputting the corrected voice recognition result;
wherein, the second Chinese character information comprises the corrected voice recognition result; the text error correction model is a neural network model.
2. The method of claim 1, wherein the pinyin information includes pinyin probabilities and the first hanzi information includes hanzi probabilities;
the fusing the pinyin information and the first Chinese character information to obtain the mixed information specifically includes:
and performing weighted fusion on the pinyin probability in the pinyin information and the Chinese character probability in the first Chinese character information to obtain the mixed information containing a plurality of sub-mixed information.
3. The method of claim 2, wherein prior to the weighted fusion of the pinyin probability in the pinyin information and the hanzi probability in the first hanzi information, the method further comprises:
and determining the positions of the Chinese characters with the Chinese character probability smaller than a threshold value in the Chinese character probability in the first Chinese character information, and performing the weighted fusion according to the positions.
4. The method as claimed in claim 3, wherein the performing the weighted fusion of the pinyin probability in the pinyin information and the hanzi probability in the first hanzi information specifically comprises:
obtaining a plurality of position expansion areas based on the positions according to a preset rule; the preset rule comprises a plurality of left offset amounts and a plurality of right offset amounts; the position dilation region covers the position, the left offset number of positions located to the left of the position, and the right offset number of positions located to the right of the position.
5. The method as claimed in claim 4, wherein the performing the weighted fusion of the pinyin probability in the pinyin information and the hanzi probability in the first hanzi information further comprises:
and replacing the Chinese character probability in the first Chinese character information, which is positioned in the position expansion area, with a pinyin mixed weighting probability, and replacing the Chinese character probability in the first Chinese character information, which is positioned outside the position expansion area, with a Chinese character mixed weighting probability, so as to complete weighted fusion.
6. The method of claim 5, wherein:
the pinyin mixing weighting probability is obtained by weighted addition of the pinyin probability in the pinyin information multiplied by the first weight and the Chinese character probability in the first Chinese character information multiplied by the second weight;
the Chinese character mixed weighted probability is obtained by weighted addition of the Chinese character probability in the first Chinese character information multiplied by the first weight and the pinyin probability in the pinyin information multiplied by the second weight;
the first weight is greater than the second weight.
7. The method according to any one of claims 1-6, wherein:
the pinyin information is a pinyin probability tensor which comprises a pinyin probability matrix formed by pinyin probability vectors;
the first Chinese character information is a Chinese character probability tensor which comprises a Chinese character probability matrix formed by Chinese character probability vectors;
the mixing information is a mixing tensor which comprises a pinyin and Chinese character mixing probability matrix formed by pinyin and Chinese character mixing probability vectors.
8. The method of claim 7, wherein the pinyin probability vectors and the hanzi probability vectors are vocabulary-based probability vectors; the word list comprises a plurality of pinyin and a plurality of Chinese characters;
the value of the pinyin probability vector corresponding to the word list pinyin is nonzero, and the value of the pinyin probability vector corresponding to the word list Chinese character is zero;
the numerical value of the Chinese character probability vector corresponding to the phonetic letters of the word list is zero, and the numerical value of the Chinese character corresponding to the word list is nonzero;
the numerical values of the pinyin and Chinese characters mixed probability vector corresponding to the word list pinyin position and Chinese characters position are nonzero.
9. The method according to claim 7 or 8, wherein the fusing the pinyin information and the first chinese character information to obtain the mixing information specifically comprises:
and carrying out weighted fusion on the pinyin probability vectors in the pinyin probability tensor and the Chinese character probability vectors in the Chinese character probability tensor to obtain the mixed tensor containing a plurality of pinyin and Chinese character mixed probability matrixes.
10. The method of claim 9, wherein prior to said weighted fusing of said pinyin probability vectors in said pinyin probability tensor and said hanzi probability vectors in said hanzi probability tensor, said method further comprises:
determining the position of the Chinese character probability vector of which the maximum Chinese character probability is smaller than the threshold value in the Chinese character probability vectors in the Chinese character probability tensor, and performing the weighted fusion according to the position; and the maximum Chinese character probability is the maximum probability numerical value in the Chinese character probability vector.
11. The method of claim 1, wherein the obtaining the pinyin information and the first chinese character information of the voice data specifically comprises:
and processing the pinyin information by applying the text error correction model to obtain the first Chinese character information.
12. The method of claim 11, wherein the obtaining the pinyin information and the first chinese character information of the speech data further comprises:
processing the voice data by applying an acoustic model to obtain the pinyin information;
wherein the acoustic model is a neural network model.
13. An electronic device for performing the method as claimed in claims 1-12 for recognition error correction of chinese speech.
14. A computer readable storage medium storing computer instructions for performing the method of claims 1-12.
15. A chip apparatus for executing the computer instructions as claimed in claim 14.
CN202110058472.2A 2021-01-16 2021-01-16 Chinese speech recognition error correction method and device and electronic equipment Pending CN114822519A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110058472.2A CN114822519A (en) 2021-01-16 2021-01-16 Chinese speech recognition error correction method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110058472.2A CN114822519A (en) 2021-01-16 2021-01-16 Chinese speech recognition error correction method and device and electronic equipment

Publications (1)

Publication Number Publication Date
CN114822519A true CN114822519A (en) 2022-07-29

Family

ID=82523943

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110058472.2A Pending CN114822519A (en) 2021-01-16 2021-01-16 Chinese speech recognition error correction method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN114822519A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI815658B (en) * 2022-09-14 2023-09-11 仁寶電腦工業股份有限公司 Speech recognition device, speech recognition method and cloud recognition system
CN117456999A (en) * 2023-12-25 2024-01-26 广州小鹏汽车科技有限公司 Audio identification method, audio identification device, vehicle, computer device, and medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI815658B (en) * 2022-09-14 2023-09-11 仁寶電腦工業股份有限公司 Speech recognition device, speech recognition method and cloud recognition system
CN117456999A (en) * 2023-12-25 2024-01-26 广州小鹏汽车科技有限公司 Audio identification method, audio identification device, vehicle, computer device, and medium
CN117456999B (en) * 2023-12-25 2024-04-30 广州小鹏汽车科技有限公司 Audio identification method, audio identification device, vehicle, computer device, and medium

Similar Documents

Publication Publication Date Title
CN109918680B (en) Entity identification method and device and computer equipment
CN111191016B (en) Multi-round dialogue processing method and device and computing equipment
CN111667814B (en) Multilingual speech synthesis method and device
CN107301865B (en) Method and device for determining interactive text in voice input
CN110797016B (en) Voice recognition method and device, electronic equipment and storage medium
CN111967224A (en) Method and device for processing dialog text, electronic equipment and storage medium
WO2020155619A1 (en) Method and apparatus for chatting with machine with sentiment, computer device and storage medium
CN111261144A (en) Voice recognition method, device, terminal and storage medium
CN110910903B (en) Speech emotion recognition method, device, equipment and computer readable storage medium
CN111428010A (en) Man-machine intelligent question and answer method and device
CN110827803A (en) Method, device and equipment for constructing dialect pronunciation dictionary and readable storage medium
US11861318B2 (en) Method for providing sentences on basis of persona, and electronic device supporting same
CN112837669B (en) Speech synthesis method, device and server
CN111223476B (en) Method and device for extracting voice feature vector, computer equipment and storage medium
CN112417855A (en) Text intention recognition method and device and related equipment
JP2022502758A (en) Coding methods, equipment, equipment and programs
CN112906381B (en) Dialog attribution identification method and device, readable medium and electronic equipment
CN114822519A (en) Chinese speech recognition error correction method and device and electronic equipment
CN112463942A (en) Text processing method and device, electronic equipment and computer readable storage medium
CN108538292B (en) Voice recognition method, device, equipment and readable storage medium
CN111898363B (en) Compression method, device, computer equipment and storage medium for long and difficult text sentence
CN117352132A (en) Psychological coaching method, device, equipment and storage medium
CN112785667A (en) Video generation method, device, medium and electronic equipment
CN111522937A (en) Method and device for recommending dialect and electronic equipment
CN115132170A (en) Language classification method and device and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination