CN113807080A

CN113807080A - Text correction method, text correction device and storage medium

Info

Publication number: CN113807080A
Application number: CN202010544358.6A
Authority: CN
Inventors: 顾鹏程; 沈冀; 谢韬; 穆瑞斌; 邵长东; 高倩
Original assignee: Ecovacs Commercial Robotics Co Ltd
Current assignee: Ecovacs Commercial Robotics Co Ltd
Priority date: 2020-06-15
Filing date: 2020-06-15
Publication date: 2021-12-17

Abstract

The embodiment of the application provides a text correction method, text correction equipment and a storage medium. In the embodiment of the application, the training corpus with the specified characteristic identifier is adopted to carry out targeted training in a way of converting pinyin into Chinese characters, a pinyin-text prediction model with error correction capability can be obtained, and the pinyin sequence with the specified characteristic identifier corresponding to the input text to be corrected is subjected to text prediction through the pinyin-text prediction model, so that the problems of text errors, first word omission, vocabulary conflicts in different fields and the like in the voice recognition process can be solved, the corrected text is accurately obtained, and the accuracy of the voice recognition process is greatly improved.

Description

Text correction method, text correction device and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a text correction method, device, and storage medium.

Background

With the rapid development of artificial intelligence, more and more intelligent machines are applied to the life of people, and most of intelligent machines are provided with a voice recognition function in order to reduce the difficulty of users in using the intelligent machines. When the user uses the intelligent machine, the user can input a voice instruction to the intelligent machine, the intelligent machine can convert the received voice instruction into text information, the intention of the user is identified through analysis of the text information, and then the corresponding task is executed.

However, in the process of performing voice interaction between a human machine and a machine, due to complex interaction scenes and environments and different accents of different users, the situation that the voice is recognized by the intelligent machine inaccurately often exists, and the intelligent machine cannot convert the voice into corresponding text information or the converted text information is wrong, so that the intelligent machine cannot perform human-machine conversation or execute corresponding tasks.

Disclosure of Invention

Aspects of the present application provide a text correction method, device and storage medium to improve accuracy of converting voice information into text information in a human-computer interaction process.

The embodiment of the application provides a text correction method, which comprises the following steps: acquiring a text to be corrected, wherein the text to be corrected is obtained by performing voice recognition on a voice signal; generating an initial pinyin sequence corresponding to the text to be corrected, and adding a designated characteristic identifier in the initial pinyin sequence to obtain a target pinyin sequence; inputting the target pinyin sequence into a pinyin-text prediction model for text prediction to obtain a candidate text set; the pinyin-text prediction model is obtained by training according to the training corpus with the specified characteristic identifier; and selecting corrected texts corresponding to the texts to be corrected from the candidate text set.

An embodiment of the present application further provides a text correction apparatus, including: a processor and a memory storing a computer program; the processor to execute the computer program to: acquiring a text to be corrected, wherein the text to be corrected is obtained by performing voice recognition on a voice signal; generating an initial pinyin sequence corresponding to the text to be corrected, and adding a designated characteristic identifier in the initial pinyin sequence to obtain a target pinyin sequence; inputting the target pinyin sequence into a pinyin-text prediction model for text prediction to obtain a candidate text set; the pinyin-text prediction model is obtained by training according to the training corpus with the specified characteristic identifier; and selecting corrected texts corresponding to the texts to be corrected from the candidate text set.

Embodiments of the present application also provide a computer-readable storage medium storing a computer program, which when executed by a processor causes the processor to at least: acquiring a text to be corrected, wherein the text to be corrected is obtained by performing voice recognition on a voice signal; generating an initial pinyin sequence corresponding to the text to be corrected, and adding a designated characteristic identifier in the initial pinyin sequence to obtain a target pinyin sequence; inputting the target pinyin sequence into a pinyin-text prediction model for text prediction to obtain a candidate text set; the pinyin-text prediction model is obtained by training according to the training corpus with the specified characteristic identifier; and selecting corrected texts corresponding to the texts to be corrected from the candidate text set.

In the embodiment of the application, the training corpus with the specified characteristic identifier is adopted to carry out targeted training in a way of converting pinyin into Chinese characters, a pinyin-text prediction model with error correction capability can be obtained, and the pinyin sequence with the specified characteristic identifier corresponding to the input text to be corrected is subjected to text prediction through the pinyin-text prediction model, so that the problems of text errors, first word omission, vocabulary conflicts in different fields and the like in the voice recognition process can be solved, the corrected text is accurately obtained, and the accuracy of the voice recognition process is greatly improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1a is a flowchart of a text correction method according to an embodiment of the present application;

FIG. 1b is a schematic structural diagram of a Pinyin-text prediction model according to an embodiment of the present disclosure;

fig. 2 is a schematic structural diagram of a text correction apparatus according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the existing human-computer interaction technology, for example, intelligent robots such as service robots and floor sweeping robots, intelligent devices such as intelligent sound boxes, televisions and handheld terminals, and intelligent machines such as unmanned automobiles and autonomous service machines all support a human-computer interaction function, when the intelligent equipment is used, a voice instruction sent by a user can be converted into a text through a voice recognition technology, the text is analyzed to recognize the intention of the user, and then a human-computer conversation is performed or corresponding actions are executed. However, due to the complex scenes and environments and the problems of different accents of different users and the like, vocabulary omission and errors often occur in the voice recognition process; in addition, during the speech recognition process, vocabulary collision often occurs for vocabularies with the same or similar pronunciation in different fields. Therefore, in practical application, it is necessary to correct problems of text omission, text errors, vocabulary conflicts in different fields, and the like in the process of converting the speech into the text, so as to improve the subsequent intention recognition and business processing capabilities of the intelligent machine for the user.

In order to solve the above problem, embodiments of the present application provide a text correction method, which may be used in a text correction apparatus. Fig. 1a is a flowchart of a text correction method provided in an embodiment of the present application, and as shown in fig. 1a, the method includes:

s1a, obtaining a text to be corrected, wherein the text to be corrected is obtained by performing voice recognition on a voice signal.

S2a, generating an initial pinyin sequence corresponding to the text to be corrected, and adding a designated characteristic identifier in the initial pinyin sequence to obtain a target pinyin sequence.

S3a, inputting the target pinyin sequence into a pinyin-text prediction model for text prediction to obtain a candidate text set; the pinyin-text prediction model is trained based on a corpus with assigned feature identifiers.

And S4a, selecting corrected texts corresponding to the texts to be corrected from the candidate text set.

In this embodiment, text information obtained by performing speech recognition on a speech signal is referred to as a text to be corrected. In the embodiment of the present application, the application scenario of speech recognition is not limited, and may be any application scenario capable of converting a speech signal of a user into text information. For example, in a scenario such as a bank, a mall, or a supermarket, when a user performs self-service voice service through the service robot, the service robot may recognize a voice question input by the user as text information, so as to understand the user's intention and provide a corresponding service to the user. For another example, in the process of using a smart terminal, such as a mobile phone, a tablet computer, etc., the smart terminal may recognize a voice command input by a user as text information, so as to understand the user's intention. For another example, in an intelligent home scene, a television, an intelligent refrigerator, a sweeping robot, and other intelligent home devices may recognize a voice command input by a user as text information, so as to understand the user's intention. In the embodiment, a text to be corrected is taken as an object, and the text to be corrected is subjected to text correction, so that an intelligent machine with a voice recognition function (the intelligent machine can be but is not limited to a service robot, an intelligent terminal, an intelligent home device and the like) can understand the intention of a user according to the corrected text, the accuracy of understanding the intention of the user is improved, and the accuracy of executing corresponding actions based on the intention of the user is improved.

In this embodiment, the text to be corrected is corrected in a form of converting pinyin into chinese characters. Firstly, analyzing and processing a text to be corrected to generate an initial pinyin sequence corresponding to the text to be corrected, and adding an appointed characteristic identifier in the initial pinyin sequence to obtain a target pinyin sequence containing the appointed characteristic identifier. In the embodiment of the present application, the specific feature identifier indicates feature information required to solve a problem that may exist in text obtained by speech recognition, and the specific feature identifier may be one or a plurality of identifiers. For example, if the text obtained by speech recognition may have a problem of missing first character, a first character identifier may be included to indicate that the text obtained by speech recognition may need to add first character. First word omission refers to the problem of the absence or omission of the first word in the text resulting from speech recognition. For another example, if the text obtained by speech recognition may have a domain vocabulary conflict problem, a scene identifier may be included to indicate the domain scene of the text obtained by speech recognition to solve the vocabulary conflict problem in different domains. Domain vocabulary collision refers to the problem of erroneously recognizing text in one domain as text in another domain during speech recognition.

Further, in the embodiment of the application, a pinyin-text predictive training model is obtained by training a training corpus with a specified characteristic identifier in advance, text prediction is performed on a target pinyin sequence corresponding to a text to be corrected through the pinyin-text predictive training model to obtain a candidate text set containing at least one candidate text, and then a corrected text corresponding to the corrected text can be selected from the candidate text set. Aiming at the problems possibly existing in the voice recognition process, the feature identifiers needed for solving the problems are added in the training corpus, and a pinyin-text prediction model (namely a model for converting pinyin into text) trained on the basis of the training corpus with the feature identifiers has error correction capability, so that the problems existing in the voice recognition process can be solved, and a candidate text set for overcoming the corresponding text is output. In practical use, a text to be corrected obtained by voice recognition is converted into a pinyin sequence with a corresponding characteristic identifier, a target pinyin sequence with the corresponding characteristic identifier is input into a pinyin-text prediction model for text prediction again, and a candidate text which overcomes the voice recognition problems (such as first character omission, domain vocabulary conflict and the like) corresponding to the characteristic identifier in the target pinyin sequence to a certain extent is obtained, so that the purpose of text correction is achieved, the problems in the voice recognition process are overcome, the accuracy of text correction is improved, and the subsequent intention recognition and action execution based on the corrected text are more accurate.

In the embodiment of the application, the initial pinyin sequence refers to a pinyin sequence generated directly on the basis of the text to be corrected, and is the pinyin sequence before the specified characteristic identifier is added. In the embodiment of the present application, the manner of generating the initial pinyin sequence is not limited. In an alternative embodiment, at least one pinyin mode may be provided, and different pinyin modes correspond to different pinyin characteristics and have different pinyin effects, and in actual use, the pinyin mode adapted to the pinyin mode may be selected according to the pinyin characteristics of the input text. Based on this, a way to generate an initial pinyin sequence corresponding to a text to be corrected includes: selecting a target pinyin mode from at least one pinyin mode according to the pinyin characteristics of the text to be corrected; and generating an initial pinyin sequence corresponding to the text to be corrected according to the target pinyin mode.

Optionally, the at least one pinyin mode mentioned in this embodiment may include the following pinyin modes: a tone-removed pinyin mode, a tone-added pinyin mode, a tone-removed pinyin mode with separated initials and finals, a tone-added pinyin mode with separated initials and finals, a pinyin mode using only the first initial or final, a pinyin mode subdivided by final pinyin, and the like. The tone-removing pinyin mode is a pinyin mode with only pinyin and no tone; the pinyin mode with tones refers to a pinyin mode with tones and pinyin; the pinyin mode with tone removed and separated initials and finals refers to a pinyin mode with only pinyin without tone and adjacent initials and finals spaced apart; the pinyin mode with tones and separated initials and finals refers to a pinyin mode with tones and adjacent initials and finals spaced apart; the pinyin mode using only the first initial consonant or the final vowel means the pinyin mode using only the pinyin first letter of each word; the pinyin mode of vowel pinyin subdivision refers to a pinyin mode in which all adjacent letters in the pinyin are spaced apart. For example, taking the example that the user withdraws money at a bank and sends a voice command that i want to withdraw money to the bank service robot, the bank service robot may identify the voice command sent by the user to obtain a corresponding text to be corrected, may select a target pinyin mode from the pinyin modes included in the following table 1 according to the pinyin characteristics corresponding to the text to be corrected, and further generates an initial pinyin sequence corresponding to the text to be corrected according to the target pinyin mode. In table 1 below, numerals 3, 4, etc. represent tones, numeral 3 represents a three-tone, and numeral 4 represents a four-tone.

TABLE 1

Pinyin mode	Example (c): i want to withdraw money (text)
		Tone-removed pinyin mode	wo yao qu kuan
Tone-bearing phonetic mode	wo3 yao4 qu3 kuan3
		Pinyin mode with tone removal and separated initial consonants and vowels	/w o/y ao/q u/k uan
Pinyin mode with tone and separated initial consonant and vowel	/w o 3/y ao 4/q u 3/k uan 3
		Pinyin mode using only first initial consonant or vowel	w y q k
Pinyin mode for pinyin subdivision of vowels	w o y a o q u k u a n

In this embodiment, at least one pinyin mode is provided, so that a more suitable pinyin mode can be flexibly selected according to the pinyin features of the text to be corrected, and the error types such as homophonic and homophonic sounds, homophonic and different tones, homophonic sounds and different vowels and the like in the text correction process can be specifically solved. For example, in this embodiment, a pinyin mode with tones and separated initials and finals can be selected as a target pinyin mode according to the pinyin features of the text to be corrected; and further, generating an initial pinyin sequence corresponding to the text to be corrected according to a pinyin mode with tones and separated initials and finals. For example, assume that the text to be corrected is: which property is common, the initial pinyin sequence generated according to the pinyin mode with tones and separated initials and finals is as follows: c ai 2/ch an 3/p in 3/n a 3/g e 4/ch ang 2/j ian 4/.

In the embodiment of the application, after the initial pinyin sequence is obtained, text correction is not directly performed according to the initial pinyin sequence, but a specified characteristic identifier is added into the initial pinyin sequence to obtain a target pinyin sequence with a target characteristic identifier, and text correction is performed on the basis of the target pinyin sequence. In this embodiment, the adding position of the specific characteristic identifier in the initial pinyin sequence is not limited. For example, the specific characteristic identifier may be added before the first letter in the initial pinyin sequence, that is, the specific characteristic identifier is added before the initial pinyin sequence; or, the designated characteristic identifier can be added behind the last letter in the initial pinyin sequence, namely, the designated characteristic identifier is added at the tail of the initial pinyin sequence; alternatively, the specific characteristic identifier may be added to an intermediate position in the initial pinyin sequence, that is, the specific characteristic identifier may be added to the intermediate position in the initial pinyin sequence. For example, with the initial pinyin sequence: c ai 2 ch an 3 p in 3 n a 3g e 4 ch ang 2j ian 4 as an example, then the example of adding a specified feature identifier in front of the initial pinyin sequence is: c ai 2 ch an 3 p in 3 n a 3g e 4 ch ang 2j ian 4, an example of adding a specified feature identifier at the end of the initial pinyin sequence is: c ai 2 ch an 3 p in 3 n a 3g e 4 ch ang 2j ian 4 [ identifier ], an example of adding a specified feature identifier in the middle of the initial pinyin sequence is: c ai 2 ch an 3 [ identifier ] p in 3 n a 3g e 4 ch ang 2j ian 4.

In the embodiment of the application, the designated characteristic identifier is added to the initial pinyin sequence in order to identify the characteristic information required for correcting the initial pinyin sequence or the text expressed by the initial pinyin sequence. According to different application scenes and different voice recognition capabilities of the intelligent machine, the specified characteristic identifiers required to be used by the text obtained by voice recognition can be different, and the specified characteristic identifiers required to be used can be determined according to actual requirements. For example, for a speech recognition scenario in which there may be different domain vocabulary conflicts, the specified feature identifiers may include a scenario identifier representing a domain scenario to which the text to be corrected belongs, so as to correct the vocabulary conflict problem in the different domains. For another example, for a speech recognition scenario in which first word omission may exist, the specified feature identifier may include a first word identifier indicating that the text to be corrected may need to be added with a first word to correct the first word omission problem. For another example, in other application scenarios, there may be both a problem of vocabulary conflict in different fields and a problem of first character omission, and the specified feature identifier may include both a scene identifier of a field scene to which the text to be corrected belongs and a first character identifier indicating that the text to be corrected may need to be added with first characters. The first character identifier only indicates that the text to be corrected may need to add the first character, and does not indicate that the text to be corrected certainly omits the first character; for the text to be corrected which does omit the first character, the first character can be added in the candidate text output by the pinyin-text prediction model; for the text to be corrected without missing first character, the candidate text output by the pinyin-text prediction model may have identification information without adding first character, or may not have any prompt information.

Based on the above analysis, when the specified characteristic identifier is added to the initial pinyin sequence, at least one of the scene identifier and the first character identifier may be added to the initial pinyin sequence to obtain the target pinyin sequence. For example, assume that the text to be corrected is: which is common for the financial products, the corresponding initial pinyin sequence is as follows: c ai 2 ch an 3 p in 3 n a 3g e 4 ch ang 2j ian 4, the target pinyin sequence obtained after adding the scene identifier and the first character identifier in the initial pinyin sequence is as follows: BANK [ F ] c ai 2 ch an 3 p in 3 n a 3g e 4 ch ang 2j ian 4; wherein [ BANK ] is a scene identifier, which in this example indicates that a domain scene to which the text to be corrected belongs is a BANK scene; 【F】 Is a first character identifier, which in this example indicates that the text to be corrected lacks a first character, to which a first character needs to be added during the correction process. In this example, the implementation of the scene identifier and the first character identifier is merely an example, and is not limited thereto. The target pinyin sequence containing the scene identifier is provided with characteristic information of a field scene to which the text to be corrected belongs, and the characteristic information can be used for correcting conflict problems between the text to be corrected and vocabularies in other fields; the target pinyin sequence containing the first character identifier has characteristic information of whether the text to be corrected lacks the first character, and the characteristic information can be used for correcting the problem of missing the first character of the text to be corrected. It should be noted that the embodiment of the present application does not limit the way of representing the specific characteristic identifier, and any identifier way that can distinguish the pinyin sequence from the identifier is applicable to the embodiment of the present application, and for example, the identifier may be represented by a bracket [ ], or may be represented by a double-quotation mark "".

In the embodiment of the application, after the target pinyin sequence is obtained, the target pinyin sequence can be input into a pinyin-text prediction model trained in advance for text prediction, so that a candidate text set containing at least one candidate text is obtained. In the embodiment of the present application, as shown in fig. 1b, the pinyin-text prediction model includes an encoding (Encoder) network and a decoding (Decoder) network; the coding Network and the decoding Network can be implemented by networks with structures such as Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), Long-Short-Term Memory Network (LSTM), and Transformer, respectively. As shown in fig. 1b, the input to the coding network is a target pinyin sequence (scene identifier and/or first character identifier, X1, X2, …, Xn), X1, X2, …, Xn each representing the pinyin for a character in the text to be corrected; the coding network is used for coding the target pinyin sequence into a first feature vector with a fixed size and outputting the first feature vector to the decoding network. Optionally, as shown in fig. 1b, the coding network may include multiple layers, where the first layer is configured to perform vectorization on each character pinyin in the input target pinyin sequence according to the characters to obtain a feature vector corresponding to each character pinyin and output the feature vector to other layers, and the other layers are responsible for performing vectorization on the pinyin sequence corresponding to the text to be corrected from the perspective of syntax and semantics to obtain the first feature vector. After the decoding network obtains the first feature vector provided by the coding network, the decoding network decodes the first vector feature and outputs a second feature vector (Y1, Y2, …, Yn), wherein the second feature vector comprises a feature vector corresponding to each character, namely Y1, Y2, …, Yn respectively represents the feature vector corresponding to each character in the text to be corrected, and the feature vector represents the output probability of each character. As shown in fig. 1b, the input of the decoding network includes not only the first feature vector of the output of the encoding network, but also feature vectors of other characters that have been previously output. After the second feature vector is obtained, at least one candidate text can be obtained according to the second feature vector to form a candidate text set. Based on the method, after the target pinyin sequence is input into the pinyin-text prediction model, the target pinyin sequence is coded into a first feature vector with a fixed size in the pinyin-text prediction model by using a coding network; decoding the first feature vector by using a decoding network to obtain a second feature vector; and obtaining a candidate text set containing at least one candidate text according to the second feature vector.

After the candidate text set is obtained, one candidate text can be selected from the candidate text set as a corrected text corresponding to the text to be corrected. For example, assume that the text to be corrected is: if the property is common, the candidate text set obtained after the processing process comprises the following steps: the method comprises the steps that a common [ BANK ] null property product, a common [ BANK ] financing product and a common [ BANK ] null material product are obtained, wherein one manufacturer of the [ BANK ] null material product can select one of the common [ BANK ] null property product and the common [ BANK ] financing product as a corrected text, for example, the common [ BANK ] financing product can be selected as the corrected text corresponding to the text to be corrected. In the embodiment of the present application, a manner of selecting corrected text from the candidate text set is not limited. For example, in an optional embodiment, first, a score of each candidate text in the candidate text set is calculated according to the language model, further, the score of the candidate text including the domain keyword is adjusted according to the known domain keyword, and the candidate text with the highest adjusted score is selected as the corrected text corresponding to the text to be corrected.

The language model is a model used for calculating a probability of a sentence, and for example, a probability value of each word in each candidate text may be calculated, and then a probability value of each candidate text may be calculated according to the probability value of each word in each candidate text, and the probability value of each candidate text may be used as a score of each candidate text. In practical applications, the way of calculating the score of the candidate texts is not limited, and for each word in each candidate text, the probability value of the possible occurrence of the word can be calculated based on the probability values of all the words in front of the word. For example, suppose S represents a candidate text composed of a series of words W1, W2.., Wn arranged in a particular order, and the probability of S occurring in the text corpus, P (W1, W2.., Wn). With the conditional probability formula, the probability of S occurrence is equal to the conditional probability of each word occurrence multiplied, i.e. formula P (S) ═ P (W1, W2.., Wn) ═ P (W1) P (W2| W1) P (W3| W1, W2) … P (Wn | W1, W2, …, Wn-1). Where P (W1) represents a probability value of occurrence of the word W1, and P (W2| W1) represents a probability value of occurrence of the word W2 calculated on the premise that the probability value of occurrence of the word W1 is known. By analogy, the probability value of each candidate text occurrence, namely the score of each candidate text can be calculated. Alternatively, the score of the candidate text may be calculated based on an N-gram algorithm. An N-gram is a statistical language model that predicts the nth word from the first (N-1) words using statistical methods. Taking the bigram as an example, it is assumed that the occurrence of a word depends only on the word that appears before it, i.e. for each word in each candidate text, the probability value of its occurrence can be calculated based on the probability value of the occurrence of the word before it, e.g. using the formula P(s) ═ P (W1) P (W2| W1) P (W3| W2).. P (Wn | Wn-1). P (Wi | Wi-1) ═ C (Wi-1, Wi)/C (Wi-1), C (Wi-1, Wi) is the number of times that the vocabulary Wi and the vocabulary Wi-1 appear in the vocabulary library used by the voice model, the probability values that the vocabulary Wi and Wi-1 appear can be calculated according to the number of times that the vocabulary Wi and Wi-1 appear in the vocabulary library used by the voice model, and further, the probability value that each candidate text appears can be calculated, namely, the score of each candidate text.

After the score of each candidate text is obtained, optionally, the candidate text with the highest score may be selected as the corrected text corresponding to the text to be corrected directly according to the score of each candidate text, but the embodiment is not limited to this. In this embodiment, after the score of each candidate text is obtained, further, according to the known domain keywords, the candidate text containing the domain keywords is determined; adjusting the scores of the candidate texts containing the domain keywords, wherein the adjustment is mainly to increase the scores of the candidate texts containing the domain keywords; and then selecting the candidate text with the highest score after adjustment as the corrected text corresponding to the text to be corrected. In practical applications, the number of the candidate texts including the domain keywords and the number of the domain keywords included in the candidate texts may be counted according to the known domain keywords, and the score of the candidate texts including the domain keywords may be adjusted according to the number of the domain keywords included in the candidate texts including the domain keywords. For example, the weight factor corresponding to the candidate text containing the domain key may be calculated according to the number of the domain keywords contained in the candidate text containing the domain key, the score of the text containing the domain key is multiplied by the corresponding weight factor to obtain a new score, all the candidate texts are reordered according to the updated score, and the text with the highest score is used as the corrected text corresponding to the text to be corrected. Alternatively, the score of the candidate text containing the domain keyword may be adjusted by using a formula score (1+0.5 Ncount), where Ncount represents the number of domain keywords contained in the candidate text, and 1+0.5 Ncount represents the weight factor corresponding to the candidate text containing the domain keyword. Of course, the candidate texts containing the domain keywords are not limited to the above adjustment in practical application. It should be noted that, under the condition of determining the candidate texts containing the domain keywords, the scores of the candidate texts containing the domain keywords may not be adjusted according to the number of the domain keywords contained in the candidate texts, but the candidate text with the highest score and containing the domain keywords is directly selected from the candidate texts containing the domain keywords as the corrected text corresponding to the text to be corrected; alternatively, the present application embodiment can use any text mode that can screen out a text that is most suitable for the current domain scene, such as comparing candidate texts containing domain keywords, and using a text that best matches the current context in the texts containing the domain keywords as a corrected text corresponding to the corrected text. In addition, known domain keywords may be stored in a keyword dictionary or a keyword dictionary, which is not limited to this.

In the embodiment of the application, before the pinyin-text prediction model is used, the pinyin-text prediction model can be obtained by training according to the training corpus with the designated characteristic identifier. The process of training the pinyin-text prediction model comprises the following steps: acquiring a training corpus, wherein the training corpus comprises a text corpus and a pinyin corpus corresponding to the text corpus; adding a designated characteristic identifier in a pinyin corpus in a training corpus; and performing model training according to the training corpus with the designated characteristic identifier to obtain a pinyin-text prediction model. In this embodiment, the designated feature identifier represents feature information required to solve a problem that may exist in a text obtained by speech recognition, model training is performed based on a training corpus to which the designated feature identifier is added, so that a pinyin-text prediction model with error correction capability is obtained, and further, text prediction is performed using the pinyin-text prediction model with error correction capability, so that a problem that exists in a speech recognition process can be solved, and a candidate text set that overcomes a corresponding text is output.

In the present embodiment, the form of the specific feature identifier added to the corpus is not limited. For example, for a speech recognition scenario in which there may be first word omission, a first word identifier may be added to the corpus, that is, the specified feature identifier includes the first word identifier, in this case, the process of acquiring the corpus includes: acquiring a text corpus; generating a standard training corpus by the text corpus and the corresponding correct pinyin sequence; replacing correct pinyin in the standard training corpus by using known fuzzy sound to obtain a fuzzy sound training corpus; and removing the pinyin of the first character in the standard training corpus and the fuzzy tone training corpus to obtain the missing training corpus of the first character.

Regarding obtaining text corpora:

in a speech recognition scene with possibly missing first words, a domain scene to which a text corpus belongs is not limited, for example, a text in a general domain scene may be acquired as the text corpus, a text in a vertical domain scene may be acquired as the text corpus, or a text in the general domain scene and a text in the vertical domain scene may be acquired as the text corpus at the same time. It should be noted that, for a speech recognition scenario in which a domain vocabulary conflict may exist, texts in a general domain scenario and a vertical domain scenario may be simultaneously obtained as text corpora, and accordingly, a scene identifier needs to be added to a training corpus. The text corpus in the general field scene is relatively large in scale and can cover most of general words, the text corpus in the vertical field scene is relatively small in scale and generally only covers words in a specific field, and the method has field characteristics. After the text corpus is obtained, the text corpus is analyzed to obtain a corresponding pinyin sequence and a text, and a domain identifier is added in the corresponding pinyin sequence to be used as input of a training corpus, so that the problem of vocabulary conflict caused by different domains in the voice recognition process is solved. The obtained text corpus is exemplified, for example, i is a good-built person, which is common to financial products, i withdraws money, and the like.

In the embodiment of the present application, the training corpora used for model training are in a sentence-pair format, i.e. (pinyin sequence (input), text corpora (output)). In each corpus, the Pinyin sequence is used as the input (input) of model training, and the text corpus is the output (output) of model training. According to various problems that first word omission, confusion sound and the like possibly exist in a speech recognition scene and need to be corrected, in some optional embodiments of the application, a training corpus is formed by adopting 3 sentence-pair modes, which are respectively: the standard corpus, the fuzzy-tone corpus, and the first-character missing corpus are described in detail below, and the generation processes of these 3 corpora are described in detail.

Regarding the generation of the standard corpus:the standard corpus is a corpus formed by text corpus and its corresponding correct pinyin sequence, and sentence pairs thereof are structured as (correct pinyin sequence (input), text corpus (output)). Optionally, after the text corpus is collected, word segmentation processing may be performed on the text corpus, and then the word segmentation result is converted into a corresponding pinyin to obtain a correct pinyin sequence corresponding to the text corpus, and the two are combined together to obtain a standard training corpus. Taking the text corpus "i is a fujian" in the above embodiment as an example, the corresponding standard training corpus is (w o 3 sh i 4 f u 2j ian 4 r en 2, i is a fujian).

For fuzzy tone corpus generation:the fuzzy-tone corpus is a corpus obtained by replacing correct pinyin in a standard corpus with known fuzzy tones, and the sentence pair structure of the corpus is (after fuzzy processing)Wrong pinyin sequence (input), text corpus (output)) for voice substitution. The known fuzzy sound is a pinyin corpus obtained by analyzing and sorting commonly existing confusion pronunciation and text corpora in specific scenes. Among them, a common fuzzy tone replacement example can be referred to as the following table 2. Specifically, known fuzzy sound can be used to perform fuzzy sound replacement on relevant pinyin in a correct pinyin sequence in the standard corpus to obtain an incorrect pinyin sequence, and the incorrect pinyin sequence and the text corpus are combined to obtain the fuzzy sound corpus. For example, taking the text corpus "i is a fujian" in the above embodiment as an example, the corresponding standard corpus is (w o 3 sh i 4 f u 2j ian 4 r en 2, i is a fujian), and the fuzzy-sound corpus obtained after fuzzy-sound replacement is (w o 3 sh i 4 h u 2j ian 4 r en 2, i is a fujian).

TABLE 2

The correct pinyin in the standard corpus is replaced by the known fuzzy sound, the vocabulary with the same or similar pronunciation can be corrected, the vocabulary conflict with the same or similar pronunciation is solved, the training is carried out based on the corpus after the fuzzy sound replacement, the prediction accuracy of the obtained pinyin-text prediction model is higher, then the training is carried out according to the replaced fuzzy sound corpus, and the true text corresponding to the text corpus can be predicted as follows: i am a good job.

And (3) generation of the missing first word training corpus:the missing-first-character corpus is a corpus obtained by removing the pinyin of the first character in the standard corpus and the fuzzy-tone corpus, and the sentence pair structure is (the wrong pinyin sequence after removing the pinyin of the first character, the text corpus). Because the first character omission phenomenon is caused by the influence of a complex environment in a speech recognition scene, the model training is carried out by taking the rest pinyin sequence as a training corpus by removing the first pinyin (including initial consonant and final sound) in the pinyin sequence corresponding to the text corpus. For exampleAfter the correct pinyin sequence in the standard corpus corresponding to the text corpus "i is a fujian" mentioned in the above embodiment is removed from the pinyin of the first character, the obtained first character missing training corpus is (sh i 4 f u 2j ian 4 ren 2, i is a fujian), and similarly, after the wrong pinyin sequence in the fuzzy tone training corpus corresponding to the text corpus "i is a fujian" mentioned in the above embodiment is removed from the pinyin of the first character, the obtained first character missing training corpus is (sh i 4 h u 2j ian 4 ren 2, i is a fujian). The first character missing training corpus formed after the first character of the standard training corpus and the fuzzy sound training corpus is removed from pinyin is trained, the obtained pinyin-text prediction model can correct the problem of the first character missing in the voice recognition process during prediction, and the prediction accuracy of the pinyin-text prediction model is improved. Training according to the replaced first word missing training corpus, and predicting the real text corresponding to the text corpus as follows: i am a good job.

Description on adding a specific feature identifier:in the embodiment of the application, a first character identifier and/or a domain identifier can be added in the corpus to address the two situations of first character omission and/or domain vocabulary conflict. The first character identifier and the domain identifier can be added to the beginning of the corpus separately or can be combined and applied at the same time. Taking the pinyin mode using the toned pinyin as the pinyin corpus as an example, the above-mentioned 3 training corpora are taken as an example to illustrate the addition of the designated feature identifier. Where table 3 used the domain identifier alone, table 4 used the first character identifier alone, and table 5 used both the first character identifier and the domain identifier.

The domain identifier is used to represent domain scene information where the corpus is located, such as [ BANK ] for a BANK domain, [ MEDICAL ] for a MEDICAL domain, and [ COM ] for a general domain scene. Wherein, the first character identifiers are consistent in all the training corpuses (or in model training input), and are all represented by [ F ], which indicates that the problem of first character omission may exist or indicates that the first character may need to be added; when the model is output, null is used to indicate that the first character does not need to be added, otherwise, the default character is directly added in the initial position.

TABLE 3

TABLE 4

TABLE 5

In the embodiment of the application, after the corpus with the specified characteristic identifier is obtained, model training can be performed according to the corpus with the specified characteristic identifier to obtain a pinyin-text prediction model. Optionally, a process of model training comprises: pre-training a preset network structure model by using a training corpus corresponding to a general field scene to obtain an initialization model; and performing fine tuning training on the initialization model by using the training corpus corresponding to the vertical field scene to obtain a pinyin-text prediction model. The training corpus corresponding to the general field scene is a training corpus generated by analyzing and processing the text corpus in the general field scene; the corpus corresponding to the vertical field scene is a corpus generated by analyzing and processing the text corpus in the vertical field scene. Because the text corpus in the general field scene has a relatively large scale and can cover most of general words, the prediction result obtained by model prediction based on the generated training corpus is wider, and most of general text contents can be predicted; the text corpus in the vertical field scene is relatively small in scale and generally only covers vocabularies in a specific field, and the text content in the specific field can be predicted by a prediction result obtained by performing model prediction on the basis of the generated training corpus, and the prediction result is more specific and has higher accuracy. The two-round training process is carried out on the model based on the training corpus with the designated characteristic identifier in the embodiment, the obtained pinyin-text prediction model can solve the problems of text errors, text omission, vocabulary conflicts in different fields and the like in the pinyin-text conversion process, and the accuracy of the prediction result is higher.

In the embodiment of the present application, the implementation manner of the preset network structure model is not limited. In this embodiment, a mask Sequence to Sequence Pre-training (MASS) model with a structure shown in fig. 1b may be adopted, and the model may be implemented by using the Convolutional Neural Network (CNN), the Recurrent Neural Network (RNN), the long-short term memory network (LSTM), and the like mentioned in the above embodiments, so as to perform Pre-training and fine-tuning training on the training corpus and obtain the pinyin-text prediction model. The pre-training refers to a process of coding and decoding a training corpus based on a large-scale general field scene corpus to obtain an initialization model; the fine tuning training refers to a process of coding and decoding the training corpus based on small-scale vertical field scene corpus to obtain a pinyin-text prediction model.

In the above embodiments, the implementation manner of the whole process of text correction is not limited, and may be determined according to the specific implementation form of the intelligent machine. For example, for an intelligent machine supporting human-computer interaction, if the processing function of the intelligent machine is strong enough and a text correction module and a pinyin-text training model are built in the intelligent machine, the whole process of text correction can be completed at the intelligent machine end; if the processing function of the intelligent machine is not strong enough and can depend on the remote server, the text correction process can be completed at the remote server corresponding to the intelligent machine. The following describes in detail the whole text correction process in the embodiment of the present application in a pinyin-to-chinese manner, with reference to specific scenario embodiments:

scenario example 1:

taking an autonomous service machine with a human-computer interaction function as an example of a service robot, the service robot is internally provided with a voice recognition module and a text correction module, and the text correction module adopts a pinyin-text training model to correct texts. When the service robot is needed by the user to execute the service task, a voice instruction can be sent to the service robot in a voice mode. For example, the service robot is a bank service robot, the user needs to register the withdrawal information when transacting the withdrawal service, and the service robot can be sent out "register the withdrawal information! "is used for voice command. For another example, the service robot may be a shopping mall or supermarket service robot, and the user wants to confirm the location of the fruit and vegetable area during self-service shopping, and can send "where is the location of the fruit and vegetable area? "is used for voice command. After receiving the voice command, the service robot can recognize the voice command of the user as a text through the built-in voice recognition module, and then correct the text recognized by the voice recognition module as a text to be corrected through the built-in text correction module. Specifically, the text correction module may generate an initial pinyin sequence of the text to be corrected, and add a first character identifier and/or a scene identifier to the initial pinyin sequence to obtain a target pinyin sequence; inputting the target pinyin sequence into a pinyin-text training model to perform text prediction on the target pinyin sequence, and obtaining corrected text information. Thereafter, the service robot may understand the user's intention based on the corrected text information and then perform a corresponding action according to the user's intention. For example, if it is understood that the user intends to register withdrawal information, a withdrawal information registration window may be presented to the user on the electronic screen for the user to register withdrawal information; or broadcasting the corresponding position of the manual drawing window to the user in a voice broadcasting mode. If the user's intention is to find the fruit and vegetable area position, an electronic map of a shopping mall or a supermarket can be displayed to the user on the electronic screen, and a travel route from the current position to the fruit and vegetable area position is marked in the electronic map; or, a route and the like from the current position to the fruit and vegetable area are broadcasted to the user in a voice broadcast mode.

Scenario example 2:

in the embodiment of the application, if the processing capability of the autonomous service machine is not strong enough, the process of human-computer interaction can be completed by depending on the remote server. The autonomous service machine is internally provided with a voice recognition module, the text correction module is arranged at a remote server end, and the text correction module adopts a pinyin-text training model to correct the text. For example, taking the autonomous service machine with a human-computer interaction function as a sweeping robot as an example, when a user needs to perform a sweeping task by the sweeping robot, a voice command of "please sweep the kitchen and the living room" may be sent to the sweeping robot in a voice form. Or, when the robot of sweeping the floor when carrying out the task of cleaning the electric quantity when low, can send the electric quantity and cross the police dispatch newspaper excessively, the user hears this electric quantity and crosses behind the police dispatch newspaper excessively, and the voice command that the accessible pronunciation form sent "the charging seat charges" to the robot of sweeping the floor. The sweeping robot can recognize the voice command of the user into a text through the built-in voice recognition module after receiving the voice command, and then uploads the text information to the text correction module of the remote server side, and the text correction module can correct the text information as a text to be corrected after receiving the text information. Specifically, the text correction module may generate an initial pinyin sequence of the text to be corrected, and add a first character identifier and/or a scene identifier to the initial pinyin sequence to obtain a target pinyin sequence; inputting the target pinyin sequence into a pinyin-text training model to perform text prediction on the target pinyin sequence, and obtaining corrected text information. And then, the server transmits the corrected text information back to the sweeping robot, and the sweeping robot can understand the intention of the user based on the corrected text information and then execute corresponding actions according to the intention of the user. For example, if it is understood that the user's intention is to let the kitchen and the living room be cleaned, the cleaning robot may move to the kitchen to perform the cleaning task in the kitchen first, and then move to the living room to perform the cleaning task in the living room. If the intention of the user is to charge the recharging seat, the sweeping robot can record the current sweeping position, starts to execute recharging action, gradually moves to the recharging seat, and starts to charge after being successfully connected with the recharging seat.

Or after receiving a voice instruction sent by a user, the sweeping robot directly uploads the voice instruction to the server, and the server performs voice recognition on the voice instruction and converts the voice instruction into a text; and then, the text correction module at the server side can correct the text information as a text to be corrected after receiving the text information, and returns the corrected text information to the sweeping robot.

In the embodiment of the application, based on the text corpus corresponding to the general field scene and the training corpus in the vertical field scene, the characteristic information corresponding to the problems of first word omission, field vocabulary conflict and the like is considered in the training corpus to carry out targeted training, a pinyin-text prediction model with error correction capability can be obtained, and text prediction is carried out on a pinyin sequence with a characteristic identifier of an input text to be corrected through the pinyin-text prediction model, so that the problems of text errors, first word omission, different field vocabulary conflict and the like in the voice recognition process can be solved, an accurate corrected text is obtained, and the accuracy of the voice recognition process is greatly improved.

It should be noted that the execution subjects of the steps of the methods provided in the above embodiments may be the same device, or different devices may be used as the execution subjects of the methods. For example, the execution subjects of steps S1a to S4a may be device a; for another example, the execution subject of steps S1a and S2a may be device a, and the execution subject of steps S3a and S4a may be device B; and so on.

In addition, in some of the flows described in the above embodiments and the drawings, a plurality of operations occurring in a specific order are included, but it should be clearly understood that the operations may be executed out of the order they appear herein or in parallel, and the sequence numbers of the operations, such as S1a, S2a, etc., are merely used to distinguish between the various operations, and the sequence numbers themselves do not represent any execution order. Additionally, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first", "second", etc. in this document are used for distinguishing different messages, devices, modules, etc., and do not represent a sequential order, nor limit the types of "first" and "second" to be different.

Fig. 2 is a schematic structural diagram of a text correction apparatus according to an embodiment of the present application. The text correction device 100 provided in the embodiment of the present application may be an intelligent machine with a human-computer interaction function, for example, may be an intelligent robot such as a service robot or a floor sweeping robot, may also be an intelligent device such as an intelligent sound box, a television, a handheld terminal, or may also be an unmanned automobile; or a server device cooperating with the intelligent machine.

As shown in fig. 2, the text correction apparatus 100 includes: a processor 10 and a memory 20 storing computer instructions. The processor 10 and the memory 20 may be one or more, among others.

The memory 20 is mainly used for storing computer programs, which can be executed by the processor 10, so that the processor 10 controls the text correction apparatus 100 to implement corresponding functions and complete corresponding actions or tasks. In addition to storing computer programs, the memory 20 may also be configured to store other various data to support operations on the text correction apparatus 100. Examples of such data include instructions for any application or method operating on the text correction device 100.

The memory 20, which may be implemented by any type of volatile or non-volatile memory device or combination thereof, may include, for example, a Static Random Access Memory (SRAM), an electrically erasable programmable read-only memory (EEPROM), an erasable programmable read-only memory (EPROM), a programmable read-only memory (PROM), a read-only memory (ROM), a magnetic memory, a flash memory, a magnetic disk, or an optical disk.

In the embodiment of the present application, the implementation form of the processor 10 is not limited, and may be, for example, but not limited to, a CPU, a GPU, an MCU, or the like. The processor 10, which may be considered a control system of the text correction apparatus 100, may be used to execute computer programs stored in the memory 20 to control the text correction apparatus 100 to perform corresponding functions, perform corresponding actions or tasks. It should be noted that, depending on the implementation form and the scene where the text correction apparatus 100 is located, the functions, actions or tasks required to be implemented may be different; accordingly, the computer programs stored in memory 20 may vary, and execution of different computer programs by processor 10 may control text correction apparatus 100 to perform different functions, perform different actions or tasks.

In some optional embodiments, as shown in fig. 2, the text correction apparatus 100 may further include: communication components 40, power components 50, and drive components 60. Only some of the components are schematically shown in fig. 2, and it is not meant that the text correction apparatus 100 includes only the components shown in fig. 2. The driving assembly 50 may include a driving wheel, a driving motor, a universal wheel, etc., among others. Further optionally, the text correction device 100 may further include other components such as the display 70 and the audio component 80 for different application requirements, which are illustrated by dashed boxes in fig. 2, and it is understood that the components within the dashed boxes are optional components, not necessarily components, depending on the product form of the text correction device 100. If the text correction apparatus 100 has a human-computer interaction function and is a sweeping robot, the text correction apparatus 100 may further include a dust collecting barrel, a floor brush assembly, and the like, which will not be described herein.

In the embodiment of the present application, when the processor 10 executes the computer program in the memory 20, it is configured to: acquiring a text to be corrected, wherein the text to be corrected is obtained by performing voice recognition on a voice signal; generating an initial pinyin sequence corresponding to the text to be corrected, and adding a designated characteristic identifier in the initial pinyin sequence to obtain a target pinyin sequence; inputting the target pinyin sequence into a pinyin-text prediction model for text prediction to obtain a candidate text set; the pinyin-text prediction model is obtained by training according to the training corpus with the specified characteristic identifier; and selecting corrected texts corresponding to the texts to be corrected from the candidate text set.

In an alternative embodiment, the processor 10, when obtaining the target pinyin sequence, is configured to: adding at least one identifier of a scene identifier and a first character identifier in the initial pinyin sequence to obtain a target pinyin sequence; the scene identifier represents a field scene to which the text to be corrected belongs, and is used for correcting vocabulary conflicts in different fields; the first character identifier indicates that the text to be corrected may need to add a first character to correct the first character missing problem.

In an alternative embodiment, the processor 10, when selecting the corrected text corresponding to the text to be corrected, is configured to: calculating the score of each candidate text in the candidate text set according to the language model; according to the known domain keywords, adjusting the scores of the candidate texts containing the domain keywords; and selecting the candidate text with the highest score after adjustment as the corrected text corresponding to the text to be corrected.

In an alternative embodiment, the processor 10 is further configured to: acquiring a training corpus, wherein the training corpus comprises a text corpus and a pinyin corpus corresponding to the text corpus; adding a designated characteristic identifier into a pinyin corpus in a training corpus; and performing model training according to the training corpus with the designated characteristic identifier to obtain a pinyin-text prediction model.

In an alternative embodiment, if the specified characteristic identifier comprises a first character identifier, the processor 10, when obtaining the corpus, is configured to: acquiring a text corpus; generating a standard training corpus by the text corpus and the corresponding correct pinyin sequence; replacing correct pinyin in the standard training corpus by using known fuzzy sound to obtain a fuzzy sound training corpus; and removing the pinyin of the first character in the standard training corpus and the fuzzy tone training corpus to obtain the missing training corpus of the first character.

In an alternative embodiment, if the specified feature identifier further includes a scene identifier, the processor 10, when obtaining the text corpus, is configured to: and acquiring text corpora in the general field scene and the vertical field scene.

Accordingly, embodiments of the present application also provide a computer readable storage medium storing a computer program, which when executed by a processor causes the processor to at least: acquiring a text to be corrected, wherein the text to be corrected is obtained by performing voice recognition on a voice signal; generating an initial pinyin sequence corresponding to a text to be corrected, and adding a specified characteristic identifier in the initial pinyin sequence to obtain a target pinyin sequence; inputting the target pinyin sequence into a pinyin-text prediction model for text prediction to obtain a candidate text set; the pinyin-text prediction model is obtained by training according to a training corpus with a specified characteristic identifier; and selecting corrected texts corresponding to the texts to be corrected from the candidate text set.

The communication component of fig. 2 described above is configured to facilitate communication between the device in which the communication component is located and other devices in a wired or wireless manner. The device where the communication component is located can access a wireless network based on a communication standard, such as a WiFi, a 2G, 3G, 4G/LTE, 5G and other mobile communication networks, or a combination thereof. In an exemplary embodiment, the communication component receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component may further include a Near Field Communication (NFC) module, Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and the like.

The display of fig. 2 described above includes a screen, which may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.

The power supply assembly of fig. 2 described above provides power to the various components of the device in which the power supply assembly is located. The power components may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device in which the power component is located.

The audio components of fig. 2 described above may be configured to output and/or input audio signals. For example, the audio component includes a Microphone (MIC) configured to receive an external audio signal when the device in which the audio component is located is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in a memory or transmitted via a communication component. In some embodiments, the audio assembly further comprises a speaker for outputting audio signals.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A text correction method, comprising:

acquiring a text to be corrected, wherein the text to be corrected is obtained by performing voice recognition on a voice signal;

generating an initial pinyin sequence corresponding to the text to be corrected, and adding a designated characteristic identifier in the initial pinyin sequence to obtain a target pinyin sequence;

inputting the target pinyin sequence into a pinyin-text prediction model for text prediction to obtain a candidate text set; the pinyin-text prediction model is obtained by training according to the training corpus with the specified characteristic identifier;

and selecting corrected texts corresponding to the texts to be corrected from the candidate text set.

2. The method of claim 1, wherein generating an initial pinyin sequence corresponding to the text to be corrected comprises:

selecting a target pinyin mode from at least one pinyin mode according to the pinyin characteristics of the text to be corrected;

generating an initial pinyin sequence corresponding to the text to be corrected according to the target pinyin mode;

the at least one pinyin mode comprises one or more of a tone-removed pinyin mode, a tone-contained pinyin mode, a tone-removed pinyin mode with separated initials and finals, a tone-contained pinyin mode with separated initials and finals, a pinyin mode using only the first initial or final, and a pinyin mode subdivided by final pinyin.

3. The method of claim 1, wherein adding a specified characteristic identifier to the initial pinyin sequence to obtain a target pinyin sequence, includes:

adding at least one identifier of a scene identifier and a first character identifier in the initial pinyin sequence to obtain a target pinyin sequence;

the scene identifier represents a field scene to which the text to be corrected belongs, and is used for correcting vocabulary conflicts in different fields; the first character identifier indicates that the text to be corrected may need to add a first character to correct the first character missing problem.

4. The method of claim 1, wherein inputting the target pinyin sequence into a pinyin-text prediction model for text prediction to obtain a candidate text set, comprises:

in the pinyin-text prediction model, encoding the target pinyin sequence into a first feature vector with a fixed size by using an encoding network;

decoding the first feature vector by using a decoding network to obtain a second feature vector;

and obtaining a candidate text set containing at least one candidate text according to the second feature vector.

5. The method of claim 1, wherein selecting the corrected text corresponding to the text to be corrected from the candidate text set comprises:

calculating the score of each candidate text in the candidate text set according to a language model;

according to the known domain keywords, adjusting the scores of the candidate texts containing the domain keywords;

and selecting the candidate text with the highest score after adjustment as the corrected text corresponding to the text to be corrected.

6. The method of claim 5, wherein adjusting the score of the candidate text containing the domain keyword according to the known domain keyword comprises:

according to the known domain keywords, counting the number of candidate texts containing the domain keywords and the number of the domain keywords contained in the candidate texts;

and adjusting the score of the candidate text containing the domain keywords according to the number of the domain keywords contained in the candidate text containing the domain keywords.

7. The method of any one of claims 1-6, further comprising:

acquiring a training corpus, wherein the training corpus comprises a text corpus and a pinyin corpus corresponding to the text corpus;

adding a designated characteristic identifier in the pinyin corpus in the training corpus;

and performing model training according to the training corpus with the specified characteristic identifier to obtain the pinyin-text prediction model.

8. The method of claim 7, wherein if the specified feature identifier comprises a first character identifier, obtaining the corpus comprises:

acquiring a text corpus;

generating a standard training corpus by the text corpus and the corresponding correct pinyin sequence;

replacing the correct pinyin in the standard training corpus by using known fuzzy tones to obtain a fuzzy tone training corpus;

and removing the pinyin of the first character in the standard training corpus and the fuzzy sound training corpus to obtain the missing training corpus of the first character.

9. The method according to claim 8, wherein if the specific feature identifier further includes a scene identifier, the obtaining the text corpus comprises: and acquiring text corpora in the general field scene and the vertical field scene.

10. The method of claim 9, wherein performing model training based on the corpus with the assigned feature identifier to obtain the pinyin-text prediction model, comprises:

pre-training a preset network structure model by using a training corpus corresponding to a general field scene to obtain an initialization model;

and performing fine tuning training on the initialization model by using a training corpus corresponding to a vertical field scene to obtain the pinyin-text prediction model.

11. The method of claim 10, wherein the network structure model is a mask sequence to sequence pre-training MASS model.

12. A text correction apparatus characterized by comprising: a processor and a memory storing a computer program;

the processor to execute the computer program to:

13. The text correction apparatus of claim 12 wherein the processor, in obtaining the target pinyin sequence, is configured to:

14. The text correction apparatus of claim 12, wherein the processor, when selecting the corrected text corresponding to the text to be corrected, is configured to:

15. The text correction apparatus of any of claims 12-14, wherein the processor is further configured to:

16. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, causes the processor to at least: