WO2022141855A1

WO2022141855A1 - Text regularization method and apparatus, and electronic device and storage medium

Info

Publication number: WO2022141855A1
Application number: PCT/CN2021/083493
Authority: WO
Inventors: 李俊杰; 蒋伟伟; 马骏; 王少军
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-12-31
Filing date: 2021-03-29
Publication date: 2022-07-07
Also published as: CN112765937A

Abstract

The present application relates to the technical field of artificial intelligence, and in particular to a text regularization method and apparatus, and an electronic device and a storage medium. The method comprises: acquiring text to be regularized; performing character segmentation on the text to be regularized, so as to obtain a plurality of characters; encoding each of the plurality of characters, so as to obtain a first feature vector of each of the plurality of characters, wherein the first feature vector of each of the plurality of characters is used for representing context information of each of the plurality of characters; and performing, according to the first feature vector of each of the plurality of characters and a language type of the text to be regularized, regularization processing on the text to be regularized, so as to obtain regularized text of the text to be regularized. The present application is beneficial for improving the efficiency and accuracy of regularizing text.

Description

Text regularization method, device, electronic device and storage medium

This application claims the priority of the Chinese patent application filed on December 31, 2020 with the application number 202011644545.8 and the invention title is "Text Regularization Method, Apparatus, Electronic Device and Storage Medium", the entire contents of which are incorporated by reference in this application.

technical field

The present application relates to the technical field of artificial intelligence, and in particular, to a text regularization method, apparatus, electronic device and storage medium.

Background technique

The establishment of a traditional text regularization system requires a strong linguistic background, and often requires experts in specific fields to manually construct a large number of complex and tedious text regularization rules according to linguistic characteristics. At the same time, there are obvious differences in linguistic knowledge between different languages, which cannot be effectively transferred. If text regularization is performed on a new language, a set of text regularization rules needs to be rebuilt.

The inventors found that in recent years, with the rapid development of artificial intelligence, text regularization systems based on neural networks of encoder and decoder models have begun to appear in the public eye. However, the inventor realized that due to the soft classification characteristics of the pure encoder and decoder models, the pure encoder and decoder models cannot obtain satisfactory text regularization accuracy. Therefore, the current mainstream text regularization system still needs to manually construct a set of specific, complex and cumbersome text regularization rules, and for different languages, different text regularization rules need to be constructed, which requires a lot of manpower and physics. There may be code redundancy between these text rules.

Therefore, in the process of existing text regularization, it is necessary to construct a text regularization system manually, the labor cost is relatively high, and the text regularization efficiency is relatively slow.

technical problem

The embodiments of the present application provide a text regularization method, apparatus, electronic device and storage medium, which can perform text regularization according to the language type of the text to be regularized and the feature vector of each character, so as to improve the text regularization efficiency and reduce labor costs.

technical solutions

In a first aspect, an embodiment of the present application provides a text regularization method, including: obtaining text to be regularized; characterizing the text to be regularized to obtain multiple characters; encoding, to obtain a first feature vector of each character in the plurality of characters, wherein the first feature vector of each character in the plurality of characters is used to represent the context information of each character in the plurality of characters; According to the first feature vector of each character in the plurality of characters and the language type of the text to be regularized, the regular text of the text to be regularized is obtained by performing regular processing on the text to be regularized.

In a second aspect, an embodiment of the present application provides a text regularization device, including: an acquisition unit, configured to acquire the text to be regularized; a processing unit, configured to perform character segmentation on the text to be regularized to obtain a plurality of characters; encoding each character in the plurality of characters to obtain the first feature vector of each character in the plurality of characters, wherein the first feature vector of each character in the plurality of characters is used to represent the multi-character context information of each character in the characters; according to the first feature vector of each character in the plurality of characters and the language type of the text to be regularized, perform regular processing on the text to be regularized to obtain the text to be regularized The regular text of the text.

In a third aspect, an embodiment of the present application provides an electronic device, including: a processor, the processor is connected to a memory, the memory is used for storing a computer program, and the processor is used for executing the computer program stored in the memory , so that the electronic device executes a text regularization method, the text regularization method includes: obtaining the text to be regularized; characterizing the text to be regularized to obtain multiple characters; Encoding is performed to obtain the first feature vector of each character in the plurality of characters, wherein the first feature vector of each character in the plurality of characters is used to represent the context information of each character in the plurality of characters ; According to the first feature vector of each character in the plurality of characters and the language type of the text to be regularized, the text to be regularized is subjected to regular processing to obtain the regular text of the text to be regularized.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and the computer program causes a computer to execute a text regularization method, where the text regularization method includes: obtaining the text to be regularized ; characterize the text to be regularized to obtain a plurality of characters; encode each character in the plurality of characters to obtain the first feature vector of each character in the plurality of characters, wherein the The first feature vector of each character in the plurality of characters is used to represent the context information of each character in the plurality of characters; according to the first feature vector of each character in the plurality of characters and the text to be regularized The language type of the to-be-regularized text is processed to obtain the regular text of the to-be-regularized text.

beneficial effect

Implementing the embodiments of the present application realizes that regularization of regular text can be completed without manually writing regular rules, which improves the efficiency of text regularization and saves labor costs. In addition, in the process of text regularization, the language type of the text to be regularized will be combined, so that texts in various languages can be regularized, so that the text regularization method of the present application has more usage scenarios.

Description of drawings

FIG. 1 is a schematic flowchart of a text regularization provided by an embodiment of the present application.

FIG. 2 is a schematic flowchart of encoding and decoding processing of non-standard characters according to an embodiment of the present application.

FIG. 3 is a schematic diagram of encoding and decoding of non-standard characters by an encoder and a decoder according to an embodiment of the present application.

FIG. 4 is a block diagram of functional units of a text regularization apparatus provided by an embodiment of the present application.

FIG. 5 is a schematic structural diagram of a text regularization apparatus provided by an embodiment of the present application.

Embodiments of the present invention

The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application.

The technical solution of the present application relates to the field of artificial intelligence technology, so as to realize regular text and help to promote the construction of smart cities. Optionally, the data involved in this application, such as to-be-regular text and/or regular text, may be stored in a database, or may be stored in a blockchain, which is not limited in this application.

Referring to FIG. 1 , FIG. 1 is a schematic flowchart of a text regularization method provided by an embodiment of the present application. The method is applied to a text regularizer. The method includes the following steps.

101: The text regularization device obtains the text to be regularized.

Exemplarily, the text to be regularized may be manually input by the user in the information input field of the text regularization device, or it may be automatically read by the text regularization device from a text library. For example, the text to be regularized may be some If there is a document to be regularized, the text regularization device can sequentially read the text to be regularized from the document. Therefore, this application does not limit the acquisition of regular text.

102: The text regularization device performs character segmentation on the regular text to obtain multiple characters.

Exemplarily, the to-be-regular text may be character-segmented by a tokenizer to obtain multiple characters, for example, the character-segmented regular text may be segmented by a word2vec tokenizer. The characters may be English words, Chinese words, French words or special symbols, such as "$", "/", and so on.

103: The text regularization device encodes each of the multiple characters to obtain a first feature vector of each of the multiple characters, wherein the first feature vector of each of the multiple characters is used to represent multiple Contextual information for each character in a character.

Exemplarily, each character in the plurality of characters is encoded to obtain a character vector corresponding to each character in the plurality of characters. Specifically, each character is segmented to obtain the letter string of each character; each letter in the letter string of each character is encoded to obtain the letter vector corresponding to each letter; finally, each letter is The letter vector is encoded to get a character vector for each character. For example, for the character "Achieve", the character "Achieve" is processed as the letter string of "A", "c", "h", "i", "e", "v", "e", and the letter string is The letter vector for each letter in is modeled as the input to the encoder, resulting in a character vector for the character "Achieve". Then, a first text corresponding to the character A is constructed with the character A as the center, wherein the character A is any one of the multiple characters, and the first text includes X characters located before the character A in the regular text , the character A, and the Y characters located after the character A in the regular text, where X and Y are both integers greater than or equal to 1; Splicing (ie, horizontal splicing), the first feature vector corresponding to the character A is obtained, wherein the first feature vector corresponding to the character A is used to represent the contextual information of the character A in the first text.

It should be understood that if there are no X characters before the character A, for example, the character A is the first character or the last character in the to-be-regularized text, you can fill in the preset character (for example, you can fill in the start character S) The mode constructs the first text for character A.

104: The text regularization device performs regularization processing on the regular text according to the first feature vector of each character in the plurality of characters and the language type of the text to be regularized to obtain regular text of the text to be regularized.

Exemplarily, according to the first feature vector of each character in the plurality of characters, an attribute of each character in the plurality of characters is determined, wherein the attribute of each character includes a standard character or a non-standard character; then, the The standard character in the text to be regularized is used as the regular character of the marked character, that is, the standard character itself is used as the regular character of the standard character, and according to the language type of the text to be regularized and the first corresponding to the non-standard character in the text to be regularized feature vector, encode and decode the standard character to obtain the regular character of the non-standard character; finally, combine the regular character of the standard character and the regular character of the non-standard character in the regular text to obtain the regular character of the regular text to be regular text.

Exemplarily, a standard character refers to a character with the same pronunciation and writing. For example, for the character "year", its pronunciation and writing are the same, that is, both are "year", and the regular character of the standard character is itself. Exemplarily, the non-standard characters involved in this application include but are not limited to the following:

Dates, currencies, addresses, letters, cardinal numbers, ordinal numbers, web addresses, units of measurement, fractional forms, decimal forms, phone numbers, time, digits, punctuation, and foreign words.

Further, taking the character B as the center, construct the second text corresponding to the character B, and the second text includes the M characters located before the character B in the text to be regularized, the character B and the text to be regularized. N characters after character B, where character B is any non-standard character in the text to be regularized, M and N are integers greater than or equal to 1; then, through double-byte encoding (Byte-Pair Encoding , BPE) to encode each character in the second text to obtain the second feature vector of each character in the second text.

Specifically, each character in the second text is split into letter strings, and according to the frequency of occurrence of the letter strings of all characters in the second text, the letter strings of each character are combined to obtain a new letter string; then, the new letter string of each character is input into the encoder for encoding, and the second feature vector of each character in the second text is obtained. The problem of unregistered words in the second text can be solved by double-byte encoding. Then, the second feature vector of each character in the second text is input into the Transformer-XL network for feature extraction to obtain a third feature vector corresponding to character B, where the third feature vector of character B is used to represent character B Context information in the second text; finally, according to the first feature vector of the character B, the third feature vector of the character B and the language type of the text to be regularized, the character B is encoded and decoded to obtain the corresponding character B regular characters. The process of encoding and decoding the character B will be described in detail later, and will not be described too much here.

It can be seen that, in the embodiment of the present application, first character segmentation is performed on the regular text to be treated, and then each character is encoded to obtain the first feature vector of each character; finally, according to the first feature vector of each character and The language type of the text to be regularized, and the regularization of the regular text is performed, that is, the regularization of the regular text can be completed without manually writing regular rules, the efficiency of text regularization is improved, and the labor cost is saved. In addition, in the process of text regularization, the language type of the text to be regularized will be combined, so that texts in various languages can be regularized, so that the text regularization method of the present application has more usage scenarios.

Referring to FIG. 2, FIG. 2 is a schematic flowchart of an encoding and decoding method provided by an embodiment of the present application. The method is applied to a text regularizer. The method includes the following steps.

201: Perform word embedding processing on character B to obtain a fourth feature vector of character B.

Exemplarily, performing word embedding processing on character B is actually performing mapping processing on character B to obtain the fourth feature vector of character B. For example, the ASCII code of character B can be used as the fourth feature vector of character B.

202: Encode the attributes of the character B to obtain the part-of-speech vector of the character B.

Exemplarily, encoding the attributes of character B is to map the part of speech to which character B belongs to obtain the part of speech vector of character B. For example, if character B is "currency", the GB232 code of "currency" is used as the character B. The part-of-speech vector.

It should be understood that although the attributes of the character B have been classified by the first feature vector of each character, the process of classifying the attributes of the character B only classifies whether the character is a standard character or a non-standard character. Subdivision is performed on non-standard characters. Therefore, the first feature vector of each character can only be used to distinguish whether each character is a standard character or a non-standard character, and cannot be further distinguished on non-standard characters. Here is the more detailed classification of character B, and the part-of-speech vector of each non-standard character is mapped to obtain a more detailed category of each non-standard character.

203: Encode the language type of the regular text to obtain the language vector of the regular text, and use the language vector as the encoding parameter of the encoder and the decoding parameter of the decoder respectively.

Similarly, the language type of the regular text is mapped to obtain the language vector of the regular text. For example, the GB2312 code of the Chinese representation of the language type (for example, the language type is "English", "Chinese", "French", etc.) can be used as the language vector of the language type.

204 : Input the fourth feature vector of the character B into the encoder for encoding, and encode the character B to obtain the fifth feature vector of the character B.

Exemplarily, the encoder may be a neural network constructed based on a long short term memory network, a bidirectional long short term memory network or a recurrent network. This application does not limit the type of encoder.

Exemplarily, the character B is encoded according to the hidden layer vector output by the encoder last encoding, the fourth feature vector of the character B, and the encoding parameters of the encoder (that is, the language vector), and the fifth feature vector corresponding to the character B is obtained. , and the hidden layer vector corresponding to character B.

It should be understood that when the character B is the first non-standard character to be encoded, the hidden layer vector output by the encoder last encoding is a preset hidden layer vector, for example, a zero vector. In addition, if only character B is encoded in this encoding, the hidden layer vector finally output by the encoder is the hidden layer vector generated by the encoding process of character B. If other non-standard characters need to be encoded, it will be combined with The hidden layer vector corresponding to character B is used as the hidden layer vector of the next non-standard character to be encoded.

It should be understood that if the character B is one of multiple consecutive non-standard characters in the regular text with the same attributes (that is, the parts of speech are exactly the same, for example, they are all dates in non-standard words), in order to speed up the processing of these multiple characters. The encoding efficiency and encoding precision of non-standard characters can be encoded together with these multiple non-standard characters instead of encoding a certain non-standard character separately.

Exemplarily, as shown in Figure 3, there are multiple consecutive non-standard characters with the same attributes in the regular text [X ₁ , X ₂ ,..., X _n ]; for multiple non-standard characters [X ₁ , X ₂ ,...,X _n ] for each non-standard character in the word embedding, and obtain the fourth feature vector of each non-standard character respectively; then, based on the preset hidden layer vector e ₀ and the encoding parameters of the encoder, for The first non-standard character X ₁ in the plurality of non-standard characters is encoded for the first time to obtain the fifth feature vector Y1 of the first non-standard character X ₁ and the hidden layer vector corresponding to the first encoding e ₁ ; further, based on the hidden layer vector e ₁ output by the first encoding and the encoding parameters of the encoder, perform the second encoding on the second non-standard character X ₂ in the plurality of non-standard characters to obtain the first The fifth feature vector corresponding to the two non-standard characters X ₂ , and the hidden layer vector e ₂ corresponding to the second encoding; repeating the above steps to obtain the first non-standard character X _n in the plurality of non-standard characters Five feature vectors, and the hidden layer _vector en output from the last encoding. Among them, the hidden layer vector output by the last encoding contains the contextual semantic information of these multiple non-standard characters. In this way, the multiple non-standard characters [X ₁ , X ₂ ,..., X _n ] are successively encoded successfully, and the fifth feature vector [Y ₁ , Y ₂ ,..., Y _n ] corresponding to the multiple non-standard characters is output. .

For example, if the regular text is "Achieve record net income of about $1 billion during the year", the non-standard characters are identified as "$", "1", "billion", and the attributes of these three non-standard characters are the same and continuous. Therefore, the three non-standard characters can be encoded continuously , and output the fifth feature vector of the three non-standard characters and the hidden layer vector obtained by the encoder for the last encoding. Specifically, first perform word embedding processing on the characters "$", "1", and "billion" respectively to obtain The fourth feature vector of each non-standard character; then, the fourth feature vector of these three characters is used as the input of the encoder, and the encoder is first based on the initial hidden layer vector (ie, the zero vector) and the first The four feature vectors encode the character "$" for the first time to obtain the fifth feature vector of the character "$" and the hidden layer vector of the first encoding; then, the encoder is based on the hidden layer vector and the character The fourth feature vector of "1", encode the character "1" for the second time to obtain the fifth feature vector of the character "1", and the hidden layer vector of the second encoding; then, the encoder is based on the second encoding. The new hidden layer vector and the fourth feature vector of the character "billion", encode the character "billion" for the third time to obtain the fifth feature vector of the character "billion", and the last hidden layer vector; the last time The hidden layer vector contains the full-text semantic information of these three non-standard characters.

205 : Input the part-of-speech vector of the character B and the fifth feature vector of the character B into the decoder, decode the character B, and obtain the regular text of the character B.

Exemplarily, the decoder may be a neural network constructed based on a long short term memory network, a bidirectional long short term memory network or a recurrent network. This application does not limit the type of decoder.

Exemplarily, an attention mechanism operation is performed on the hidden layer vector output by the decoder last decoding and the fifth feature vector corresponding to the character B to obtain the sixth feature vector corresponding to the character B. Among them, the attention mechanism can be a general attention mechanism operation, for example, the fifth feature vector corresponding to the character B can be used as a key-value pair, that is, a key-value vector-value vector (Key-value); then, the decoder The hidden layer vector output from the last decoding is used as the query vector (query) to perform the attention mechanism operation to obtain the sixth feature vector corresponding to the character B. The attention mechanism operation involved in the follow-up is similar and will not be described.

It should be understood that if character B is the first character to be decoded, the hidden layer vector output by the decoder last decoding is the hidden layer vector output by the encoder last encoding; if character B is not the first character to be decoded, Then the hidden layer vector output by the decoder last decoding is the hidden layer vector generated when the decoder decodes the previous character. Since the hidden layer vector output by the decoder last decoding (for example, the hidden layer vector output by the encoder last encoding) contains the contextual semantic information of the character B, through the attention mechanism operation, the key information of this decoding can be retained. Next, improve the decoding accuracy.

Further, splicing the part-of-speech vector of character B, the third feature vector of character B, the sixth feature vector of character B and the decoding result decoded by the decoder last time to obtain the target feature vector of character B; according to the decoding of the encoder The parameter (language vector) and the target feature vector of character B, decode character B to obtain the regular character corresponding to character B. That is, use the decoding parameters of the decoder to operate the target feature vector to obtain the probability of each character falling into the standard dictionary, and use the standard character corresponding to the maximum probability as the regular character of the character B.

Among them, the decoding result decoded by the decoder last time is the decoding result generated in the process of decoding the character last time by the decoder (that is, the feature vector of the regular character of the previous character). It should be understood that if the character B is the first character that needs to be decoded, the decoding result of the last decoding is the feature vector of the preset character. For example, the preset character is the start symbol S, and the feature vector of the start symbol S is spliced. , to indicate the start of this decoding.

Similarly, if character B is one of multiple consecutive non-standard characters in the text to be regularized with the same attributes (that is, the parts of speech are exactly the same, for example, all are dates in non-standard words), in order to speed up the decoding of non-standard characters In terms of efficiency and decoding accuracy, the plurality of non-standard characters will be sequentially decoded according to the fifth feature vector of the plurality of non-standard characters, and a certain non-standard character will not be decoded in isolation.

Exemplarily, as shown in Fig. 3, the hidden layer vector e0 output by the last encoding of the encoder is used, and the fifth feature vector [Y ₁ of the plurality of non-standard characters [X ₁ , X ₂ ,..., X _n ] is , Y ₂ ,…, Y _n ] perform the attention mechanism operation to obtain a sixth feature vector. It should be understood that since the hidden layer vector output by the encoder for the last encoding will contain the full text semantic information of the multiple non-standard characters [X ₁ , X ₂ ,..., X _n ], the decoding attention will be placed on the attention mechanism operation. Above the first character that needs to be decoded, thereby improving the decoding accuracy. Then, splicing the sixth feature vector, the part-of-speech vector L of the multiple non-standard characters, the third feature vector H of the multiple non-standard characters, and the feature vector of the preset symbol (not shown in FIG. 3 ) , to obtain the target feature vector of the first non-standard character to be decoded, wherein, since the attributes of the multiple non-standard characters are the same, the part-of-speech vector of the multiple non-standard characters can be any one of the multiple non-standard characters The part-of-speech vector of the non-standard characters, and the third feature vector of the plurality of non-standard characters is the average value of the third feature vectors of each non-standard character in the plurality of non-standard characters. Finally, based on the target feature vector of the first non-standard character to be decoded, the first non-standard character to be decoded is decoded to obtain the first decoded decoding result Z ₁ (that is, the first decoded non-standard character Regular characters of non-standard characters), and the first decoded hidden layer vector d ₁ ; then, use the first decoded hidden layer vector d ₁ , the first decoded decoding result Z ₁ , multiple non-standard characters The part-of-speech vector L and the third feature vector H of , and the fifth feature vector [Y ₁ , Y ₂ ,..., Y _n ] of multiple non-standard characters are decoded for the second time, and the decoding result of the second decoding is obtained ( That is, the regular character of the second non-standard character that needs to be decoded) Z ₂ , and the hidden layer vector of the second decoding; repeat the above steps until the multiple non-standard characters [X ₁ , X ₂ ,..., Regular character [Z ₁ , Z ₂ ,…, Z _n ] for each non-standard character in X _n ], stop decoding.

For example, take the non-standard character "$1 billion" as an example to illustrate the decoding process. During the first decoding process, use the hidden layer vector output by the encoder for the last encoding and the fifth feature vector of the above three non-standard characters to perform the attention mechanism operation to obtain a sixth feature vector (because the first time The regular character "1", the sixth feature vector focuses on the character "1"); then, the sixth feature vector, the part-of-speech vector (the part-of-speech vectors of the three non-standard characters are the same), the third feature vector (This third feature vector is obtained by averaging the third feature vector of each non-standard character) and the feature vector of the start symbol S are spliced to obtain a target feature vector; the first decoding is performed according to the target feature vector, Obtain the vector of the character "1" (after mapping this vector, the regular character of the character "1" can be obtained as "one") and the hidden layer vector corresponding to the character "1"; then, perform the second decoding, using the first The hidden layer vector output by one decoding and the fifth feature vector of the above three characters are subjected to the attention mechanism operation to obtain a sixth feature vector. The sixth feature vector, the part of speech vector, the third t feature vector and the first feature vector The vector of the character "1" output by the decoding is spliced to obtain a target feature vector vector, and the target feature vector is input into the decoder for decoding, and the vector of the character "billion" is obtained (after mapping this vector, "billion" is obtained The regular character is "billion") and the hidden layer vector of the second decoding; then, the third decoding is performed, and the attention mechanism operation is performed using the hidden layer vector of the second decoding and the fifth feature vector of the above three characters , get a sixth feature vector, splicing the sixth feature vector, part-of-speech vector, third feature vector vector, and the character "billion" (the character corresponding to the regular character of "billion") output from the second decoding to get a Target feature vector, input this target feature vector into the decoder for decoding, and get the vector of the character "$" (after mapping, the regular vector of "$" can be obtained as "dollars") and a hidden layer vector of the decoder; Finally, use the hidden layer vector output by the third decoding and the fifth feature vector of the above three characters to perform the attention mechanism operation to obtain a sixth feature vector, the sixth feature vector, the part of speech vector, the third feature vector The vector and the vector of the character "$" output by the second decoding are spliced to obtain a target feature vector, which is input into the decoder for decoding, and the end symbol "end" is decoded to indicate the stop of decoding.

Therefore, through the above encoding and decoding process, the three consecutive standard characters "$1 billion" can be regularized into one billion dollars at one time, and then the above text to be regularized can be regularized as "Achieve record net income of about one billion dollars during the year".

It can be seen that, in the application embodiment, in the process of encoding and decoding non-standard characters, an attention mechanism is adopted to improve the accuracy of each encoding and decoding. In addition, multiple consecutive non-standard characters with the same attributes can be encoded and decoded synchronously, and information can be learned from each other during the encoding and decoding process, which improves the efficiency and accuracy of encoding and decoding.

Referring to FIG. 4 , FIG. 4 is a block diagram of functional units of a text regularization device provided by an embodiment of the present application. The text regularization device 400 includes: an obtaining unit 401 and a processing unit 402, wherein: the obtaining unit 401 is used to obtain the text to be regularized; the processing unit 402 is used to perform character segmentation on the text to be regularized to obtain a plurality of characters; Each character in the plurality of characters is encoded to obtain a first feature vector of each character in the plurality of characters, wherein the first feature vector of each character in the plurality of characters is used to represent the Context information of each character in the plurality of characters; according to the first feature vector of each character in the plurality of characters and the language type of the text to be regularized, perform regular processing on the text to be regularized to obtain the text to be regularized Regular text for regular text.

In some possible implementations, in terms of encoding each of the plurality of characters to obtain a first feature vector of each of the plurality of characters, the processing unit 402 is specifically configured to: encoding each character in the plurality of characters to obtain a character vector corresponding to each character in the plurality of characters; taking character A as the center, constructing a first text corresponding to the character A, the first text corresponding to the character A is constructed. The text includes X characters located before the character A in the text to be regularized, the character A, and Y characters located after the character A in the text to be regularized, and the character A is the plurality of any one of the characters, where X and Y are both integers greater than or equal to 1; the character vectors corresponding to each character in the first text are spliced to obtain the first feature vector of the character A, The first feature vector of the character A is used to represent the context information of the character A in the first text.

In some possible implementations, according to the first feature vector of each character in the plurality of characters and the language type of the text to be regularized, the text to be regularized is subjected to regularization processing to obtain the text to be regularized In terms of the regular text of the The attributes of the characters include standard characters or non-standard characters; the standard characters in the text to be regularized are used as regular characters of the marked characters; according to the language type and the first feature corresponding to the non-standard characters in the text to be regularized vector, encoding and decoding the non-standard characters to obtain the regular characters of the non-standard characters; combining the regular characters of the standard characters and the regular characters of the non-standard characters in the text to be regularized to obtain the regular characters to be regular The regular text of the text.

In some possible implementations, according to the language type and the first feature vector corresponding to the non-standard characters in the text to be regularized, the non-standard characters are encoded and decoded to obtain the non-standard characters. In terms of regular characters, the processing unit 402 is specifically configured to: take the character B as the center, construct a second text corresponding to the character B, and the second text includes the M texts located before the character B in the text to be regularized character, the character B, and the N characters located after the character B in the text to be regularized, where the character B is any non-standard character in the text to be regularized, where M and N are both greater than or an integer equal to 1; encode each character in the second text through double-byte encoding to obtain a second feature vector of each character in the second text; encode each character in the second text The second feature vector of the character B is input into the Transformer-XL network, and the third feature vector corresponding to the character B is obtained, and the third feature vector of the character B is used to represent the character B in the second text. The context information of the text ; According to the attribute of the character B, the third feature vector of the character B and the language type, the character B is encoded and decoded to obtain the regular character corresponding to the character B.

In some possible implementations, encoding and decoding processing is performed on the character B according to the attribute of the character B, the third feature vector of the character B, and the language type, so as to obtain the regular corresponding to the character B. In terms of characters, the processing unit 402 is specifically configured to: perform word embedding processing on the character B to obtain the fourth feature vector of the character B; encode the attributes of the character B to obtain the corresponding character B Part-of-speech vector; encode the language type to obtain a language vector, and use the language vector as the encoding parameter of the encoder and the decoding parameter of the decoder respectively; input the fourth feature vector of the character B into the encoding decoder, encode the character B, and obtain the fifth feature vector of the character B; input the part-of-speech vector of the character B and the fifth feature vector of the character B into the decoder, and analyze the character B is decoded to obtain the regular text corresponding to the character B.

In some possible implementations, in terms of inputting the fourth feature vector of the character B into the encoder for encoding, and obtaining the fifth feature vector of the character B, the processing unit 402 is specifically configured to: according to the The hidden layer vector of the last encoding output of the encoder, the fourth feature vector of the character B, and the encoding parameters of the encoder are used to encode the character B to obtain the fifth feature vector of the character B.

In some possible implementations, the part-of-speech vector of the character B and the fifth feature vector of the character B are input into the decoder, and the character B is decoded to obtain the regular expression corresponding to the character B In terms of text, the processing unit 402 is specifically configured to: perform an attention mechanism operation on the hidden layer vector decoded and output by the decoder last time and the fifth feature vector corresponding to the character B to obtain the sixth feature vector corresponding to the character B. feature vector; splicing the part-of-speech vector of the character B, the third feature vector of the character B, the sixth feature vector of the character B and the decoding result decoded by the decoder last time to obtain the character B The target feature vector of the character B is decoded according to the encoding parameters of the encoder and the target feature vector of the character B to obtain the regular character corresponding to the character B.

Referring to FIG. 5 , FIG. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device includes a processor and memory. Optionally, the electronic device may further include a transceiver. For example, as shown in FIG. 5 , the electronic device 500 includes a transceiver 501 , a processor 502 and a memory 503 . They are connected by bus 504 . The memory 503 is used to store computer programs and data, and can transmit the data stored in the memory 503 to the processor 502 .

The processor 502 is configured to read the computer program in the memory 503 and perform the following operations: control the transceiver 501 to obtain the text to be regularized; perform character segmentation on the text to be regularized to obtain multiple characters; Each character is encoded to obtain a first feature vector of each character in the plurality of characters, wherein the first feature vector of each character in the plurality of characters is used to represent each character in the plurality of characters the context information; according to the first feature vector of each character in the plurality of characters and the language type of the text to be regularized, the regular text of the text to be regularized is obtained by regularizing the text to be regularized.

In some possible implementation manners, in terms of encoding each character in the plurality of characters to obtain the first feature vector of each character in the plurality of characters, the processor 502 is specifically configured to perform the following operations: Encoding each character in the plurality of characters to obtain a character vector corresponding to each character in the plurality of characters; taking character A as the center, constructing a first text corresponding to the character A, the The first text includes X characters before the character A in the text to be regularized, the character A, and Y characters after the character A in the text to be regularized, where the character A is the Any one of the multiple characters, where X and Y are both integers greater than or equal to 1; the character vector corresponding to each character in the first text is spliced to obtain the first feature of the character A vector, the first feature vector of the character A is used to represent the context information of the character A in the first text.

In some possible implementations, according to the first feature vector of each character in the plurality of characters and the language type of the text to be regularized, the text to be regularized is subjected to regularization processing to obtain the text to be regularized In terms of the regular text, the processor 502 is specifically configured to perform the following operations: determine the attribute of each character in the The attributes of each character include standard characters or non-standard characters; the standard characters in the text to be regularized are used as the regular characters of the marked character; according to the language type and the first corresponding to the non-standard characters in the text to be regularized a feature vector, which encodes and decodes the non-standard characters to obtain the regular characters of the non-standard characters; combines the regular characters of the standard characters and the regular characters of the non-standard characters in the text to be regularized to obtain the The regular text to be regular text.

In some possible implementations, according to the language type and the first feature vector corresponding to the non-standard characters in the text to be regularized, the non-standard characters are encoded and decoded to obtain the non-standard characters. In terms of regular characters, the processor 502 is specifically configured to perform the following operations: centering on character B, construct a second text corresponding to the character B, where the second text includes the text to be regularized before the character B. M characters, the character B, and the N characters located after the character B in the text to be regularized, where the character B is any non-standard character in the text to be regularized, wherein M and N are both is an integer greater than or equal to 1; encode each character in the second text through double-byte encoding to obtain a second feature vector of each character in the second text; encode each character in the second text The second feature vector of each character is input into the Transformer-XL network, and the third feature vector corresponding to the character B is obtained, and the third feature vector of the character B is used to represent the character B in the second text. Context information; according to the attribute of the character B, the third feature vector of the character B and the language type, the character B is encoded and decoded to obtain the regular character corresponding to the character B.

In some possible implementations, encoding and decoding processing is performed on the character B according to the attribute of the character B, the third feature vector of the character B, and the language type, so as to obtain the regular corresponding to the character B. In terms of characters, the processor 502 is specifically configured to perform the following operations: perform word embedding processing on the character B to obtain a fourth feature vector of the character B; Corresponding part-of-speech vector; encode the language type to obtain a language vector, and use the language vector as the encoding parameter of the encoder and the decoding parameter of the decoder respectively; input the fourth feature vector of the character B into the The encoder encodes the character B to obtain the fifth feature vector of the character B; the part-of-speech vector of the character B and the fifth feature vector of the character B are input into the decoder, and the The character B is decoded to obtain the regular text corresponding to the character B.

In some possible implementations, in terms of inputting the fourth feature vector of the character B into the encoder for encoding, and obtaining the fifth feature vector of the character B, the processor 502 is specifically configured to perform the following operations: According to the hidden layer vector output by the encoder last encoding, the fourth feature vector of the character B and the encoding parameters of the encoder, the character B is encoded to obtain the fifth feature vector of the character B .

In some possible implementations, the part-of-speech vector of the character B and the fifth feature vector of the character B are input into the decoder, and the character B is decoded to obtain the regular expression corresponding to the character B In terms of text, the processor 502 is specifically configured to perform the following operations: perform an attention mechanism operation on the hidden layer vector output by the decoder last decoding and the fifth feature vector corresponding to the character B, and obtain the corresponding value of the character B. The sixth feature vector; splicing the part-of-speech vector of the character B, the third feature vector of the character B, the sixth feature vector of the character B, and the decoding result decoded by the decoder last time to obtain the The target feature vector of the character B; according to the encoding parameters of the encoder and the target feature vector of the character B, the character B is decoded to obtain the regular character corresponding to the character B.

Specifically, the transceiver 501 may be the acquisition unit 401 of the text regularization apparatus 400 of the embodiment shown in FIG. 4 , and the processor 502 may be the processing unit 402 of the text regularization apparatus 400 of the embodiment shown in FIG. 4 .

It should be understood that the text regularization device in this application may include smart phones (such as Android mobile phones, iOS mobile phones, Windows Phone mobile phones, etc.), tablet computers, PDAs, notebook computers, mobile Internet devices MID (Mobile Internet Devices, referred to as: MID) or wearable devices, etc. The above text regularization device is only an example, not exhaustive, including but not limited to the above text regularization device. In practical applications, the above-mentioned text regularization apparatus may further include: an intelligent vehicle-mounted terminal, a computer device, and the like.

Embodiments of the present application further provide a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and the computer program is executed by a processor to implement any one of the text regularization methods described in the foregoing method embodiments some or all of the steps.

Optionally, the storage medium involved in this application, such as a computer-readable storage medium, may be non-volatile or volatile.

Embodiments of the present application further provide a computer program product, the computer program product comprising a non-transitory computer-readable storage medium storing a computer program, the computer program being operable to cause a computer to execute the methods described in the foregoing method embodiments Some or all of the steps of any text regularization method.

It should be noted that, for the sake of simple description, the foregoing method embodiments are all expressed as a series of action combinations, but those skilled in the art should know that the present application is not limited by the described action sequence. Because in accordance with the present application, certain steps may be performed in other orders or concurrently. Secondly, those skilled in the art should also know that the embodiments described in the specification are all optional embodiments, and the actions and modules involved are not necessarily required by the present application.

In the above-mentioned embodiments, the description of each embodiment has its own emphasis. For parts that are not described in detail in a certain embodiment, reference may be made to the relevant descriptions of other embodiments.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative, for example, the division of the units is only a logical function division, and there may be other division methods in actual implementation, for example, multiple units or components may be combined or Integration into another system, or some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical or other forms.

The units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware, and can also be implemented in the form of software program modules.

The integrated unit, if implemented in the form of a software program module and sold or used as a stand-alone product, may be stored in a computer readable memory. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence, or the part that contributes to the prior art, or all or part of the technical solution, and the computer software product is stored in a memory, Several instructions are included to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned memory includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.

Those skilled in the art can understand that all or part of the steps in the various methods of the above embodiments can be completed by instructing relevant hardware through a program, and the program can be stored in a computer-readable memory, and the memory can include: a flash disk , read-only memory (English: Read-Only Memory, referred to as: ROM), random access device (English: Random Access Memory, referred to as RAM), magnetic disk or optical disk, etc.

The embodiments of the present application have been introduced in detail above, and the principles and implementations of the present application are described in this paper by using specific examples. The descriptions of the above embodiments are only used to help understand the methods and core ideas of the present application; at the same time, for Persons of ordinary skill in the art, based on the idea of the present application, will have changes in the specific implementation manner and application scope. In summary, the contents of this specification should not be construed as limitations on the present application.

Claims

A text regularization method, including:

Get the text to be regularized;

performing character segmentation on the text to be regularized to obtain multiple characters;

Encode each character in the plurality of characters to obtain a first feature vector of each character in the plurality of characters, wherein the first feature vector of each character in the plurality of characters is used to represent the Describe context information for each of the multiple characters;

According to the first feature vector of each character in the plurality of characters and the language type of the text to be regularized, the regular text of the text to be regularized is obtained by performing regular processing on the text to be regularized.
The method according to claim 1, wherein the encoding each character in the plurality of characters to obtain the first feature vector of each character in the plurality of characters comprises:

Encoding each character in the plurality of characters to obtain a character vector corresponding to each character in the plurality of characters;

Taking the character A as the center, construct the first text corresponding to the character A, the first text including the X characters before the character A in the text to be regularized, the character A and the text to be regularized Y characters located after the character A in the text, where the character A is any one of the multiple characters, wherein X and Y are both integers greater than or equal to 1;

The character vectors corresponding to each character in the first text are spliced to obtain the first feature vector of the character A, and the first feature vector of the character A is used to indicate that the character A is in the first Contextual information in the text.
The method according to claim 1 or 2, wherein the regularization processing is performed on the text to be regularized according to the first feature vector of each character in the plurality of characters and the language type of the text to be regularized, Obtain the regular text of the text to be regularized, including:

determining an attribute of each character in the plurality of characters according to the first feature vector of each character in the plurality of characters, where the attribute of each character in the plurality of characters includes a standard character or a non-standard character;

Using the standard characters in the text to be regularized as the regular characters of the marked characters;

According to the language type and the first feature vector corresponding to the non-standard characters in the text to be regularized, the non-standard characters are encoded and decoded to obtain the regular characters of the non-standard characters;

Combining the regular characters of the standard characters and the regular characters of the non-standard characters in the text to be regularized to obtain the regular text of the text to be regularized.
The method according to claim 3, wherein, according to the language type and the first feature vector corresponding to the non-standard characters in the text to be regularized, encoding and decoding the non-standard characters is performed to obtain the Regular characters for non-standard characters, including:

Taking the character B as the center, construct a second text corresponding to the character B, where the second text includes the M characters located before the character B in the text to be regularized, the character B, and the text to be regularized N characters located after the character B in the above, and the character B is any non-standard character in the text to be regularized, wherein M and N are both integers greater than or equal to 1;

Encode each character in the second text by double-byte encoding to obtain a second feature vector of each character in the second text;

Input the second feature vector of each character in the second text into the Transformer-XL network to obtain the third feature vector corresponding to the character B, and the third feature vector of the character B is used to represent the character B contextual information in said second text;

According to the attribute of the character B, the third feature vector of the character B, and the language type, the character B is encoded and decoded to obtain the regular character corresponding to the character B.
The method according to claim 4, wherein the character B is encoded and decoded according to the attribute of the character B, the third feature vector of the character B and the language type to obtain the character Regular characters corresponding to B, including:

performing word embedding processing on the character B to obtain the fourth feature vector of the character B;

Encoding the attributes of the character B to obtain a part-of-speech vector corresponding to the character B;

The language type is encoded to obtain a language vector, and the language vector is used as the encoding parameter of the encoder and the decoding parameter of the decoder respectively;

The fourth feature vector of the character B is input into the encoder, and the character B is encoded to obtain the fifth feature vector of the character B;

The part-of-speech vector of the character B and the fifth feature vector of the character B are input to the decoder, and the character B is decoded to obtain the regular text corresponding to the character B.
The method according to claim 5, wherein the fourth feature vector of the character B is input into the encoder for encoding to obtain the fifth feature vector of the character B, comprising:

According to the hidden layer vector output by the encoder last encoding, the fourth feature vector of the character B and the encoding parameters of the encoder, the character B is encoded to obtain the fifth feature vector of the character B .
The method according to claim 5 or 6, wherein the part-of-speech vector of the character B and the fifth feature vector of the character B are input into the decoder, and the character B is decoded to obtain the The regular text corresponding to the above character B, including:

Perform an attention mechanism operation on the hidden layer vector of the last decoding output of the decoder and the fifth feature vector corresponding to the character B, to obtain the sixth feature vector corresponding to the character B;

Splicing the part-of-speech vector of the character B, the third feature vector of the character B, the sixth feature vector of the character B and the decoding result decoded by the decoder last time to obtain the target feature of the character B vector;

According to the encoding parameters of the encoder and the target feature vector of the character B, the character B is decoded to obtain the regular character corresponding to the character B.
A text regularization device, comprising:

The acquisition unit is used to acquire the text to be regularized;

a processing unit, configured to perform character segmentation on the text to be regularized to obtain a plurality of characters; encode each character in the plurality of characters to obtain a first feature vector of each character in the plurality of characters , wherein the first feature vector of each character in the plurality of characters is used to represent the context information of each character in the plurality of characters; according to the first feature vector of each character in the plurality of characters and the Describe the language type of the text to be regularized, perform regular processing on the text to be regularized, and obtain the regular text of the text to be regularized.
An electronic device, comprising: a processor and a memory, the processor is connected to the memory, the memory is used for storing a computer program, the processor is used for executing the computer program stored in the memory, so that the The electronic device performs the following methods:

Get the text to be regularized;

performing character segmentation on the text to be regularized to obtain multiple characters;

Encode each character in the plurality of characters to obtain a first feature vector of each character in the plurality of characters, wherein the first feature vector of each character in the plurality of characters is used to represent the Describe context information for each of the multiple characters;

According to the first feature vector of each character in the plurality of characters and the language type of the text to be regularized, the regular text of the text to be regularized is obtained by performing regular processing on the text to be regularized.
The electronic device according to claim 9, wherein the encoding of each of the plurality of characters is performed to obtain a first feature vector of each of the plurality of characters, comprising:

Encoding each character in the plurality of characters to obtain a character vector corresponding to each character in the plurality of characters;

Taking the character A as the center, construct the first text corresponding to the character A, the first text including the X characters before the character A in the text to be regularized, the character A and the text to be regularized Y characters located after the character A in the text, where the character A is any one of the multiple characters, wherein X and Y are both integers greater than or equal to 1;

The character vectors corresponding to each character in the first text are spliced to obtain the first feature vector of the character A, and the first feature vector of the character A is used to indicate that the character A is in the first Contextual information in the text.
The electronic device according to claim 9 or 10, wherein performing the regularization on the text to be regularized according to the first feature vector of each character in the plurality of characters and the language type of the text to be regularized Process to obtain the regular text of the text to be regularized, including:

determining an attribute of each character in the plurality of characters according to the first feature vector of each character in the plurality of characters, where the attribute of each character in the plurality of characters includes a standard character or a non-standard character;

Using the standard characters in the text to be regularized as the regular characters of the marked characters;

According to the language type and the first feature vector corresponding to the non-standard characters in the text to be regularized, the non-standard characters are encoded and decoded to obtain the regular characters of the non-standard characters;

Combining the regular characters of the standard characters and the regular characters of the non-standard characters in the text to be regularized to obtain the regular text of the text to be regularized.
The electronic device according to claim 11, wherein the first feature vector corresponding to the language type and the non-standard characters in the text to be regularized is executed, and the non-standard characters are encoded and decoded to obtain Regular characters of the non-standard characters, including:

Taking the character B as the center, construct a second text corresponding to the character B, where the second text includes the M characters located before the character B in the text to be regularized, the character B, and the text to be regularized N characters located after the character B in the above, and the character B is any non-standard character in the text to be regularized, wherein M and N are both integers greater than or equal to 1;

Encode each character in the second text by double-byte encoding to obtain a second feature vector of each character in the second text;

Input the second feature vector of each character in the second text into the Transformer-XL network to obtain the third feature vector corresponding to the character B, and the third feature vector of the character B is used to represent the character B contextual information in said second text;

According to the attribute of the character B, the third feature vector of the character B, and the language type, the character B is encoded and decoded to obtain the regular character corresponding to the character B.
The electronic device according to claim 12, wherein the encoding and decoding processing is performed on the character B according to the attribute of the character B, the third feature vector of the character B, and the language type, to obtain the The regular characters corresponding to the above character B, including:

performing word embedding processing on the character B to obtain the fourth feature vector of the character B;

Encoding the attributes of the character B to obtain a part-of-speech vector corresponding to the character B;

The language type is encoded to obtain a language vector, and the language vector is used as the encoding parameter of the encoder and the decoding parameter of the decoder respectively;

The fourth feature vector of the character B is input into the encoder, and the character B is encoded to obtain the fifth feature vector of the character B;

The part-of-speech vector of the character B and the fifth feature vector of the character B are input to the decoder, and the character B is decoded to obtain the regular text corresponding to the character B.
The electronic device according to claim 13, wherein the inputting the part-of-speech vector of the character B and the fifth feature vector of the character B into the decoder is performed, and the character B is decoded to obtain the The regular text corresponding to the above character B, including:

Perform an attention mechanism operation on the hidden layer vector of the last decoding output of the decoder and the fifth feature vector corresponding to the character B, to obtain the sixth feature vector corresponding to the character B;

Splicing the part-of-speech vector of the character B, the third feature vector of the character B, the sixth feature vector of the character B and the decoding result decoded by the decoder last time to obtain the target feature of the character B vector;

According to the encoding parameters of the encoder and the target feature vector of the character B, the character B is decoded to obtain the regular character corresponding to the character B.
A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and the computer program is executed by a processor to implement the following method:

Get the text to be regularized;

performing character segmentation on the text to be regularized to obtain multiple characters;

Encode each character in the plurality of characters to obtain a first feature vector of each character in the plurality of characters, wherein the first feature vector of each character in the plurality of characters is used to represent the Describe context information for each of the multiple characters;

According to the first feature vector of each character in the plurality of characters and the language type of the text to be regularized, the regular text of the text to be regularized is obtained by performing regular processing on the text to be regularized.
The computer-readable storage medium according to claim 15, wherein the encoding of each of the plurality of characters is performed to obtain a first feature vector of each of the plurality of characters, comprising:

Encoding each character in the plurality of characters to obtain a character vector corresponding to each character in the plurality of characters;

Taking the character A as the center, construct the first text corresponding to the character A, the first text including the X characters before the character A in the text to be regularized, the character A and the text to be regularized Y characters located after the character A in the text, where the character A is any one of the multiple characters, wherein X and Y are both integers greater than or equal to 1;

The character vectors corresponding to each character in the first text are spliced to obtain the first feature vector of the character A, and the first feature vector of the character A is used to indicate that the character A is in the first Contextual information in the text.
The computer-readable storage medium according to claim 15 or 16, wherein performing the processing of the to-be-regularized text according to the first feature vector of each of the plurality of characters and the language type of the to-be-regularized text The text is regularized to obtain the regular text of the text to be regularized, including:

determining an attribute of each character in the plurality of characters according to the first feature vector of each character in the plurality of characters, where the attribute of each character in the plurality of characters includes a standard character or a non-standard character;

Using the standard characters in the text to be regularized as the regular characters of the marked characters;

According to the language type and the first feature vector corresponding to the non-standard characters in the text to be regularized, the non-standard characters are encoded and decoded to obtain the regular characters of the non-standard characters;

Combining the regular characters of the standard characters and the regular characters of the non-standard characters in the text to be regularized to obtain the regular text of the text to be regularized.
The computer-readable storage medium according to claim 17, wherein the first feature vector corresponding to the language type and the non-standard characters in the text to be regularized is executed to encode and decode the non-standard characters Processing to obtain the regular characters of the non-standard characters, including:

Taking the character B as the center, construct a second text corresponding to the character B, where the second text includes the M characters located before the character B in the text to be regularized, the character B, and the text to be regularized N characters located after the character B in the above, and the character B is any non-standard character in the text to be regularized, wherein M and N are both integers greater than or equal to 1;

Encode each character in the second text by double-byte encoding to obtain a second feature vector of each character in the second text;

Input the second feature vector of each character in the second text into the Transformer-XL network to obtain the third feature vector corresponding to the character B, and the third feature vector of the character B is used to represent the character B contextual information in said second text;

According to the attribute of the character B, the third feature vector of the character B, and the language type, the character B is encoded and decoded to obtain the regular character corresponding to the character B.
The computer-readable storage medium according to claim 18, wherein the encoding/decoding process on the character B according to the attribute of the character B, the third feature vector of the character B, and the language type is performed , obtain the regular character corresponding to the character B, including:

performing word embedding processing on the character B to obtain the fourth feature vector of the character B;

Encoding the attributes of the character B to obtain a part-of-speech vector corresponding to the character B;

The language type is encoded to obtain a language vector, and the language vector is used as the encoding parameter of the encoder and the decoding parameter of the decoder respectively;

The fourth feature vector of the character B is input into the encoder, and the character B is encoded to obtain the fifth feature vector of the character B;

The part-of-speech vector of the character B and the fifth feature vector of the character B are input to the decoder, and the character B is decoded to obtain the regular text corresponding to the character B.
The computer-readable storage medium of claim 19, wherein the inputting the part-of-speech vector of the character B and the fifth feature vector of the character B to the decoder to decode the character B is performed , obtain the regular text corresponding to the character B, including:

Perform an attention mechanism operation on the hidden layer vector of the last decoding output of the decoder and the fifth feature vector corresponding to the character B, to obtain the sixth feature vector corresponding to the character B;

Splicing the part-of-speech vector of the character B, the third feature vector of the character B, the sixth feature vector of the character B and the decoding result decoded by the decoder last time to obtain the target feature of the character B vector;

According to the encoding parameters of the encoder and the target feature vector of the character B, the character B is decoded to obtain the regular character corresponding to the character B.