CN117131842A

CN117131842A - WFST-based method for realizing multi-language mixed text regularization and anti-regularization

Info

Publication number: CN117131842A
Application number: CN202311406988.7A
Authority: CN
Inventors: 孟博华; 张句; 王宇光; 王龙标
Original assignee: Huiyan Technology Tianjin Co ltd
Current assignee: Huiyan Technology Tianjin Co ltd
Priority date: 2023-10-27
Filing date: 2023-10-27
Publication date: 2023-11-28
Anticipated expiration: 2043-10-27
Also published as: CN117131842B

Abstract

The invention relates to a method for realizing multi-language mixed text regularization and anti-regularization based on WFST, which is characterized in that the text is input, or ssml is added to control different languages or the requirements of different transcription environments, the text of a designated language or mixed language is labeled, the transcription is carried out according to the rule design of a target text, decoding output is carried out according to a self-defined ordering rule and decoding logic, and the regularization and anti-regularization operation of the text is realized. The method greatly reduces the cost of manual judgment and improves the efficiency and accuracy of voice synthesis and voice recognition. The invention has the advantages of high coverage rate of the appointed text, extremely low error transfer rate, quicker and more convenient modification capability, higher robustness in different scenes and different languages, and applicability.

Description

WFST-based method for realizing multi-language mixed text regularization and anti-regularization

Technical Field

The invention relates to the technical field of voice recognition, in particular to a method for realizing multi-language mixed text regularization and anti-regularization based on WFST, which is a WFST framework capable of realizing multi-language single tag recognition.

Background

Text regularization (Text Normalization, TN) and anti-regularization (Inverse Text Normalization, ITN) are indispensable parts for building a complete speech interaction system. The former is widely used for front-end processing of a speech synthesis system, while the latter affects the look and feel experience of subtitles when the recognition text of the speech recognition system is displayed on a screen. With the rapid development of information technology and the continuous progress of the field of artificial intelligence, a mixed text regularization and anti-regularization processing frame becomes an important technical means for processing text data.

TN/ITN systems currently under widespread investigation in academia are mainly of three types: grammar-based WFST, which is a framework composed of a large number of grammar-based rules, has the advantages of being accurate and controllable, and bug fixes as soon as possible, and has the disadvantage of being not robust enough for ambiguous text. Based on the end-to-end model of the neural network, the model changes from basic grammar rules into labeling texts and making training sets as large as possible to cover wider data, the cost of modifying the trained network under the rules is very high, and the grammar after system transcription is possibly reasonable but the semantics are difficult to control. The system of mixing the rules and the network is used, when the system is not matched with the proper grammar rules, the neural network is selected to be used, the method balances the problems of the prior framework, but the training of the network is difficult to control, and more resources are consumed for training in the situation.

Among the open source TN/ITN projects worldwide, the C++ framework Sparrowhawk offered by Google corporation is the most widespread of the current audience. The disadvantage of this framework is that it is just a rule execution engine and google corporation does not have grammar rules for open source related languages. In addition, sparrowhawk implementation relies on many third party open source libraries (including OpenFst, thrax, re, protobuf), resulting in an overall framework that is not simple enough and lightweight.

Another more mature project is the neo_text_processing of the inflexion company, which still uses Sparrowhawk as a deployment tool in the production environment. Unlike google, the project also opens up a regular grammar in multiple languages such as english, german, russian, etc. In the field of Chinese TN/ITN rules, a third party personal developer such as Jiayu and the like opens a customized Chinese TN/ITN rule base Chinese_text_normalization.

Disclosure of Invention

Based on the defects of the prior art, the invention develops a TN/ITN tool for crossing multilingual species (such as Chinese, english, tibetan, chinese-English mixing and the like). The tool not only supports transcription for single language and single requirement, but also can handle the transcription requirement of mixed language. The tool functionally overrides Sparrowhawk while optimizing dependencies to be more lightweight, and also provides an interface for Python batch transcription, and provides a c++ rule processing engine that is more lightweight than Sparrowhawk.

The technical scheme of the invention is that the method for realizing multi-language mixed text regularization and anti-regularization based on WFST comprises the following steps:

step 1) researching the target languages, summarizing the use conditions of the numbers processed in the target languages, counting and classifying the numbers, and determining the encoding and decoding methods used by the target languages;

step 2) rule design is carried out on each classified category, meanwhile, priority ordering is carried out on different classifications according to the quantity counted in the step 1), and standard processing is carried out on the condition that ambiguity occurs in the same text;

step 3) generating a weighted fst file by using a WFST method, and optimizing the file to obtain an optimized fst file;

step 4) testing the optimized fst file;

step 5) identifying, matching and classifying the optimized fst file, and marking according to rules of different classes to obtain a text containing labels;

step 6) transferring the text containing the label;

step 7) sequencing the transcribed text obtained in the step 6) according to Reorder logic;

and 8) decoding according to the decoding rule to obtain a decoded text, and outputting the decoded text.

Further, step 2) gives different weights to different categories according to the context and the situation that the number appears when constructing the rule, considering that the same number may appear in different contexts in the rule design.

Further, the test content of the pressure test in the step 4) mainly includes encoding and decoding speed and size of a generated graph.

Further, the step 4) adds a corresponding white list to deal with the case of extremely serious ambiguity or special specific vocabulary.

Further, the step 4) is added with a preprocessing step after the test, wherein the preprocessing step comprises complex-simplified conversion in Chinese characters, half-angle full-angle conversion of punctuation marks and deletion of Chinese words; and (5) performing English case conversion and deleting redundant space.

Further, the step 5) is to uniformly classify the characters not in the category into "char" categories, and does not perform any transcription.

Further, the step 8) optimizes the redundant characters and redundant spaces.

Further, the rule in step 2) and its grammatical features are as follows:

StringFile: the grammar consists in loading a string or string binary list and compiling the characters therein into fst expressed as a union of these characters, which is equivalent to ("strong1|strong2|strong3| …"), where it is necessary to state that the "|" symbol is a following unit operation;

cdwrite: for rewriting certain words in a given contextual string;

closure: representing closure operations in a canonical form, there are typically several operators: "x" means 0-N matching the string; "+" indicates 1-N matches to the string; "? "means matching 0-1; "{ a, b }" means matching a-b; using ". Ques". Plus "to represent matching a string 0-1 times and 1- ≡times, respectively;

concat: when concat is used for characters, the method comprises the steps of splicing a plurality of characters and linking the characters by using spaces; and when used with FSTs, means calculating the product of two FSTs; the string is represented as a sum, its return value is also FST, its sign is abbreviated as "+", used between two FST;

difference: when diff is used for a character, it means that the first character is matched and the second character is not matched, the symbol is abbreviated as '-', e.g. "fst1-fst2", the return value of which is also one fst;

composition: representing the concatenation of two fsts, the output denoted fst1 in the fst operation being the input to fst 2; in the character string, on the premise of meeting fst1, the requirement of fst2 is met at the same time, or on the basis of fst1, some paths are modified, and a modification rule is pointed out by fst 2;

union: representing that two fsts are subjected to OR operation, wherein fst1|fst2 represents that the character string meets the rule of fst1 or meets the rule of fst 2;

weight: means that a certain bar or a certain sub-graph part in the target fst is given weight;

optimize: representing automatic optimization of a specified fst, including epsilon-remote: deleting an arc with a weight of 0; determinination: determining a fst; minimization: minimizing one fst, i.e., deleting redundant arcs;

basic operations include cross, delete, insert, accep;

cross: representing forcing one state in one fst to add an arc to transition to another state;

delete: an arc representing that an input is added to fst to designate an output as empty;

insert: indicating that specified content is added to the output content as the arc is traversed;

accep: the input and output representing the arc are designated strings, where fst corresponds to one fsa.

Advantageous effects

1. The method has the advantages that the multilingual support is integrated, and the processing of the transcription requirement of the mixed text is realized; meanwhile, the design of the rule engine is optimized, so that the rule engine is lighter in processing speed and resource consumption. This will enable the user to more conveniently perform cross-lingual text transcription and handle complex transcription tasks in a more efficient manner.

2. In the traditional TN and ITN, each language is a single set of flow codes, and has a different encoder part (normalizer, inversenormalizer), a sequencer token_player according to the text and a decoder part (processor) for marking the text, while the TN and ITN designed by the invention can directly transfer the mixed text by simplifying the rule part and modifying the encoding and sequencing part, and simultaneously has the advantages of being convenient for bug modification and rule making.

3. Different weights are given to different types according to the context and the situation that the number appears when the rule is constructed, and meanwhile, on the basis of basic TN and ITN, a controllable parameter is added, and the invention is transcribed on the basis of default, but when a user has a requirement, one type can be selected for transcription so as to prevent the designed weight from being inconsistent with the target. Before performing the text TN and ITN, the text needs to be cleaned correspondingly, so that the accuracy of transcription is ensured.

4. After the generating of the fst file, the invention can combine with other secondary development to realize both streaming service (real-time identification, i.e. identification is started without a sentence being finished) and non-streaming service (non-real-time identification, i.e. identification is performed after a complete sentence is required to be input).

5. Performing some preprocessing before encoding, such as all capitalization and transcription, removing punctuation to become blank, deleting blacklist words and the like (of course, the operations can be adjusted through parameters so as to meet the personalized customization requirement), and obtaining better transcription robustness; meanwhile, in order to improve accuracy, a corresponding white list is added to deal with the situation that some ambiguity is particularly serious or some special words.

6. In order to ensure the stability of transfer, the invention also adds the char character rule in front of each rule, and can output identity without special transfer under the condition of not conforming to the appointed rule.

Drawings

Fig. 1 is a flow chart.

Detailed Description

The invention is further described below with reference to the accompanying drawings. FIG. 1 is a flow chart of the present invention.

The invention discloses a method for realizing multi-language mixed text regularization and anti-regularization based on WFST, which is characterized in that the method is used for marking the text of a specified language or mixed language by inputting the text or adding ssml to control different languages or the requirements of different transcription environments, carrying out transcription according to the rule design of a target text, decoding and outputting the text according to a self-defined ordering rule and decoding logic, and carrying out regularization and anti-regularization operation on the text, and specifically comprises the following steps:

when the rule is designed, different weights are given to different categories according to the context and the occurrence of the number, wherein the different readings of the same number possibly occurring in different contexts are considered in the rule design;

step 4) testing the optimized fst file;

the test content of the pressure test mainly comprises the encoding and decoding speed and the size of a generated diagram.

The corresponding whitelist is added to deal with the case of a particularly severe ambiguity or a special specific vocabulary.

Adding preprocessing steps, including complex-simplified conversion in Chinese characters, half-angle full-angle conversion of punctuation marks and deletion of Chinese words; performing English case conversion and deleting redundant space;

step 5) identifying, matching and classifying the optimized fst file of the target language, and marking according to rules of different classes to obtain a text containing labels; for the characters which are not in the category, the characters are uniformly classified into a char category, and no transcription is performed;

step 6) transferring the text containing the label;

step 8) decoding according to the decoding rule to obtain a decoded text, and outputting the decoded text; and optimizing the redundant characters and redundant spaces.

Rule modeling: the WFST model is created by writing rules to classify and transcribe the input content. This allows rules to be written according to specific requirements, translating them into corresponding grammar rule strings.

Text matching and transcription: and matching the input text with a pre-constructed model, so as to realize the required transfer effect. In the writing of the rules, different rules can be written according to different data and scenes, and the transfer task under the specific condition can be realized by distributing weights.

Multilingual hybrid modeling: the mixed situations of different languages are considered, so that Chinese and English mixed modeling is added in the model building process. The robustness of the framework is improved, so that the framework can better adapt to the transfer requirements of mixed texts in different languages.

The method has the advantages that the support of multiple languages is integrated, and the processing of the transcription requirement of the mixed text is realized; meanwhile, the design of the rule engine is optimized, so that the rule engine is lighter in processing speed and resource consumption. This will enable the user to more conveniently perform cross-lingual text transcription and handle complex transcription tasks in a more efficient manner.

First, the logic of TN and ITN are identical, and they all comprise three parts: "trigger", "Reorder", "Verbalizer". The trigger is responsible for analyzing the input text to obtain dictionary-like structured information; reorder is responsible for sequentially adjusting the structured information; and finally, the tagged data is taken charge of being extracted and spliced and displayed through a Verbalizer.

Further, rule modeling is mainly implemented by writing rules, and by changing weights of different routes, an image of fst is generated, and a detailed description is given here of rules used later and some grammatical features:

StringFile: the grammar aims at loading a string or string tuple list and compiling the characters therein into fst, which is represented as a union of these characters, which corresponds to ("strong1|strong2|strong3| …"), where it is necessary to state that the "|" symbol is a following unit operation.

Cdwrite: for overwriting certain words in a given contextual string. Such as cdwrite [ "s": "z", "d [ EOS ]" ], the "s" before "d" at the end of the character string is changed into "z". Note that, '[ EOS ]' indicates that the end of string is matched in the rule, and, 'BOS ]' indicates the beginning of the string.

Closure: representing closure operations in a canonical form. There are typically several operators: "x" means 0-N matching the string; "+" indicates 1-N matches to the string; "? "means matching 0-1; "{ a, b }" means matching a-b; while at the same time, "sequences" ", plus" can also be used to represent matching of strings 0-1 times and 1- ≡times, respectively.

Concat: when concat is used for characters, it means that a plurality of characters are spliced to be linked by spaces, and when FST is used, it means that the product of two FSTs is calculated, the character string is expressed as a sum, and its return value is also FST. Its sign is abbreviated as "+" and is used between two fst.

Difference: when diff is used for a character, it means that the first character is matched and the second character is not matched, the symbol abbreviated as '-', e.g. "fst1-fst2", the return value of which is also one fst.

Composition: the two fst are spliced, the output expressed as fst1 in the fst operation is taken as the input of fst2, and the input expressed as the character string is that the requirement of fst2 is met on the premise of meeting fst1, or some paths are modified on the basis of fst1, and the modification rule is pointed out by fst 2.

Union: it is only necessary to perform an or operation on two fsts, and fst1|fst2 is that the character string satisfies the rule of fst1 or satisfies the rule of fst 2.

Weight: indicating that a weight is given to a certain bar or a certain sub-picture portion in the target FST, e.g. "fst|=add_weight (accept ('one') 100)", an example indicates that the operation of this figure on the FST is performed, and the weight is increased by 100 when a portion in which the character string is "one" is recognized.

Optimize: the method for automatically optimizing the designated fst specifically comprises the following steps:

epsilon-removal: deleting an arc with a weight of 0;

determinination: determining a fst;

minimization: one fst is minimized, i.e., the redundant arc is deleted.

Basic operations include cross, delete, insert, accep and the like.

cross: meaning that one state in one fst is forced to add an arc to another state, such as cross ('ten', '10') meaning that the character "ten" is converted to '10';

delete: an arc representing that an input is added to fst to designate an output as null, e.g., delete ('ten') represents delete 'ten';

insert: indicating that specified content is added to the output content as the arc is traversed, such as insert ('ten') indicating that the character 'ten' is added as the arc is traversed; accep: the input and output representing the arc are designated character strings, and the FST corresponds to an FSA, such as acep ('mega') representing the arc as 'mega'.

Further, after the rule is designed, the input text needs to be detected, and the rule is mainly divided into three parts:

firstly, marking different parts of a text, namely tags;

secondly, performing Reorder operation on each text containing the tag;

finally, deleting the label of each text containing the label again to obtain the transcribed text.

In the traditional TN and ITN, each language is a single set of flow codes, and has a different encoder part (normalizer, inversenormalizer), a sequencer token_player according to the text and a decoder part (processor) for marking the text, while the TN and ITN designed by the invention can directly transfer the mixed text by simplifying a rule part and modifying an encoding and sequencing part, and simultaneously has the advantages of being convenient for bug modification and adding a formulated rule.

Further, for the text currently input, a corresponding token is generated when the text passes through the encoder to identify the current text, and the input corresponding to each word level has a corresponding tag. The resulting text with token is fed into a sorter which sorts the current text according to the specified order and presents anomalies (e.g., redundant spaces or non-regular inputs, as described further below).

Because the same number possibly appears in different contexts and can have different reading methods, different weights are given to different types according to the contexts and the situations of the number when the rule is constructed, and meanwhile, on the basis of basic TN and ITN, a controllable parameter is added, and the invention is transcribed on the basis of default, and when a user needs, one type can be selected for transcription to prevent the weight of the design from being inconsistent with the target of the user. Therefore, the text needs to be cleaned correspondingly before the text TN and ITN are carried out, so that the accuracy of transcription is ensured.

Because the current languages are more circulated and developed, the independent transcription of only one language obviously cannot meet the market demand, so that the digital transcription of Chinese, english, tibetan and Chinese-English mixture is realized at present and is used for matching the demands under different tasks. At the same time, considering the transfer of a large amount of text, the python service speed is somewhat slow, so that TN and ITN frameworks of C++ version are developed at the same time.

It should be noted that after the fst file is generated, the present design may be combined with other secondary developments to achieve both streaming services (real-time recognition, i.e., recognition begins without a sentence being spoken) and non-streaming services (non-real-time recognition, i.e., recognition after a complete sentence is required to be entered).

Further, for example, when developing english, it is necessary to first study the places where numbers appear in english, and then classify the places in english, for example, classifying the places into categories of "cardinal, date, digits, fraction, domain, measure, money, time, telephone, address, id" in english, designing a classification rule of each classification, and finally synthesizing the classification rule into a fst file, which is an encoder part of english TN. Meanwhile, in consideration of the robustness of transcription, some preprocessing needs to be performed before encoding, such as all capitalization transcription and transcription, punctuation is removed to become blank, blacklist words are deleted and the like (of course, the operations can be adjusted through parameters so as to meet the personalized customization requirement); at the same time, the accuracy is considered, and a corresponding white list is added to deal with some cases with serious ambiguity or some special words. Meanwhile, for the stability of transcription, the rules with char characters added in front of each rule can be output under identical condition without special transcription.

The design rule part, the Chinese Mandarin in the target language, the transcription method is text regularization, as follows. Through research and summary, the Chinese digital code mainly comprises the conditions of 'base words, date, score, graduated word, telephone, time and sports score', and is given a certain weight through the occurrence frequency.

date = add_weight(Date().tagger, 1.02)

whitelist = add_weight(Whitelist().tagger, 1.03)

measure = add_weight(Measure().tagger, 1.06)

sport = add_weight(Sport().tagger, 1.04)

fraction = add_weight(Fraction().tagger, 1.05)

money = add_weight(Money().tagger, 1.05)

time = add_weight(Time().tagger, 1.05)

cardinal = add_weight(Cardinal().tagger, 1.07)

math = add_weight(Math().tagger, 90)

telephone = add_weight(Telephone().tagger, 1.07)

char = add_weight(Char().tagger, 100)

cardi = add_weight(Cardi().tagger, 10)

The following designs rules for each category, such as in the category of base words, different look-up tables (tables, symbol tables, operator tables) are designed while logic from + -99999999 (tens of millions) is designed separately according to the usage rules. Meanwhile, considering the reading method of converting two into two under partial conditions, different weights are added under different conditions so as to realize the purpose of transcription. At the same time, different SSML request responses are designed according to different requirements at this step, and a single fst is generated for calling according to different requirements. The encoder for the base words has been designed so far, and then different orders are designed according to daily logic, for example, in TN, a sequential dictionary can be designed as follows: TN_ORDERS= { ' date [ ' year ', ' mole ', ' day ', ' fraction [ ' reflector ', ' numearer ', ' measure [ ' reflector ', ' numearer ', ' value ', ' mole [ ' value ', ' currency ', ' time [ ' non ', ' hour ', ' second ',noother mention is made of only one value tag and therefore no reordering is required. The decoder is then designed based on this ordering result, wherein the decoded format is based on the 'UTF-8' format, while the text in the tags is parsed taking into account the tags previously typed by the encoder section. When the input text does not match any rule, the input content is output, and the background can obtain the return value of 'invalid'.

The operation example of the second part transfer part is followed, the language is selected as Chinese, and the following test examples are used for the TN which is already designed:

"student tested a certain time of 800 meters was 5 minutes 28 seconds"

"if there is something, please dial 022-12345678, or dial 10086 to inquire about the telephone charge balance"

The deep color vegetables should account for 1/2 of the total vegetable weight "

"8GB+128GB capacity 2399 yuan".

These test samples were used for ITN:

"one thousand two hundred thirty four elements, one thousand two hundred thirty four dollars, one thousand two hundred thirty four yen"

Seventeen numbers of two-zero two three years "

"score is one thousand hundred and nineteen, thirty-five degrees celsius, one hundred and thirty-five degrees celsius"

"the result of calculation is one point zero four". Whether the rule is identical to the predetermined rule is checked by checking the result of the trigger encoder.

In TN: example 1: char { value: "school" } char { value } raw "} char { value } measured" } char { value } test "} measure { value } is measured" } value } 800m and char { value } certain "} char { value } times and char { value } is calculated" } char { value } is measured "} score" } is calculated "} value } is obtained and is subjected to label removal by a decoder after" } char { value } is measured "} value" }, and the result is output: "the performance of a student tested eight hundred meters for a certain time is five minutes twenty eight seconds".

For example 2, the labeling results can be seen: char { value: "e.g." } char { value: "fruit" } char { value: "have" } char { value: "," "char { value:", please "} char { value:", click "} char { value:" click "} text { value:" value "} character { value:" 022-12345678 "} char { value:", or "} char { value:" click "} value { value" } query "} character { value:" v. In "} character" } char { value: "in case" } character { value: "character" } character { character value "} character { value" } character result of final transfer ": if any, please dial zero, two, four, five, six, seven and eight, or dial zero, eight and six to inquire the balance of the telephone charge.

For example 3, intermediate values: char { value: "deep" } char { value: "color" } char { value: "vegetable" } char { value: "stress" } char { value: "occupancy" } char { value: "vegetable" } char { value: "total" } char { value: "amount" } char { value: "value" } fraction { denominator: "2" numearer: "1" }, the end result: "dark vegetables should make up one half of the total vegetables".

For example 4: measure { value } "8GB+128GB" } char { value } "Capacity" } char { value } "quantity" } measure { value } "2399 yuan" }, result: "eight gigabytes plus one hundred twenty eight gigabytes capacity two thousand three hundred ninety-nine yuan".

In ITN, example 5, when the control input-key is "default", is a default transcription, it is classified as a graduated word, and its labeled text is: "Messature { value:" one thousand two hundred thirty-four members "} char { value:", "} Messature { value:" one thousand two hundred thirty-four dollars "} char { value:", "} Messature { value:" one thousand two hundred thirty-four yen "}", the final transcription result is: "1234 yuan, 1234 dollar, 1234 yen", when the control-key is "money", it uses the fst of "money" and is labeled: "Money { currency:" meta "value:" one thousand two hundred thirty four "} char { value:", "} Money { currency:" dollar "value:" one thousand two hundred thirty four "} char { value:", "} Money { currency:" yen "value:" one thousand two hundred thirty four "}, then the result is transcribed as: "# 1234, $1234, JPY1234".

For example 6, the labels are: "date { year:" two zero two three "Month:" seven "day:" seventeen } "its final transfer result is: "No. 17 7 month of 2023".

For example 7, the first string of digits is identified as the base number without a suffix and the second string of digits is identified as the adverb with the suffix: "char { value:" get "} char { value:" divide "} char { value:" yes "} carinal { value:" one thousand hundred fifty nine "} char { value:", "} char { value:" now "} char { value:" at "} char { value:" temperature "} char { value:" degree "} char { value:" is "} measure { value:" thirty five degrees celsius "}, the final transfer result is: "score 1259, now temperature 35 ℃.

For example 8, the "result is" the statement sentence which is judged as the base word by the rule, so as to prevent ambiguity of "one point, zero and four" with time, and the intermediate label is "char { value:" count "} carbnal { value:" result is one point, zero and four "}", and the final transfer result can be obtained as follows: "the calculation result is 1.04".

Claims

1. The method for realizing multi-language mixed text regularization and anti-regularization based on WFST is characterized by comprising the following steps:

step 4) testing the optimized fst file;

step 6) transferring the text containing the label;

step 7) sequencing the transcribed text obtained in the step 6);

2. The method for realizing multi-lingual mixed text regularization and anti-regularization based on WFST of claim 1, wherein the step 2) gives different weights to different categories according to the context and the condition of the number occurrence when constructing the rule, considering that the same number may appear in different contexts has different readings when designing the rule.

3. The method for realizing multi-language mixed text regularization and anti-regularization based on WFST of claim 1, wherein the test contents of the pressure test in the step 4) are mainly coding and decoding speed and size of a generated graph.

4. The method for realizing multi-lingual mixed text regularization and anti-regularization based on WFST of claim 1, wherein said step 4) adds a corresponding whitelist to deal with the case of extremely serious ambiguity or special vocabulary.

5. The method for realizing multi-language mixed text regularization and anti-regularization based on WFST of claim 1, wherein the step 4) is characterized by preprocessing after testing, including complex-simplified body conversion in Chinese characters, half-angle full-angle conversion of punctuation marks and deletion of Chinese words; and (5) performing English case conversion and deleting redundant space.

6. The method for realizing multi-lingual mixed text regularization and anti-regularization according to claim 1, wherein said step 5) is characterized in that for the words not in category, they are unified into "char" category without any transcription.

7. The method for realizing multi-lingual mixed text regularization and anti-regularization based on WFST according to claim 1, wherein the step 8) optimizes redundant characters and redundant spaces.