CN114330286A

CN114330286A - Text regularization method and related device, electronic equipment and storage medium

Info

Publication number: CN114330286A
Application number: CN202111486205.1A
Authority: CN
Inventors: 储银雪; 高丽; 江源
Original assignee: Xi'an Xunfei Super Brain Information Technology Co ltd
Current assignee: Xi'an Xunfei Super Brain Information Technology Co ltd
Priority date: 2021-12-07
Filing date: 2021-12-07
Publication date: 2022-04-12

Abstract

The application discloses a text regularization method, a related device, electronic equipment and a storage medium, wherein the text regularization method comprises the following steps: analyzing the text to be normalized to obtain a target sub-text; the target sub-text needs to be transcribed into a target language, and the grammar of the target language is related to a number lattice; identifying attribute categories of the target sub-text about a plurality of attributes; the attributes of the attribute lattice attributes comprise attribute lattice categories of the target sub-text in the target language; based on the attribute category of the target sub-text, the target sub-text is transcribed into a target language to obtain a regularized sub-text with the same semantics as the target sub-text; and obtaining a regular text corresponding to the text to be regular based on the regular sub-text corresponding to the target sub-text. By the scheme, the text regular accuracy and convenience of the character grid language can be improved.

Description

Text regularization method and related device, electronic equipment and storage medium

Technical Field

The present application relates to the field of natural speech processing technologies, and in particular, to a text regularization method, and a related apparatus, an electronic device, and a storage medium.

Background

In the field of natural language, there are many application scenarios in which it is desirable to convert non-standard form text into standard form text in a target language. Taking a speech synthesis application scenario as an example, in speech synthesis front-end processing, irregularly written texts such as numbers, time and date, currency units, special symbols and the like need to be accurately converted into words of a target language, so that accurate front-end information input is ensured in a speech synthesis process, and an accurate speech synthesis result is obtained. For example, if the target language is Chinese, it is necessary to transcribe "123" into Chinese "one hundred twenty three", 1/10 "into Chinese" one tenth ", 8:00 am" into "eight o' clock as early", and so on.

For the text regularization of general languages, one is to use rules to perform text transcription, specifically, a certain transcription rule is preset, and under the condition that the text is matched with the transcription rule, the text regularization is performed according to the transcription rule; the other method is to use an end-to-end model to perform text transcription, specifically, after the text is input into the end-to-end model, the end-to-end model directly outputs a regular text in a machine translation mode.

However, in a language environment of the related digraph language, different digraphs and pronunciations of a text with the same semantic have different writing modes and pronunciations, and the text transcription is simply performed by using an end-to-end model or rule, so that on one hand, the accuracy cannot be guaranteed, and on the other hand, due to the diversity of the character conversion of the digraph language, the requirements on the quality and the quantity of training data are more severe, and therefore, how to improve the text regular accuracy and convenience of the digraph language becomes an urgent problem to be solved.

Disclosure of Invention

The technical problem mainly solved by the application is to provide a text regularization method, a related device, electronic equipment and a storage medium, and the text regularization accuracy and convenience of a native lattice language can be improved.

In order to solve the above technical problem, a first aspect of the present application provides a text regularization method, including: analyzing the text to be normalized to obtain a target sub-text; the text to be normalized consists of a plurality of sub-texts, the target sub-text is the sub-text which needs to be normalized, the target sub-text needs to be transcribed into a target language, and the grammar of the target language relates to a character number lattice; identifying attribute categories of the target sub-text with respect to a plurality of attributes; the attribute types of the character lattice attributes comprise character lattice types of the target sub-text in the target language; based on the attribute category of the target sub-text, the target sub-text is transcribed into the target language to obtain a regularized sub-text with the same semantic meaning as the target sub-text; and obtaining a regular text corresponding to the text to be regular based on the regular sub-text corresponding to the target sub-text.

In order to solve the above technical problem, a second aspect of the present application provides a text regularizing apparatus, including: the system comprises an analysis module, an identification module, a transcription module and an acquisition module, wherein the analysis module is used for analyzing a text to be normalized to obtain a target sub-text; the text to be normalized consists of a plurality of sub-texts, the target sub-text is the sub-text which needs to be normalized, the target sub-text needs to be transcribed into a target language, and the grammar of the target language relates to a character number lattice; the identification module is used for identifying attribute categories of the target sub-text about a plurality of attributes; the attribute types of the character lattice attributes comprise character lattice types of the target sub-text in the target language; the transfer module is used for transferring the target sub-text into the target language based on the attribute category of the target sub-text to obtain a regularized sub-text with the same semantic meaning as that of the target sub-text; and the acquisition module is used for obtaining the regular text corresponding to the text to be regular based on the regular sub-text corresponding to the target sub-text.

In order to solve the above technical problem, a third aspect of the present application provides an electronic device, which includes a memory and a processor, which are coupled to each other, wherein the memory stores program instructions, and the processor is configured to execute the program instructions to implement the text regularization method in the first aspect.

In order to solve the above technical problem, a fourth aspect of the present application provides a computer-readable storage medium storing program instructions executable by a processor, the program instructions being configured to implement the text regularization method in the first aspect.

According to the scheme, after the text to be normalized is analyzed, the target sub-text needing to be normalized is obtained, the attribute categories of the target sub-text about a plurality of attributes are identified, then the target sub-text is transcribed into the target language based on the attribute categories of the target sub-text, the regularized sub-text having the same semantics with the target sub-text is obtained, and finally the regularized sub-text corresponding to the text to be normalized is obtained based on the regularized sub-text corresponding to the target sub-text.

Drawings

FIG. 1 is a schematic flow chart diagram of an embodiment of a regularization method of the present application;

FIG. 2 is a schematic flow chart diagram illustrating another embodiment of a method for regularizing text of the present application;

FIG. 3 is a block diagram of an embodiment of a regularization apparatus of the present application;

FIG. 4 is a block diagram of an embodiment of an electronic device of the present application;

FIG. 5 is a block diagram of an embodiment of a computer-readable storage medium of the present application.

Detailed Description

The following describes in detail the embodiments of the present application with reference to the drawings attached hereto.

In the following description, for purposes of explanation and not limitation, specific details are set forth such as particular system structures, interfaces, techniques, etc. in order to provide a thorough understanding of the present application.

The terms "system" and "network" are often used interchangeably herein. The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship. Further, the term "plurality" herein means two or more than two.

Referring to fig. 1, fig. 1 is a schematic flowchart illustrating an embodiment of a regularization method according to the present application.

Specifically, the method may include the steps of:

step S11: and analyzing the text to be normalized to obtain a target sub-text.

The text regularization method is used for converting a text to be regularized into a text which accords with a target language standard form. The target language may be a language of grammatical dependency lattices such as Arabic, Russian, Polish, etc. In the target language of the grammatical dependency grid, the writing forms and pronunciations of nouns, pronouns or adjectives may change correspondingly due to semantics or positions in the text, for example, the writing forms of the text are different when the text is different from the suffix, so that in the language of the dependency grid, different dependency grids of the same text may have different writing forms and pronunciations, and therefore, the irregularly written text to be normalized needs to be accurately converted into the text of the target language.

The text to be regularized is the text which needs to be regularized and consists of a plurality of subfiles. The sub-text may be a text component of a number, a symbol, a word, or any combination thereof, such as a number only, a symbol only, a combination of a number and a symbol, and so on. If the sub-text is in the standard form of the target language, the sub-text belongs to the sub-text which does not need to be processed regularly, and if the sub-text is in the non-standard form of the target language, the sub-text needs to be processed regularly. The method for analyzing the text to be normalized may be any text analysis method such as word segmentation and neural network model in the prior art, and is not specifically limited herein.

For example, the target language is Arabic, and a text to be normalized exists

The sub-texts "D", "3200", "1080", "x" and "1920" are not standard versions of arabic and are target sub-texts that need to be regularized, and the rest of the sub-texts are already arabic and do not need to be regularized.

Step S12: attribute categories of the target sub-text with respect to a number of attributes are identified.

Different sub-texts may belong to the same or different attributes, and the attribute categories of different sub-texts under the same attribute may also be the same or different. The plurality of attributes include a character lattice attribute, a number word attribute, a symbol attribute, and the like, and are not specifically limited herein.

In the case that the grammar of the target language relates to the character lattices, the attributes comprise character lattice attributes, and the attribute categories of the character lattice attributes comprise character lattice categories of the target sub-text in the target language. The sex of the sex lattice type refers to that nouns, pronouns and adjectives express the type recognition of the text and the attributes thereof by people through changes in grammatical forms, and the sex of the sex lattice type can comprise negative, positive and neutral attribute types. The number of the sex number lattice type is that nouns, pronouns and the like represent the quantity understanding of the text by the change of grammatical forms, and the number of the sex number lattice type can comprise the attribute types of singular, plural and even numbers; the lattices of the sex lattice type are lattice positions, and nouns, pronouns and the like represent the relationship with other text components through the change of grammatical forms. It will be appreciated that the trellis representation of different languages is not consistent. For example, the number '1' in Arabic may correspond to

(negative)A main lattice),

(negative object lattice),

(negative genus lattice),

(positive main lattice),

(positive bingo),

(Positive genus), 6 trait classes in total.

And under the condition that the target sub-text is a number, the attributes further comprise number attribute, and the attribute category of the number attribute comprises the number category of the target sub-text. In a disclosed embodiment, the number category may include a base number, an ordinal number, and a string. For numbers both the number category and the character lattice category are included, e.g. the radical, ordinal and string 3 number categories and

(negative main lattice),

(negative object lattice),

(negative genus lattice),

(positive main lattice),

(positive bingo),

(Yang)Sex attribute lattice) 6 character lattice categories, 18 categories in total, such as radical-negative dominance. The text regularization of the sextual number language is mainly difficult to focus on the determination of the number word category of the number, and the digital sub-text such as time, date and the like can be a combination of numbers for category classification or a combination of the number word categories of one number. The specific transcription form of the number depends on a large number of rules, and because the number of the characters needs to be determined according to the context under many conditions, the accuracy rate cannot be guaranteed only by relying on the preset text rule or the end-to-end model; in addition, due to the diversity of digital conversion forms in the sex digital lattice language, the end-to-end model method has higher requirements on data quantity and data quality in order to achieve the same effect as that of a common language. Especially, a number has multiple transcription modes, and an end-to-end model faces more pressure when prediction is carried out. Compared with the existing text regularization research, the method does not consider the characteristic that the numbers have various number word categories when the numbers are used as the sub-texts, does not distinguish the numbers separately to realize text regularization, can identify the number word categories of the numbers, and then transcribes the text based on the number word categories, so that the method can process the text regularization of the same number sub-text with various transcription modes.

For example, in the process of text regularization of a front text in speech synthesis, which writing method is required to be used for pronunciation needs to be determined according to the contents before and after the text and the overall semantics, so that the accuracy of the ideographic categories of the target sub-text is particularly important. In the case that the target sub-text is a symbol, the plurality of attributes further include symbol attributes, and the attribute categories of the symbol attributes include ideographic categories of the target sub-text. For example, the ideographic category of the symbol "-" may be "minus" or a range of "… to …"; symbol ": "the ideographic categories may be" colon "," time point "," score "; there are also monetary units and other special symbols that require the ideographic categories to be determined based on the specific context.

Step S13: and transferring the target sub-text into a target language based on the attribute category of the target sub-text to obtain a regularized sub-text with the same semantics as the target sub-text.

And (4) the classified attribute classification result is transcribed into a text of the target language through a specific rule. For example, after the attribute category of the target sub-text is determined, based on the mapping relationship between the attribute category and the text expressed in the target language, the text expressed in the target language corresponding to the attribute category is used as the regularized sub-text, and the target sub-text is transcribed into the target language based on the attribute category of the target sub-text, so as to obtain the regularized sub-text having the same semantics as the target sub-text. Compared with the method that the text regular substance of the end-to-end model is translation, the method has the advantages that the text regular substance is classified and then transcribed, the text regular substance is used as a classification task, less data relative to the end-to-end model is used, and meanwhile, common unrecoverable errors of the end-to-end model can be avoided. The mapping relation is the corresponding relation of the sub-texts, the attribute categories and the regularized sub-texts represented by the target language.

In a disclosed embodiment, the method comprises the following specific steps of, based on the attribute category of the target sub-text, transcribing the target sub-text into a target language, and obtaining a regularized sub-text having the same semantic meaning as that of the target sub-text: inquiring in the transcription rule set to obtain a first sub-text meeting the matching condition with the target sub-text; the sub-text pairs comprise a first sub-text and a second sub-text with the same semantics, and the second sub-text is represented in a target language according to the attribute category of the first sub-text; and taking the second sub-text in the sub-text pair to which the first sub-text belongs as the regularized sub-text of the target sub-text. The matching condition may be that the semantics of the first sub-text and the target sub-text are the same and the attribute category is also completely the same, or that the attribute category of the first sub-text and the target sub-text is the same.

For the regular text of a general language, the prior art uses a preset text rule to perform text transcription, specifically, a certain number of transcription rules are preset, and the text is regular according to the transcription rules when the text is matched with the transcription rules. When text transcription is performed by using a preset text rule, the transcription rule must be preset, and only one sub-text is usedFor example, the regularized sub-text corresponding to the sub-text number '1' is only one type, 'one' and the regularized sub-text corresponding to the sub-text number '2' is only one type, 'two'. However, in the language of the grammatical dependency lattice, one subfile often corresponds to multiple forms and needs to be transcribed into multiple regular subfiles, for example, when the subfile of arabic language is the number '1', there may be corresponding subfiles

(negative main lattice),

(negative object lattice),

(negative genus lattice),

(positive main lattice),

(positive bingo),

the method includes the steps that (positive attribute lattice) 6 different regularized sub-texts are adopted, each attribute category corresponds to one regularized sub-text, at this time, text transcription by using a preset text rule cannot be achieved, and the method is suitable for a scene where the same sub-text corresponds to multiple regularized sub-texts when the regularized sub-texts are determined based on the attribute categories of the target sub-texts. The simple text regularization in the progressive text-to-lattice language relying on preset text rules has the following limitations: firstly, the number word category, the ideographic category and the like of the sub-text can not find out a universal rule, and the rule needs to be customized independently, so that the requirement on the number of the rules is high; secondly, if an unfamiliar text is encountered or a corresponding rule is not included, errors often occur, and for the lattice position of the text, the lattice position is often determined according to specific semantics, and an applicable rule cannot be found, so that the generalization and the accuracy of the rule are highA difference; thirdly, when the same text has a plurality of transcription modes, the text can not be transcribed only by the rules. In the prior art, an end-to-end model is also used for text transcription, specifically, after a text is input into the end-to-end model, the end-to-end model directly outputs a regular text in a machine translation manner, but due to the limitations of a deep learning network and the amount of training data, some unrecoverable errors often occur, for example, a subfile of a number '123' is transcribed into 'one hundred thirty two thousand'. The method and the device determine the attribute type of the sub-text, then transfer the text of the target language by using the attribute type, classify the text first and then transfer the text, and can solve the problem that the text transfer is performed only by means of rules or an end-to-end model.

Therefore, the problem that the preset text rule and the end-to-end model in the prior art can not perform text regular transcription according to the context of the text can be solved by identifying the attribute categories of the target sub-text about a plurality of attributes and then transcribing the target sub-text into the target language based on the attribute categories of the target sub-text to obtain the regularized sub-text with the same semantics as the target sub-text.

Step S14: and obtaining a regular text corresponding to the text to be regular based on the regular sub-text corresponding to the target sub-text.

The target sub-text is a sub-text which needs to be subjected to regularization processing in the text to be regularized, and the regularized sub-text corresponding to the target sub-text can be obtained based on the regularized sub-text corresponding to the target sub-text. For example, when a regular text corresponding to a text to be regular is obtained based on a regular sub-text corresponding to a target sub-text, the target sub-text in the text to be regular is replaced by the regular text corresponding to the target sub-text, and then the regular text can be obtained.

The text regularization of the sexual number lattice language mainly determines the character (negative, positive and neutral), number (singular, plural and even) and lattice isocratic number lattice categories, word categories and ideographic category attribute categories of a target sub-text in a non-standard form in a context, then determines a transcription form according to the attribute categories to obtain a regularized sub-text with the same semantics as the target sub-text, and then obtains a regularized text expressed in a target language by classification and transcription. In order to further improve the text regularization accuracy and convenience of the high-performance spreadsheet language, in the embodiment of the present disclosure, the attribute classification network may be utilized to determine the attribute category, specifically refer to fig. 2, where fig. 2 is a schematic flowchart of another embodiment of the text regularization method of the present application. Specifically, the method may include the steps of:

step S21: and analyzing the text to be normalized to obtain a plurality of sub-texts.

The method for analyzing the text to be normalized may be any existing implementation method, and is not specifically limited herein. And analyzing the text to be normalized to obtain a plurality of sub-texts.

Step S22: and respectively carrying out rule matching on the plurality of sub texts by utilizing the regular rule set of the target language to obtain the target sub text.

The regular rule set comprises at least one of a first subset and a second subset, wherein the first subset comprises a plurality of text rules which do not need regular processing, and the second subset comprises a plurality of text rules which need regular processing. For example, in a disclosed embodiment, the regular rule set includes a first sub-set and a second sub-set, a plurality of sub-texts are subjected to rule matching with the first sub-set and the second sub-set, the sub-texts successfully matched with the second sub-set are used as target sub-texts, and the sub-texts successfully matched with the first sub-set are discarded and are not used as target sub-texts. And under the condition that the regular rule set only comprises the first sub-set, carrying out rule matching on a plurality of sub-texts and the first sub-set, and taking the sub-texts which fail to be matched as target sub-texts. And under the condition that the regular rule set only comprises the second sub-set, carrying out rule matching on a plurality of sub-texts and the second sub-set, and taking the successfully matched sub-texts as target sub-texts. The regular rule set is a set of basic regular rule set established for the target language, and a certain text rule is included in the set, namely whether the regular processing is required for the sub-text can be uniquely determined by the text rule. The text rules in the first subset and the second subset may be set by a user, and are not limited specifically herein.

Therefore, when the text to be regularized is analyzed to obtain the target sub-text, the text to be regularized can be analyzed to obtain a plurality of sub-texts, then the regular rule set of the target language is utilized to perform rule matching on the plurality of sub-texts respectively to obtain the target sub-text, the sub-texts corresponding to the plurality of text rules which do not need to be regularized are filtered in advance, the subsequent steps S23 to S24 are not required, the data processing amount is reduced, and the text regularization efficiency is improved.

Step S23: and identifying attribute categories of the target sub-text about a plurality of attributes, wherein the attribute categories of the plurality of attributes are obtained by processing the standard text by using an attribute classification network and predicting.

In the embodiment of the present disclosure, the attribute categories of the attributes are obtained by processing the regular text to be processed by using the attribute classification network. The text regularization assisted by the attribute classification network is not a text which is output regularized by an end-to-end model in the prior art in a machine translation mode, but a character lattice attribute, a digit attribute, a symbol attribute and the like are used as a category to be predicted by using the network, and then text transcription is performed by using a specific transcription rule in the step S24.

The attribute classification network is obtained by utilizing sample text training, and sample sub-texts in the sample text are marked with sample marks. Do it in the sample subfileAnd under the condition of processing, the sample mark comprises sample attribute categories of the sample sub-documents about a plurality of attributes, and under the condition that the sample sub-documents do not need regular processing, the sample mark is a preset mark. The preset mark can be set in a self-defined way, for example, the preset mark is<self>. The attributes of the sample sub-text comprise a character number lattice attribute, or any combination of the character number lattice attribute, a number word attribute and a symbol attribute. In the case where the attribute of the sample sub-text includes a gender lattice attribute, the sample attribute category of the sample tag includes a gender lattice category of the sample sub-text in the target language, for example, the number '1' in arabic corresponds to

(negative master) sex score category. In the case that the attribute of the sample sub-text includes a number word attribute, the sample attribute category of the sample tag includes a number word category of the sample sub-text in the target language. In the case where the attribute of the sample sub-text includes a symbolic attribute, the sample attribute category of the sample tag includes an ideographic category of the sample sub-text in the target language. For example, the ideographic class of the symbol "-" is "minus". When the sample text is obtained, corresponding sample subfiles can be crawled in a large text, then the sample subfiles are labeled manually, the sample subfiles with specific determined semantics for time points, dates and the like only need to be marked integrally, the category of the number words and the category of the sexual number table need to be labeled for other numbers, and the category of the ideographic category and the category of the sexual number table need to be labeled for symbols.

The attribute classification Network can be a Recurrent Neural Network (RNN) or other Network as a single task model for classification; two parallel networks can also be used as a multitask model, namely a plurality of networks are used for classifying different attributes respectively so as to realize attribute classification in parallel. The main structure of the network can be a two-layer bidirectional Long-Short Term Memory (LSTM) network, the target sub-text is used as the input of the attribute classification network, then the target sub-text passes through the embedding layer and the two-layer bidirectional Long-Short Term Memory network, and finally the attribute type corresponding to the target sub-text is output. The composition of the attribute classification network can be set by user according to needs, and the attribute classification task can be realized.

In order to enhance the performance of the attribute classification network, appropriate auxiliary training information may be added to the network input for training the attribute classification network, for example, the auxiliary training information may be N-gram character information of the sample sub-text, semantic information of the sample sub-text on the pre-training model, and the like. For example, in a disclosed embodiment, the training step of the attribute classification network includes: acquiring a first vectorized embedded representation of the sample sub-text, acquiring N-gram character information of the sample sub-text, and acquiring a second vectorized embedded representation of the N-gram character information; fusing the first embedded representation and the second embedded representation to obtain a fused embedded representation of the sample sub-text; classifying and predicting the fusion embedded representation by utilizing an attribute classification network to obtain a prediction mark of a sample sub-text; the prediction mark is a preset mark, or the sample sub-file is a prediction attribute type about a plurality of attributes; network parameters of the attribute classification network are adjusted based on a difference between the sample label and the prediction label. As another example, in a disclosed embodiment, the sample sub-text is first input into the pre-training model to obtain semantic information of the sample sub-text on the pre-training model, and then embedded representation of the semantic information of the sample sub-text on the pre-training model in a warp-wise quantization manner is obtained; carrying out classification prediction on the embedded representation by utilizing an attribute classification network to obtain a prediction mark of a sample sub-text; similarly, the prediction mark is a preset mark, or the sample sub-file is a prediction attribute type about a plurality of attributes; network parameters of the attribute classification network are adjusted based on a difference between the sample label and the prediction label. The preset mark can be set by self, and is not limited specifically herein. In order to represent the sample sub-texts which do not need to be regularized in the sample text, the preset mark of the sample sub-texts which do not need to be regularized is marked as < self >.

In a disclosed embodiment, after the regular rule set of the target language is used to respectively perform rule matching on a plurality of sub-texts to obtain target sub-texts, and before the attribute categories of various attributes are respectively detected to obtain detection results, the preset text rule of the target language can be utilized to respectively match the preset text rules of a plurality of target sub-texts, so that the target sub-text successfully matched with the preset text rule is transcribed into the regularized sub-text expressed by the target language by using the preset text rule, therefore, for the target sub-texts which can realize text transcription by using the preset text rules, attribute categories do not need to be identified, the subsequent steps from S23 to S24 do not need to be executed, the preset text rules are directly used for text regularization, and the target sub-texts which fail to be matched with the preset text rules are input into the attribute classification network. The preset text rule may be various transcription rules preset in the prior art when text transcription is performed by using the preset text rule. For example, the standardized conversion form corresponding to "xx: xx" in the preset text rule is a time point, and if the sub-text is "08: 00", the text regularization result of the sub-text may be "eight-click" according to the preset text rule. In addition, in order to increase scalability before and after the feature, a rule method based on the finite state transformer and a rule method based on the finite state transformer based on the weight have been developed, which can be referred to the prior art and will not be described in detail herein. Because the preset text rule is used for text regularization, corresponding rules are required to be formulated according to the grammar rule of the target language, the problem that most texts are regularized can be solved for languages without character lattices, but the problem that more than one transcription possibility texts cannot be processed exists, and the defects of maintenance and flexibility can be caused by too many rules, so that the text regularization can be carried out on subfiles needing to be regularized in combination with two modes of the preset text rule and an attribute classification network. Specifically, for the transcribed text determined directly according to the preset text rule, an attribute classification network is not needed, so that the unrecoverable errors caused by the deep learning network and the limitation of the training data amount as described above are avoided, the influence caused by the classification errors is reduced, and for the sub-text which cannot be subjected to text regularization by using the preset text rule, the attribute classification network is used for processing the regular text to predict the regular text to obtain the attribute class, and then the target sub-text is transcribed into the target language based on the attribute class of the target sub-text to obtain the regularized sub-text having the same semantics as the target sub-text.

In an application scenario, the target language is arabic, and the existing text to be normalized is:

the sub-texts "D", "3200", "1080", "x" and "1920" are not standard versions of arabic and are target sub-texts that need to be regularized, and the rest of the sub-texts are already arabic and do not need to be regularized. For the 'D', the pronunciation according to English letters can be judged by the preset text rule, and attribute type judgment and subsequent transcription do not need to be carried out by using an attribute classification network. For 3 numbers such as 3200, 1080 and 1920, judging whether the number word category belongs to a character string, a radix number word or an ordinal number word and a specific character number category through an attribute classification network; and the symbol of the X multiplied sign also needs to judge the specific character table category through the attribute classification network. When the attribute classification network is trained, the target language is Arabic, and the text to be normalized is:

the label of the 'D' is a preset mark<self>While the labels of "3200", "1080" and "1920" are labels corresponding to the numeric category and the sexual category, and the label of "x" is a label corresponding to the sexual category, so as to correspond to the category<self>The sub-text of the label can keep the original writing form without conversion, and the regularized sub-text with the same semantic meaning as the target sub-text replaces the original writing form for the label of the number and the label of the symbol.

Step S24: and respectively detecting the attribute categories of various attributes to obtain detection results, judging whether the attribute categories including the attributes are correctly identified or not according to the detection results, and replacing the attribute categories corresponding to the attributes with correction categories in response to the fact that the attribute categories including the attributes are incorrectly identified according to the detection results.

In order to avoid interference of classification results among target sub-texts with different attributes, for example, a symbol of an x multiplier should only have a character case type of a negative case, but a combination of a number word type and a character case type of a prime number word negative case is given, so that after the attribute types of the target sub-texts about a plurality of attributes are identified, classification results are examined, unreasonable classification results are filtered out, and the accuracy of the attribute types is improved. Specifically, after identifying the attribute categories of the target sub-text about a plurality of attributes, and before the target sub-text is transcribed into the target language based on the attribute categories of the target sub-text and a regularized sub-text having the same semantics as the target sub-text is obtained, the attribute categories of various attributes are detected respectively to obtain a detection result; wherein the detection result comprises whether the attribute type of the attribute is correctly identified; in response to the detection result including that the attribute class of the attribute is incorrectly identified, replacing the attribute class of the corresponding attribute with a correction class. Taking the symbol of the 'x' multiplier as an example that only the sex lattice type of the negative lattice should be provided, but the combination of the numerator type and the sex lattice type of the prime number negative lattice is given, detecting the attribute types of various attributes to obtain a detection result, and after the attribute type identification of the attributes is found to be incorrect, in response to the detection result that the attribute type identification including the attributes is incorrect, replacing the prime number negative lattice with a correction type negative lattice.

Step S24 may be performed as needed, that is, step S24 may not be performed in other disclosed embodiments.

Step S25: and transferring the target sub-text into a target language based on the attribute category of the target sub-text to obtain a regularized sub-text with the same semantics as the target sub-text.

The description of step S25 can refer to step S13 in the embodiment of fig. 1, and is not repeated here.

Step S26: and obtaining a regular text corresponding to the text to be regular based on the regular sub-text corresponding to the target sub-text.

The description of step S26 can refer to step S14 in the embodiment of fig. 1, and is not repeated here. For the above-mentioned embodiment, after the regular rule set of the target language is used to perform rule matching on several sub-texts respectively to obtain the target sub-texts, and before the attribute categories of various attributes are detected to obtain the detection result, the preset text rule of the target language can be utilized to respectively match the preset text rules of a plurality of target sub-texts, therefore, when the target sub-text successfully matched with the preset text rule is transcribed into the regularized sub-text expressed by the target language by using the preset text rule, and summarizing, utilizing a preset text rule to transcribe the target sub-texts into the regularized sub-texts represented by the target language and attribute categories based on the target sub-texts, transcribing the target sub-texts into the target language, obtaining the regularized sub-texts with the same semantics as the target sub-texts, and obtaining the regularized texts corresponding to the texts to be regularized.

In the scheme, the regular rule set of the target language is utilized to respectively perform rule matching on a plurality of sub-texts of the to-be-regular text to obtain the target sub-text, the plurality of sub-texts which do not need regular processing are filtered in advance, so that the data processing amount is reduced, and the regular efficiency of the text is improved; when the attribute categories of the target sub-text about the attributes are identified, the attribute categories of the attributes are obtained by processing the regular text to be processed by using the attribute classification network and predicting, so that the efficiency and the accuracy of determining the attribute categories can be improved; the detection result is obtained by respectively detecting the attribute categories of various attributes, whether the attribute categories including the attributes are correctly identified or not is detected, and the attribute categories corresponding to the attributes are replaced by the correction categories in response to the detection result that the attribute categories including the attributes are incorrectly identified, so that the interference of classification results among the target sub-texts with different attributes can be avoided.

Referring to fig. 3, fig. 3 is a schematic diagram of a frame of an embodiment of a regularization device 30 of the present application. The text regularization device 30 comprises an analysis module 31, an identification module 32, a transcription module 33 and an acquisition module 34, wherein the analysis module 31 is used for analyzing the text to be regularized to obtain a target sub-text; the text to be normalized consists of a plurality of sub-texts, the target sub-text is the sub-text which needs to be normalized, the target sub-text needs to be transcribed into a target language, and the grammar of the target language relates to a character number lattice; the identification module 32 is used for identifying attribute categories of the target sub-text about a plurality of attributes; the attribute types of the character lattice attributes comprise character lattice types of the target sub-text in the target language; a transcription module 33, configured to transcribe the target sub-text into the target language based on the attribute category of the target sub-text, so as to obtain a regularized sub-text having the same semantic meaning as the target sub-text; an obtaining module 34, configured to obtain a regular text corresponding to the text to be regular based on the regular sub-text corresponding to the target sub-text.

According to the scheme, the analysis module 31 analyzes the text to be normalized to obtain the target sub-text to be normalized, the identification module 32 identifies the attribute types of the target sub-text about a plurality of attributes, the transcription module 33 transcribes the target sub-text into the target language based on the attribute types of the target sub-text to obtain the normalized sub-text having the same semantics as the target sub-text, and the acquisition module 34 obtains the normalized text corresponding to the text to be normalized based on the normalized sub-text corresponding to the target sub-text.

In some disclosed embodiments, where the target sub-text is a number, the number of attributes further includes a number attribute, the attribute category of the number attribute including a number category of the target sub-text.

Therefore, under the condition that the target sub-text is a number, the plurality of attributes further comprise the number attribute, and the attribute category of the number attribute comprises the number category of the target sub-text, so that the attribute information of the target sub-text can be further enriched, and the accuracy of the subsequent transcription into the target language is improved.

In some disclosed embodiments, in a case where the target sub-text is a symbol, the plurality of attributes further include a symbol attribute, and the attribute class of the symbol attribute includes an ideographic class of the target sub-text.

Therefore, under the condition that the target sub-text is a symbol, the plurality of attributes further comprise symbol attributes, and the attribute categories of the symbol attributes comprise ideographic categories of the target sub-text, so that the attribute information of the target sub-text can be further enriched, and the accuracy of the subsequent transcription into the target language is improved.

In some disclosed embodiments, the attribute categories of the attributes are predicted by processing the text to be normalized by using an attribute classification network, the attribute classification network is obtained by training a sample text, and sample sub-texts in the sample text are marked with sample marks; wherein, when the sample sub-text needs to be processed regularly, the sample mark comprises sample attribute categories of the sample sub-text with respect to the attributes, and when the sample sub-text does not need to be processed regularly, the sample mark is a preset mark.

Therefore, the attribute classification network is used for assisting in text regularization, attributes such as the character lattice attribute, the number word attribute and the symbol attribute are used as a type to predict by using the network, the transcription problem is skillfully converted into the classification problem, and the text regularization accuracy and convenience of the character lattice language can be improved. In addition, the training sample labels of the attribute classification network comprise preset labels and sample attribute categories related to the attributes, so that the training sample labels not only can distinguish whether regular processing is needed, but also can process a general language and a character grid language.

In some disclosed embodiments, the training of the attribute classification network comprises: acquiring a vectorized first embedded representation of the sample sub-text, acquiring N-gram character information of the sample sub-text, and acquiring a radial quantized second embedded representation of the N-gram character information; fusing the first embedded representation and the second embedded representation to obtain a fused embedded representation of the sample sub-text; carrying out classification prediction on the fusion embedded representation by utilizing the attribute classification network to obtain a prediction mark of the sample sub-text; wherein the predicted label is the preset label or the sample sub-text is a predicted attribute category of the plurality of attributes; adjusting a network parameter of the attribute classification network based on a difference between the sample label and the prediction label.

Therefore, after the first embedded representation obtained by vectorizing the sample sub-texts and the second embedded representation obtained by quantizing the N-gram character information are fused, the fused embedded representation is classified and predicted by using the attribute classification network so as to realize the training of the attribute classification network, and based on the fact that appropriate auxiliary training information is added in the network input of the training attribute classification network, the performance of the attribute classification network can be enhanced.

In some disclosed embodiments, the parsing module 31 is configured to, when the text to be normalized is parsed to obtain the target sub-texts, further be configured to analyze the text to be normalized to obtain the plurality of sub-texts; respectively carrying out rule matching on the plurality of sub texts by utilizing the regular rule set of the target language to obtain the target sub text; the regular rule set comprises at least one of a first subset and a second subset, the first subset comprises a plurality of text rules which do not need regular processing, and the second subset comprises a plurality of text rules which need regular processing.

Therefore, when the text to be normalized is analyzed to obtain the target sub-text, the text to be normalized can be analyzed to obtain a plurality of sub-texts, then the regular rule set of the target language is utilized to respectively perform rule matching on the plurality of sub-texts to obtain the target sub-text, the sub-texts corresponding to the plurality of text rules which do not need to be processed regularly are filtered in advance, subsequent classification and transcription do not need to be executed, the data processing amount is reduced, and the text regular efficiency is improved.

In some disclosed embodiments, the transcription module 33 is configured to, based on the attribute category of the target sub-text, transcribe the target sub-text into the target language, and when obtaining a regularized sub-text having the same semantic as the target sub-text, perform query in a transcription rule set to obtain a first sub-text that meets a matching condition with the target sub-text; the transcription rule set comprises a plurality of sub-text pairs, the sub-text pairs comprise a first sub-text and a second sub-text with the same semantics, and the second sub-text is represented in the target language according to the attribute category of the first sub-text; and taking the second sub-text in the sub-text pair to which the first sub-text belongs as the regularized sub-text of the target sub-text.

Therefore, under the condition that the second sub-text is represented in the target language according to the attribute category of the first sub-text, query is performed in the transcription rule set to obtain the first sub-text meeting the matching condition with the target sub-text, that is, the second sub-text in the sub-text pair to which the first sub-text belongs can be used as the regularized sub-text of the target sub-text, so that when the regularized sub-text is determined based on the attribute category of the target sub-text, the transcription problem is skillfully converted into a classification problem, and the first classification and the second transcription are performed, so that the method is suitable for scenes in which the same sub-text corresponds to multiple regularized sub-texts, and can solve the problem that the preset text rule and an end-to-end model in the prior art cannot perform text transcription according to the context.

In some disclosed embodiments, the matching condition comprises: the first sub-text and the target sub-text are identical in semantic meaning and identical in attribute category.

Therefore, the first sub-text which has the same semantics and the same attribute type as the target sub-text can be found through the matching conditions, and then the first sub-text is used for subsequently determining the regularized sub-text, and the unique corresponding relation is determined by utilizing the matching conditions, so that the transcription accuracy is improved.

In some disclosed embodiments, after the identifying the attribute categories of the target sub-text with respect to the plurality of attributes, and before the converting the target sub-text into the target language based on the attribute categories of the target sub-text to obtain the regularized sub-text having the same semantics as the target sub-text, the text regularizing device 30 is further configured to detect the attribute categories of the various attributes respectively to obtain detection results; wherein the detection result comprises whether the attribute category of the attribute is correctly identified; in response to the detection result including that the attribute class identification of the attribute is incorrect, replacing the attribute class corresponding to the attribute with a correction class.

Therefore, by detecting the attribute categories of various attributes, identifying the attribute categories including the attributes incorrectly in the detection result, and replacing the attribute categories corresponding to the attributes with correction categories, the interference of classification results among target sub-texts with different attributes can be avoided.

In some disclosed embodiments, the obtaining module 34 is configured to, when obtaining the regularized text corresponding to the text to be regularized based on the regularized sub-text corresponding to the target sub-text, replace the target sub-text in the text to be regularized with the regularized text corresponding to the target sub-text to obtain the regularized text.

Therefore, the target sub-text is the sub-text which needs to be subjected to the regularization processing in the text to be regularized, and after the regularized sub-text corresponding to the target sub-text is obtained, the target sub-text in the text to be regularized is replaced by the regularized text corresponding to the target sub-text, so that the regularized text can be quickly obtained.

Referring to fig. 4, fig. 4 is a schematic block diagram of an embodiment of an electronic device 40 according to the present application. The electronic device 40 includes a memory 41 and a processor 42 coupled to each other, the memory 41 storing program instructions, and the processor 42 executing the program instructions to implement the steps in any of the above-described embodiments of the text regularization method. Specifically, the electronic device 40 may include, but is not limited to: desktop computers, notebook computers, servers, mobile phones, tablet computers, and the like, without limitation.

In particular, the processor 42 is configured to control itself and the memory 41 to implement the steps in any of the above-described text regularization method embodiments. Processor 42 may also be referred to as a CPU (Central Processing Unit). The processor 42 may be an integrated circuit chip having signal processing capabilities. The Processor 42 may also be a general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. In addition, the processor 42 may be commonly implemented by an integrated circuit chip.

Referring to fig. 5, fig. 5 is a block diagram illustrating an embodiment of a computer-readable storage medium 50 according to the present application. The computer readable storage medium 50 stores program instructions 51 executable by the processor, the program instructions 51 for implementing the steps in any of the above-described text regularization method embodiments.

In some embodiments, functions of or modules included in the apparatus provided in the embodiments of the present disclosure may be used to execute the method described in the above method embodiments, and specific implementation thereof may refer to the description of the above method embodiments, and for brevity, will not be described again here.

The foregoing description of the various embodiments is intended to highlight various differences between the embodiments, and the same or similar parts may be referred to each other, and for brevity, will not be described again herein.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a module or a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, mechanical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Claims

1. A method of text regularization, comprising:

analyzing the text to be normalized to obtain a target sub-text; the text to be normalized consists of a plurality of sub-texts, the target sub-text is the sub-text which needs to be normalized, the target sub-text needs to be transcribed into a target language, and the grammar of the target language relates to a character number lattice;

identifying attribute categories of the target sub-text with respect to a plurality of attributes; the attribute types of the character lattice attributes comprise character lattice types of the target sub-text in the target language;

based on the attribute category of the target sub-text, the target sub-text is transcribed into the target language to obtain a regularized sub-text with the same semantic meaning as the target sub-text;

and obtaining a regular text corresponding to the text to be regular based on the regular sub-text corresponding to the target sub-text.

2. The method according to claim 1, wherein, in the case that the target sub-text is a number, the plurality of attributes further include a number attribute, and the attribute category of the number attribute includes a number category of the target sub-text.

3. The method according to claim 1 or 2, wherein, in the case that the target sub-text is a symbol, the plurality of attributes further include a symbol attribute, and the attribute category of the symbol attribute includes an ideographic category of the target sub-text.

4. The method according to claim 1, wherein the attribute categories of the attributes are predicted by processing the text to be normalized using an attribute classification network, the attribute classification network is trained using a sample text, and sample sub-texts in the sample text are labeled with sample labels;

wherein, when the sample sub-text needs to be processed regularly, the sample mark comprises sample attribute categories of the sample sub-text with respect to the attributes, and when the sample sub-text does not need to be processed regularly, the sample mark is a preset mark.

5. The method of claim 4, wherein the step of training the attribute classification network comprises:

acquiring a vectorized first embedded representation of the sample sub-text, acquiring N-gram character information of the sample sub-text, and acquiring a radial quantized second embedded representation of the N-gram character information;

fusing the first embedded representation and the second embedded representation to obtain a fused embedded representation of the sample sub-text;

carrying out classification prediction on the fusion embedded representation by utilizing the attribute classification network to obtain a prediction mark of the sample sub-text; wherein the predicted label is the preset label or the sample sub-text is a predicted attribute category of the plurality of attributes;

adjusting a network parameter of the attribute classification network based on a difference between the sample label and the prediction label.

6. The method according to claim 1, wherein the parsing the text to be normalized to obtain a target sub-text comprises:

analyzing the text to be normalized to obtain a plurality of sub-texts;

respectively carrying out rule matching on the plurality of sub texts by utilizing the regular rule set of the target language to obtain the target sub text;

the regular rule set comprises at least one of a first subset and a second subset, the first subset comprises a plurality of text rules which do not need regular processing, and the second subset comprises a plurality of text rules which need regular processing.

7. The method according to claim 1, wherein the transcribing the target sub-text into the target language based on the attribute category of the target sub-text to obtain a regularized sub-text having the same semantic meaning as the target sub-text comprises:

inquiring in a transcription rule set to obtain a first sub-text meeting a matching condition with the target sub-text; the transcription rule set comprises a plurality of sub-text pairs, the sub-text pairs comprise a first sub-text and a second sub-text with the same semantics, and the second sub-text is represented in the target language according to the attribute category of the first sub-text;

and taking the second sub-text in the sub-text pair to which the first sub-text belongs as the regularized sub-text of the target sub-text.

8. The method of claim 7, wherein the matching condition comprises: the first sub-text and the target sub-text are identical in semantic meaning and identical in attribute category.

9. The method according to claim 1, wherein after said identifying the attribute category of the target sub-text with respect to several attributes, and before said transcribing the target sub-text into the target language based on the attribute category of the target sub-text, obtaining a regularized sub-text having the same semantic meaning as the target sub-text, the method further comprises:

respectively detecting the attribute types of the attributes to obtain detection results; wherein the detection result comprises whether the attribute category of the attribute is correctly identified;

in response to the detection result including that the attribute class identification of the attribute is incorrect, replacing the attribute class corresponding to the attribute with a correction class.

10. The method according to claim 1, wherein obtaining the regularized text corresponding to the text to be regularized based on the regularized sub-text corresponding to the target sub-text comprises:

and replacing the target sub-text in the text to be normalized with a normalized text corresponding to the target sub-text to obtain the normalized text.

11. A text regularizing apparatus, comprising:

the analysis module is used for analyzing the text to be normalized to obtain a target sub-text; the text to be normalized consists of a plurality of sub-texts, the target sub-text is the sub-text which needs to be normalized, the target sub-text needs to be transcribed into a target language, and the grammar of the target language relates to a character number lattice;

the identification module is used for identifying attribute categories of the target sub-text about a plurality of attributes; the attribute types of the character lattice attributes comprise character lattice types of the target sub-text in the target language;

the transfer module is used for transferring the target sub-text into the target language based on the attribute category of the target sub-text to obtain a regularized sub-text with the same semantic meaning as that of the target sub-text;

and the acquisition module is used for obtaining the regular text corresponding to the text to be regular based on the regular sub-text corresponding to the target sub-text.

12. An electronic device comprising a memory and a processor coupled to each other, the memory having stored therein program instructions, the processor being configured to execute the program instructions to implement the text regularization method according to any one of claims 1 to 10.

13. A computer-readable storage medium, characterized in that program instructions are stored which can be executed by a processor for implementing the text regularization method according to any one of claims 1 to 10.