CN110765733A

CN110765733A - Text normalization method, device, equipment and storage medium

Info

Publication number: CN110765733A
Application number: CN201911017291.4A
Authority: CN
Inventors: 张强
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2019-10-24
Filing date: 2019-10-24
Publication date: 2020-02-07

Abstract

The application provides a text normalization method, a text normalization device, equipment and a storage medium, wherein the text normalization method comprises the following steps: acquiring a text to be structured; extracting text normalization features from the text to be normalized, wherein the text normalization features at least comprise semantic features capable of representing the semantics of the text to be normalized and generalization features capable of representing repeated parts in the text to be normalized; and determining a regular text corresponding to the text to be structured by utilizing the text regular features and a pre-established text regular model. According to the text normalization method, the text to be normalized can be normalized into the text with clear sentence meaning and strong readability and logicality by utilizing the text normalization features of the text to be normalized and the pre-established text normalization model.

Description

Text normalization method, device, equipment and storage medium

Technical Field

The present application relates to the field of natural language processing technologies, and in particular, to a text normalization method, apparatus, device, and storage medium.

Background

In some application scenarios, text may be obtained and the obtained text may need to be provided for the target user to read, however, for some reasons, the obtained text may have problems of poor readability, unclear significance, and the like, which makes it difficult for the target user to read the text.

Taking a speech recognition scenario as an example: the voice input is the most natural and convenient way in human-computer interaction, and when voice input is performed, due to some reasons (for example, voice around a speaker is involved, the speaker sends some nonsense word and repeated words due to tension or unclear thinking, and the speaker can speak some network vocabularies and personalized vocabularies which can not be understood by ordinary people due to language habits), the problems of poor readability, unclear significance and the like exist when the voice recognition system recognizes the input voice to obtain a text, and a reader cannot understand the meaning that the speaker wants to express according to the text.

Disclosure of Invention

In view of this, the present application provides a text normalization method, apparatus, device and storage medium, which are used to normalize a text with poor readability, unclear meaning and other problems, so that a reader can read and understand the text, and the technical solution is as follows:

a text normalization method, comprising:

acquiring a text to be structured;

extracting text regulation features from the text to be regulated, wherein the text regulation features comprise semantic features capable of representing the semantics of the text to be regulated and generalization features capable of representing repeated parts in the text to be regulated;

and determining a regular text corresponding to the text to be structured by utilizing the text structuring feature and a pre-established text structuring model.

Optionally, the extracting a text normalization feature from the text to be normalized includes:

for any sentence in the text to be structured:

acquiring semantic features and generalization features of the sentence, splicing the semantic features and the generalization features of the sentence, and taking the spliced features as the text regular features of the sentence;

so as to obtain the text regulation characteristic of each sentence in the text to be regulated.

Optionally, the obtaining semantic features of the sentence includes:

for any word in the sentence, obtaining a word vector and a part-of-speech vector of the word, splicing the word vector and the part-of-speech vector of the word, and taking the spliced vector as a feature vector of the word to obtain a feature vector of each word in the sentence, wherein the part-of-speech vector of one word is a vector representing the part-of-speech of the word;

and splicing the feature vectors of all words in the sentence, wherein the spliced vector is used as the semantic feature of the sentence.

Optionally, the determining a regular text corresponding to the text to be regular by using the text regular feature and a pre-established text regular model includes:

inputting the text normalization features of each sentence in the text to be normalized into the text normalization model to obtain normalized sentences corresponding to each sentence in the text to be normalized;

and composing the regular sentences corresponding to the texts to be structured by the regular sentences corresponding to the sentences in the texts to be structured respectively.

Optionally, the process of constructing the text-normalization model in advance includes:

acquiring a training text from a pre-constructed training text set, wherein the training data set comprises a plurality of training texts, each training text corresponds to a label text, and the label text corresponding to a training text is a real regular text corresponding to the training text;

and training a text regular model by using the obtained training text and the corresponding labeled text.

Optionally, the training the text normalization model by using the obtained training text and the corresponding labeled text thereof includes:

extracting a text regular feature from the training text to serve as the training text regular feature;

determining a mask vector of a labeled text corresponding to the training text, wherein the mask vector can represent words needing to be replaced and words not needing to be replaced in the labeled text corresponding to the training text;

and training a text regular model by using the training text regular features, the labeled text corresponding to the training text and the mask vector of the labeled text corresponding to the training text.

Optionally, the determining a mask vector of a label text corresponding to the training text includes:

determining a probability vector of a labeled text corresponding to the training text, wherein the probability vector consists of the probability of a prefix sequence of each word in the labeled text corresponding to the training text, and the prefix sequence of one word is a word sequence consisting of all words before the word;

and determining a mask vector of the labeled text corresponding to the training text according to the probability vector of the labeled text corresponding to the training text.

Optionally, the determining, according to the probability vector of the labeled text corresponding to the training text, the mask vector of the labeled text corresponding to the training text includes:

normalizing the probability vector of the labeled text corresponding to the training text to obtain a normalized probability vector;

performing first-order difference on the normalized probability vector to obtain a first-order difference result;

and determining a mask vector of the labeled text corresponding to the training text according to the first-order difference result.

Optionally, the training a text-structured model by using the training text-structured feature, the labeled text corresponding to the training text, and the mask vector of the labeled text corresponding to the training text includes:

predicting a regular text corresponding to the training text by using the training text regular feature, a mask vector of a labeled text corresponding to the training text and a text regular model;

determining the prediction loss of the text regular model according to the labeled text corresponding to the training text, the predicted regular text and the mask vector of the labeled text corresponding to the training text;

and updating the parameters of the text normalization model according to the prediction loss of the text normalization model.

Optionally, the predicting the regular text corresponding to the training text by using the training text regular feature, the mask vector of the labeled text corresponding to the training text, and the text regular model includes:

predicting word by utilizing the training text regular features, the mask vector of the labeled text and a text regular model:

when predicting words at each target moment, if determining that the tagged word at the previous moment is a word which does not need to be replaced according to the mask vector of the tagged text corresponding to the training text, predicting the words at the target moment according to the tagged word at the previous moment, and if determining that the tagged word at the previous moment is a word which needs to be replaced according to the mask vector of the tagged text corresponding to the training text, predicting the words at the target moment according to the tagged word at the previous moment and the predicted word at the previous moment, wherein the tagged words are words in the tagged text corresponding to the training text;

and forming the regular text corresponding to the training text by all the predicted words.

Optionally, the predicting the word at the target time according to the tagged word at the previous time and the predicted word at the previous time includes:

calculating the cosine distance between the annotation word at the previous moment and the prediction word at the previous moment;

calculating a fusion gate vector of a predicted word at a previous moment, wherein the fusion gate vector is used for controlling the degree of fusion of a label word at the previous moment and the predicted word at the previous moment;

determining a fusion vector of the marking word at the previous moment and the prediction word at the previous moment according to the cosine distance and the fusion gate vector;

and determining the predicted words of the target moment according to the fusion vector.

Optionally, the determining a prediction loss of the text-structured model according to the labeled text corresponding to the training text, the predicted structured text, and the mask vector of the labeled text corresponding to the training text includes:

determining a prediction error rate according to words which do not need to be replaced in the labeled text corresponding to the training text and prediction words corresponding to the words which do not need to be replaced in the labeled text;

determining the entropy of probability distribution of a predicted word corresponding to a word needing to be replaced in a labeled text corresponding to the training text;

determining the prediction loss of the text normalization model according to the prediction error rate and the entropy of the probability distribution of the prediction words corresponding to the words needing to be replaced in the labeled text;

and determining words which do not need to be replaced and words which need to be replaced in the labeled text according to the mask vector of the labeled text corresponding to the training text.

A text normalization apparatus, comprising: the system comprises a text acquisition module, a feature extraction module and a text normalization module;

the text acquisition module is used for acquiring a text to be structured;

the feature extraction module is configured to extract a text normalization feature from the text to be normalized, where the text normalization feature includes a semantic feature capable of representing semantics of the text to be normalized and a generalization feature capable of representing a repeated portion in the text to be normalized;

and the text normalization module is used for determining a normalized text corresponding to the text to be normalized by utilizing the text normalization characteristics and a pre-established text normalization model.

Optionally, the feature extraction module is specifically configured to obtain, for any sentence in the text to be normalized, a semantic feature and a generalization feature of the sentence, splice the semantic feature and the generalization feature of the sentence, and use the spliced feature as the text normalization feature of the sentence; so as to obtain the text regulation characteristic of each sentence in the text to be regulated.

Optionally, when obtaining the semantic features of the sentence, the feature extraction module is specifically configured to, for any word in the sentence, obtain a word vector and a part-of-speech vector of the word, splice the word vector and the part-of-speech vector of the word, and use the spliced vector as the feature vector of the word to obtain the feature vector of each word in the sentence, where the part-of-speech vector of a word is a vector representing the part-of-speech of the word; and splicing the feature vectors of all words in the sentence, wherein the spliced vector is used as the semantic feature of the sentence.

Optionally, the text normalization module is specifically configured to input text normalization features of each sentence in the text to be normalized into the text normalization model, so as to obtain a normalized sentence corresponding to each sentence in the text to be normalized; and composing the regular sentences corresponding to the texts to be structured by the regular sentences corresponding to the sentences in the texts to be structured respectively.

The text normalization device further comprises: the text-structured model building module comprises a training text acquisition module and a text-structured model training module;

the training data set comprises a plurality of training texts, each training text corresponds to a label text, and the label text corresponding to one training text is a real regular text corresponding to the training text;

and the text-structured model training module is used for training a text-structured model by using the obtained training text and the corresponding labeled text.

Optionally, the text-normalization model training module includes: the device comprises a feature extraction submodule, a mask vector determination submodule and a model training submodule;

the feature extraction submodule is used for extracting a text regular feature from the training text to serve as the training text regular feature;

the mask vector determining submodule is used for determining a mask vector of a labeled text corresponding to the training text, wherein the mask vector can represent words needing to be replaced and words not needing to be replaced in the labeled text corresponding to the training text;

and the model training submodule is used for training the text normalization model by utilizing the training text normalization features, the labeled text corresponding to the training text and the mask vector of the labeled text corresponding to the training text.

Optionally, the mask vector determining submodule is specifically configured to determine a probability vector of a labeled text corresponding to the training text, and determine a mask vector of a labeled text corresponding to the training text according to the probability vector of the labeled text corresponding to the training text; the probability vector is composed of the probability of the prefix sequence of each word in the label text corresponding to the training text, and the prefix sequence of one word is a word sequence composed of all words before the word.

Optionally, when determining the mask vector of the labeled text corresponding to the training text according to the probability vector of the labeled text corresponding to the training text, the mask vector determination submodule is specifically configured to normalize the probability vector of the labeled text corresponding to the training text to obtain a normalized probability vector, perform first-order difference on the normalized probability vector to obtain a first-order difference result, and determine the mask vector of the labeled text corresponding to the training text according to the first-order difference result.

Optionally, the model training sub-module includes a text prediction sub-module, a prediction loss determination sub-module, and a parameter updating sub-module;

the text prediction submodule is used for predicting the regular text corresponding to the training text by utilizing the training text regular features, the mask vector of the labeled text corresponding to the training text and a text regular model;

the prediction loss determining submodule is used for determining the prediction loss of the text warping model according to the labeled text corresponding to the training text, the predicted warping text and the mask vector of the labeled text corresponding to the training text;

and the parameter updating submodule is used for updating the parameters of the text normalization model according to the prediction loss of the text normalization model.

Optionally, the text prediction sub-module is specifically configured to predict word by using the training text normalization feature, the mask vector of the labeled text, and the text normalization model: when predicting words at each target moment, if determining that the tagged word at the previous moment is a word which does not need to be replaced according to the mask vector of the tagged text corresponding to the training text, predicting the words at the target moment according to the tagged word at the previous moment, and if determining that the tagged word at the previous moment is a word which needs to be replaced according to the mask vector of the tagged text corresponding to the training text, predicting the words at the target moment according to the tagged word at the previous moment and the predicted word at the previous moment, wherein the tagged words are words in the tagged text corresponding to the training text; and forming the regular text corresponding to the training text by all the predicted words.

Optionally, when predicting a word at a target time according to a tagged word at a previous time and a predicted word at a previous time, the text prediction sub-module is specifically configured to calculate a cosine distance between the tagged word at the previous time and the predicted word at the previous time, calculate a fused gate vector of the predicted word at the previous time, determine a fused vector of the tagged word at the previous time and the predicted word at the previous time according to the cosine distance and the fused gate vector, and determine the predicted word at the target time according to the fused vector; and the fusion gate vector is used for controlling the degree of fusion of the annotation words at the previous moment and the prediction words at the previous moment.

Optionally, the prediction loss determining sub-module is specifically configured to determine a prediction error rate according to a word that does not need to be replaced in the labeled text corresponding to the training text and a predicted word corresponding to the word that does not need to be replaced in the labeled text, determine an entropy of probability distribution of the predicted word corresponding to the word that needs to be replaced in the labeled text corresponding to the training text, and determine the prediction loss of the text normalization model according to the prediction error rate and the entropy of probability distribution of the predicted word corresponding to the word that needs to be replaced in the labeled text; and determining words which do not need to be replaced and words which need to be replaced in the labeled text according to the mask vector of the labeled text corresponding to the training text.

A text-warping device comprising: a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement each step of the text normalization method.

A readable storage medium, having stored thereon a computer program which, when executed by a processor, carries out the steps of the text normalization method of any of the preceding claims.

According to the scheme, the text normalization method, the text normalization device, the text normalization equipment and the storage medium are characterized in that firstly, a text to be normalized is obtained, then, text normalization features are extracted from the text to be normalized, and finally, the text normalization features and a pre-constructed text normalization model are utilized to determine a normalized text corresponding to the text to be normalized. According to the text normalization method, the semantic features capable of representing the semantics of the text to be normalized, the generalization features capable of representing the repeated parts in the text to be normalized and the pre-established text normalization model can be utilized to normalize the text to be normalized into the text with clear sentence meanings and strong readability and logicality, so that a reader can easily read the text, and the user experience is good.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic flow chart of a text normalization method provided in an embodiment of the present application;

fig. 2 is a schematic flow chart of training a text normalization model by using a training text and a label text corresponding to the training text according to an embodiment of the present application;

fig. 3 is a schematic flowchart of a process of determining a mask vector of a label text corresponding to a training text according to an embodiment of the present application;

fig. 4 is a schematic flow chart illustrating a process of training a text-normalized model by using training text-normalized features, labeled texts corresponding to the training texts, and mask vectors of the labeled texts corresponding to the training texts, according to an embodiment of the present application;

fig. 5 is a schematic flowchart of a process of predicting a word at a target time according to a tagged word at a previous time and a predicted word at the previous time provided in an embodiment of the present application;

FIG. 6 is a schematic structural diagram of a text normalization device according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a text normalization apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to realize text normalization, the inventor of the present invention has studied, and the original idea is:

the text warping is converted into a two-classification problem, namely, words in the text to be warped are subjected to "deletion" or "retention" operation, and the basic process can include: firstly, performing word segmentation processing on a text to be structured to obtain a word sequence; and then, inputting the word sequence into a pre-established text normalization model for secondary classification to obtain a classification result of each word in the text to be normalized, wherein the classification result of each word in the text to be normalized can be represented by 0 or 1, if the classification result of one word is 0, the word needs to be deleted, if the classification result of one word is 1, the word needs to be reserved, after the classification result of each word in the text to be normalized is obtained, the words needing to be deleted in the text to be normalized are deleted according to the classification result, all the reserved words are spliced in sequence, and the spliced text is the text after normalization.

However, the above text normalization method has a single target, and only deletes the words to be eliminated from the text to be normalized, that is, the text after normalization is a subset of the original text, and the text obtained after deleting the words to be eliminated is only initially readable, that is, the above text normalization method has a poor normalization effect.

The inventor finds that, in order to obtain a text with strong readability and clear semantic meaning, in addition to words requiring deletion operation in the text to be normalized, words requiring replacement operation (such as network words and personalized words that cannot be understood by ordinary people), words requiring addition operation (such as pronouns in context), words requiring sequence adjustment operation (such as disordered words in sentences), and the like need to be considered, and the words precisely play a key role in semantic accuracy and continuity.

Therefore, the inventor of the present invention further studies and finally provides a text normalization method with a better effect, wherein the text normalization method is used for normalizing a text with problems of poor readability, unclear significance and the like into a text with strong readability and clear semantics, and the text normalization method can be used as post-processing of speech recognition, that is, performing text normalization on a recognition text of a speech recognition system. The text normalization method provided in the present application is described by the following examples.

Referring to fig. 1, a schematic flow chart of a text normalization method provided in an embodiment of the present application is shown, where the method may include:

and S101, acquiring a text to be structured.

Optionally, the text to be structured may be, but is not limited to, a text obtained by performing speech recognition on spoken speech data, and the text to be structured may include one sentence or a plurality of sentences.

And S102, extracting text warping characteristics of the text to be warped.

The text normalization features may include semantic features and generalization features, where the semantic features are features capable of representing semantics of the text to be normalized, and the generalization features are discrete features capable of representing repeated portions in the text to be normalized.

Specifically, the process of extracting the text regular features from the text to be structured includes: and aiming at any sentence in the text to be structured, acquiring the semantic features and the generalization features of the sentence, splicing the semantic features and the generalization features of the sentence, and taking the spliced features as the text regular features of the sentence to obtain the text regular features of each sentence in the text to be structured.

The process of obtaining the semantic features of a sentence comprises the following steps: for any word in the sentence, obtaining a word vector and a part-of-speech vector of the word, splicing the word vector and the part-of-speech vector of the word, and taking the spliced vector as a feature vector of the word to obtain a feature vector of each word in the sentence; and splicing the feature vectors of all words in the sentence, wherein the spliced vector is used as the semantic feature of the sentence.

It should be noted that a word vector of a word is a vector capable of characterizing the word, and a part-of-speech vector of a word is a vector capable of characterizing the part-of-speech of the word.

Specifically, the process of obtaining a word vector and a part-of-speech vector of each word in a sentence includes: the method comprises the steps of performing word segmentation processing on a sentence to obtain each word in the sentence, performing part-of-speech tagging on each word after each word in the sentence is obtained, then determining a word vector of each word, and determining the part-of-speech vector of each word according to the part-of-speech tagging of each word. Alternatively, word vectors for each word may be determined using the word2vec method or other methods, and part-of-speech vectors for each word may be determined using conditional random fields or other methods.

For example, a sentence in the text to be normalized is "this true is an idea of a kayak for one moment", the sentence is first segmented to obtain a word sequence "this true is an idea of a kayak for one moment", and then each word in the word sequence is part-of-speech labeled to obtain: "this/pronoun/help word is really/adverb/measure-one-time racing boat/adjective/help word idea/noun", finally, the word vector of each word and the part-of-speech vector of each word in the text to be normalized are determined according to the word segmentation result and the part-of-speech tagging result.

Step S103, determining a structured text corresponding to the text to be structured by utilizing the text structured feature and a pre-constructed text structured model.

It should be noted that the structured text corresponding to the text to be structured is the text which is structured by the text to be structured.

The text regular feature of each sentence in the text to be structured can be obtained through step S102, after the text regular feature of each sentence in the text to be structured is obtained, the text regular feature of each sentence in the structured text can be input into a pre-established text regular model, a structured sentence corresponding to each sentence in the text to be structured is obtained, and the structured sentences corresponding to each sentence in the text to be structured are combined into a structured text corresponding to the text to be structured.

It should be noted that the input of the text normalization model is a text normalization feature of a sentence, the output is a word sequence, and the sentence formed by the word sequence is a normalized sentence, i.e., a normalized sentence.

The text normalization method provided by the embodiment of the application comprises the steps of firstly obtaining a text to be normalized, then extracting text normalization features from the text to be normalized, and finally determining a normalized text corresponding to the text to be normalized by using the text normalization features and a pre-constructed text normalization model. According to the text normalization method provided by the embodiment of the application, the text normalization characteristics and the text normalization model of the text to be normalized can be utilized to effectively normalize the text to be normalized, so that the normalized text with better readability, accuracy and logicality can be obtained, the normalized text can help a reader to accurately understand the meaning of the text to be normalized which is originally intended to be expressed, and the user experience is better.

By the embodiment, the structured text corresponding to the text to be structured is determined by using the pre-established text structuring model. Next, a process of building a text-warping model in advance will be described.

The process of pre-building a text-warping model may include: acquiring a training text from a pre-constructed training text set; and training the text regular model by using the training text and the labeled text corresponding to the training text.

The training text set comprises a plurality of training texts, each training text corresponds to a label text, and the label text corresponding to a training text is a real regular text corresponding to the training text. For example, if a training text is "this really is an idea of a boat for one moment", the corresponding labeled text of the training text is "this really is an idea of a boat for one moment".

The text normalization model in this embodiment may be an end-to-end neural network model, and the model may include an encoder and a decoder, where the encoder may be a feature extractor formed by network layers such as CNN and LSTM, and similarly, the decoder may also be formed by network layers such as LSTM and CNN, and the decoder end may introduce an attention mechanism to effectively utilize the encoding features.

Referring to fig. 2, a schematic flow chart of training a text normalization model by using a training text and a labeled text corresponding to the training text is shown, which may include:

step S201, extracting a text regular feature from the training text to be used as the training text regular feature.

The process of extracting the text regular features from the training text is similar to the implementation process of "extracting the text regular features from the text to be normalized" provided in the foregoing embodiment, and details are not repeated here.

And S202, determining a mask vector of the labeled text corresponding to the training text.

The mask vector of the labeled text corresponding to the training text is composed of mask values corresponding to all words in the labeled text corresponding to the training text, and the mask value corresponding to any word can represent whether the word is a word needing to be replaced, namely the mask vector of the labeled text corresponding to the training text can represent the word needing to be replaced and the word needing not to be replaced in the labeled text corresponding to the training text.

Illustratively, a labeled text corresponding to a training sample is "this is really an idea of a racing boat at one moment", and if the word "racing boat at one moment" in the labeled text is a word that needs to be replaced, and other words are words that do not need to be replaced, then the mask values corresponding to the words "this", "true", "one", "racing boat at one moment", "what", "idea" are all 1, and the mask value corresponding to the "racing boat at one moment" is 0, so that the mask vector [ 111011 ] of the labeled text "this is really an idea of a racing boat at one moment" can be obtained.

Step S203, training a text normalization model by using the training text normalization features, the labeled text corresponding to the training text and the mask vector of the labeled text corresponding to the training text.

The specific processes of step S202 and step S203 described above will be described below.

Referring to fig. 3, a schematic flow chart illustrating an implementation process of the foregoing "step S202, determining a mask vector of a label text corresponding to a training text" may include:

and S301, determining a probability vector of a labeled text corresponding to the training text.

The probability vector of the labeled text corresponding to the training text is composed of the probability of the prefix sequence of each word in the labeled text corresponding to the training text.

It should be noted that a prefix sequence of a word refers to a word sequence formed by all words before the word. Illustratively, a training sample corresponds to a labeled text "this is really an idea of an instant rowing boat", wherein the prefix sequence of the word "one" is "this is really" and the prefix sequence of the word "idea" is "this is really an instant rowing boat".

In one possible implementation, the RNN language model may be utilized to determine the probability of a prefix sequence for each word in the annotated text. It should be noted that the RNN language model is a general neural network language model, and the training data of the RNN language model may be a training text of a text-normalized model, or may be other texts, that is, the training data of the RNN language model may be any text.

The input of the RNN language model is a word sequence formed by all words in the label text, and the output is a string of probability values. Illustratively, the word sequence is "this is really an idea of a racing boat", and after inputting it into the RNN language model, 7 probability values are obtained, which are p (this | < s >), p (true is | < s >, this), p (one | < s >, this, true is), p (racing boat | < s >, this, true is, one), p (this, true is, one, racing boat), p (thought | < s >, this, true is, one, racing boat, this, p (</s > | < s >, this, true is, one, racing boat, this, these, 7 probability values constitute a 7-dimensional probability vector.

Step S302, determining a mask vector of the labeled text corresponding to the training text according to the probability vector of the labeled text corresponding to the training text.

Each value in the probability vector represents the probability of a prefix sequence, it being understood that the probability is greater for common prefix sequences and less for uncommon prefix sequences. Illustratively, among the probabilities in the above example, p (this | < s >), p (true is | < s >), p (one | < s >, true is), p (boat of one moment of match | < s >, true is, one) are larger, because these four probabilities respectively correspond to prefix sequences "< s >", "< s > this is true is" and "< s >," this is true is, one is more common, "< s > this is true is a boat of one moment of match" because it is more personalized with words, and therefore, its corresponding probability p (its | < s >, "is true is, one is moment of match boat) is smaller, and the prefix sequence" < s > this is a boat of one moment of match "and the prefix sequence" is true is a boat of one moment of match "because it contains the idea of" < s >, "this is true is a boat of one moment of match boat" (p, this is true, p (this is true, p, this is a boat of one moment of match boat, this is true of one, quarter racing yacht) and p (</s > | < s >, which is true of one, quarter racing yacht, thought) are also smaller.

Specifically, the process of determining the mask vector of the labeled text corresponding to the training text according to the probability vector of the labeled text corresponding to the training text may include:

and S3021, normalizing the probability vector of the labeled text corresponding to the training text to obtain a normalized probability vector.

Optionally, the normalization method of the probability vector of the labeled text corresponding to the training text may be: each probability in the probability vector is divided by the smallest probability in the probability vector.

And step S3022, performing first-order difference on the normalized probability vector to obtain a first-order difference result.

Illustratively, the normalized probability vector is v ═ 50,34,45,2,3,1,4, and then a first order difference is applied to the probability vector to obtain v ═ 16,11, -43,1, -2, 3.

And S3023, determining a mask vector of the labeled text corresponding to the training text according to the first-order difference result.

Specifically, the difference minimum value is determined from the first-order difference result, the words needing to be replaced in the labeled text corresponding to the training text are determined according to the difference minimum value, the other words are the words needing not to be replaced, and the mask vector of the labeled text corresponding to the training text is generated according to the words needing to be replaced and the words needing not to be replaced in the labeled text corresponding to the training text.

Illustratively, the labeled text corresponding to the training text is "this is really an idea of a racing boat", it is assumed that v' [ -16,11, -43,1, -2,3] is obtained after the probability vector of the labeled text is normalized and first-order differentiated, where-43 is the difference minimum, since the minimum difference value is determined based on probability 45 and probability 2, while the word corresponding to the probability 45 is "one", the word corresponding to the probability "2" is "one-moment racing boat", this indicates that a differential minimum occurs from "one" to "one-time racing boat", and therefore, the "one-time racing boat" is determined as a word to be replaced, other words are words that do not require replacement, words that require replacement are labeled "0", words that do not require replacement are labeled "1", thus, a mask vector of a labeled text corresponding to the training text, which is really an idea of a racing boat, can be obtained [ 111011 ].

Next, the text-normalized model is trained by using the training text normalization features, the labeled text corresponding to the training text, and the mask vector of the labeled text corresponding to the training text in step S203.

It should be noted that the principle of training the text normalization model in the present application is that common words in the training text are not replaced, that is, words are retained to maintain the original meaning, and special words in the training text are replaced with common near-meaning words. Based on this, please refer to fig. 4, which shows a schematic flow chart of training a text normalization model by using a training text normalization feature, a labeled text corresponding to the training text, and a mask vector of the labeled text corresponding to the training text, and the schematic flow chart may include:

step S401, a regular text corresponding to the training text is predicted by using the training text regular feature, the mask vector of the labeled text corresponding to the training text and the text regular model.

Specifically, the word-by-word prediction is carried out by utilizing the training text regular features, the mask vector of the labeled text corresponding to the training text and a text regular model: when predicting the word at each target moment, if determining that the tagged word at the previous moment is a word which does not need to be replaced according to the mask vector of the tagged text corresponding to the training text, predicting the word at the target moment according to the tagged word at the previous moment, and if determining that the tagged word at the previous moment is a word which needs to be replaced according to the mask vector of the tagged text corresponding to the training text, predicting the word at the target moment according to the tagged word at the previous moment and the predicted word at the previous moment; and finally, forming a regular text corresponding to the training text by all the predicted words. And the labeled words are words in the labeled text corresponding to the training text.

Illustratively, the labeled text corresponding to the training text is 'this is really the idea of a racing boat', the mask vector of the label text is [ 111011 ], the value of the 2 nd position of the mask vector is 1, which indicates that the word at the position does not need to be replaced when the word is normalized, therefore, the 2 nd word in the annotation text may be used as the prediction input at time 3, the value at the 4 th position of the mask vector is 0, assuming that at time 4, the word predicted by the text warping model is "exciting", then, when predicting the output word at time 5, fusing the predicted word at the 4 th moment and the labeled word at the 4 th moment, predicting the output word at the 5 th moment according to the fused result, that is, the word vectors of "exciting" and "racing yacht at one moment" are fused, and the output word at the 5 th moment is predicted from the fusion result.

In the above implementation process, "the words at the target time are predicted according to the annotation words at the previous time and the prediction words at the previous time" may be referred to in the description of the following embodiments.

And S402, determining the prediction loss of the text regular model according to the labeled text corresponding to the training text, the predicted regular text and the mask vector of the labeled text corresponding to the training text.

The process of determining the prediction loss of the text-regularizing model according to the labeled text corresponding to the training text, the predicted regularized text, and the mask vector of the labeled text corresponding to the training text may include: determining a prediction error rate according to the words which do not need to be replaced in the labeled text corresponding to the training text and the predicted words corresponding to the words which do not need to be replaced in the labeled text corresponding to the training text; determining the entropy of probability distribution of a predicted word corresponding to a word needing to be replaced in a labeled text corresponding to a training text; and determining the prediction loss of the text normalization model according to the prediction error rate and the entropy of the probability distribution of the predicted words corresponding to the words needing to be replaced in the labeled text. And determining words which do not need to be replaced and words which need to be replaced in the labeled text corresponding to the training text according to the mask vector of the labeled text corresponding to the training text.

Specifically, the predicted loss of the text-warping model may be determined using the following loss function:

wherein, the mask is a mask vector of the labeled text corresponding to the training text,

i.e. the prediction error rate, H (P) as described above_fusion) Namely, the entropy of the probability distribution of the predicted word corresponding to the word to be replaced in the labeled text corresponding to the training text. The above loss function will force the predicted word at the position where the mask is 1 to be consistent with the corresponding annotation word to the maximum extent, and at the same time, the predicted word at the position where the mask is 0 has diversity, that is, the model has the capability of replacing special words.

And S403, updating parameters of the text-structured model according to the prediction loss of the text-structured model.

The implementation of the above-mentioned "predicting the target time based on the annotation word at the previous time and the prediction word at the previous time" in the above-mentioned embodiments is described below.

Referring to fig. 5, a schematic flow chart of predicting a word at a target time according to a tag word at a previous time and a predicted word at a previous time is shown, and the flow chart may include:

step S501, calculating the cosine distance between the annotation word at the previous moment and the prediction word at the previous moment.

Specifically, the cosine distance λ between the annotation word at the previous time and the predicted word at the previous time can be calculated by the following formula:

wherein s is₁Word vectors, s, of predicted words at previous moments₂The word vector of the tagged word at the previous time.

And step S502, calculating a fusion gate vector of the predicted word at the previous moment.

The fusion gate vector is used for controlling the degree of fusion of the marking words at the previous moment and the prediction words at the previous moment, the probability of occurrence of the similar meaning words is increased by the high degree of fusion, and the error occurrence of the non-similar meaning words is avoided by the low degree of fusion.

Specifically, the fusion gate vector g of the predicted word at the previous time_tCan be calculated by the following formula:

g_t＝σ(W·[s₁；s₂]+b) (3)

wherein W is a matrix, b is a vector, σ is an activation function, which can be sigmoid or tanh function, etc., g_tIs a vector whose dimensions are the same as the word vector dimensions of the predicted word.

Step S503, determining a fusion vector of the annotation word at the previous moment and the prediction word at the previous moment according to the cosine distance between the annotation word at the previous moment and the prediction word at the previous moment and the fusion gate vector of the prediction word at the previous moment.

The fusion vector is mainly used for representing the fusion result of the predicted word and the tagged word, the fusion result contains the information of the tagged word and the information of the predicted word, and the relevance of the two is increased in the training process of the model.

Obtaining the cosine distance lambda between the marking word at the previous moment and the predicted word at the previous moment and the fusion gate vector g of the predicted word at the previous moment_tThen, a fusion vector of the annotation word at the previous time and the predicted word at the previous time can be calculated by the following formula:

wherein the content of the first and second substances,representing multiplication of corresponding elements of the vector, U is a matrix used for transforming the corresponding vector, the specific transformation process can be set according to the requirement,

the proportion of the predicted words fused into the label words is determined, and the proportion can be set by self or set according to empirical values, wherein phi is an activation function.

And step S504, determining a predicted word at the target moment according to the fusion vector.

After the text-structured model is trained in the training process, the structured text can be structured by using the text-structured model obtained by training, so that the structured text with stronger readability and logicality and clearer semanteme can be obtained.

The following describes a text-normalization device provided in an embodiment of the present application, and the text-normalization device described below and the text-normalization method described above may be referred to correspondingly.

Referring to fig. 6, fig. 6 is a schematic structural diagram of a text normalization device according to an embodiment of the present application. As shown in fig. 6, the apparatus may include: a text acquisition module 601, a feature extraction module 602, and a text normalization module 603.

A text obtaining module 601, configured to obtain a text to be structured;

a feature extraction module 602, configured to extract a text normalization feature for the text to be normalized, where the text normalization feature includes a semantic feature capable of representing semantics of the text to be normalized and a generalization feature capable of representing a repeated portion in the text to be normalized;

and a text normalization module 603, configured to determine a normalized text corresponding to the text to be normalized by using the text normalization feature and a pre-established text normalization model.

The text normalization device provided by the embodiment of the application can utilize the text normalization feature and the text normalization model of the text to be normalized to effectively normalize the text to obtain the normalized text with better readability, accuracy and logicality, and the normalized text can help a reader to accurately understand the meaning of the text to be normalized which is originally intended to be expressed, so that the user experience is better.

In a possible implementation manner, the feature extraction module 602 in the text normalization device provided in the above embodiment is specifically configured to obtain, for any sentence in the text to be normalized, a semantic feature and a generalization feature of the sentence, and splice the semantic feature and the generalization feature of the sentence, where the spliced feature is used as the text normalization feature of the sentence; so as to obtain the text regulation characteristic of each sentence in the text to be regulated.

In a possible implementation manner, when obtaining the semantic features of the sentence, the feature extraction module in the text regularization device provided in the above embodiment is specifically configured to, for any word in the sentence, obtain a word vector and a part-of-speech vector of the word, splice the word vector and the part-of-speech vector of the word, and use the spliced vector as the feature vector of the word to obtain the feature vector of each word in the sentence, where the part-of-speech vector of a word is a vector representing the part-of-speech of the word; and splicing the feature vectors of all words in the sentence, wherein the spliced vector is used as the semantic feature of the sentence.

In a possible implementation manner, the text normalization module in the text normalization device provided in the above embodiment is specifically configured to input the text normalization characteristics of each sentence in the text to be normalized into the text normalization model, so as to obtain a normalized sentence corresponding to each sentence in the text to be normalized; and composing the regular sentences corresponding to the texts to be structured by the regular sentences corresponding to the sentences in the texts to be structured respectively.

The text normalization apparatus provided by the above embodiment further includes: the text-structured model building module comprises a training text obtaining module and a text-structured model training module.

In a possible implementation manner, the text-normalization model training module includes: the device comprises a feature extraction submodule, a mask vector determination submodule and a model training submodule.

And the feature extraction submodule is used for extracting the text regular features from the training text to be used as the training text regular features.

The mask vector determining submodule is used for determining a mask vector of the labeled text corresponding to the training text, wherein the mask vector can represent words needing to be replaced and words not needing to be replaced in the labeled text corresponding to the training text.

In a possible implementation manner, the mask vector determining submodule is specifically configured to determine a probability vector of a labeled text corresponding to the training text, and determine a mask vector of a labeled text corresponding to the training text according to the probability vector of the labeled text corresponding to the training text; the probability vector is composed of the probability of the prefix sequence of each word in the label text corresponding to the training text, and the prefix sequence of one word is a word sequence composed of all words before the word.

In a possible implementation manner, when determining the mask vector of the labeled text corresponding to the training text according to the probability vector of the labeled text corresponding to the training text, the mask vector determination submodule is specifically configured to normalize the probability vector of the labeled text corresponding to the training text to obtain a normalized probability vector, perform first-order difference on the normalized probability vector to obtain a first-order difference result, and determine the mask vector of the labeled text corresponding to the training text according to the first-order difference result.

In a possible implementation manner, the model training sub-module includes a text prediction sub-module, a prediction loss determination sub-module, and a parameter updating sub-module.

The text prediction submodule is used for predicting the regular text corresponding to the training text by using the training text regular features, the mask vector of the labeled text corresponding to the training text and a text regular model.

And the prediction loss determining submodule is used for determining the prediction loss of the text regular model according to the labeled text corresponding to the training text, the predicted regular text and the mask vector of the labeled text corresponding to the training text.

In a possible implementation manner, the text prediction sub-module is specifically configured to perform word-by-word prediction by using the training text normalization features, the mask vector of the labeled text, and a text normalization model: when predicting words at each target moment, if determining that the tagged word at the previous moment is a word which does not need to be replaced according to the mask vector of the tagged text corresponding to the training text, predicting the words at the target moment according to the tagged word at the previous moment, and if determining that the tagged word at the previous moment is a word which needs to be replaced according to the mask vector of the tagged text corresponding to the training text, predicting the words at the target moment according to the tagged word at the previous moment and the predicted word at the previous moment, wherein the tagged words are words in the tagged text corresponding to the training text; and forming the regular text corresponding to the training text by all the predicted words.

In a possible implementation manner, when predicting a word at a target time according to a tagged word at a previous time and a predicted word at the previous time, the text prediction sub-module is specifically configured to calculate a cosine distance between the tagged word at the previous time and the predicted word at the previous time, calculate a fused gate vector of the predicted word at the previous time, determine a fused vector of the tagged word at the previous time and the predicted word at the previous time according to the cosine distance and the fused gate vector, and determine the predicted word at the target time according to the fused vector; and the fusion gate vector is used for controlling the degree of fusion of the annotation words at the previous moment and the prediction words at the previous moment.

In a possible implementation manner, the prediction loss determining sub-module is specifically configured to determine a prediction error rate according to a word that does not need to be replaced in a labeled text corresponding to the training text and a predicted word corresponding to the word that does not need to be replaced in the labeled text, determine an entropy of probability distribution of the predicted word corresponding to the word that needs to be replaced in the labeled text corresponding to the training text, and determine a prediction loss of the text normalization model according to the prediction error rate and the entropy of probability distribution of the predicted word corresponding to the word that needs to be replaced in the labeled text; and determining words which do not need to be replaced and words which need to be replaced in the labeled text according to the mask vector of the labeled text corresponding to the training text.

An embodiment of the present application further provides a text normalization device, please refer to fig. 7, which shows a schematic structural diagram of the text normalization device, where the text normalization device may include: at least one processor 701, at least one communication interface 702, at least one memory 703 and at least one communication bus 704;

in the embodiment of the present application, the number of the processor 701, the communication interface 702, the memory 703 and the communication bus 704 is at least one, and the processor 701, the communication interface 702 and the memory 703 complete mutual communication through the communication bus 704;

the processor 701 may be a central processing unit CPU, or an application specific Integrated circuit asic, or one or more Integrated circuits configured to implement embodiments of the present invention, or the like;

the memory 703 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory) or the like, such as at least one disk memory;

wherein the memory stores a program and the processor can call the program stored in the memory, the program for:

acquiring a text to be structured;

Alternatively, the detailed function and the extended function of the program may be as described above.

Embodiments of the present application further provide a readable storage medium, where a program suitable for being executed by a processor may be stored, where the program is configured to:

acquiring a text to be structured;

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for text normalization, comprising:

acquiring a text to be structured;

2. The method as claimed in claim 1, wherein the extracting text-structured features from the text to be structured comprises:

for any sentence in the text to be structured:

3. The method of claim 2, wherein the obtaining semantic features of the sentence comprises:

4. The method as claimed in claim 2, wherein the determining the structured text corresponding to the text to be structured by using the text-structured feature and a pre-established text-structured model includes:

5. The text-normalization method of claim 1, wherein the process of pre-constructing the text-normalization model comprises:

6. The method as claimed in claim 5, wherein the training a text-warping model by using the obtained training text and its corresponding labeled text comprises:

7. The method of claim 6, wherein the determining a mask vector of the label text corresponding to the training text comprises:

8. The method as claimed in claim 7, wherein the determining the mask vector of the labeled text corresponding to the training text according to the probability vector of the labeled text corresponding to the training text comprises:

9. The method of claim 6, wherein training a text-warping model using the training text-warping features, the labeled text corresponding to the training text, and the mask vector of the labeled text corresponding to the training text comprises:

10. The method of claim 9, wherein predicting the structured text corresponding to the training text by using the training text structured feature, the mask vector of the labeled text corresponding to the training text, and a text structured model comprises:

11. The method as claimed in claim 10, wherein the predicting the words at the target time according to the annotation words at the previous time and the predicted words at the previous time comprises:

12. The method of claim 9, wherein determining the prediction loss of the text normalization model according to the labeled text corresponding to the training text, the predicted normalization text and the mask vector of the labeled text corresponding to the training text comprises:

13. A text normalization apparatus, comprising: the system comprises a text acquisition module, a feature extraction module and a text normalization module;

the text acquisition module is used for acquiring a text to be structured;

14. A text-warping apparatus, comprising: a memory and a processor;

the memory is used for storing programs;

the processor, configured to execute the program, implementing the steps of the text normalization method according to any one of claims 1-12.

15. A readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the text normalization method according to any one of claims 1-12.