CN112214965A

CN112214965A - Case regulating method and device, electronic equipment and storage medium

Info

Publication number: CN112214965A
Application number: CN202011134242.1A
Authority: CN
Inventors: 戚婷; 万根顺; 高建清; 刘聪; 王智国; 胡国平
Original assignee: iFlytek Co Ltd
Current assignee: University of Science and Technology of China USTC; iFlytek Co Ltd
Priority date: 2020-10-21
Filing date: 2020-10-21
Publication date: 2021-01-12

Abstract

The embodiment of the invention provides a case regulating method and a device, wherein the method comprises the following steps: inputting the text to be normalized to a case and case normalization model to obtain a case and case format type of each word in the text to be normalized output by the case and case normalization model; based on the case format type of each word segmentation, the text to be structured is structured to obtain a structured text corresponding to the text to be structured; the case-and-case regularization model is obtained by training based on a sample text to be regularized and a sample case-and-case format type of each sample word in the sample text to be regularized; the case regulation model is used for determining the context semantic representation and case conversion coefficient of each participle in the text to be regulated, and determining the case format type of each participle based on the context semantic representation and case conversion coefficient of each participle. The method and the device for regulating the case and the case expand the application range of the method for regulating the case and the case, and improve the accuracy of regulating the case and the case.

Description

Case regulating method and device, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of natural language processing, in particular to a case regularization method, a case regularization device, electronic equipment and a storage medium.

Background

Language-specific language habits and grammar specifications determine that words of a partial language need to adopt different case formats in different contexts. However, both transcribed text from speech recognition and manually written edited text often fail to correctly distinguish the case format of a particular word in different contexts. Therefore, the text needs to be case-normalized to conform to the language habit and grammar specification of the language used.

Currently, case-and-case regulation methods usually perform case-and-case conversion based on preset rules and a list of replacement words. However, the regularization method is limited by a limited list of alternative words, the applicable range is small, and the words have various forms such as single-plural number, all lattices and the like, and the list of alternative words cannot cover all forms of each word, so that the generalization capability is poor. In addition, the simple alternative regularization method is easy to cause sentence meaning change, so that the text sentence meaning after regularization is wrong.

Disclosure of Invention

The embodiment of the invention provides a case regulating method, a case regulating device, electronic equipment and a storage medium, which are used for solving the defects of narrow application range and poor case regulating accuracy in the prior art.

The embodiment of the invention provides a case regulating method, which comprises the following steps:

inputting a text to be normalized to a case and case normalization model to obtain a case and case format type of each word in the text to be normalized output by the case and case normalization model;

based on the case format type of each word segmentation, the text to be structured is structured to obtain a structured text corresponding to the text to be structured;

the case-and-case regularization model is obtained by training based on a sample text to be regularized and a sample case-and-case format type of each sample word in the sample text to be regularized;

the case regulation model is used for determining the context semantic representation and case conversion coefficient of each participle in the text to be regulated, and determining the case format type of each participle based on the context semantic representation and case conversion coefficient of each participle.

According to the case and case regulation method of an embodiment of the present invention, the inputting of the text to be regulated into the case and case regulation model to obtain the case and case format type of each participle in the text to be regulated output by the case and case regulation model specifically includes:

inputting each participle in the text to be structured to a context semantic representation layer of the case and case structured model to obtain the context semantic representation of each participle output by the context semantic representation layer;

inputting the context semantic representation of each participle into a capital-lower conversion coefficient calculation layer of the capital-lower regular model to obtain a capital-lower conversion coefficient of each participle output by the capital-lower conversion coefficient calculation layer;

and inputting the context semantic representation and the case conversion coefficient of each participle into a sequence labeling layer of the case-structured model to obtain the case format type of each participle output by the sequence labeling layer.

According to the case and case normalization method of an embodiment of the present invention, the inputting each participle in the text to be normalized to the context semantic representation layer of the case and case normalization model to obtain the context semantic representation of each participle output by the context semantic representation layer specifically includes:

inputting each character in any participle to a character coding layer of the context semantic representation layer to obtain a character code of each character in any participle output by the character coding layer;

inputting the character code of each character in any participle into a pooling layer of the context semantic representation layer to obtain a pooling vector of any participle output by the pooling layer;

and inputting the pooling vector of each participle into a context semantic extraction layer of the context semantic expression layer to obtain the context semantic expression of each participle output by the context semantic extraction layer.

According to the case and case normalization method of an embodiment of the present invention, the inputting the context semantic representation and the case and case conversion coefficient of each participle into the sequence labeling layer of the case and case normalization model to obtain the case and case format type of each participle output by the sequence labeling layer specifically includes:

inputting each participle in the text to be structured into a sequence label vector representation layer of the sequence label layer to obtain a sequence label vector representation of each participle output by the sequence vector representation layer;

and inputting the context semantic representation, the sequence labeling vector representation and the case and case conversion coefficient of each participle into a label prediction layer of the sequence labeling layer to obtain the case and case format type of each participle output by the label prediction layer.

According to the case regularization method of one embodiment of the present invention, the loss function of the case regularization model includes a case conversion coefficient loss function and a sequence labeling loss function;

the case conversion coefficient loss function is used for maximizing the case conversion coefficient of the sample word with the upper case label and minimizing the dispersion degree of the case conversion coefficient of the sample word with the lower case label.

According to the case and case regularization method of an embodiment of the present invention, the loss function of the case and case regularization model further includes a period similarity loss function;

the sentence meaning similarity loss function is used for minimizing the sentence meaning similarity between the text to be structured of the sample and the text which is structured of the corresponding sample;

the sentence meaning similarity is determined based on the sentence meaning feature representation of the text to be structured of the sample and the sentence meaning feature representation of the text structured of the sample;

wherein, the sentence meaning characteristic representation is determined based on the context semantic representation of each participle in the corresponding text.

According to the case regulation method of an embodiment of the present invention, the regulating the text to be regulated based on the case format type of each participle specifically includes:

if the capital and lower case format type of any participle is capital, determining the regular mode of any participle based on the preset capital-to-capital conversion corresponding relation; wherein, the regular mode is full character capitalization or first character capitalization.

An embodiment of the present invention further provides a case regulating device, including:

the case label determining unit is used for inputting the text to be regulated into the case regulation model to obtain the case format type of each word segmentation in the text to be regulated output by the case regulation model;

the case-structured unit is used for structuring the text to be structured based on the case format type of each participle to obtain a structured text corresponding to the text to be structured;

The embodiment of the present invention further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor executes the program to implement any of the steps of the case regulating method described above.

Embodiments of the present invention further provide a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the case warping method as described in any one of the above.

According to the case regulating method, the case regulating device, the electronic equipment and the storage medium provided by the embodiment of the invention, the context semantic representation and the case conversion coefficient of each participle in the text to be regulated are determined through the case regulating model, and the case format type of each participle is determined based on the context semantic representation and the case conversion coefficient of each participle, so that the case regulation is carried out on the text to be regulated without presetting a replacement word list, and the application range of the case regulating method is expanded. Meanwhile, the case-structured model fully considers the context information of the text to be structured, so that each structured word is in accordance with the whole context of the text, and the accuracy of case-structured is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a case-structured method according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of a case-structured model operation method according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating a context semantic representation method according to an embodiment of the present invention;

FIG. 4 is a flowchart illustrating a sequence tagging method according to an embodiment of the present invention;

fig. 5 is a flowchart illustrating a case format type determining method according to an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of a case regulating device according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The partial language has specific language habit and grammar specification, which determines that different case formats are needed for specific words in different contexts. For example, the English words "Apple" and "Apple", "IT" and "IT" and the French words "Francais" and "Francais" need to be combined with a specific context to determine whether the lower case or upper case format should be adopted. In addition, when some words are proper nouns, the upper case format should be adopted, otherwise, the lower case format should be adopted. For example, the english word "Rose" should be in the upper case format "Rose" when used as the name of a person, and the lower case format "Rose" when used as the term of a person. However, both transcribed text from speech recognition and manually written edited text often fail to correctly distinguish the case format of a particular word in different contexts. Therefore, the text needs to be case-normalized to conform to the language habit and grammar specification of the language used.

Currently, case-and-case regulation methods usually perform case-and-case conversion based on preset rules and a list of replacement words. The preset rule is that after the text to be regulated is divided into sentences, the first letter of the first word of each sentence is converted into an uppercase format, the mapping relation from the full lowercase format to the fixed uppercase format of part of the words, such as English → English, is stored in the replacement word list, then the text to be regulated is matched with the replacement word list, and the matched words are converted into the corresponding uppercase format.

However, the regularization method is limited to a limited list of alternative words, can only perform case conversion on a small number of words, and has a small applicable range. Moreover, because the words have various forms such as single-plural number, all lattices and the like, the list of the replacement words cannot cover all forms of each word, and the generalization capability is poor. For example, only the John → John mapping exists in the list of replacement words, and when John's exist in the text to be normalized, case conversion cannot be performed due to matching failure. In addition, the simple replacement regularization method carries out case conversion on the successfully matched words without considering the sentence meanings of the whole text, and if the meanings corresponding to the capital form and the lowercase form of a certain word are greatly different, the sentence meanings can be changed, so that the regularized text sentence meanings are wrong. For example, if the text "get out your best Chinese and crystal" is to be structured, the meaning of the Chinese is changed from "China" to "China" after the text is converted into "get out your best Chinese and crystal" according to the mapping Chinese → Chinese in the list of replacement words, which causes the sentence meaning of the whole text to be wrong.

In view of this, the embodiment of the present invention provides a case regulating method. Fig. 1 is a schematic flow chart of a case regulating method according to an embodiment of the present invention, as shown in fig. 1, the method includes:

step 110, inputting the text to be regulated into a case and case regulation model to obtain the case and case format type of each word in the text to be regulated output by the case and case regulation model;

Specifically, the text to be normalized is a text that needs to be case-normalized, and the language used for the text is a language in which a specification for the case-and-case format of words exists, such as english, french, and the like. The text to be normalized may be a transcribed text obtained by performing speech recognition on user speech data, or a text that is manually written and edited, may be a text in the general field, or may be a text customized according to application requirements, which is not specifically limited in the embodiment of the present invention.

Before the text to be normalized is input into the upper and lower case normalization model, the text to be normalized can be preprocessed, segmentation is carried out according to termination type punctuations in the text to be normalized to obtain each clause, and then word segmentation is carried out according to punctuations and spaces in the sentence and at the end of the sentence of each clause to obtain each participle in the text to be normalized. In order to facilitate subsequent case-and-case regularization, the word segmentation in the text to be regularized can be converted into a full lower case format. It should be noted that, according to the language specification, the first word of each clause usually requires the capital letter. Therefore, the first letter of the first participle of each clause may be converted into capital during preprocessing, and the first letter of the first participle of each clause is not processed in the subsequent regularization process, or after all the participles in the text to be regularized are uniformly converted into a full-lowercase format, the first letter of the first participle of each clause is converted into capital in the subsequent regularization process.

And inputting the text to be regulated into the case and case regulation model to determine the case and case format type of each word in the text to be regulated. Wherein, the case format type of any word is used for representing whether the word should be in upper case format or lower case format in the context of the text to be structured. Specifically, the case-and-case-structured model is used for determining the context semantic representation and case-and-case conversion coefficient of each participle in the text to be structured based on the context information of the text to be structured. Here, the context semantic of any participle is used to represent the participle and semantic information contained in the context thereof, and the case conversion coefficient of any participle is used to represent the possibility that the participle needs to be subjected to case conversion in the context of the text to be normalized. Then, the case regulation model determines and outputs the case format type of each participle based on the context semantic representation and case conversion coefficient of each participle.

Because the context semantic representation of any participle reflects the meaning that the participle should have in the context of the text to be structured to a certain extent, and the meanings corresponding to the capital writing format and the lowercase writing format of the participle may be different, the capital writing format and the lowercase writing format of the participle can be determined based on the context semantic representation of the participle. For example, from the context semantic representation of the participle "china" in the text to be structured "get out your best china and crystal", it can be known that the participle may mean "china" here, and thus its case format type can be determined to be a lower case format. For another example, according to the context semantic representation of the participle "rose" in the text to be regulated "fortunately, rose has been reserved", it can be known that the participle may be a person name here, and thus it can be determined that the case format type is the capital format. The case conversion coefficient reflects the possibility that the word needs to be subjected to case conversion in the text context to be structured, so that the case normalization model can accurately determine the case format type of each word in the text to be structured by combining the context semantic representation of each word and the case conversion coefficient of each word, so that each word after being structured is in accordance with the whole context of the text, and the error of the structured text sentence meaning is avoided.

Before step 110 is executed, a case-structured model may also be obtained through pre-training, and specifically, the case-structured model may be obtained through training in the following manner: firstly, a large number of sample texts to be structured and corresponding sample structured texts are collected, and after preprocessing such as sentence segmentation, word segmentation and lower case conversion is respectively carried out, the sample case format type of each sample word in the sample texts to be structured is labeled.

Alternatively, the sample case format type may be 1 or 0, where 1 indicates that the sample participle is in upper case format and 0 indicates that the sample participle is in lower case format. In particular, since the first participle of each clause must be in capitalization format, the sample case format type of the first participle of the clause may be labeled 0. For example, after the sample text to be normalized "We have disposed a new model" is subjected to word segmentation and lower case conversion, a word sequence "We/have/disposed/a/new/model" can be obtained, and a sample case format type sequence corresponding to each sample word segmentation in the word sequence is "0000000". For another example, after the text to be normalized of the sample "However, IT is a very hard jobi" is subjected to word segmentation and lower case conversion, the word sequence "However/,/IT/is/a/very/hard/jobi" can be obtained, and the sample case format type sequence corresponding to each sample word segmentation in the word sequence one by one is "00100000".

And then, training an initial model based on the sample text to be normalized and the sample case format type of each sample word in the sample text to be normalized, thereby obtaining a case-normalized model. Optionally, in the training process, the model parameters may be updated by batch gradient descent, small batch gradient descent, or random gradient descent.

And 120, based on the case format type of each word segmentation, warping the text to be warped to obtain a warped text corresponding to the text to be warped.

Specifically, based on the case format type of each participle, capitalization conversion can be performed on the participle with the case format type being the capitalization format in the text to be normalized, and lowercase conversion can be performed on the participle with the case format type being the lowercase format in the text to be normalized, so as to obtain the normalized text corresponding to the text to be normalized. In addition, if the segmented words in the text to be normalized are converted into the full lowercase format during preprocessing the text to be normalized, the segmented words with the upper case format type in the text to be normalized can be converted into the upper case format only during upper case normalization, so that the normalized text corresponding to the text to be normalized is obtained.

According to the method provided by the embodiment of the invention, the context semantic representation and the case conversion coefficient of each participle in the text to be normalized are determined through the case normalization model, and the case format type of each participle is determined based on the context semantic representation and the case conversion coefficient of each participle, so that the case normalization is carried out on the text to be normalized, a replacement word list does not need to be preset, and the application range of the case normalization method is expanded. Meanwhile, the case-structured model fully considers the context information of the text to be structured, so that each structured word is in accordance with the whole context of the text, and the accuracy of case-structured is improved.

Based on the foregoing embodiment, fig. 2 is a schematic flow chart of a case-structured model operation method provided by the embodiment of the present invention, and as shown in fig. 2, step 110 specifically includes:

and step 111, inputting each participle in the text to be structured into a context semantic representation layer of the case and case structured model to obtain the context semantic representation of each participle output by the context semantic representation layer.

Specifically, the context semantic representation layer is configured to extract any participle and semantic information of the context thereof based on each participle in the text to be normalized, so as to obtain context semantic representation of the participle. The context semantic representation layer may be constructed on the basis of a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN), a Long-Term Memory Network (LSTM), or a variation thereof, which is not specifically limited in this embodiment of the present invention.

And 112, inputting the context semantic representation of each participle into a capital and small conversion coefficient calculation layer of the capital and small rule regulation model to obtain the capital and small conversion coefficient of each participle output by the capital and small conversion coefficient calculation layer.

Specifically, considering that in the text to be normalized, the higher the influence of the sentence meaning on the whole text, the more likely the participles need to be converted into capitalization forms, a case and case conversion coefficient calculation layer is arranged in the case and case normalization model, and is used for determining the influence degree of each participle on the sentence meaning of the text to be normalized based on the context semantic representation of each participle, and the case and case conversion coefficient of each participle is used. Alternatively, the attention mechanism may be utilized to calculate the attention weight of any participle in the text to be normalized. The greater the attention weight is, the more important the participle is in the text to be structured, and the greater the semantic information has influence on the sentence meaning of the text to be structured. Therefore, the attention weight of any participle can be used as the case conversion coefficient of the participle. Specifically, the attention weight of any participle i can be calculated by the following formula:

wherein h is_iFor contextual semantic representation of a participle i, a_iThe attention weight of the participle i is taken as the attention weight, n is the number of the participles in the text to be regulated, W, b and q are parameters which can be obtained by learning in a case and case regulation model, q is a query vector in the attention mechanism, and W and b are a weight matrix and an offset in the attention mechanism.

And step 113, inputting the context semantic representation and case conversion coefficient of each participle into a sequence labeling layer of a case and case regularization model to obtain the case and case format type of each participle output by the sequence labeling layer.

Specifically, the sequence labeling layer is configured to perform sequence labeling on each segmented word in the text to be normalized based on the context semantic representation and the case-case conversion coefficient of each segmented word, so as to obtain the case-case format type of each segmented word. The sequence labeling layer may be constructed based on any sequence labeling model, such as a Conditional Random Field (CRF) model, an LSTM model, and variations thereof, which is not specifically limited in this embodiment of the present invention.

The method provided by the embodiment of the invention determines the case format type of each participle by extracting the context expression vector of each participle in the text to be normalized, determining the case format type of each participle based on the context expression vector of each participle and combining the context expression vector and the case format conversion coefficient of each participle, fully considering the influence degree of each participle on the sentence meaning, and improving the accuracy of case normalization.

Based on any of the above embodiments, fig. 3 is a flowchart illustrating a context semantic representation method provided by the embodiment of the present invention, as shown in fig. 3, step 111 specifically includes:

step 1111, inputting each character in any participle into a character coding layer of the upper and lower semantic representation layers to obtain a character code of each character in the participle output by the character coding layer;

step 1112, inputting the character code of each character in the participle into the pooling layer of the context semantic representation layer to obtain the pooling vector of the participle output by the pooling layer;

and 1113, inputting the pooled vector of each participle into a context semantic extraction layer of the context semantic expression layer to obtain the context semantic expression of each participle output by the context semantic extraction layer.

Specifically, considering that each participle has rich morphological changes, such as a single number, all lattices, tenses and the like, each character of any participle in the text to be structured is input to the character coding layer to extract and code semantic information of each character in the participle, so as to obtain character codes of each character of the participle, and the semantic information of the participle under various morphological changes can be accurately extracted. The character encoding layer may be constructed based on models such as CNN, LSTM, or Bi-directional Long Short-Term Memory (Bi-LSTM), and the embodiment of the present invention is not limited to this.

Then, the character code of each character in the participle is input to the pooling layer of the context semantic representation layer, so that the character codes of all characters in the participle are integrated and compressed into a vector with a fixed length, namely the pooling vector of the participle. Alternatively, the pooling layer may adopt a mean pooling manner or a maximum pooling manner, which is not particularly limited in the embodiment of the present invention.

Then, the context semantic extraction layer extracts each participle and semantic information of the context thereof based on the pooling vector of each participle to obtain the context semantic representation of each participle. The context semantic extraction layer may be constructed based on models such as CNN, LSTM, or BiLSTM, which is not specifically limited in this embodiment of the present invention.

According to the method provided by the embodiment of the invention, the character code of each character in any participle is extracted, and the pooling vector of the participle is determined based on the character code of each character in any participle, so that the context semantic representation of each participle is extracted, and the accuracy of the context semantic representation is improved.

Based on any of the above embodiments, fig. 4 is a schematic flow chart of a sequence labeling method provided in the embodiment of the present invention, and as shown in fig. 4, step 113 specifically includes:

step 1131, inputting each word segmentation in the text to be normalized to a sequence labeling vector representation layer of the sequence labeling layer, so as to obtain a sequence labeling vector representation of each word segmentation output by the sequence vector representation layer.

Specifically, since the context semantic representation of each participle is used to determine the case conversion coefficient of each participle, the context semantic representation layer may pay more attention to the importance degree of the participle in the text to be structured, that is, the influence degree of the semantic information of the participle on the text sentence to be structured, when extracting the context semantic representation of any participle. However, the focus of the sequence annotation task is to determine the case format type of each participle. Therefore, the attention point of the context semantic representation layer is different from the attention point of the sequence labeling task, and the effect of performing sequence labeling only according to the context semantic representation of each participle may be poor. Therefore, in order to improve the accuracy of sequence annotation, a sequence annotation vector representation layer is arranged in the sequence annotation layer to extract the sequence annotation vector representation of each word segmentation in the text to be normalized. The sequence labeling vector representation of any participle can also represent the participle and semantic information of the context of the participle, but because the attention point of the sequence labeling vector representation layer is sequence labeling, the extracted sequence labeling vector representation is more suitable for a sequence labeling task relative to the context semantic representation. Alternatively, the sequence annotation vector representation layer may employ the same or similar structure as the context semantic representation layer.

Step 1132, the context semantic representation, the sequence labeling vector representation and the case conversion coefficient of each participle are input to a label prediction layer of the sequence labeling layer, and the case format type of each participle output by the label prediction layer is obtained.

Specifically, the label prediction layer is used for determining the case format type of each participle based on the context semantic representation, the sequence labeling vector representation and the case conversion coefficient of each participle. The context semantic representation, the sequence labeling vector representation and the case conversion coefficient of any participle can be fused, and label prediction is carried out based on the fused result to obtain the case format type of the participle. Optionally, the context semantic representation and the sequence labeling vector representation of any participle may be fused first, the fused vector is subjected to nonlinear transformation, and then the fused vector is fused with the case transformation coefficient of the participle, so as to perform label prediction based on the fusion result. For example, the following formula can be used to perform label prediction for any segmented word:

wherein, for the t-th participle in the text to be regulated,

and h_tLabeling the sequence of the participle with a vector representation and a context semantic representation,

show that

And h_tSplicing, a_tFor case conversion coefficient of the word, W_s、W_hAnd b_hParameters, y, learnable for case-regular models_tIs the case format type of the participle.

The method provided by the embodiment of the invention determines the case format type of each participle by extracting the sequence label vector representation of each participle in the text to be structured and based on the context semantic representation, the sequence label vector representation and the case conversion coefficient of each participle, thereby improving the accuracy of sequence label.

Based on any of the above embodiments, the loss function of the case regularization model includes a case transform coefficient loss function and a sequence labeling loss function;

Specifically, in the case-structured model training process, the loss function includes two parts: case transform coefficient loss functions and sequence annotation loss functions. The sequence labeling loss function is used for reducing the difference between the case format type of each sample word segmentation in the sample text to be structured determined by the case-structured model and the sample case format type of each sample word segmentation in the sample text to be structured.

In addition, the higher the case conversion coefficient of the sample participle with the sample case format type in the upper case format in the text to be normalized, the better the case conversion coefficient, and meanwhile, the sample participle with the sample case format type in the lower case format does not need to be subjected to the upper case conversion, so that the influence degree of the sample participle on the sentence meaning can be considered to be average, and therefore the case conversion coefficient of the sample participle with the sample case format type in the lower case format should be as average as possible. Therefore, the embodiment of the present invention further sets a case-case conversion coefficient loss function, which is used to maximize the case-case conversion coefficient of the sample word segmentation with the upper case label and minimize the dispersion degree of the case-case conversion coefficient of the sample word segmentation with the lower case label. For example, the case transition coefficient loss function can be constructed using the following formula:

wherein, the text to be structured of the samples comprises n sample participles, lab_iE {0,1} is the sample case format type of the ith sample participle, lab_iA value of 1 indicates that the sample case format type of the sample participle is capital-write format, lab_iA value of 0 indicates that the sample case format type of the sample participle is a lower case format, a_iAnd converting the case of the ith sample word segmentation coefficient.

A sum of case transform coefficient moduli for sample participles for which the sample case format type is capitalization format,

the variance of the case transition coefficients for the sample case labeled lower case sample participles.

Based on any of the above embodiments, the loss function of the case and case regularization model further includes a sentence meaning similarity loss function;

Specifically, if the sample text to be structured also includes sample participles with the sample case format type being the capitalization format in addition to the first participle, the sample text to be structured obtained by converting the sample participles with the sample case format type being the capitalization format into the capitalization format generally has a great difference from the original sample text to be structured in terms of semantics. Therefore, in the embodiment of the present invention, the loss function of the case and case regularization model further includes a sentence similarity loss function, which is used to minimize the sentence similarity between the text to be regularized by the sample and the text already regularized by the corresponding sample. The sentence meaning similarity is determined based on the sentence meaning feature representation of the text to be structured of the sample and the sentence meaning feature representation of the text structured of the sample, and the sentence meaning feature representation is determined based on the context semantic representation of each participle in the corresponding text. After training is finished, the case and case regularization model determines whether the obtained context semantic representation of any participle in the text to be regularized can reflect that whether the participle can cause larger change of sentence meaning after being converted into the capital form, and the accuracy of the case and case conversion coefficient is further improved. Since the objective of the sentence similarity loss function is to minimize the sentence similarity between the text to be structured in the sample and the text already structured in the sample corresponding to the text to be structured in the sample, the sentence similarity loss function can be constructed by using the following formula:

wherein S is_smallAnd S_oriRespectively representing the sentence meaning characteristic representation of the text to be structured and the sentence meaning characteristic representation of the structured text of the sample, representing the vector dot product operation, and representing the vector module value calculation.

Optionally, a sentence meaning similarity discrimination model can be constructed, a context semantic representation layer of the sentence meaning similarity discrimination model is shared with a case and case regularization model, and the two models are trained in a parameter sharing mode. Therefore, each sample participle in the sample text to be structured can be input into the context semantic representation layer of the case and text structured model to obtain the context semantic representation of each sample participle output by the context semantic representation layer, and each sample structured participle in the sample structured text is input into the context semantic representation layer of the case and text structured model to obtain the context semantic representation of each sample structured participle output by the context semantic representation layer. Then, the sentence meaning similarity discrimination model determines the sentence meaning feature representation of the text to be structured and the sentence meaning feature representation of the structured text of the sample respectively based on the context semantic representation of each sample participle and the context semantic representation of each sample structured participle, and then carries out similarity calculation on the sentence meaning feature representation of the text to be structured and the sentence meaning feature representation of the structured text of the sample to obtain the sentence meaning similarity between the text to be structured and the corresponding sample structured text.

When determining the sentence meaning feature representation, the sentence meaning similarity discrimination model can fuse the context semantic representation of each participle in the corresponding text and then compress the text into a vector with fixed length. In order to highlight the participles which have a large influence on the sentence meaning in the corresponding text and weaken the interference caused by irrelevant participles, the attention weight of each participle in the corresponding text can be determined based on the attention mechanism, and then the context semantic representation of each participle is fused and compressed based on the attention weight of each participle to obtain the sentence meaning characteristic representation. For example, the sentence meaning characterization of the corresponding text may be determined using the following formula:

wherein, suppose that the text contains n participles, h_iA contextual semantic representation for the ith participle, a_iFor the attention weight of the word segmentation, S is the sentence meaning feature representation of the text, and W, b and q are parameters that can be learned in the case-structured model.

Based on any of the embodiments, based on the case format type of each participle, the method for normalizing a text to be normalized specifically includes:

if the capital and lower case format type of any word segmentation is capital, determining the regular mode of the word segmentation based on the preset capital-to-capital conversion corresponding relation; wherein, the regular mode is full character capitalization or first character capitalization.

Specifically, after the case format type of each participle is determined, if the case format type of any participle is capitalization, the participle needs to be normalized to be converted into a capitalization format. However, when it is structured, there are two possible ways of structuring, namely full character capitalization and first character capitalization. In order to determine the regular way of the word segmentation, the capitalization transformation correspondence may be constructed in advance. The capitalization conversion correspondence may be used to determine whether all characters should be capitalized or an initial character should be capitalized when any word is subjected to capitalization conversion. And then, matching the corresponding relation between the word segmentation and the capitalization transformation to obtain a regular way of the word segmentation.

Optionally, since there are few full-character capitalized participles, for example, english abbreviations of partial proper nouns, such as IT, APP, and CT, may appear in a full-character capitalization form, a mapping relationship between the full-character capitalization participle and a lower case form thereof may be constructed, or only a lower case form of the full-character capitalization participle may be listed as a capitalization conversion corresponding relationship, which is not specifically limited in the embodiment of the present invention. If the word segmentation exists in the capitalization transformation corresponding relationship, the regular mode of the word segmentation is full character capitalization, and if the word segmentation does not exist in the capitalization transformation corresponding relationship, the regular mode of the word segmentation is first character capitalization.

Based on any of the above embodiments, fig. 5 is a schematic flow chart of a case format type determination method provided by an embodiment of the present invention, as shown in fig. 5, the method includes:

text to be structured is determined, such as "We have disposed a new model.

Inputting each character of each participle in the text to be structured into a character coding layer to obtain the character code of each character in each participle

And

wherein "We" is only exemplarily labeled in FIG. 5 ""have", "model", and ". -.

Inputting the character code of each character in each participle into the pooling layer to obtain a pooling vector p of each participle output by the pooling layer₁、p₂、…、p_n-1And p_n。

Pooling vector p of each participle₁、p₂、…、p_n-1And p_nInputting the semantic expression to a context semantic extraction layer to obtain a context semantic expression h of each participle output by the context semantic extraction layer₁、h₂、…、h_n-1And h_n。

Representing the context semantics of each participle by h₁、h₂、…、h_n-1And h_nInputting the data into a capital and small conversion coefficient calculation layer to obtain a capital and small conversion coefficient a of each participle output by the capital and small conversion coefficient calculation layer₁、a₂、…、a_n-1And a_n。

Meanwhile, each character of each participle in the text to be structured can be input into the sequence label vector representation layer, and the sequence label character code of each character in each participle is extracted by the sequence label character code layer in the sequence label vector representation layer

And

determining the sequence label pooling vector of each participle by the sequence label pooling layer in the sequence label vector representation layer

And

then determining each participle by a sequence labeling context semantic extraction layer in a sequence labeling vector representation layerSequence annotation vector representation

And

representing the context semantics of each participle by h₁、h₂、…、h_n-1And h_nSequence annotation vector representation

And

and case conversion coefficient a₁、a₂、…、a_n-1And a_nInputting the data into a label prediction layer to obtain the case format type l of each word segmentation output by the label prediction layer₁、l₂、…、l_n-1And l_n。

The case regulating device provided by the embodiment of the present invention is described below, and the case regulating device described below and the case regulating method described above may be referred to correspondingly.

Based on any of the above embodiments, fig. 6 is a schematic structural diagram of a case regulating device according to an embodiment of the present invention, and as shown in fig. 6, the device includes a case label determining unit 610 and a case regulating unit 620.

The case label determining unit 610 is configured to input the text to be structured to the case structured model, and obtain a case format type of each participle in the text to be structured output by the case structured model;

the case regularizing unit 620 is configured to regularize the text to be regularized based on the case format type of each segmented word to obtain a regularized text corresponding to the text to be regularized;

The device provided by the embodiment of the invention determines the context semantic representation and the case conversion coefficient of each participle in the text to be normalized through the case normalization model, and determines the case format type of each participle based on the context semantic representation and the case conversion coefficient of each participle, so that the case normalization is performed on the text to be normalized without presetting a replacement word list, and the application range of the case normalization method is expanded. Meanwhile, the case-structured model fully considers the context information of the text to be structured, so that each structured word is in accordance with the whole context of the text, and the accuracy of case-structured is improved.

Based on any of the above embodiments, the case label determining unit 610 specifically includes:

the context semantic representation unit is used for inputting each participle in the text to be structured into a context semantic representation layer of the case and case structured model to obtain the context semantic representation of each participle output by the context semantic representation layer;

the case and case conversion coefficient calculation unit is used for inputting the context semantic representation of each participle into a case and case conversion coefficient calculation layer of the case and case normalization model to obtain the case and case conversion coefficient of each participle output by the case and case conversion coefficient calculation layer;

and the sequence labeling unit is used for inputting the context semantic representation and the case conversion coefficient of each participle into a sequence labeling layer of the case-structured model to obtain the case format type of each participle output by the sequence labeling layer.

The device provided by the embodiment of the invention determines the case format type of each participle by extracting the context expression vector of each participle in the text to be normalized, determining the case format type of each participle based on the context expression vector of each participle and combining the context expression vector and the case format conversion coefficient of each participle, fully considering the influence degree of each participle on the sentence meaning, and improving the accuracy of case normalization.

Based on any of the above embodiments, the context semantic representation unit specifically includes:

the character coding unit is used for inputting each character in any participle to a character coding layer of the upper and lower semantic representation layers to obtain the character code of each character in the participle output by the character coding layer;

the pooling unit is used for inputting the character code of each character in the participle into a pooling layer of the context semantic representation layer to obtain a pooling vector of the participle output by the pooling layer;

and the context semantic extraction unit is used for inputting the pooling vector of each participle into a context semantic extraction layer of the context semantic expression layer to obtain the context semantic expression of each participle output by the context semantic extraction layer.

The device provided by the embodiment of the invention extracts the character code of each character in any participle and determines the pooling vector of the participle based on the character code of each character in any participle, thereby extracting the context semantic representation of each participle and improving the accuracy of the context semantic representation.

Based on any of the above embodiments, the sequence labeling unit specifically includes:

the sequence label vector representation unit is used for inputting each participle in the text to be structured into a sequence label vector representation layer of the sequence label layer to obtain the sequence label vector representation of each participle output by the sequence vector representation layer;

and the label prediction unit is used for inputting the context semantic representation, the sequence labeling vector representation and the case and case conversion coefficient of each participle into a label prediction layer of the sequence labeling layer to obtain the case and case format type of each participle output by the label prediction layer.

The device provided by the embodiment of the invention determines the case format type of each participle by extracting the sequence label vector representation of each participle in the text to be structured and based on the context semantic representation, the sequence label vector representation and the case conversion coefficient of each participle, thereby improving the accuracy of sequence label.

Fig. 7 illustrates a physical structure diagram of an electronic device, and as shown in fig. 7, the electronic device may include: a processor (processor)710, a communication Interface (Communications Interface)720, a memory (memory)730, and a communication bus 740, wherein the processor 710, the communication Interface 720, and the memory 730 communicate with each other via the communication bus 740. Processor 710 may call logical instructions in memory 730 to perform a case-and-case method comprising: inputting a text to be normalized to a case and case normalization model to obtain a case and case format type of each word in the text to be normalized output by the case and case normalization model; based on the case format type of each word segmentation, the text to be structured is structured to obtain a structured text corresponding to the text to be structured; the case-and-case regularization model is obtained by training based on a sample text to be regularized and a sample case-and-case format type of each sample word in the sample text to be regularized; the case regulation model is used for determining the context semantic representation and case conversion coefficient of each participle in the text to be regulated, and determining the case format type of each participle based on the context semantic representation and case conversion coefficient of each participle.

In addition, the logic instructions in the memory 730 can be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, an embodiment of the present invention further provides a computer program product, where the computer program product includes a computer program stored on a non-transitory computer-readable storage medium, the computer program includes program instructions, and when the program instructions are executed by a computer, the computer can execute the case regulation method provided by the above-mentioned method embodiments, where the method includes: inputting a text to be normalized to a case and case normalization model to obtain a case and case format type of each word in the text to be normalized output by the case and case normalization model; based on the case format type of each word segmentation, the text to be structured is structured to obtain a structured text corresponding to the text to be structured; the case-and-case regularization model is obtained by training based on a sample text to be regularized and a sample case-and-case format type of each sample word in the sample text to be regularized; the case regulation model is used for determining the context semantic representation and case conversion coefficient of each participle in the text to be regulated, and determining the case format type of each participle based on the context semantic representation and case conversion coefficient of each participle.

In another aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented to perform the case regulating method provided in the foregoing embodiments when executed by a processor, and the method includes: inputting a text to be normalized to a case and case normalization model to obtain a case and case format type of each word in the text to be normalized output by the case and case normalization model; based on the case format type of each word segmentation, the text to be structured is structured to obtain a structured text corresponding to the text to be structured; the case-and-case regularization model is obtained by training based on a sample text to be regularized and a sample case-and-case format type of each sample word in the sample text to be regularized; the case regulation model is used for determining the context semantic representation and case conversion coefficient of each participle in the text to be regulated, and determining the case format type of each participle based on the context semantic representation and case conversion coefficient of each participle.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A case-warping method, comprising:

2. The case-structured method according to claim 1, wherein the inputting the text to be structured into the case-structured model to obtain the case format type of each participle in the text to be structured output by the case-structured model specifically comprises:

3. The case regularization method according to claim 2, wherein the step of inputting each participle in the text to be regularized to a context semantic representation layer of the case regularization model to obtain a context semantic representation of each participle output by the context semantic representation layer specifically comprises:

4. The case regularization method according to claim 2, wherein the step of inputting the context semantic representation and the case transformation coefficient of each participle into a sequence labeling layer of the case regularization model to obtain a case format type of each participle output by the sequence labeling layer specifically comprises:

5. The case regularization method according to any one of claims 1 to 4, wherein the loss functions of the case regularization model include a case transform coefficient loss function and a sequence labeling loss function;

6. The case regularization method according to claim 5, wherein the loss functions of the case regularization model further comprise a sentence similarity loss function;

7. The case regulating method according to any one of claims 1 to 4, wherein the regulating the text to be regulated based on the case format type of each participle specifically comprises:

8. A case regularization device, comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program performs the steps of the case warping method according to any one of claims 1 to 7.

10. A non-transitory computer readable storage medium, having stored thereon a computer program, wherein the computer program, when executed by a processor, performs the steps of the case warping method according to any one of claims 1 to 7.