CN113934823A

CN113934823A - Statement rewriting method, device, system and storage medium

Info

Publication number: CN113934823A
Application number: CN202111280835.3A
Authority: CN
Inventors: 张晗; 杜新凯; 吕超; 谷姗姗; 黄莹
Original assignee: Sunshine Insurance Group Co Ltd
Current assignee: Sunshine Insurance Group Co Ltd
Priority date: 2021-11-01
Filing date: 2021-11-01
Publication date: 2022-01-14

Abstract

The application provides a method, a device, a system and a storage medium for rewriting a statement, wherein the method for rewriting the statement comprises the following steps: acquiring at least one entity noun in a history sentence according to the target entity recognition model; acquiring time information of each entity noun in the at least one entity noun; and inputting the statement to be rewritten, the entity information and the time information into a target rewriting model, and acquiring the rewritten statement output by the target rewriting model. Some embodiments of the application can perfect the sentence needing to be rewritten according to the entity nouns extracted from the historical text and the generation time of the entity nouns, effectively reduce the time consumption of rewriting and improve the rewriting efficiency and the rewriting quality.

Description

Statement rewriting method, device, system and storage medium

Technical Field

The present application relates to the field of natural language processing technologies, and in particular, to a method, an apparatus, a system, and a storage medium for rewriting a sentence.

Background

Multiple rounds of interaction are widely applied in business scenarios in intelligent systems. The multi-round interaction means that after the machine system preliminarily judges the intention of the user in the man-machine conversation, the obtained user input information is analyzed, and finally the instruction of the user is clarified.

In a life scene, when a user expresses the words, which are not required to be described, are often replaced by indicative words or directly omitted. In order to facilitate an intelligent system to accurately know the intention of a user, the prior art adopts the method that context sentences and sentences needing to be rewritten are spliced and then input into a model for rewriting, but when the context sentences are long, the time for processing data by the model is remarkably increased, the practical application is difficult, and meanwhile, the context sentences with noise are directly input into the model, so that the difficulty of model training is high, and the consumed time is long.

Therefore, how to provide an efficient statement rewriting method becomes a technical problem to be solved urgently.

Disclosure of Invention

Embodiments of the present application provide a method, an apparatus, a system, and a storage medium for rewriting a sentence, in which a term extracted from a history context, time information of the term, and a sentence to be rewritten can be input to a rewrite model to obtain a rewritten sentence, so that rewrite time is effectively reduced, and rewrite efficiency and rewrite quality are improved.

In a first aspect, some embodiments of the present application provide a method for rewriting a statement, including: acquiring at least one entity noun in the historical sentence according to the target entity recognition model; acquiring time information of each entity noun in the at least one entity noun; inputting a statement to be rewritten, the entity information and the time information into a target rewriting model, and acquiring a rewritten statement output by the target rewriting model; wherein the history statement is a statement or a plurality of statements located before the statement to be rewritten.

The embodiment of the application obtains the rewritten sentence by inputting the entity noun of the historical sentence obtained by the target entity recognition model, the time for obtaining the entity noun and the sentence to be rewritten into the target rewriting model. Compared with the technical scheme that the historical sentences and the sentences to be rewritten are spliced and then input into the model to rewrite the sentences in the prior art, the training process of the rewriting model is faster (compared with the historical sentences input during the training of the rewriting model in the prior art, the entity nouns in the sentences input by the method are shorter), and the rewriting accuracy of the sentences by adopting the rewriting model is higher (compared with the related technical scheme, the time information of the entity nouns is increased during the sentence rewriting by the method, so that the accuracy is higher).

In some embodiments, the target entity recognition model is trained by: preprocessing the obtained original historical sentences to obtain preprocessed data, wherein the preprocessing comprises removing noise in the original historical sentences and/or segmenting sentences with the length larger than a set threshold value in the original historical sentences; dividing the preprocessed data into a first training data set and a first verification data set; training the constructed initial entity recognition model according to the data in the first training data set to obtain a predicted entity noun and an entity recognition model to be verified; and according to the first verification data set, confirming that the entity identification model to be verified passes verification, and obtaining the target entity identification model.

According to the embodiment of the application, the target entity recognition model is obtained by training and verifying the initial entity recognition model. And denoising the acquired historical sentences before training, so that the length of the input text is reduced, and the difficulty and time of model training are reduced. When the sentence is rewritten, the model can acquire the entity nouns needing to be perfected, and the final sentence rewriting quality is ensured.

In some embodiments, the target-rewrite model is trained by: inputting the predicted entity nouns, the time information of the predicted entity nouns and the statements to be rewritten included in second training data into a rewriting model to be trained, and training the rewriting model to be trained to obtain a rewriting model to be verified; and according to the second verification data set, confirming that the rewriting model to be verified passes verification, and obtaining the target rewriting model.

In the embodiment of the application, the initial target rewrite model is trained and verified by the obtained predicted entity nouns, the time information of the predicted entity nouns and the second training data, so that the target rewrite model is obtained. The method adopts an end-to-end training mode for the two models, and the obtained target rewriting model improves the accuracy of sentence rewriting and reduces the time consumption of rewriting.

In some embodiments, the number of the entity nouns is multiple, wherein the rewriting method further includes: respectively coding the plurality of entity nouns to obtain a plurality of different entity noun coding marks, wherein one entity noun corresponds to one entity noun coding mark; the inputting the statement to be rewritten, the entity information and the time information into a target rewriting model to obtain the rewritten statement output by the target rewriting model, includes: splitting the statement to be rewritten and marking each object obtained by splitting the statement to be rewritten with a code to be rewritten to obtain a code marking sequence to be rewritten; screening out at least one target entity noun from the plurality of entity nouns according to the time information, and acquiring entity noun coding marks corresponding to all the target entity nouns; acquiring an insertion position and/or a replacement position of the at least one entity noun in the statement to be rewritten; inserting entity noun coding marks corresponding to the target entity nouns into the insertion positions and/or replacement positions included in the coding mark sequence to be rewritten to obtain a rewriting mark sequence; and outputting the rewriting mark sequence.

The method can extract the entity nouns and the time information, accurately position the positions needing to be rewritten and has high rewriting quality.

In some embodiments, the splitting the to-be-rewritten sentence includes: and splitting the sentence to be rewritten by taking the Chinese character as a splitting unit.

According to the embodiment of the application, the follow-up rewriting position can be accurately positioned by splitting a single font.

In some embodiments, the entity-to-be-verified identification model and the rewrite model to be verified are obtained by the following loss function validation:

wherein L is a loss function, k is the number of entity sample classes of the entity identification model to be verified,

is the label classification value of the ith type entity sample,

the probability of outputting the entity identification model to be verified as the ith entity sample is obtained, n is the number of sample classification labels of the rewritten model to be verified,

the label classification value for the jth sample classification label,

and outputting the probability of the class label of the jth sample for the rewrite model to be verified.

According to the embodiment of the application, whether the model training can be finished or not is confirmed through the loss function, and the accuracy of the model is improved.

In a second aspect, some embodiments of the present application provide a data processing method, which can be implemented by executing the data processing method: performing semantic understanding, question searching or emotion recognition on the rewritten sentences obtained by the method according to any embodiment of the first aspect to obtain semantic understanding results, question searching results or emotion recognition results, respectively.

In a third aspect, some embodiments of the present application provide a device for rewriting a sentence, including: the entity noun recognition module is configured to acquire at least one entity noun in the historical sentence according to the target entity recognition model; a noun time acquisition module configured to acquire time information of each noun in the at least one noun; the rewriting module is configured to input a statement to be rewritten, the entity information and the time information into a target rewriting model, and acquire a rewritten statement output by the target rewriting model; wherein the history statement is a statement or a plurality of statements located before the statement to be rewritten.

In a fourth aspect, some embodiments herein provide a system comprising one or more computers and one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more computers to perform the operations of the respective methods according to any of the embodiments of the first and second aspects.

In a fifth aspect, some embodiments of the present application provide one or more computer storage media storing instructions that, when executed by one or more computers, cause the one or more computers to perform the operations of the respective methods according to any of the embodiments of the first and second aspects.

Drawings

To more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

Fig. 1 is a flowchart of a training method of a target entity recognition model according to an embodiment of the present disclosure;

FIG. 2 is a flowchart of a method for training a target-rewriting model according to an embodiment of the present disclosure;

FIG. 3 is a flowchart of a method for training an entity recognition model and a rewrite model to obtain a target entity recognition model and a target rewrite model according to an embodiment of the present disclosure;

FIG. 4 is a diagram of a model structure for obtaining a target entity recognition model and a target rewrite model based on a Bi-LSTM + CRF model and a BERT model according to an embodiment of the present application;

FIG. 5 is a flowchart of a method for rewriting a statement provided in an embodiment of the present application;

FIG. 6 is a diagram of a model structure of a sentence rewriting method provided in an embodiment of the present application;

fig. 7 is a block diagram illustrating a sentence rewriting apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.

In the related technical example, the multi-turn dialogue rewriting model is used for performing reference resolution and omission completion on the current text sentence according to the text of the historical context and the currently input text sentence to obtain a rewritten result. Since the text of the historical context contains noise or is long, the difficulty and time of model training are greatly increased.

Through the analysis, the accuracy rate of rewriting the sentences by adopting the traditional rewriting method is low, and the difficulty of training the model is high. In view of this, some embodiments of the present application input the recognized entity nouns of the history sentence and the time information of these entity nouns into the rewrite model to rewrite the sentence to be rewritten. It is because the embodiments of the present application input the rewrite model by the entity nouns rather than the entire history sentence, and also obtain the time information of the entity nouns. Therefore, the rewriting result obtained by the model of the application is more accurate, because the history statement contains more interference information compared with the entity noun, the rewriting accuracy is necessarily reduced, and compared with the technical scheme without the time information of the entity noun, the accuracy of the rewriting statement can be further improved because the time information of the entity noun is also considered in the embodiment of the application.

It can be understood that, in some embodiments of the present application, in order to further improve the accuracy of the rewrite statement, the entity recognition model and the rewrite model may be trained in an end-to-end manner in the model training stage, and the two models are optimized simultaneously, so that the target entity recognition model and the target rewrite model are obtained after the training is finished. Then, when there is a sentence to be rewritten, the target entity recognition model may be first input into one or more adjacent historical sentences before the sentence to be rewritten to recognize and obtain the entity nouns, and the generation time information of each recognized entity noun may be counted. Then, the entity nouns identified by the target entity identification model, the time information of the entity nouns and the sentence to be rewritten are combined with the target rewriting model to rewrite the sentence to be rewritten, so that the rewriting efficiency and the rewriting accuracy are improved.

It should be noted that some embodiments of the present application can be adapted to various dialog scenarios. For example, the rewriting method for obtaining statements in some embodiments of the present application may be applied to a human-computer conversation scenario, where the statements input by a user are rewritten by using the rewriting method in some embodiments of the present application, and the rewritten statements may provide clearer and more complete statement information for a system. For example, the rewriting method of the obtained sentence in other embodiments of the present application may be applied to a human-to-human online conversation scenario, the method may be used as an assistant for the conversation scenario, and the rewritten sentence may provide clear and complete sentence information for the user, so as to provide a reference for correctly understanding the semantic meaning.

First, a process of training the entity recognition model and the rewrite model to obtain a target entity recognition model having a noun recognition function and a target rewrite model having a sentence rewrite function will be described.

Referring to FIG. 1, FIG. 1 illustrates a flow diagram of a method for training a target entity recognition model in some embodiments of the present application.

In some embodiments of the present application, a method of training a target entity recognition model may include: s110, preprocessing the obtained original historical sentences to obtain preprocessed data, wherein the preprocessing comprises removing noise in the original historical sentences and/or segmenting sentences with the length larger than a set threshold value in the original historical sentences. And S120, dividing the preprocessed data into a first training data set and a first verification data set. S130, training the constructed initial entity recognition model according to the data in the first training data set to obtain a predicted entity noun and an entity recognition model to be verified. S140, according to the first verification data set, confirming that the entity identification model to be verified passes verification, and then obtaining the target entity identification model.

It can be understood that, in order to ensure that the training process can be smoothly ended to obtain the target entity recognition model, an entity recognition loss function needs to be predefined, the entity recognition loss function can obtain the entity recognition loss according to the difference between the predicted entity nouns and the real entity nouns, and when the entity recognition loss does not meet the set condition, the parameters of the entity recognition model can be adjusted through modes such as back propagation. And when the entity recognition loss obtained after the model is trained for multiple times reaches the set threshold value requirement, the training process of the entity recognition model can be stopped to obtain the target entity recognition model. That is, S130 may further include confirming that the process of training the initial entity recognition model may be ended according to the entity recognition loss, resulting in the entity recognition model to be verified.

Referring to FIG. 2, FIG. 2 illustrates a flow diagram of a method of training an object-rewrite model in some embodiments of the present application.

The training method for the target rewriting model provided by the embodiment of the application can comprise the following steps: and S210, inputting the predicted entity nouns obtained in the step S130, the time information of the predicted entity nouns and the to-be-rewritten sentences included in the second training data into the to-be-rewritten model, and training the to-be-rewritten model to obtain the to-be-verified rewritten model. And S220, confirming that the rewriting model to be verified passes verification according to the second verification data set, and obtaining the target rewriting model.

It can be understood that, in order to ensure that the training process can be smoothly ended to obtain the target rewrite model, a loss function needs to be defined in advance, the loss function can obtain the rewrite loss according to the difference between the predicted rewrite statement and the actual rewrite statement, and when the rewrite loss does not meet the set condition, the parameters of the rewrite model can be adjusted by back propagation or the like. The rewriting loss obtained after the rewriting model is trained for a plurality of times will reach the set threshold requirement, and at this time, the training process of the rewriting model can be terminated to obtain the target rewriting model. That is, in some embodiments of the present application, to ensure the quality of the rewrite of the trained model. S210 may further include confirming that the process of training the initial rewrite model can be ended according to the cross entropy of classification task loss, and obtaining the rewrite model to be verified.

In some embodiments of the present application, the entity identification loss in S130 and the classification task loss in S210 may be determined by the following loss function to obtain the entity identification model to be verified and the rewrite model to be verified, and obtain a parameter model adjusted according to a loss value:

is the label classification value of the ith type entity sample,

the label classification value for the jth sample classification label,

If the labeling result of the sample is i, then

Is 1, otherwise is 0. If the labeling result of the sample is j, then

The value is 1, otherwise 0.

The specific process of training the model is specifically described below by taking the Bi-LSTM + CRF model and the BERT model as examples.

As shown in FIG. 3, some embodiments of the present application provide a method for training an entity recognition model and a rewrite model to obtain a target entity recognition model and a target rewrite model.

The method of fig. 3 includes:

s310, historical text data is collected.

Collecting original historical statement data, manually labeled entity nouns, statements to be rewritten and rewritten labeled statements from a service-related system log.

And S320, preprocessing the historical text data.

And preprocessing the obtained original historical sentences to obtain preprocessed data, wherein the preprocessing comprises removing noise in the original historical sentences and/or segmenting sentences with the length larger than a set threshold value in the original historical sentences.

As an example, since the original history sentence collected in S310 contains meaningless special characters, spaces, garbled characters, S320 may clean up these noises using regular expressions. If the text length of the original historical statement acquired in the step S310 exceeds the set threshold, the original historical statement is truncated in the step S320 by using a Python script with a segmentation function.

S330, preparing a data set.

And dividing the data processed in the step S320 into a training data set and a verification data set according to a set proportion, and respectively using the training data set and the verification data set for training and verifying the model.

As an example, during a human-machine system dialogue, the training data set and the verification data set both contain a plurality of pieces of sample data. Each sample number data comprises a sentence to be rewritten, at least one historical sentence, at least one entity noun and a rewritten annotation sentence.

And S340, training the model.

And training the entity recognition model and the rewrite model by using the training data set obtained in the step S330 to obtain the entity recognition model to be verified and the rewrite model to be verified. And then, confirming that the entity identification model to be verified and the rewrite model to be verified pass verification by using the verification data set, and obtaining a target entity identification model and a target rewrite model.

As an example, please refer to fig. 4. Fig. 4 is a model structure diagram for obtaining a target entity recognition model and a target rewrite model based on a Bi-LSTM + CRF (Long Short-Term Memory + Conditional Random Field) model (as a specific example of an entity recognition model) and a language characterization model (binary Encoder replies from Transformers, BERT for Short, as an example of a rewrite model structure) according to some embodiments of the present application. The following describes a specific training procedure by taking a training sample in human-computer interaction as an example.

In the first step, the first step is that,

and converting the historical statements into a vector form and inputting the vector form into the Bi-LSTM + CRF model to be trained. The process illustratively includes: and respectively converting the text information corresponding to the historical sentences into a vector form which can be read by a computer by taking the Chinese characters as splitting units. As shown in fig. 4, the history statement includes the above question sentence and the above system reply, where the above question sentence is: "consult with e guaranty", the above system reverts to: "has been upgraded to sunlight I Bay", the corresponding converted input vector is: the vectors are respectively characterized as "E_Consult、E_Query、E_{Followed by}、 E_e、E_{Health-care product}”、“E_{Has already been used for}、E_{Warp beam}、E_{Lifting of wine}、E_Stage、E_{Become into}、E_{Yang (Yang)}、E_{Light (es)}、E_i、E_{Health-care product}”。

The sentence to be rewritten is converted into a vector form and input into the rewrite model to be trained (i.e. BERT of FIG. 4). As shown in fig. 4, the sentence to be rewritten is input as a question: "what advantage" is, the sentence is converted into computer readable vector form by using Chinese character as splitting unit to obtain "E_{Is provided with}、E_Sundries、E_{Chinese character' Tao}、 E_{Superior food}、E_Dot”。

Secondly, acquiring an Entity coding vector with time sequence information output by the Bi-LSTM + CRF model, namely Named Entity Recognition (NER): B-P (representing the protection with E) and I-P (representing the protection with sunshine I), thereby obtaining two entity nouns, and carrying out entity noun marking on the entity nouns to obtain E_entity1(i.e., B-P: Sa with E) and E_entity2(i.e., I-P: Sun I Bao) and then proceeding with a noun-coded mark, i.e., E, based on location₁And E₂Finally, the entity noun code sequence E will be composed₁₁. It is understood that the occurrence time B-P (representing the "with e" guaranty) of the two nouns is earlier than the occurrence time I-P (representing the "sunshine" guaranty), i.e. the obtained occurrence time information of the two nouns is t1 and t2, respectively, thent1 corresponds to a time earlier than the time corresponding to t 2.

Third, based on the input order of the "what is advantageous" words, according to "E_{Is provided with}、E_Sundries、E_{Chinese character' Tao}、E_{Superior food}、 E_Dot"position-coding each Chinese character, where CLS denotes the symbol from E₁The start of coding is marked, SEP is an end mark, and E is obtained₁、E₂、E₃、E₄、E₅、E₆、E₇. Then splitting the statement and marking each object obtained by splitting with a code to be rewritten to obtain a code marking sequence E to be rewritten₀。

Fourthly, the coding mark sequence E to be rewritten₀And the time sequence information of the entity noun and the entity noun code sequence E obtained in the second step₁₁Input to the BERT model (i.e., as an example of a rewrite model).

And fifthly, acquiring the rewriting mark sequence output by the BERT model. As can be seen in the figure, the noun code sequence E is selected₁₁The encoding label of the middle entity noun is E₂The entity noun of (a) is inserted in front of the word of (b) (because the time of occurrence of this entity noun is closer to the replaced sentence to be rewritten), the predicted rewritten sentence "sunlight I holds what advantage".

And sixthly, analyzing the rewritten sentences output by the BERT model and the rewritten labeling sentences in the training data set to obtain the cross entropy of the classification task loss. And if the value of the cross entropy is judged to be smaller than the set threshold value, adjusting the parameters of the entity recognition model and the rewriting model, and repeating the training process.

And seventhly, confirming that the training process of the BERT model and the entity recognition model can be finished according to the cross entropy lost by the classification task, and obtaining the BERT model to be verified and the entity recognition model to be verified. And then, confirming that the BERT model to be verified passes the verification and confirming that the entity identification model passes the verification by using the verification data set, and obtaining a target BERT model and a target entity identification model.

It can be understood that, in order to subsequently use the trained target entity recognition model and target rewrite model, the model parameters after training need to be saved.

The following exemplarily explains a specific process of the sentence rewriting method provided in the embodiment of the present application in combination with the trained target entity recognition model and the target rewriting model.

Referring to fig. 5, fig. 5 is a flowchart of a sentence rewriting method provided in the embodiment of the present application.

The method for rewriting the statement provided by the embodiment of the application can comprise the following steps: s510, acquiring at least one entity noun in the historical sentence according to the target entity recognition model; s520, acquiring time information of each entity noun in the at least one entity noun; s530, inputting a statement to be rewritten, the entity information and the time information into a target rewriting model, and acquiring a rewritten statement output by the target rewriting model; wherein the historical statement is a statement or a plurality of statements located before the statement to be rewritten.

The above process is exemplarily set forth below.

The target entity recognition model and the target rewrite model related to S510 are obtained by training in the manner shown in fig. 1, fig. 2, or fig. 4, and some embodiments of the present application may also use a training process different from that shown in fig. 1 and fig. 2 to obtain the target entity recognition model and the target rewrite model, where the training process shown in fig. 1 and fig. 2 is only used as a specific example.

It is understood that at least one original history sentence is collected in advance before executing S510, for example, in some embodiments of the present application, at least one original history sentence of S510 is collected from the related dialog service system. Since the original history sentences collected in advance may have a problem of noise or long sentences, the original history sentences may need to be denoised or truncated before S510 is executed. It will be appreciated that in some embodiments of the present application, the collected original history statements may be both denoised and truncated.

In some embodiments of the present application, a history statement is a statement or statements that precede the statement to be rewritten. That is, some embodiments of the present application select the history statements to be generated earlier than the statements to be rewritten. For example, if the generation time of the sentence to be rewritten is t, one or more history sentences are selected before time t in the order from the near to the far from time t.

In order to accurately extract the entity noun and obtain the time information generated by the entity, and accurately locate the position to be rewritten, some embodiments of the present application need to obtain the generation time of the entity noun. For example, in some embodiments of the present application, a plurality of entity nouns with timing information may be obtained through the target entity recognition model, where the timing information is time information generated by the entity nouns; and time information generated by entity nouns can be acquired by reading timing information cached in the system. It should be noted that, for convenience of clear explanation, the rewriting method S530 of the statement exemplarily illustrated in conjunction with the model structure diagram of the rewriting method of the statement provided in fig. 6 may further include the following steps.

The method comprises the steps of firstly, coding a plurality of entity nouns of a historical sentence obtained by adopting a target entity recognition model respectively to obtain a plurality of different entity noun coding marks, wherein one entity noun corresponds to one entity noun coding mark.

For example, two history sentences closest to the sentence to be rewritten are selected, and the history sentences are input into the trained target Bi-LSTM + CRF model (as a specific example of the target entity recognition model), so as to obtain two entity nouns. That is, E shown in FIG. 6_entity1And E_entity2The entity terms indicated are "protect with e" and "sun I protect", respectively. Marking the entity code of the 'protect with E' as E₁The entity code of "sunshine I guaranty" is marked as E₂. Then two entity nouns are formed into entity coding sequence E₁₁。

And secondly, splitting the statement to be rewritten and marking each object obtained by splitting with a coding mark to be rewritten to obtain a coding mark sequence to be rewritten.

For example, to be changedThe writing statement "what advantage is. As shown in fig. 6, the 5 words are split, and the encoding mark to be rewritten, i.e. position encoding: d₁(corresponding to "CLS", start of encoding), D₂(corresponding to "having"), D₃(corresponding to "sh"), D₄(corresponding to "how"), D₅(corresponding to "you"), D₆(corresponding to "dot"), D₇(corresponding to the "SEP", end of code). Then the position coding data is formed into coding mark sequence E to be rewritten₀。

And thirdly, screening out at least one target entity noun from the plurality of entity nouns according to the time information, and acquiring entity noun coding marks corresponding to all the target entity nouns.

For example, FIG. 6 shows E in the first step₁And E₂。

And fourthly, acquiring the insertion position and/or the replacement position of at least one entity noun in the sentence to be rewritten. And inserting the entity noun coding mark corresponding to the target entity noun into the insertion position and/or the replacement position included in the coding mark sequence to be rewritten to obtain a rewriting mark sequence.

For example, the coding mark sequence E to be rewritten₀And the entity-encoding sequence E₁₁Input into the trained target BERT model (as a specific example of the target rewrite model), as can be seen from FIG. 6, the target BERT model outputs the entity-coded sequence E₁₁The entity coding mark in (1) is E₂Is inserted into the position code as D₂Before (c) is performed.

And fifthly, outputting the rewriting mark sequence.

For example, as can be seen from fig. 6, the rewritten sentence that is finally output by the target BERT model is "what advantage is retained by the sun I".

It should be noted that the target entity recognition model and the target rewrite model may also be trained by models having a function of language recognition processing, other than the Bi-LSTM + CRF model and the BERT model.

As can be seen from the above, in some embodiments of the present application, the target entity identification model and the target rewrite model obtained by training may be used jointly, and the sentence to be rewritten is rewritten according to the history sentence. In addition, the target entity recognition model and the target rewrite model separate entity recognition and rewrite into two parts, so in other embodiments of the present application, the target entity recognition model and the target rewrite model may also be used separately.

In addition, an embodiment of the present application further provides a data processing method, and by executing the data processing method, the data processing method can implement: semantic understanding, question searching or emotion recognition is performed on the rewritten sentences obtained by the method according to any embodiment in fig. 5, and a semantic understanding result, a question searching result or an emotion recognition result is obtained respectively.

As can be seen from the above, the content expressed by the rewritten sentence is more complete and clear, and it is easier for a system or a person to acquire important information in semantic understanding, question retrieval, or emotion recognition.

Referring to fig. 7, fig. 7 is a block diagram illustrating a writing apparatus for writing a sentence according to an embodiment of the present application. It should be understood that the rewriting device of the sentence corresponds to the method embodiment of fig. 5 described above, and can perform the steps related to the method embodiment described above, and the specific functions of the rewriting device of the sentence can be referred to the description above, and the detailed description is appropriately omitted here to avoid redundancy.

The rewriting device of the sentence of fig. 7 includes at least one software functional module that can be stored in a memory in the form of software or firmware or fixed in the rewriting device of the sentence, the rewriting device includes: an entity noun identification module 710, an entity noun time acquisition module 720 and a rewriting module 730.

The entity noun identification module 710 may be configured to: and acquiring at least one entity noun in the history statement according to the target entity recognition model. The entity noun time acquisition module 720 may be configured to: and acquiring time information of each entity noun in the at least one entity noun. The rewrite module 730 may be configured to: inputting the statement to be rewritten, the entity information and the time information into a target rewriting model, and acquiring a rewritten statement output by the target rewriting model; wherein the history statement is a statement or a plurality of statements located before the statement to be rewritten.

In some embodiments of the present application, the rewriting apparatus of the sentence of fig. 7 may further include a first training module and a second training module (not shown in the figure), wherein the first training module may be configured to: preprocessing the obtained original historical sentences to obtain preprocessed data, wherein the preprocessing comprises removing noise in the original historical sentences and/or segmenting sentences with the length larger than a set threshold value in the original historical sentences; dividing the preprocessed data into a first training data set and a first verification data set; training the constructed initial entity recognition model according to the data in the first training data set to obtain a predicted entity noun and an entity recognition model to be verified; and according to the first verification data set, confirming that the entity identification model to be verified passes verification, and obtaining the target entity identification model.

The second training module may be configured to: inputting the predicted entity nouns, the time information of the predicted entity nouns and the statements to be rewritten included in second training data into a rewriting model to be trained, and training the rewriting model to be trained to obtain a rewriting model to be verified; and according to the second verification data set, confirming that the rewriting model to be verified passes verification, and obtaining the target rewriting model.

In some embodiments of the present application, the rewrite module 730 may be further configured to: respectively coding the plurality of entity nouns to obtain a plurality of different entity noun coding marks, wherein one entity noun corresponds to one entity noun coding mark; the number of the entity nouns is multiple. And splitting the sentence to be rewritten and marking each object obtained by splitting the sentence to be rewritten with a code to be rewritten (namely splitting the sentence to be rewritten by taking the Chinese character as a splitting unit) to obtain a code marking sequence to be rewritten. And screening out at least one target entity noun from the plurality of entity nouns according to the time information, and acquiring entity noun coding marks corresponding to all the target entity nouns. And acquiring the insertion position and/or the replacement position of the at least one entity noun in the statement to be rewritten. And inserting the entity noun coding mark corresponding to the target entity noun into the insertion position and/or the replacement position included in the coding mark sequence to be rewritten to obtain a rewriting mark sequence. And outputting the rewriting mark sequence.

In some embodiments of the present application, the first training module or the second training module may be further configured to: and confirming to obtain the entity identification model to be verified and the rewrite model to be verified through the following loss function:

wherein L is a loss function, k is the entity class number of the entity identification model to be verified,

the tag classification value of the i-th class entity of the entity identification model to be verified,

the probability of the entity identification model to be verified being output as the ith entity is identified, n is the number of the classification labels of the rewritten model to be verified,

for the label classification value of the jth class classification label of the rewritten model to be verified,

and outputting the probability of the jth classification label for the rewrite model to be verified.

Some embodiments of the present application also provide a system comprising one or more computers and one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more computers to perform the operations of the method of any of the embodiments in fig. 5.

Some embodiments of the present application also provide one or more computer-storage media storing instructions that, when executed by one or more computers, cause the one or more computers to perform the operations of the method of any of the embodiments in fig. 5.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims

1. A rewrite method for a sentence, the rewrite method comprising:

acquiring at least one entity noun in a history sentence according to the target entity recognition model;

acquiring time information of each entity noun in the at least one entity noun;

inputting a statement to be rewritten, the entity information and the time information into a target rewriting model, and acquiring a rewritten statement output by the target rewriting model;

wherein the history statement is a statement or a plurality of statements located before the statement to be rewritten.

2. The rewrite apparatus of claim 1, wherein the target entity recognition model is trained by:

preprocessing the obtained original historical sentences to obtain preprocessed data, wherein the preprocessing comprises removing noise in the original historical sentences and/or segmenting sentences with the length larger than a set threshold value in the original historical sentences;

dividing the preprocessed data into a first training data set and a first verification data set;

training the constructed initial entity recognition model according to the data in the first training data set to obtain a predicted entity noun and an entity recognition model to be verified;

and according to the first verification data set, confirming that the entity identification model to be verified passes verification, and then obtaining the target entity identification model.

3. The rewrite system of claim 2, wherein the target rewrite model is trained by:

inputting the predicted entity nouns, the time information of the predicted entity nouns and the statements to be rewritten included in second training data into a rewriting model to be trained, and training the rewriting model to be trained to obtain a rewriting model to be verified;

and according to the second verification data set, confirming that the rewriting model to be verified passes verification, and obtaining the target rewriting model.

4. The rewriting method of claim 1, wherein the number of said entity nouns is plural, wherein,

the rewriting method further includes: respectively coding the plurality of entity nouns to obtain a plurality of different entity noun coding marks, wherein one entity noun corresponds to one entity noun coding mark;

the inputting the statement to be rewritten, the entity information and the time information into a target rewriting model and obtaining the rewritten statement output by the target rewriting model comprises:

splitting the statement to be rewritten and marking each object obtained by splitting with a code to be rewritten to obtain a code marking sequence to be rewritten;

screening out at least one target entity noun from the plurality of entity nouns according to the time information, and acquiring entity noun coding marks corresponding to all the target entity nouns;

acquiring an insertion position and/or a replacement position of the at least one entity noun in the statement to be rewritten;

inserting entity noun coding marks corresponding to the target entity nouns into the insertion positions and/or replacement positions included in the coding mark sequence to be rewritten to obtain a rewriting mark sequence;

and outputting the rewriting mark sequence.

5. The rewrite apparatus of claim 4, wherein the splitting the statement to rewrite comprises:

and splitting the sentence to be rewritten by taking the Chinese character as a splitting unit.

6. The method according to any of claims 1-5, wherein the entity-identifying model to be verified and the rewrite model to be verified are validated by the following loss function:

is the label classification value of the ith type entity sample,

the label classification value for the jth sample classification label,

and outputting the probability of the jth sample classification label for the rewrite model to be verified.

7. A data processing method, characterized in that the data processing method is executed to realize: performing semantic understanding, question searching or emotion recognition on the rewritten sentences obtained according to the method of any one of claims 1-6 to obtain semantic understanding results, question searching results or emotion recognition results, respectively.

8. A rewriting apparatus of a sentence, characterized in that the rewriting apparatus comprises:

the entity noun recognition module is configured to acquire at least one entity noun in the historical sentence according to the target entity recognition model;

a noun time acquisition module configured to acquire time information of each noun in the at least one noun;

the rewriting module is configured to input a statement to be rewritten, the entity information and the time information into a target rewriting model, and acquire a rewritten statement output by the target rewriting model;

9. A system comprising one or more computers and one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more computers to perform operations of the respective methods of any of claims 1-7.

10. One or more computer storage media storing instructions that, when executed by one or more computers, cause the one or more computers to perform operations of the respective methods of any of claims 1-7.