CN113934823A - Statement rewriting method, device, system and storage medium - Google Patents

Statement rewriting method, device, system and storage medium Download PDF

Info

Publication number
CN113934823A
CN113934823A CN202111280835.3A CN202111280835A CN113934823A CN 113934823 A CN113934823 A CN 113934823A CN 202111280835 A CN202111280835 A CN 202111280835A CN 113934823 A CN113934823 A CN 113934823A
Authority
CN
China
Prior art keywords
entity
model
rewriting
rewritten
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111280835.3A
Other languages
Chinese (zh)
Inventor
张晗
杜新凯
吕超
谷姗姗
黄莹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sunshine Insurance Group Co Ltd
Original Assignee
Sunshine Insurance Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sunshine Insurance Group Co Ltd filed Critical Sunshine Insurance Group Co Ltd
Priority to CN202111280835.3A priority Critical patent/CN113934823A/en
Publication of CN113934823A publication Critical patent/CN113934823A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Machine Translation (AREA)

Abstract

The application provides a method, a device, a system and a storage medium for rewriting a statement, wherein the method for rewriting the statement comprises the following steps: acquiring at least one entity noun in a history sentence according to the target entity recognition model; acquiring time information of each entity noun in the at least one entity noun; and inputting the statement to be rewritten, the entity information and the time information into a target rewriting model, and acquiring the rewritten statement output by the target rewriting model. Some embodiments of the application can perfect the sentence needing to be rewritten according to the entity nouns extracted from the historical text and the generation time of the entity nouns, effectively reduce the time consumption of rewriting and improve the rewriting efficiency and the rewriting quality.

Description

Statement rewriting method, device, system and storage medium
Technical Field
The present application relates to the field of natural language processing technologies, and in particular, to a method, an apparatus, a system, and a storage medium for rewriting a sentence.
Background
Multiple rounds of interaction are widely applied in business scenarios in intelligent systems. The multi-round interaction means that after the machine system preliminarily judges the intention of the user in the man-machine conversation, the obtained user input information is analyzed, and finally the instruction of the user is clarified.
In a life scene, when a user expresses the words, which are not required to be described, are often replaced by indicative words or directly omitted. In order to facilitate an intelligent system to accurately know the intention of a user, the prior art adopts the method that context sentences and sentences needing to be rewritten are spliced and then input into a model for rewriting, but when the context sentences are long, the time for processing data by the model is remarkably increased, the practical application is difficult, and meanwhile, the context sentences with noise are directly input into the model, so that the difficulty of model training is high, and the consumed time is long.
Therefore, how to provide an efficient statement rewriting method becomes a technical problem to be solved urgently.
Disclosure of Invention
Embodiments of the present application provide a method, an apparatus, a system, and a storage medium for rewriting a sentence, in which a term extracted from a history context, time information of the term, and a sentence to be rewritten can be input to a rewrite model to obtain a rewritten sentence, so that rewrite time is effectively reduced, and rewrite efficiency and rewrite quality are improved.
In a first aspect, some embodiments of the present application provide a method for rewriting a statement, including: acquiring at least one entity noun in the historical sentence according to the target entity recognition model; acquiring time information of each entity noun in the at least one entity noun; inputting a statement to be rewritten, the entity information and the time information into a target rewriting model, and acquiring a rewritten statement output by the target rewriting model; wherein the history statement is a statement or a plurality of statements located before the statement to be rewritten.
The embodiment of the application obtains the rewritten sentence by inputting the entity noun of the historical sentence obtained by the target entity recognition model, the time for obtaining the entity noun and the sentence to be rewritten into the target rewriting model. Compared with the technical scheme that the historical sentences and the sentences to be rewritten are spliced and then input into the model to rewrite the sentences in the prior art, the training process of the rewriting model is faster (compared with the historical sentences input during the training of the rewriting model in the prior art, the entity nouns in the sentences input by the method are shorter), and the rewriting accuracy of the sentences by adopting the rewriting model is higher (compared with the related technical scheme, the time information of the entity nouns is increased during the sentence rewriting by the method, so that the accuracy is higher).
In some embodiments, the target entity recognition model is trained by: preprocessing the obtained original historical sentences to obtain preprocessed data, wherein the preprocessing comprises removing noise in the original historical sentences and/or segmenting sentences with the length larger than a set threshold value in the original historical sentences; dividing the preprocessed data into a first training data set and a first verification data set; training the constructed initial entity recognition model according to the data in the first training data set to obtain a predicted entity noun and an entity recognition model to be verified; and according to the first verification data set, confirming that the entity identification model to be verified passes verification, and obtaining the target entity identification model.
According to the embodiment of the application, the target entity recognition model is obtained by training and verifying the initial entity recognition model. And denoising the acquired historical sentences before training, so that the length of the input text is reduced, and the difficulty and time of model training are reduced. When the sentence is rewritten, the model can acquire the entity nouns needing to be perfected, and the final sentence rewriting quality is ensured.
In some embodiments, the target-rewrite model is trained by: inputting the predicted entity nouns, the time information of the predicted entity nouns and the statements to be rewritten included in second training data into a rewriting model to be trained, and training the rewriting model to be trained to obtain a rewriting model to be verified; and according to the second verification data set, confirming that the rewriting model to be verified passes verification, and obtaining the target rewriting model.
In the embodiment of the application, the initial target rewrite model is trained and verified by the obtained predicted entity nouns, the time information of the predicted entity nouns and the second training data, so that the target rewrite model is obtained. The method adopts an end-to-end training mode for the two models, and the obtained target rewriting model improves the accuracy of sentence rewriting and reduces the time consumption of rewriting.
In some embodiments, the number of the entity nouns is multiple, wherein the rewriting method further includes: respectively coding the plurality of entity nouns to obtain a plurality of different entity noun coding marks, wherein one entity noun corresponds to one entity noun coding mark; the inputting the statement to be rewritten, the entity information and the time information into a target rewriting model to obtain the rewritten statement output by the target rewriting model, includes: splitting the statement to be rewritten and marking each object obtained by splitting the statement to be rewritten with a code to be rewritten to obtain a code marking sequence to be rewritten; screening out at least one target entity noun from the plurality of entity nouns according to the time information, and acquiring entity noun coding marks corresponding to all the target entity nouns; acquiring an insertion position and/or a replacement position of the at least one entity noun in the statement to be rewritten; inserting entity noun coding marks corresponding to the target entity nouns into the insertion positions and/or replacement positions included in the coding mark sequence to be rewritten to obtain a rewriting mark sequence; and outputting the rewriting mark sequence.
The method can extract the entity nouns and the time information, accurately position the positions needing to be rewritten and has high rewriting quality.
In some embodiments, the splitting the to-be-rewritten sentence includes: and splitting the sentence to be rewritten by taking the Chinese character as a splitting unit.
According to the embodiment of the application, the follow-up rewriting position can be accurately positioned by splitting a single font.
In some embodiments, the entity-to-be-verified identification model and the rewrite model to be verified are obtained by the following loss function validation:
Figure BDA0003330813680000041
wherein L is a loss function, k is the number of entity sample classes of the entity identification model to be verified,
Figure BDA0003330813680000042
is the label classification value of the ith type entity sample,
Figure BDA0003330813680000043
the probability of outputting the entity identification model to be verified as the ith entity sample is obtained, n is the number of sample classification labels of the rewritten model to be verified,
Figure BDA0003330813680000044
the label classification value for the jth sample classification label,
Figure BDA0003330813680000045
and outputting the probability of the class label of the jth sample for the rewrite model to be verified.
According to the embodiment of the application, whether the model training can be finished or not is confirmed through the loss function, and the accuracy of the model is improved.
In a second aspect, some embodiments of the present application provide a data processing method, which can be implemented by executing the data processing method: performing semantic understanding, question searching or emotion recognition on the rewritten sentences obtained by the method according to any embodiment of the first aspect to obtain semantic understanding results, question searching results or emotion recognition results, respectively.
In a third aspect, some embodiments of the present application provide a device for rewriting a sentence, including: the entity noun recognition module is configured to acquire at least one entity noun in the historical sentence according to the target entity recognition model; a noun time acquisition module configured to acquire time information of each noun in the at least one noun; the rewriting module is configured to input a statement to be rewritten, the entity information and the time information into a target rewriting model, and acquire a rewritten statement output by the target rewriting model; wherein the history statement is a statement or a plurality of statements located before the statement to be rewritten.
In a fourth aspect, some embodiments herein provide a system comprising one or more computers and one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more computers to perform the operations of the respective methods according to any of the embodiments of the first and second aspects.
In a fifth aspect, some embodiments of the present application provide one or more computer storage media storing instructions that, when executed by one or more computers, cause the one or more computers to perform the operations of the respective methods according to any of the embodiments of the first and second aspects.
Drawings
To more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
Fig. 1 is a flowchart of a training method of a target entity recognition model according to an embodiment of the present disclosure;
FIG. 2 is a flowchart of a method for training a target-rewriting model according to an embodiment of the present disclosure;
FIG. 3 is a flowchart of a method for training an entity recognition model and a rewrite model to obtain a target entity recognition model and a target rewrite model according to an embodiment of the present disclosure;
FIG. 4 is a diagram of a model structure for obtaining a target entity recognition model and a target rewrite model based on a Bi-LSTM + CRF model and a BERT model according to an embodiment of the present application;
FIG. 5 is a flowchart of a method for rewriting a statement provided in an embodiment of the present application;
FIG. 6 is a diagram of a model structure of a sentence rewriting method provided in an embodiment of the present application;
fig. 7 is a block diagram illustrating a sentence rewriting apparatus according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.
In the related technical example, the multi-turn dialogue rewriting model is used for performing reference resolution and omission completion on the current text sentence according to the text of the historical context and the currently input text sentence to obtain a rewritten result. Since the text of the historical context contains noise or is long, the difficulty and time of model training are greatly increased.
Through the analysis, the accuracy rate of rewriting the sentences by adopting the traditional rewriting method is low, and the difficulty of training the model is high. In view of this, some embodiments of the present application input the recognized entity nouns of the history sentence and the time information of these entity nouns into the rewrite model to rewrite the sentence to be rewritten. It is because the embodiments of the present application input the rewrite model by the entity nouns rather than the entire history sentence, and also obtain the time information of the entity nouns. Therefore, the rewriting result obtained by the model of the application is more accurate, because the history statement contains more interference information compared with the entity noun, the rewriting accuracy is necessarily reduced, and compared with the technical scheme without the time information of the entity noun, the accuracy of the rewriting statement can be further improved because the time information of the entity noun is also considered in the embodiment of the application.
It can be understood that, in some embodiments of the present application, in order to further improve the accuracy of the rewrite statement, the entity recognition model and the rewrite model may be trained in an end-to-end manner in the model training stage, and the two models are optimized simultaneously, so that the target entity recognition model and the target rewrite model are obtained after the training is finished. Then, when there is a sentence to be rewritten, the target entity recognition model may be first input into one or more adjacent historical sentences before the sentence to be rewritten to recognize and obtain the entity nouns, and the generation time information of each recognized entity noun may be counted. Then, the entity nouns identified by the target entity identification model, the time information of the entity nouns and the sentence to be rewritten are combined with the target rewriting model to rewrite the sentence to be rewritten, so that the rewriting efficiency and the rewriting accuracy are improved.
It should be noted that some embodiments of the present application can be adapted to various dialog scenarios. For example, the rewriting method for obtaining statements in some embodiments of the present application may be applied to a human-computer conversation scenario, where the statements input by a user are rewritten by using the rewriting method in some embodiments of the present application, and the rewritten statements may provide clearer and more complete statement information for a system. For example, the rewriting method of the obtained sentence in other embodiments of the present application may be applied to a human-to-human online conversation scenario, the method may be used as an assistant for the conversation scenario, and the rewritten sentence may provide clear and complete sentence information for the user, so as to provide a reference for correctly understanding the semantic meaning.
First, a process of training the entity recognition model and the rewrite model to obtain a target entity recognition model having a noun recognition function and a target rewrite model having a sentence rewrite function will be described.
Referring to FIG. 1, FIG. 1 illustrates a flow diagram of a method for training a target entity recognition model in some embodiments of the present application.
In some embodiments of the present application, a method of training a target entity recognition model may include: s110, preprocessing the obtained original historical sentences to obtain preprocessed data, wherein the preprocessing comprises removing noise in the original historical sentences and/or segmenting sentences with the length larger than a set threshold value in the original historical sentences. And S120, dividing the preprocessed data into a first training data set and a first verification data set. S130, training the constructed initial entity recognition model according to the data in the first training data set to obtain a predicted entity noun and an entity recognition model to be verified. S140, according to the first verification data set, confirming that the entity identification model to be verified passes verification, and then obtaining the target entity identification model.
It can be understood that, in order to ensure that the training process can be smoothly ended to obtain the target entity recognition model, an entity recognition loss function needs to be predefined, the entity recognition loss function can obtain the entity recognition loss according to the difference between the predicted entity nouns and the real entity nouns, and when the entity recognition loss does not meet the set condition, the parameters of the entity recognition model can be adjusted through modes such as back propagation. And when the entity recognition loss obtained after the model is trained for multiple times reaches the set threshold value requirement, the training process of the entity recognition model can be stopped to obtain the target entity recognition model. That is, S130 may further include confirming that the process of training the initial entity recognition model may be ended according to the entity recognition loss, resulting in the entity recognition model to be verified.
Referring to FIG. 2, FIG. 2 illustrates a flow diagram of a method of training an object-rewrite model in some embodiments of the present application.
The training method for the target rewriting model provided by the embodiment of the application can comprise the following steps: and S210, inputting the predicted entity nouns obtained in the step S130, the time information of the predicted entity nouns and the to-be-rewritten sentences included in the second training data into the to-be-rewritten model, and training the to-be-rewritten model to obtain the to-be-verified rewritten model. And S220, confirming that the rewriting model to be verified passes verification according to the second verification data set, and obtaining the target rewriting model.
It can be understood that, in order to ensure that the training process can be smoothly ended to obtain the target rewrite model, a loss function needs to be defined in advance, the loss function can obtain the rewrite loss according to the difference between the predicted rewrite statement and the actual rewrite statement, and when the rewrite loss does not meet the set condition, the parameters of the rewrite model can be adjusted by back propagation or the like. The rewriting loss obtained after the rewriting model is trained for a plurality of times will reach the set threshold requirement, and at this time, the training process of the rewriting model can be terminated to obtain the target rewriting model. That is, in some embodiments of the present application, to ensure the quality of the rewrite of the trained model. S210 may further include confirming that the process of training the initial rewrite model can be ended according to the cross entropy of classification task loss, and obtaining the rewrite model to be verified.
In some embodiments of the present application, the entity identification loss in S130 and the classification task loss in S210 may be determined by the following loss function to obtain the entity identification model to be verified and the rewrite model to be verified, and obtain a parameter model adjusted according to a loss value:
Figure BDA0003330813680000081
wherein L is a loss function, k is the number of entity sample classes of the entity identification model to be verified,
Figure BDA0003330813680000082
is the label classification value of the ith type entity sample,
Figure BDA0003330813680000083
the probability of outputting the entity identification model to be verified as the ith entity sample is obtained, n is the number of sample classification labels of the rewritten model to be verified,
Figure BDA0003330813680000084
the label classification value for the jth sample classification label,
Figure BDA0003330813680000085
and outputting the probability of the class label of the jth sample for the rewrite model to be verified.
If the labeling result of the sample is i, then
Figure BDA0003330813680000086
Is 1, otherwise is 0. If the labeling result of the sample is j, then
Figure BDA0003330813680000091
The value is 1, otherwise 0.
The specific process of training the model is specifically described below by taking the Bi-LSTM + CRF model and the BERT model as examples.
As shown in FIG. 3, some embodiments of the present application provide a method for training an entity recognition model and a rewrite model to obtain a target entity recognition model and a target rewrite model.
The method of fig. 3 includes:
s310, historical text data is collected.
Collecting original historical statement data, manually labeled entity nouns, statements to be rewritten and rewritten labeled statements from a service-related system log.
And S320, preprocessing the historical text data.
And preprocessing the obtained original historical sentences to obtain preprocessed data, wherein the preprocessing comprises removing noise in the original historical sentences and/or segmenting sentences with the length larger than a set threshold value in the original historical sentences.
As an example, since the original history sentence collected in S310 contains meaningless special characters, spaces, garbled characters, S320 may clean up these noises using regular expressions. If the text length of the original historical statement acquired in the step S310 exceeds the set threshold, the original historical statement is truncated in the step S320 by using a Python script with a segmentation function.
S330, preparing a data set.
And dividing the data processed in the step S320 into a training data set and a verification data set according to a set proportion, and respectively using the training data set and the verification data set for training and verifying the model.
As an example, during a human-machine system dialogue, the training data set and the verification data set both contain a plurality of pieces of sample data. Each sample number data comprises a sentence to be rewritten, at least one historical sentence, at least one entity noun and a rewritten annotation sentence.
And S340, training the model.
And training the entity recognition model and the rewrite model by using the training data set obtained in the step S330 to obtain the entity recognition model to be verified and the rewrite model to be verified. And then, confirming that the entity identification model to be verified and the rewrite model to be verified pass verification by using the verification data set, and obtaining a target entity identification model and a target rewrite model.
As an example, please refer to fig. 4. Fig. 4 is a model structure diagram for obtaining a target entity recognition model and a target rewrite model based on a Bi-LSTM + CRF (Long Short-Term Memory + Conditional Random Field) model (as a specific example of an entity recognition model) and a language characterization model (binary Encoder replies from Transformers, BERT for Short, as an example of a rewrite model structure) according to some embodiments of the present application. The following describes a specific training procedure by taking a training sample in human-computer interaction as an example.
In the first step, the first step is that,
and converting the historical statements into a vector form and inputting the vector form into the Bi-LSTM + CRF model to be trained. The process illustratively includes: and respectively converting the text information corresponding to the historical sentences into a vector form which can be read by a computer by taking the Chinese characters as splitting units. As shown in fig. 4, the history statement includes the above question sentence and the above system reply, where the above question sentence is: "consult with e guaranty", the above system reverts to: "has been upgraded to sunlight I Bay", the corresponding converted input vector is: the vectors are respectively characterized as "EConsult、EQuery、EFollowed by、 Ee、EHealth-care product”、“EHas already been used for、EWarp beam、ELifting of wine、EStage、EBecome into、EYang (Yang)、ELight (es)、Ei、EHealth-care product”。
The sentence to be rewritten is converted into a vector form and input into the rewrite model to be trained (i.e. BERT of FIG. 4). As shown in fig. 4, the sentence to be rewritten is input as a question: "what advantage" is, the sentence is converted into computer readable vector form by using Chinese character as splitting unit to obtain "EIs provided with、ESundries、EChinese character' Tao、 ESuperior food、EDot”。
Secondly, acquiring an Entity coding vector with time sequence information output by the Bi-LSTM + CRF model, namely Named Entity Recognition (NER): B-P (representing the protection with E) and I-P (representing the protection with sunshine I), thereby obtaining two entity nouns, and carrying out entity noun marking on the entity nouns to obtain Eentity1(i.e., B-P: Sa with E) and Eentity2(i.e., I-P: Sun I Bao) and then proceeding with a noun-coded mark, i.e., E, based on location1And E2Finally, the entity noun code sequence E will be composed11. It is understood that the occurrence time B-P (representing the "with e" guaranty) of the two nouns is earlier than the occurrence time I-P (representing the "sunshine" guaranty), i.e. the obtained occurrence time information of the two nouns is t1 and t2, respectively, thent1 corresponds to a time earlier than the time corresponding to t 2.
Third, based on the input order of the "what is advantageous" words, according to "EIs provided with、ESundries、EChinese character' Tao、ESuperior food、 EDot"position-coding each Chinese character, where CLS denotes the symbol from E1The start of coding is marked, SEP is an end mark, and E is obtained1、E2、E3、E4、E5、E6、E7. Then splitting the statement and marking each object obtained by splitting with a code to be rewritten to obtain a code marking sequence E to be rewritten0
Fourthly, the coding mark sequence E to be rewritten0And the time sequence information of the entity noun and the entity noun code sequence E obtained in the second step11Input to the BERT model (i.e., as an example of a rewrite model).
And fifthly, acquiring the rewriting mark sequence output by the BERT model. As can be seen in the figure, the noun code sequence E is selected11The encoding label of the middle entity noun is E2The entity noun of (a) is inserted in front of the word of (b) (because the time of occurrence of this entity noun is closer to the replaced sentence to be rewritten), the predicted rewritten sentence "sunlight I holds what advantage".
And sixthly, analyzing the rewritten sentences output by the BERT model and the rewritten labeling sentences in the training data set to obtain the cross entropy of the classification task loss. And if the value of the cross entropy is judged to be smaller than the set threshold value, adjusting the parameters of the entity recognition model and the rewriting model, and repeating the training process.
And seventhly, confirming that the training process of the BERT model and the entity recognition model can be finished according to the cross entropy lost by the classification task, and obtaining the BERT model to be verified and the entity recognition model to be verified. And then, confirming that the BERT model to be verified passes the verification and confirming that the entity identification model passes the verification by using the verification data set, and obtaining a target BERT model and a target entity identification model.
It can be understood that, in order to subsequently use the trained target entity recognition model and target rewrite model, the model parameters after training need to be saved.
The following exemplarily explains a specific process of the sentence rewriting method provided in the embodiment of the present application in combination with the trained target entity recognition model and the target rewriting model.
Referring to fig. 5, fig. 5 is a flowchart of a sentence rewriting method provided in the embodiment of the present application.
The method for rewriting the statement provided by the embodiment of the application can comprise the following steps: s510, acquiring at least one entity noun in the historical sentence according to the target entity recognition model; s520, acquiring time information of each entity noun in the at least one entity noun; s530, inputting a statement to be rewritten, the entity information and the time information into a target rewriting model, and acquiring a rewritten statement output by the target rewriting model; wherein the historical statement is a statement or a plurality of statements located before the statement to be rewritten.
The above process is exemplarily set forth below.
The target entity recognition model and the target rewrite model related to S510 are obtained by training in the manner shown in fig. 1, fig. 2, or fig. 4, and some embodiments of the present application may also use a training process different from that shown in fig. 1 and fig. 2 to obtain the target entity recognition model and the target rewrite model, where the training process shown in fig. 1 and fig. 2 is only used as a specific example.
It is understood that at least one original history sentence is collected in advance before executing S510, for example, in some embodiments of the present application, at least one original history sentence of S510 is collected from the related dialog service system. Since the original history sentences collected in advance may have a problem of noise or long sentences, the original history sentences may need to be denoised or truncated before S510 is executed. It will be appreciated that in some embodiments of the present application, the collected original history statements may be both denoised and truncated.
In some embodiments of the present application, a history statement is a statement or statements that precede the statement to be rewritten. That is, some embodiments of the present application select the history statements to be generated earlier than the statements to be rewritten. For example, if the generation time of the sentence to be rewritten is t, one or more history sentences are selected before time t in the order from the near to the far from time t.
In order to accurately extract the entity noun and obtain the time information generated by the entity, and accurately locate the position to be rewritten, some embodiments of the present application need to obtain the generation time of the entity noun. For example, in some embodiments of the present application, a plurality of entity nouns with timing information may be obtained through the target entity recognition model, where the timing information is time information generated by the entity nouns; and time information generated by entity nouns can be acquired by reading timing information cached in the system. It should be noted that, for convenience of clear explanation, the rewriting method S530 of the statement exemplarily illustrated in conjunction with the model structure diagram of the rewriting method of the statement provided in fig. 6 may further include the following steps.
The method comprises the steps of firstly, coding a plurality of entity nouns of a historical sentence obtained by adopting a target entity recognition model respectively to obtain a plurality of different entity noun coding marks, wherein one entity noun corresponds to one entity noun coding mark.
For example, two history sentences closest to the sentence to be rewritten are selected, and the history sentences are input into the trained target Bi-LSTM + CRF model (as a specific example of the target entity recognition model), so as to obtain two entity nouns. That is, E shown in FIG. 6entity1And Eentity2The entity terms indicated are "protect with e" and "sun I protect", respectively. Marking the entity code of the 'protect with E' as E1The entity code of "sunshine I guaranty" is marked as E2. Then two entity nouns are formed into entity coding sequence E11
And secondly, splitting the statement to be rewritten and marking each object obtained by splitting with a coding mark to be rewritten to obtain a coding mark sequence to be rewritten.
For example, to be changedThe writing statement "what advantage is. As shown in fig. 6, the 5 words are split, and the encoding mark to be rewritten, i.e. position encoding: d1(corresponding to "CLS", start of encoding), D2(corresponding to "having"), D3(corresponding to "sh"), D4(corresponding to "how"), D5(corresponding to "you"), D6(corresponding to "dot"), D7(corresponding to the "SEP", end of code). Then the position coding data is formed into coding mark sequence E to be rewritten0
And thirdly, screening out at least one target entity noun from the plurality of entity nouns according to the time information, and acquiring entity noun coding marks corresponding to all the target entity nouns.
For example, FIG. 6 shows E in the first step1And E2
And fourthly, acquiring the insertion position and/or the replacement position of at least one entity noun in the sentence to be rewritten. And inserting the entity noun coding mark corresponding to the target entity noun into the insertion position and/or the replacement position included in the coding mark sequence to be rewritten to obtain a rewriting mark sequence.
For example, the coding mark sequence E to be rewritten0And the entity-encoding sequence E11Input into the trained target BERT model (as a specific example of the target rewrite model), as can be seen from FIG. 6, the target BERT model outputs the entity-coded sequence E11The entity coding mark in (1) is E2Is inserted into the position code as D2Before (c) is performed.
And fifthly, outputting the rewriting mark sequence.
For example, as can be seen from fig. 6, the rewritten sentence that is finally output by the target BERT model is "what advantage is retained by the sun I".
It should be noted that the target entity recognition model and the target rewrite model may also be trained by models having a function of language recognition processing, other than the Bi-LSTM + CRF model and the BERT model.
As can be seen from the above, in some embodiments of the present application, the target entity identification model and the target rewrite model obtained by training may be used jointly, and the sentence to be rewritten is rewritten according to the history sentence. In addition, the target entity recognition model and the target rewrite model separate entity recognition and rewrite into two parts, so in other embodiments of the present application, the target entity recognition model and the target rewrite model may also be used separately.
In addition, an embodiment of the present application further provides a data processing method, and by executing the data processing method, the data processing method can implement: semantic understanding, question searching or emotion recognition is performed on the rewritten sentences obtained by the method according to any embodiment in fig. 5, and a semantic understanding result, a question searching result or an emotion recognition result is obtained respectively.
As can be seen from the above, the content expressed by the rewritten sentence is more complete and clear, and it is easier for a system or a person to acquire important information in semantic understanding, question retrieval, or emotion recognition.
Referring to fig. 7, fig. 7 is a block diagram illustrating a writing apparatus for writing a sentence according to an embodiment of the present application. It should be understood that the rewriting device of the sentence corresponds to the method embodiment of fig. 5 described above, and can perform the steps related to the method embodiment described above, and the specific functions of the rewriting device of the sentence can be referred to the description above, and the detailed description is appropriately omitted here to avoid redundancy.
The rewriting device of the sentence of fig. 7 includes at least one software functional module that can be stored in a memory in the form of software or firmware or fixed in the rewriting device of the sentence, the rewriting device includes: an entity noun identification module 710, an entity noun time acquisition module 720 and a rewriting module 730.
The entity noun identification module 710 may be configured to: and acquiring at least one entity noun in the history statement according to the target entity recognition model. The entity noun time acquisition module 720 may be configured to: and acquiring time information of each entity noun in the at least one entity noun. The rewrite module 730 may be configured to: inputting the statement to be rewritten, the entity information and the time information into a target rewriting model, and acquiring a rewritten statement output by the target rewriting model; wherein the history statement is a statement or a plurality of statements located before the statement to be rewritten.
In some embodiments of the present application, the rewriting apparatus of the sentence of fig. 7 may further include a first training module and a second training module (not shown in the figure), wherein the first training module may be configured to: preprocessing the obtained original historical sentences to obtain preprocessed data, wherein the preprocessing comprises removing noise in the original historical sentences and/or segmenting sentences with the length larger than a set threshold value in the original historical sentences; dividing the preprocessed data into a first training data set and a first verification data set; training the constructed initial entity recognition model according to the data in the first training data set to obtain a predicted entity noun and an entity recognition model to be verified; and according to the first verification data set, confirming that the entity identification model to be verified passes verification, and obtaining the target entity identification model.
The second training module may be configured to: inputting the predicted entity nouns, the time information of the predicted entity nouns and the statements to be rewritten included in second training data into a rewriting model to be trained, and training the rewriting model to be trained to obtain a rewriting model to be verified; and according to the second verification data set, confirming that the rewriting model to be verified passes verification, and obtaining the target rewriting model.
In some embodiments of the present application, the rewrite module 730 may be further configured to: respectively coding the plurality of entity nouns to obtain a plurality of different entity noun coding marks, wherein one entity noun corresponds to one entity noun coding mark; the number of the entity nouns is multiple. And splitting the sentence to be rewritten and marking each object obtained by splitting the sentence to be rewritten with a code to be rewritten (namely splitting the sentence to be rewritten by taking the Chinese character as a splitting unit) to obtain a code marking sequence to be rewritten. And screening out at least one target entity noun from the plurality of entity nouns according to the time information, and acquiring entity noun coding marks corresponding to all the target entity nouns. And acquiring the insertion position and/or the replacement position of the at least one entity noun in the statement to be rewritten. And inserting the entity noun coding mark corresponding to the target entity noun into the insertion position and/or the replacement position included in the coding mark sequence to be rewritten to obtain a rewriting mark sequence. And outputting the rewriting mark sequence.
In some embodiments of the present application, the first training module or the second training module may be further configured to: and confirming to obtain the entity identification model to be verified and the rewrite model to be verified through the following loss function:
Figure BDA0003330813680000161
wherein L is a loss function, k is the entity class number of the entity identification model to be verified,
Figure BDA0003330813680000162
the tag classification value of the i-th class entity of the entity identification model to be verified,
Figure BDA0003330813680000163
the probability of the entity identification model to be verified being output as the ith entity is identified, n is the number of the classification labels of the rewritten model to be verified,
Figure BDA0003330813680000164
for the label classification value of the jth class classification label of the rewritten model to be verified,
Figure BDA0003330813680000165
and outputting the probability of the jth classification label for the rewrite model to be verified.
Some embodiments of the present application also provide a system comprising one or more computers and one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more computers to perform the operations of the method of any of the embodiments in fig. 5.
Some embodiments of the present application also provide one or more computer-storage media storing instructions that, when executed by one or more computers, cause the one or more computers to perform the operations of the method of any of the embodiments in fig. 5.
The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims (10)

1. A rewrite method for a sentence, the rewrite method comprising:
acquiring at least one entity noun in a history sentence according to the target entity recognition model;
acquiring time information of each entity noun in the at least one entity noun;
inputting a statement to be rewritten, the entity information and the time information into a target rewriting model, and acquiring a rewritten statement output by the target rewriting model;
wherein the history statement is a statement or a plurality of statements located before the statement to be rewritten.
2. The rewrite apparatus of claim 1, wherein the target entity recognition model is trained by:
preprocessing the obtained original historical sentences to obtain preprocessed data, wherein the preprocessing comprises removing noise in the original historical sentences and/or segmenting sentences with the length larger than a set threshold value in the original historical sentences;
dividing the preprocessed data into a first training data set and a first verification data set;
training the constructed initial entity recognition model according to the data in the first training data set to obtain a predicted entity noun and an entity recognition model to be verified;
and according to the first verification data set, confirming that the entity identification model to be verified passes verification, and then obtaining the target entity identification model.
3. The rewrite system of claim 2, wherein the target rewrite model is trained by:
inputting the predicted entity nouns, the time information of the predicted entity nouns and the statements to be rewritten included in second training data into a rewriting model to be trained, and training the rewriting model to be trained to obtain a rewriting model to be verified;
and according to the second verification data set, confirming that the rewriting model to be verified passes verification, and obtaining the target rewriting model.
4. The rewriting method of claim 1, wherein the number of said entity nouns is plural, wherein,
the rewriting method further includes: respectively coding the plurality of entity nouns to obtain a plurality of different entity noun coding marks, wherein one entity noun corresponds to one entity noun coding mark;
the inputting the statement to be rewritten, the entity information and the time information into a target rewriting model and obtaining the rewritten statement output by the target rewriting model comprises:
splitting the statement to be rewritten and marking each object obtained by splitting with a code to be rewritten to obtain a code marking sequence to be rewritten;
screening out at least one target entity noun from the plurality of entity nouns according to the time information, and acquiring entity noun coding marks corresponding to all the target entity nouns;
acquiring an insertion position and/or a replacement position of the at least one entity noun in the statement to be rewritten;
inserting entity noun coding marks corresponding to the target entity nouns into the insertion positions and/or replacement positions included in the coding mark sequence to be rewritten to obtain a rewriting mark sequence;
and outputting the rewriting mark sequence.
5. The rewrite apparatus of claim 4, wherein the splitting the statement to rewrite comprises:
and splitting the sentence to be rewritten by taking the Chinese character as a splitting unit.
6. The method according to any of claims 1-5, wherein the entity-identifying model to be verified and the rewrite model to be verified are validated by the following loss function:
Figure FDA0003330813670000021
wherein L is a loss function, k is the number of entity sample classes of the entity identification model to be verified,
Figure FDA0003330813670000022
is the label classification value of the ith type entity sample,
Figure FDA0003330813670000023
the probability of outputting the entity identification model to be verified as the ith entity sample is obtained, n is the number of sample classification labels of the rewritten model to be verified,
Figure FDA0003330813670000024
the label classification value for the jth sample classification label,
Figure FDA0003330813670000025
and outputting the probability of the jth sample classification label for the rewrite model to be verified.
7. A data processing method, characterized in that the data processing method is executed to realize: performing semantic understanding, question searching or emotion recognition on the rewritten sentences obtained according to the method of any one of claims 1-6 to obtain semantic understanding results, question searching results or emotion recognition results, respectively.
8. A rewriting apparatus of a sentence, characterized in that the rewriting apparatus comprises:
the entity noun recognition module is configured to acquire at least one entity noun in the historical sentence according to the target entity recognition model;
a noun time acquisition module configured to acquire time information of each noun in the at least one noun;
the rewriting module is configured to input a statement to be rewritten, the entity information and the time information into a target rewriting model, and acquire a rewritten statement output by the target rewriting model;
wherein the history statement is a statement or a plurality of statements located before the statement to be rewritten.
9. A system comprising one or more computers and one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more computers to perform operations of the respective methods of any of claims 1-7.
10. One or more computer storage media storing instructions that, when executed by one or more computers, cause the one or more computers to perform operations of the respective methods of any of claims 1-7.
CN202111280835.3A 2021-11-01 2021-11-01 Statement rewriting method, device, system and storage medium Pending CN113934823A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111280835.3A CN113934823A (en) 2021-11-01 2021-11-01 Statement rewriting method, device, system and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111280835.3A CN113934823A (en) 2021-11-01 2021-11-01 Statement rewriting method, device, system and storage medium

Publications (1)

Publication Number Publication Date
CN113934823A true CN113934823A (en) 2022-01-14

Family

ID=79285128

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111280835.3A Pending CN113934823A (en) 2021-11-01 2021-11-01 Statement rewriting method, device, system and storage medium

Country Status (1)

Country Link
CN (1) CN113934823A (en)

Similar Documents

Publication Publication Date Title
CN110135457B (en) Event trigger word extraction method and system based on self-encoder fusion document information
CN109299273B (en) Multi-source multi-label text classification method and system based on improved seq2seq model
CN107729309B (en) Deep learning-based Chinese semantic analysis method and device
CN110046350B (en) Grammar error recognition method, device, computer equipment and storage medium
CN112527992B (en) Long text processing method, related device and readable storage medium
CN107341143B (en) Sentence continuity judgment method and device and electronic equipment
CN111079432B (en) Text detection method and device, electronic equipment and storage medium
CN111581345A (en) Document level event extraction method and device
CN116127953B (en) Chinese spelling error correction method, device and medium based on contrast learning
CN112992125B (en) Voice recognition method and device, electronic equipment and readable storage medium
CN111414745A (en) Text punctuation determination method and device, storage medium and electronic equipment
CN110728117A (en) Paragraph automatic identification method and system based on machine learning and natural language processing
CN113449489A (en) Punctuation mark marking method, punctuation mark marking device, computer equipment and storage medium
CN110705211A (en) Text key content marking method and device, computer equipment and storage medium
CN113076720B (en) Long text segmentation method and device, storage medium and electronic device
CN113486178A (en) Text recognition model training method, text recognition device and medium
CN111126059B (en) Short text generation method, short text generation device and readable storage medium
CN113553847A (en) Method, device, system and storage medium for parsing address text
CN113934823A (en) Statement rewriting method, device, system and storage medium
CN113345409B (en) Speech synthesis method, speech synthesis device, electronic equipment and computer-readable storage medium
CN114239554A (en) Text sentence-breaking method, text sentence-breaking training device, electronic equipment and storage medium
CN113762160A (en) Date extraction method and device, computer equipment and storage medium
CN114416991A (en) Method and system for analyzing text emotion reason based on prompt
CN113934833A (en) Training data acquisition method, device and system and storage medium
CN110889289B (en) Information accuracy evaluation method, device, equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination