WO2023035883A1 - 用于文档和摘要的一致性检测的方法、设备和介质 - Google Patents

用于文档和摘要的一致性检测的方法、设备和介质 Download PDF

Info

Publication number
WO2023035883A1
WO2023035883A1 PCT/CN2022/112869 CN2022112869W WO2023035883A1 WO 2023035883 A1 WO2023035883 A1 WO 2023035883A1 CN 2022112869 W CN2022112869 W CN 2022112869W WO 2023035883 A1 WO2023035883 A1 WO 2023035883A1
Authority
WO
WIPO (PCT)
Prior art keywords
sample
abstract
document
target
information
Prior art date
Application number
PCT/CN2022/112869
Other languages
English (en)
French (fr)
Inventor
陈家泽
曾致远
Original Assignee
北京有竹居网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京有竹居网络技术有限公司 filed Critical 北京有竹居网络技术有限公司
Publication of WO2023035883A1 publication Critical patent/WO2023035883A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • Exemplary embodiments of the present disclosure generally relate to the field of computers, and in particular, to a method, device, and computer-readable storage medium for consistency detection of documents and abstracts.
  • Text summarization is to generate a simplified version of the source document while preserving important information in the source document.
  • Document summarization is a branch of text generation techniques that is not constrained by the text that appears in the source document. Therefore, the summary has greater flexibility and strong generation ability when it is generated.
  • many researches have developed various summary generation models to realize automatic summary generation.
  • a method for user guidance includes determining a first sample and first annotation information, the first annotation information indicating that a first abstract included in the first sample is inconsistent with the first document, and at least one text unit among the plurality of text units of the first abstract is marked is inconsistent with the first document.
  • the method also includes generating a first adversarial example by applying disturbance information to the first sample, the disturbance information being applied to the first sample and other textual units in the first digest except at least one textual unit.
  • the method also includes training a consistency detection model based on at least the first sample, the first adversarial sample, and the first annotation information according to the training target, the consistency detection model is configured to detect whether the summary is consistent with the document, and the training target is configured as The differences between the detection results of the consistency detection model for the first sample and the first adversarial sample and the first label information are all within a predetermined threshold.
  • an electronic device in a second aspect of the present disclosure, includes at least one processing unit; and at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit.
  • the instructions when executed by at least one processing unit, cause the device to perform the following actions: determine a first sample and first annotation information, the first annotation information indicates that the first abstract included in the first sample is inconsistent with the first document, and the first abstract includes At least one text unit in the plurality of text units is marked as inconsistent with the first document; a first adversarial example is generated by applying noise information to the first sample, the noise information is applied to the first sample and the first abstract except At least one text unit other than the text unit; and based on at least the first sample, the first adversarial sample and the first annotation information, according to the training target to train the consistency detection model, the consistency detection model is configured to detect the abstract and document Whether they are consistent or not, the training target is configured so that the difference between the detection results of the consistency detection model for the first sample and the first adversarial
  • an apparatus for checking consistency between a document and an abstract includes: a determination module configured to determine a first sample and first annotation information, the first annotation information indicates that the first abstract included in the first sample is inconsistent with the first document, and in the multiple text units of the first abstract At least one text unit of is marked as inconsistent with the first document; the adversarial generation module is configured to generate a first adversarial sample by applying disturbance information to the first sample, the disturbance information is applied to the first sample and the first summary other text units except at least one text unit; and a training module configured to train a consistency detection model according to a training objective based on at least the first sample, the first adversarial sample and the first label information, and the consistency detection The model is configured to detect whether the abstract is consistent with the document, and the training target is configured to make the difference between the detection results of the consistency detection model for the first sample and the first adversarial sample and the first annotation information be within a predetermined threshold.
  • a computer readable storage medium is provided.
  • a computer program is stored on the medium, and when the program is executed by the processor, the method in the first aspect is realized.
  • Figure 1 shows a schematic diagram of an example environment in which embodiments of the present disclosure can be implemented
  • FIG. 2 shows an architecture for training a consistency detection model according to some embodiments of the present disclosure
  • Figure 3 shows an example of a source document and abstract according to some embodiments of the present disclosure
  • FIG. 4 shows an architecture for applying a consistency detection model according to some embodiments of the present disclosure
  • FIG. 5 illustrates an example of error tracking for digests according to some embodiments of the present disclosure
  • Fig. 6 shows a flow chart of a process for checking consistency between a document and an abstract according to some embodiments of the present disclosure
  • Fig. 7 shows a block diagram of an apparatus for consistency detection between a document and an abstract according to some embodiments of the present disclosure.
  • Figure 8 shows a block diagram of a device capable of implementing various embodiments of the present disclosure.
  • model can learn the relationship between the corresponding input and output from the training data, so that the corresponding output can be generated for the given input after the training is completed.
  • the generation of the model may be based on machine learning techniques.
  • Deep learning is a machine learning algorithm that uses multiple layers of processing units to process input and provide corresponding output.
  • a neural network model is an example of a deep learning based model.
  • a “model” may also be referred to herein as a "machine learning model,” “learning model,” “machine learning network,” or “learning network,” and these terms are used interchangeably herein.
  • a “neural network” is a machine learning network based on deep learning.
  • a neural network is capable of processing input and providing a corresponding output, which generally includes an input layer and an output layer and one or more hidden layers between the input layer and the output layer.
  • Neural networks used in deep learning applications typically include many hidden layers, increasing the depth of the network.
  • the layers of the neural network are connected in sequence so that the output of the previous layer is provided as the input of the subsequent layer, where the input layer receives the input of the neural network, and the output of the output layer serves as the final output of the neural network.
  • Each layer of a neural network consists of one or more nodes (also known as processing nodes or neurons), each of which processes input from the previous layer.
  • machine learning can roughly include three phases, namely training phase, testing phase and application phase (also known as inference phase).
  • training phase a given model can be trained using a large amount of training data, and the parameter values are updated iteratively until the model can obtain consistent inferences that meet the expected goals from the training data.
  • a model can be thought of as being able to learn associations from inputs to outputs (also known as input-to-output mappings) from the training data.
  • the parameter values of the trained model are determined.
  • the testing phase the performance of the model is determined by applying test inputs to the trained model to test whether the model can provide the correct output.
  • the model can be used to process the actual input and determine the corresponding output based on the parameter values obtained by training.
  • FIG. 1 shows a block diagram of an environment 100 in which various implementations of the present disclosure can be practiced.
  • the consistency detection model 105 configured to detect whether an abstract is consistent with a document.
  • the environment 100 includes a model training system 110 and a model application system 120 .
  • the model training system 110 is configured to use a plurality of training samples 112-1, 112-2, ..., 112-N and a set of annotation information 114 to Train the consistency detection model 105, where N is an integer greater than or equal to 1. These samples will generally be sample 112 for ease of discussion.
  • Each sample 112 includes a document 113 and an abstract 115 .
  • Annotation information set 114 includes annotation information for sample 112 , which indicates whether the abstract in sample 112 is consistent with the document.
  • the samples 112 used to train the model may include one or more positive samples and one or more negative samples.
  • the consistency detection model 105 can learn from the positive samples what characteristics the abstract and the document are consistent with each other, and can learn from the negative samples what characteristics the abstract and the document are inconsistent with each other.
  • document refers to an object that partially or fully represents text in natural language. Documents can be in any electronic format as long as the textual information can be extracted. In the subsequent processing, the text in the document is used as the processing object. Each document can contain multiple text units.
  • summary refers to a simplified version of a document that expresses important information in the document more concisely and with less text.
  • Each abstract can consist of multiple text units.
  • text unit refers to a unit processed in a natural language processing task, and its granularity can be changed and set according to the application.
  • a text unit may include a word, a phrase, a symbol, a combination of the foregoing, or any other unit that may appear in a natural language expression.
  • units of text are also referred to as tokens.
  • the parameter values of the consistency detection model 105 may be initialized, or pre-trained parameter values may be obtained through a pre-training process. Through the training process, the parameter values of the consistency detection model 105 are updated and adjusted. After the training is completed, the consistency detection model 105 has trained parameter values. Based on such parameter values, the consistency detection model 105 can be used to implement the task of consistency detection between abstracts and documents.
  • model application system 120 receives input source documents 132 and target abstracts 134 .
  • the model application system 120 may be configured to utilize the trained consistency detection model 105 to perform consistency detection for the source document 132 and the target abstract 134 .
  • the model training system 110 and the model application system 120 may be any systems with computing capabilities, such as various computing devices/systems, terminal devices, servers, and the like.
  • Terminal equipment can be any type of mobile terminal, fixed terminal or portable terminal, including mobile phones, desktop computers, laptop computers, notebook computers, netbook computers, tablet computers, media computers, multimedia tablets, or any combination of the foregoing , including accessories and peripherals for these devices, or any combination thereof.
  • Servers include, but are not limited to, mainframes, edge computing nodes, computing devices in cloud environments, and the like.
  • model training system 110 and model application system 120 may be integrated in the same system or device. Embodiments of the present disclosure are not limited in this regard.
  • an improved document and abstract consistency detection scheme is proposed.
  • the adversarial data augmentation training method is used to construct the adversarial negative samples.
  • Adversarial negative examples are usually generated by applying perturbation information to negative examples.
  • a more effective way of adversarial data augmentation is proposed to construct adversarial negative samples.
  • using negative samples and adversarial negative samples to train a consistency detection model enables the model to better detect and track parts of summaries that are inconsistent with documents.
  • FIG. 2 shows an example of an architecture 200 for training the consistency detection model 105 according to some embodiments of the present disclosure.
  • the architecture 200 of FIG. 2 may be implemented in the model training system 110 of FIG. 1 .
  • Each module/component in the architecture 200 may be implemented by hardware, software, firmware or any combination thereof.
  • samples 202 and 204 for training the consistency detection model 105 are shown.
  • the samples 202 and 204 are to be used to train the consistency detection model 105 , for example, the sample 112 shown in FIG. 1 may be included.
  • the samples 202 and 204 may be expressed in the form of a text sequence, which includes multiple text units of the document and multiple text units of the abstract concatenated.
  • the text sequences of samples 202 and 204 also include a symbol [CLS] at the starting position, used to indicate the start of the text sequence, and a symbol [CLS] inserted between the document and the abstract SEP] to separate the document and abstract.
  • the sample 202 shown includes a document 211 and an abstract 213 , wherein the abstract 213 is inconsistent with the document 211 , that is, the sample 202 is a negative sample for training the consistency detection model 105 , also called an inconsistent sample.
  • the shown sample 204 includes a document 215 and a summary 217 , wherein the summary 217 is not consistent with the document 215 , that is, the sample 204 is a positive sample for training the consistency detection model 105 , also called a consistent sample.
  • the consistency of the document for each sample with the abstract is indicated by the annotation information set 116 .
  • Text units of documents and summaries in different samples may or may not be the same.
  • the text sequence obtained by concatenating documents and abstracts in different samples may include a predetermined number of text units by means of padding.
  • the text sequence of each sample may be provided to the embedding layer 210 so that the embedding layer 210 outputs an embedded representation corresponding to each sample.
  • embedded representation refers to a vectorized representation of a text sequence, where each text unit and other special symbols (e.g., [CLS] and [SEP]) in the text sequence can be transformed into corresponding vectors.
  • the vectorized representation of the population of text sequences can be in the form of multidimensional vectors. In this way, subsequent processing can be performed on the basis of the vectorized representation.
  • different text units or symbols can be transformed into different vectors.
  • embedding layer 210 determines embedded representation 212 for sample 202 and embedded representation 214 for sample 204 .
  • the embedding layer 210 may use predetermined text units and a mapping table of symbols and vectors to determine the embedded representation, or may use a machine learning model, such as a language model, to extract features of text sequences as embedded representations. Embodiments of the present disclosure are not limited in this respect.
  • the positive samples and negative samples (eg, samples 204 and 202 ) used to train the consistency detection model 105 may be obtained from a database, or obtained from other data sources. In practical applications, there may be more positive samples in existing data sources, that is, documents and summaries that are consistent with each other.
  • negative samples can also be constructed based on existing positive samples, so as to obtain artificial training data. In this way, the huge cost caused by artificially generating or marking negative samples can be avoided, and based on the supervision information of only positive samples, negative samples and their supervision information can be obtained quickly, effectively and at low cost for model training.
  • sample 202 in FIG. 2 with inconsistent documents 211 and abstracts 213 is generated from a positive sample (eg, sample 204).
  • the consistency between the abstract 217 and the document 215 can be broken by modifying one or more text units in the abstract 217 to obtain the inconsistent sample 202 .
  • document 211 is the same as document 215 in sample 205, ie s
  • abstract 213 ie, t'
  • the annotation information set 116 not only records the annotation information of the existing sample 204 , but also supplements the annotation information of the newly generated sample 202 , the annotation information indicates that the document 211 is inconsistent with the abstract 213 .
  • Consistency of summary 217 with document 215 can be broken in a number of ways. Some example approaches are described below.
  • one or more text units in summary 217 may be modified by means of entity replacement. Specifically, entities in the abstract 217 may be replaced with entities of the same type in the document 215 to obtain the abstract 213 . In some examples, one or more entities in summary 217 may be replaced.
  • entity refers to a thing or a concept. Each entity may be represented by one or more textual units (eg, words, phrases), etc. Entities can be classified by type as people, roles, objects, events, etc.
  • an entity eg, a person's name
  • an entity eg, a person's name
  • another entity eg, another person's name
  • another entity of the same type may be randomly selected from document 215 .
  • the similarity between entities can be measured, for example, by a text-based distance algorithm.
  • the threshold similarity can be configured as required.
  • one or more textual units in abstract 217 may additionally or alternatively be modified by pronoun replacement.
  • the pronoun in the abstract 217 may be replaced with another pronoun to obtain the abstract 213 .
  • Each pronoun can be represented by one or more textual units (eg, words, phrases), etc.
  • Another pronoun may be a pronoun that grammatically matches the sentence in which the pronoun is located in the abstract 217 to avoid grammatical errors in the revised abstract.
  • the pronouns "he” and “his” can be interchanged with the pronouns “she” and “her”, respectively, the pronouns "they", “we", “you”, etc. can be interchanged, etc.
  • one or more pronouns in abstract 217 may be replaced.
  • the pronouns to be replaced in summary 217 may be randomly selected.
  • one or more units of text in abstract 217 may additionally or alternatively be modified by positive-negative modification.
  • the affirmative verbs in the abstract 217 may be changed to negative verbs, and/or the negative verbs may be changed to affirmative pronouns to obtain the abstract 213 .
  • Each verb may be represented by one or more textual units (eg, words, phrases), etc.
  • auxiliary verbs such as be verbs, positive and negative forms of modal verbs (eg, should, could, would, etc.) may also be specifically modified.
  • one or more verbs in abstract 217 may be replaced.
  • the verbs to be replaced in summary 217 may be randomly selected.
  • Figure 3 shows examples of source documents and abstracts according to some embodiments of the present disclosure.
  • a document 310 is presented along with an abstract 320 that is in fact consistent with the document 310 .
  • Abstract 320 correctly summarizes key information in document 310 , such as bolded and underlined sentences in document 310 . Therefore, the document 310 and the abstract 320 can constitute a consistent positive sample.
  • the entity "Davis" in the abstract 320 is replaced with another entity "Harry” appearing in the document 320 by means of entity replacement, thereby obtaining the modified abstract 330 .
  • the facts described by the abstract 330 are no longer consistent with the document 310, so the document 310 and the abstract 330 can constitute an inconsistent negative sample.
  • Some example ways of modifying summaries in positive samples to construct summaries in negative samples are described above.
  • other ways to modify the summary 217 may also be applied to construct a summary inconsistent with the original document.
  • one or more text units in the abstract 217 may be modified in one or more ways.
  • a modified text unit obtained by modifying a text unit in summary 217 is what caused summary 213 to be inconsistent with the original document, so that text unit or units may be marked as inconsistent with the original document. Such labels will be used in subsequent model training.
  • one or more negative samples can also be obtained from existing databases or manually for training consistency detection Model 105. In some embodiments, for some positive samples, corresponding negative samples may not be constructed. Embodiments of the present disclosure are not limited in this respect.
  • the robustness of the consistency detection model 105 can also be improved by means of adversarial enhancement.
  • the models trained on the basis of general positive samples and negative samples, especially on artificially constructed samples can often give correct results for simple inputs, but for complex situations that may occur in practical applications, the robustness Not very sticky. Therefore, the way of adversarial enhancement can improve the robustness of the trained model to complex samples.
  • the way of adversarial enhancement is to apply interference information to existing samples (positive samples and negative samples) to obtain adversarial samples. Due to the addition of noise information, adversarial examples are different from simple existing examples.
  • the model is required to learn against adversarial samples, so that the adversarial samples can also output the same detection results as existing samples. For example, for an adversarial positive sample constructed from existing positive samples, the model is required to be able to judge that the summary in the adversarial positive sample is consistent with the document, while for an adversarial negative sample constructed from existing negative samples, the model is required to be able to judge The abstract in the sample does not match the documentation.
  • the model trained in this way can also give correct detection results when faced with complex inputs that vary in practical applications.
  • Adversarial augmentation is often used in machine learning applications.
  • the perturbation information is fully applied to the samples in the same way.
  • the inventors of the present application found that, in the task of checking the consistency between the document and the abstract, such a perturbation method is disadvantageous in terms of improving the detection accuracy and tracking the erroneous parts in the abstract. Therefore, according to the embodiments of the present disclosure, an improved way of adversarial enhancement is proposed. The following will first discuss how to determine the perturbation information used to generate adversarial samples, and then discuss the improved adversarial enhancement method.
  • perturbation information may be determined for both positive and negative samples.
  • the sample 202 and the sample 204 are taken as examples for illustration.
  • the sample 202 and the sample 204 can be respectively applied to the consistency detection model 105 , specifically, the embedding representation 212 corresponding to the sample 202 and the embedding representation 214 corresponding to the sample 204 are input to the consistency detection model 105 .
  • the consistency detection model 105 processes the embedded representations 212 and 214 with the current parameter values to give corresponding detection results.
  • the detection result for the sample 202 indicates whether the abstract 213 in the sample 202 is consistent with the document 211
  • the detection result for the sample 204 indicates whether the abstract 215 in the sample 204 is consistent with the document 217 .
  • the current detection result reflects the initial or intermediate learning detection capability of the consistency detection model 105 . Note that the training process of the model is an iterative process, and the detection ability of the model will continue to improve during the iterative process.
  • the consistency detection model 105 may include a feature extraction part and a result prediction part.
  • the feature extraction part is used to extract feature representations related to documents and summaries from embedded representations.
  • the feature extraction part can be implemented by using various machine learning models, neural networks, etc. suitable for feature extraction of text, such as Roberta model, various encoder models, etc.
  • the result prediction part is used to determine the prediction result based on the features extracted by the feature extraction part, that is, whether the input abstract is consistent with the document.
  • the result prediction part can be implemented as a linear layer, such as a softmax layer, etc.
  • the output of the consistency detection model 105 is a binary classification input, that is, two kinds of prediction results of consistency and inconsistency.
  • the detection results generated by the consistency detection model 105 for each sample are provided to the loss function calculation module 220 .
  • the loss function calculation module 220 is configured to determine the difference between the detection result generated by the consistency detection model 105 for each sample and the annotation information for the sample in the annotation information set 116 based on the annotation information set 116 .
  • differences can be expressed in the form of a loss function, such as the cross-entropy loss, which can be expressed as Where e represents a sample (specifically, the embedded representation of the sample), ⁇ represents the current parameter value of the consistency detection model 105, Y represents the annotation information of the sample, and Y ⁇ consistent, inconsistent ⁇ .
  • loss function It is used to measure the difference between the prediction result given by the consistency detection model 105 based on the current parameter value for the sample e and the real result given by the label information Y.
  • the training target is configured to reduce or minimize the difference between the detection result of the consistency detection model 105 on the sample and the label information, for example, reduce it to a certain predetermined threshold (set as required).
  • a training objective can update the parameter values of the consistency detection model 105, so that the loss function be reduced or minimized.
  • the architecture 200 includes a parameter update module 230 configured to update the parameter values of the consistency detection model 105 according to the training target. Therefore, the loss function may be provided to parameter update module 230 for parameter value update of the model.
  • the consistency detection model 105 may not be able to accurately predict the consistency between the document and the abstract in the input sample. As the parameter values are continuously updated, the detection ability of the model is improved, thereby The value of the loss function will be continuously reduced.
  • various training methods such as stochastic gradient descent, can be used to update model parameters, so as to determine how to update the parameter values of the model.
  • the adversarial example of a certain sample when determining the adversarial example of a certain sample, it can be based on the difference between the detection result of the original sample and the label information, that is, based on the loss function to determine the total disturbance information to be applied to the sample. Based on the total interference information and the original samples, adversarial samples are generated.
  • the total disturbance information for each sample may be determined by the disturbance determination module 240 .
  • the total disturbance information may be represented as a perturbation vector, which includes a vector applied to each text unit or special symbol of a sample's text sequence (eg, sample 202 or 204 ).
  • the total interference information can be determined to maximize the loss function The worst interference vector of , that is, it is expected that the total interference information can interfere or hinder the correct detection of the adversarial samples by the consistency detection model 105, so as to enhance the detection ability of the consistency detection model 105 for the adversarial samples.
  • the determination of total interference information for a sample may be expressed as follows:
  • represents the norm bound of the total interference information, which can be a predetermined value
  • e+v represents the adversarial sample obtained after applying the interference information v to the sample e
  • arg max() represents the loss function
  • the resulting interference information v in the maximized case is determined as the total interference information determined for sample e
  • the total interference information can be determined from equation (1) through various approximations
  • the total disturbance information can be calculated using the Fast Gradient Value (FGV) algorithm This can be expressed as follows:
  • the gradient g is the loss function
  • the first-order differential of which represents the rapid change of the loss function with respect to the sample e, that is, the direction of rapid growth; Represents the normalization of the gradient g, where
  • the disturbance determination module 240 is based on the loss function For example, using formula (2), determine the interference information 242 for the sample 202 and obtain the total interference information 252 for the sample 202 through normalization Disturbance determination module 240 may similarly determine disturbance information 244 for samples 204 and determine total disturbance information 254 for samples 204 by normalizing
  • Total interference information determined for a sample Contains the noise vectors applied to each text unit in the sample.
  • the noise information part to be applied to the text unit marked as inconsistent in the abstract of the negative sample is filtered out, and the other text units in the negative sample are applied Interfering information. That is, for negative samples, the/those text units in the summary that are marked as inconsistent with the document will not be perturbed.
  • sample 202 is a negative sample, therefore, total noise 252 will be filtered by filter vector 262 to obtain filtered noise 272
  • the filter vector 262 may be composed of 0 and 1, wherein a value of 0 is applied to the noise vector corresponding to the text unit marked as inconsistent in the abstract 213 in the total noise information 252, and a value of 1 is applied to the noise vector in the total noise information 252 with Interference vectors corresponding to other text units in the document 211 and abstract 213 . Therefore, the noise vectors corresponding to the text units marked as inconsistent in the abstract 213 are no longer included in the noise information 272 .
  • the negative sample 202 when constructing the negative sample 202 from the positive sample, which/which text units in the summary 213 are modified from the summary 217 can be marked, so that such marking information can be directly used when filtering.
  • text units inconsistent with the document 211 in the summary 213 may be manually or automatically marked in other ways.
  • Interference information 272 Applied to the sample 202 for example, to the embedded representation e corresponding to the sample 202, the corresponding embedded representation 216 of the adversarial sample of the sample 202 is obtained, which is denoted as
  • the normalized total interference information 254 can be directly applied to the embedded representation e corresponding to the sample 204, resulting in an embedded representation 218 corresponding to the adversarial sample of the sample 204, which is denoted as That is, for positive samples, individual text units of documents and summaries may be disturbed.
  • Adversarial examples of positive and negative examples can also be applied to the consistency detection model 105 for constructing additional loss functions.
  • the embedded representations 216 and 218 of the adversarial samples corresponding to the samples 202 and 204 can be input to the consistency detection model 105 respectively, so that the consistency detection model 105 uses the current parameter values to process the embedded representations respectively 216 and 218 to give corresponding detection results.
  • the detection result for the embedded representation 216 indicates whether the abstract in the adversarial example corresponding to the sample 202 is consistent with the document, that is, whether the disturbed abstract 213 is consistent with the disturbed document 211 .
  • the detection result for the embedded representation 218 indicates whether the abstract in the adversarial example corresponding to the sample 204 is consistent with the document, that is, whether the disturbed abstract 217 is consistent with the disturbed document 215 .
  • the label information of the adversarial example is consistent with the label information of the original sample.
  • the consistency detection model 105 has higher robustness, and can still give the same detection results as abstracts and samples before being disturbed for abstracts and documents modified by disturbing information.
  • the detection results generated by the consistency detection model 105 for each sample are provided to the loss function calculation module 220 .
  • the loss function calculation module 220 is configured to determine the difference between the detection result generated by the consistency detection model 105 for each adversarial example and the label information of the original sample corresponding to the adversarial example in the label information set 116 based on the label information set 116 .
  • such differences can be expressed in the form of a loss function, such as cross-entropy loss, which can be expressed as an adversarial loss function
  • e' represents an adversarial sample (specifically, the embedded representation of the adversarial sample)
  • represents the current parameter value of the consistency detection model 105
  • Y represents the label information of the original sample e corresponding to the adversarial sample
  • Y ⁇ consistent, inconsistent ⁇ is expressed in the form of a loss function, such as cross-entropy loss, which can be expressed as an adversarial loss function
  • the training target is configured to reduce or minimize the difference between the detection result of the consistency detection model 105 on the adversarial samples and the label information, for example, within a certain predetermined threshold (set as required).
  • a training objective can update the parameter values of the consistency detection model 105, so that the loss function be reduced or minimized. Therefore, the loss function may be provided to parameter update module 230 for parameter value update of the model.
  • the parameter update module 230 can update the parameter values of the model based on the two loss functions to achieve the overall training goal, even if the difference between the detection result of the original sample by the consistency detection model 105 and the label information is reduced or minimized, And the discrepancy between the detection results of the adversarial examples and the labeled information is also reduced or minimized.
  • the total loss function used by the parameter updating module 230 for model parameter value updating can be expressed as:
  • is a predetermined value between 0 and 1 for weighing the two loss functions.
  • the parameter update module 230 can use various training methods, such as stochastic gradient descent method, etc. to update the model parameters, so that the total loss function Reduced to a predetermined threshold or minimized.
  • the consistency detection model remains sensitive to finding inconsistent text units in the abstract by covering up the interference information of the inconsistent text units in the abstract of the negative samples. This can not only improve the consistency detection model's ability to accurately detect consistency, but also enable the consistency detection model to better track the wrong parts in the summary, thereby obtaining automatic error tracking capabilities.
  • Such error tracking capability is achieved by using the back-propagated gradient g. Let's analyze how to implement such error tracking.
  • Equation (3) suppose Due to the loss function is determined using adversarial examples, where the degree of difference between the adversarial examples and the labeled information may be higher than the difference between the original samples and the labeled information, so Equation (3) can be simplified as During training, for negative samples, since the perturbations of inconsistent textual units are masked (i.e., not applied to adversarial examples), changes in these textual units lead to larger changes in the overall loss function, that is, the consistency detection The model 105 will maintain sensitivity to these text units.
  • the trained consistency detection model 105 can be provided to the model application system 120 for use in judging the consistency between the input source document 132 and the target abstract 134 .
  • FIG. 4 illustrates an architecture 400 for applying the consistency detection model 105 according to some embodiments of the present disclosure.
  • the architecture 400 of FIG. 4 may be implemented in the model application system 120 of FIG. 1 .
  • Each module/component in architecture 400 may be implemented by hardware, software, firmware or any combination thereof.
  • the source document 132 and the target abstract 134 can form a text sequence 402, which includes the text units of the source document 132 and the target abstract 134, and also includes a special symbol [CLS] indicating the start of the text sequence and used to separate Special notation [SEP] for documents and abstracts.
  • the text sequence 402 is provided to the embedding layer 210 , which converts the text sequence 402 into a corresponding embedded representation 412 .
  • the corresponding embedded representation 412 may be input to the consistency detection model 105 .
  • Consistency detection model 105 processes embedding representation 412 using the trained parameter values to obtain target detection results 415 indicating whether target abstract 134 is consistent with source document 132 .
  • the trained consistency detection model 105 can also provide error tracking capabilities.
  • the architecture 400 includes a bug tracking module 420 that provides bug tracking functionality. If the target detection result 415 indicates that the target abstract 134 is inconsistent with the source document 132, the bug tracking module 420 is activated.
  • the error tracking module 420 determines a number of rates of change of the object detection result 415 relative to a number of object text units in the object summary 134 .
  • the calculation of the rate of change may include calculating the gradient of the target detection result 415 with respect to the plurality of target text units in the target summary 134 .
  • the error tracking module 420 can calculate the cross-entropy loss based on the embedded representation 412 corresponding to the text sequence 402, the current parameter value of the model (ie, the parameter value after training) and the target detection result 415, similar to the loss function Then, individual gradients of the cross-entropy loss are calculated with respect to each target text unit in the target summary 134 .
  • the gradient distribution (ie, the distribution of rates of change) of these text units may indicate the degree to which each text unit contributes to the inconsistency between the target abstract 134 and the source document 132 .
  • the error tracking module 420 selects a text unit with a higher rate of change from the target summary 134 based on the determined rate of change, such as the extraction of each text unit, and determines the selected text unit to be the target Bad text unit in digest 134. In some embodiments, the error tracking module 420 may select the top k text units with the highest rate of change (k is an integer greater than or equal to 1), and mark these text units as errors. In some embodiments, the error tracking module 420 may provide error prompt information 422 to indicate one or more text units in the target summary 134 that are determined to be errors.
  • the error prompt information 422 can be provided to the user, so that the user can quickly understand which text units in the target abstract 134 are wrong, thus causing the target abstract 134 to be inconsistent with the source document 132 .
  • the inconsistent part may also be indicated to the user by marking (highlighting, underlining, dotted box, etc.) on the text unit in the target summary 520 .
  • Figure 5 illustrates an example of error tracking for digests according to some embodiments of the present disclosure.
  • a source document 510 and a target abstract 520 are presented.
  • it is predetermined to extract text units with the top 5 change rates and mark them as inconsistent text units.
  • rate of change of each text unit in the target text abstract 520 it can be determined that the words "day for June 2010" and "holiday” are wrong summary extractions, resulting in inconsistencies with the facts described in the source document 510.
  • FIG. 6 shows a flowchart of a process 600 for consistency detection between documents and abstracts according to some embodiments of the present disclosure.
  • Process 600 may be implemented at model training system 110 and/or model application system 120 .
  • a first sample and first annotation information are determined, the first annotation information indicating that the first abstract included in the first sample is inconsistent with the first document. At least one textual unit of the plurality of textual units of the first abstract is marked as inconsistent with the first document.
  • the first sample when determining the first sample, may be generated based on the sample that the document is consistent with the abstract. Specifically, a second sample including a first document and a second abstract and second annotation information may be acquired, and the second annotation information indicates that the second abstract is consistent with the first document.
  • the first abstract is generated by modifying at least one text unit in the second abstract, and the first document and the first abstract are combined into a first sample.
  • the first annotation information may also be generated to indicate that the first document is inconsistent with the first abstract.
  • the modified at least one text unit included in the first abstract is marked as inconsistent with the first document.
  • an entity in the second abstract when generating the first abstract, may be replaced with another entity of the same type in the first document. In some embodiments, alternatively or additionally, when generating the first abstract, the pronoun in the second abstract is replaced with another pronoun. In some embodiments, alternatively or additionally, when generating the first abstract, affirmative verbs in the second abstract are modified to negative verbs, and/or negative verbs in the second abstract are Modified verbs in the affirmative form.
  • a first adversarial example is generated by applying disturbance information to the first example.
  • Noise information is applied to the first sample and other textual units in the first abstract than at least one textual unit.
  • the interference information to be applied may be determined by applying the first sample to the consistency detection model to obtain a first detection result output by the consistency detection model, the first detection result indicating the first Whether the first document in this book is consistent with the first abstract.
  • Total interference information for the first sample is determined based on a first difference between the first detection result and the first annotation information.
  • An information portion to be applied to at least one text unit marked as inconsistent in the first abstract is filtered out from the total noise information to obtain noise information. In this way, for the first sample containing inconsistent documents and abstracts, noise information will not be applied to the inconsistent text units.
  • the consistency detection model is trained according to the training target, the consistency detection model is configured to detect whether the summary is consistent with the document, and the training target is configured as The differences between the detection results of the consistency detection model for the first sample and the first adversarial sample and the first label information are all within a predetermined threshold.
  • the first sample and the first adversarial sample can be respectively applied to the consistency detection model to obtain the first detection result and the second detection result output by the consistency detection model result.
  • the first detection result indicates whether the first document in the first sample is consistent with the first abstract
  • the second detection result indicates whether the first document is consistent with the first interference abstract.
  • the parameter values of the consistency detection model are updated based on at least a first difference between the first detection result and the first annotation information and a second difference between the second detection result and the first annotation information.
  • the consistency detection model is also trained using samples with consistent text and abstracts.
  • the third sample and third annotation information may be determined, where the third annotation information indicates that the third document included in the third sample is consistent with the third abstract.
  • a third adversarial example is generated by applying disturbance information to the third document and the third abstract. It is also possible to train the consistency detection model based on the third sample, the third adversarial sample and the third labeled information according to the training target, and the training target is also configured to make the detection result of the consistency detection model on the third sample and the third adversarial sample
  • the differences with the third annotation information are all within a predetermined threshold.
  • the trained consistency detection model can be applied to detect the consistency between documents and summaries. Specifically, in some embodiments, the source document and the target abstract are obtained, and the source document and the target abstract are applied to the trained consistency detection model to obtain the target detection result output by the consistency detection model, and the target detection result indicates the target Whether the abstract is consistent with the source document.
  • the trained consistency detection model can also provide error tracking capabilities. Specifically, if the target detection result indicates that the target abstract is inconsistent with the source document, a plurality of change rates of the target detection result relative to the plurality of target text units in the target abstract is determined. Based on the plurality of rates of change, at least one target text unit is selected from the plurality of target text units, the at least one target text unit has a greater rate of change than other text units in the target summary. In some embodiments, error prompt information may be provided to indicate that at least one target text unit in the target abstract is wrong.
  • Fig. 7 shows a block diagram of an apparatus 700 for checking consistency between a document and an abstract according to some embodiments of the present disclosure.
  • Apparatus 700 may be implemented as or included in model training system 110 and/or model application system 120 .
  • Each module/component in the device 700 may be implemented by hardware, software, firmware or any combination thereof.
  • the apparatus 700 includes a determination module 710 configured to determine a first sample and first annotation information, the first annotation information indicates that the first abstract included in the first sample is inconsistent with the first document, and the first abstract includes At least one textual unit of the plurality of textual units is marked as inconsistent with the first document.
  • the apparatus 700 further includes an adversarial generation module 720 configured to generate a first adversarial sample by applying disturbance information to the first sample, the disturbance information is applied to the first sample and the first abstract except at least one text unit other text units.
  • the apparatus 700 also includes a training module 730 configured to train a consistency detection model according to the training objective based at least on the first sample, the first adversarial sample and the first annotation information, and the consistency detection model is configured to detect whether the abstract and the document Consistent, the training target is configured to make the differences between the detection results of the consistency detection model on the first sample and the first adversarial sample and the first label information all within a predetermined threshold.
  • a training module 730 configured to train a consistency detection model according to the training objective based at least on the first sample, the first adversarial sample and the first annotation information, and the consistency detection model is configured to detect whether the abstract and the document Consistent, the training target is configured to make the differences between the detection results of the consistency detection model on the first sample and the first adversarial sample and the first label information all within a predetermined threshold.
  • the determining module 710 includes: an obtaining module configured to obtain a second sample including the first document and the second abstract and second annotation information, the second annotation information indicating that the second abstract is consistent with the first document; an abstract generation module configured to generate the first abstract by modifying at least one text unit in the second abstract; a sample composition module configured to compose the first document and the first abstract into a first sample; and an annotation generation module, It is configured to generate first annotation information to indicate that the first document is inconsistent with the first abstract.
  • the modified at least one text unit included in the first abstract is marked as inconsistent with the first document.
  • the digest generation module is configured to modify at least one textual unit in the second digest by at least one of: replacing an entity in the second digest with another entity of the same type in the first document, Replace the pronoun in the second abstract with another pronoun, modify the affirmative form of the verb in the second abstract to a negative form of the verb, and modify the negative form of the verb in the second abstract to an affirmative form of the verb.
  • the apparatus 700 further includes a disturbance determination module configured to determine the disturbance information to be applied by applying the first sample to the consistency detection model to obtain the first detection output of the consistency detection model As a result, the first detection result indicates whether the first document in the first sample is consistent with the first abstract; based on the first difference between the first detection result and the first annotation information, the total noise information for the first sample is determined ; and filtering the information part to be applied to at least one text unit marked as inconsistent in the first abstract from the total noise information to obtain noise information.
  • a disturbance determination module configured to determine the disturbance information to be applied by applying the first sample to the consistency detection model to obtain the first detection output of the consistency detection model
  • the first detection result indicates whether the first document in the first sample is consistent with the first abstract
  • the total noise information for the first sample is determined ; and filtering the information part to be applied to at least one text unit marked as inconsistent in the first abstract from the total noise information to obtain noise information.
  • the model training module 720 includes: a sample application module configured to apply the first sample and the first adversarial sample to the consistency detection model respectively, so as to respectively obtain the first detection results output by the consistency detection model and a second detection result, the first detection result indicates whether the first document in the first sample is consistent with the first abstract, and the second detection result indicates whether the first document is consistent with the first interference abstract; and the parameter update module is configured The parameter values of the consistency detection model are updated based on at least a first difference between the first detection result and the first annotation information and a second difference between the second detection result and the first annotation information.
  • the model training module 720 further includes: a sample determination module configured to determine a third sample and third annotation information, the third annotation information indicating that the third document included in the third sample is consistent with the third abstract; An adversarial example generation module configured to generate a third adversarial example by applying interference information to the third document and the third abstract; and another model training module configured to also be based on the third sample, the third adversarial example and the first Three label information, train the consistency detection model according to the training target, and the training target is also configured to make the difference between the detection results of the consistency detection model on the third sample and the third adversarial sample and the third label information be within a predetermined threshold Inside.
  • a sample determination module configured to determine a third sample and third annotation information, the third annotation information indicating that the third document included in the third sample is consistent with the third abstract
  • An adversarial example generation module configured to generate a third adversarial example by applying interference information to the third document and the third abstract
  • another model training module configured to also be based on the third sample, the third adversar
  • the apparatus 700 further includes a document and abstract obtaining module configured to obtain the source document and the target abstract; and a model application module configured to apply the source document and the target abstract to the trained consistency detection model, To obtain the target detection result output by the consistency detection model, the target detection result indicates whether the target summary is consistent with the source document.
  • the apparatus 700 further includes: a change rate determination module configured to determine a plurality of change rates of the target detection result relative to a plurality of target text units in the target abstract if the target detection result indicates that the target abstract is inconsistent with the source document ;
  • a text unit selection module configured to select at least one target text unit from a plurality of target text units based on a plurality of rates of change, at least one target text unit having a greater rate of change than other text units in the target summary ; and an error prompt module configured to provide error prompt information to indicate that at least one target text unit in the target summary is wrong.
  • FIG. 8 shows a block diagram illustrating a computing device 800 in which one or more embodiments of the present disclosure may be implemented. It should be understood that the computing device 800 shown in FIG. 8 is exemplary only and should not constitute any limitation on the functionality and scope of the embodiments described herein. The computing device 800 shown in FIG. 8 can be used to implement the model training system 110 and/or the model application system 120 of FIG. 1 .
  • computing device 800 is in the form of a general-purpose computing device.
  • Components of computing device 800 may include, but are not limited to, one or more processors or processing units 810, memory 820, storage devices 830, one or more communication units 840, one or more input devices 850, and one or more output devices 860.
  • the processing unit 810 may be an actual or virtual processor and is capable of performing various processes according to programs stored in the memory 820 .
  • multiple processing units execute computer-executable instructions in parallel to increase the parallel processing capability of the computing device 800 .
  • Computing device 800 typically includes a plurality of computer storage media. Such media can be any available media that is accessible by computing device 800, including but not limited to, volatile and nonvolatile media, removable and non-removable media.
  • Memory 820 can be volatile memory (eg, registers, cache, random access memory (RAM)), nonvolatile memory (eg, read only memory (ROM), electrically erasable programmable read only memory (EEPROM) , flash memory) or some combination of them.
  • Storage device 830 may be removable or non-removable media, and may include machine-readable media, such as flash drives, magnetic disks, or any other media that may be capable of storing information and/or data (e.g., training data for training ) and can be accessed within the computing device 800.
  • Computing device 800 may further include additional removable/non-removable, volatile/nonvolatile storage media.
  • a disk drive for reading from or writing to a removable, nonvolatile disk such as a "floppy disk"
  • a disk drive for reading from a removable, nonvolatile disk may be provided.
  • CD-ROM drive for reading or writing.
  • each drive may be connected to the bus (not shown) by one or more data media interfaces.
  • Memory 820 may include a computer program product 825 having one or more program modules configured to perform the various methods or actions of the various embodiments of the present disclosure.
  • the communication unit 840 enables communication with other computing devices through the communication medium. Additionally, the functionality of the components of computing device 800 may be implemented in a single computing cluster or as a plurality of computing machines capable of communicating via communication links. Accordingly, computing device 800 may operate in a networked environment using logical connections to one or more other servers, a network personal computer (PC), or another network node.
  • PC network personal computer
  • Input device 850 may be one or more input devices, such as a mouse, keyboard, trackball, and the like.
  • Output device 860 may be one or more output devices, such as a display, speakers, printer, or the like.
  • the computing device 800 can also communicate with one or more external devices (not shown) through the communication unit 840 as needed, such as storage devices, display devices, etc., and one or more devices that enable the user to interact with the computing device 800 In communication, or with any device (eg, network card, modem, etc.) that enables computing device 800 to communicate with one or more other computing devices. Such communication may be performed via an input/output (I/O) interface (not shown).
  • I/O input/output
  • a computer-readable storage medium on which computer-executable instructions are stored, wherein the computer-executable instructions are executed by a processor to implement the methods described above.
  • a computer program product tangibly stored on a non-transitory computer-readable medium and comprising computer-executable instructions, and the computer-executable instructions are executed by a processor to implement the method described above.
  • These computer-readable program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine such that when executed by the processing unit of the computer or other programmable data processing apparatus , producing an apparatus for realizing the functions/actions specified in one or more blocks in the flowchart and/or block diagram.
  • These computer-readable program instructions can also be stored in a computer-readable storage medium, and these instructions cause computers, programmable data processing devices and/or other devices to work in a specific way, so that the computer-readable medium storing instructions includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks in flowcharts and/or block diagrams.
  • computer-readable program instructions can be loaded onto a computer, other programmable data processing apparatus, or other equipment, so that a series of operational steps are performed on the computer, other programmable data processing apparatus, or other equipment to produce a computer-implemented process, Instructions executed on computers, other programmable data processing devices, or other devices can thus implement the functions/actions specified in one or more blocks in the flowcharts and/or block diagrams.
  • each block in a flowchart or block diagram may represent a module, a program segment, or a portion of an instruction that contains one or more executable instruction.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented by a dedicated hardware-based system that performs the specified function or action , or may be implemented by a combination of dedicated hardware and computer instructions.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Machine Translation (AREA)

Abstract

根据本公开的实施例,提供了用于文档与摘要的一致性检测的方法、设备、装置和存储介质。该方法包括确定第一样本和第一标注信息,第一标注信息指示第一样本包括的第一摘要与第一文档不一致,第一摘要中的至少一个文本单元被标记为与第一文档不一致。该方法还包括通过向第一样本施加干扰信息来生成第一对抗样本,干扰信息被施加到第一样本以及第一摘要中除至少一个文本单元之外的其他文本单元。该方法还包括至少基于第一样本、第一对抗样本和第一标注信息,根据训练目标来训练一致性检测模型。以此方式,所得到的训练后的模型能够更好地检测和追踪摘要中与文档不一致的部分。

Description

用于文档和摘要的一致性检测的方法、设备和介质
本申请要求2021年9月13日递交的,标题为“用于文档和摘要的一致性检测的方法、设备和介质”、申请号为CN202111070769.7的中国发明专利申请的优先权。
技术领域
本公开的示例实施例总体涉及计算机领域,特别地涉及用于文档和摘要的一致性检测的方法、设备和计算机可读存储介质。
背景技术
文本摘要提取是生成源文档的简化版本,同时保留源文档中的重要信息。文档摘要提取是文本生成技术的一个分支,不受到源文档中出现的文本的约束。因此,摘要在生成时具有较大的灵活度和较强的生成能力。当前已有很多研究开发出各种摘要生成模型,来实现自动摘要生成。
然而,摘要生成工作面临的挑战在于摘要的简洁性与源文档中事实的一致性之间的权衡。摘要越简洁,摘要中出现事实性错误的概率可能越高。具有事实性错误的摘要是不可取的。因此,期望能够准确、有效地检测摘要与文档的一致性,进而还可以验证摘要生成模型的可靠性和可用性。
发明内容
根据本公开的示例实施例,提供了一种用于文档和摘要的一致性检测的方案。
在本公开的第一方面,提供了一种用户引导的方法。该方法包括确定第一样本和第一标注信息,第一标注信息指示第一样本包括的第 一摘要与第一文档不一致,第一摘要的多个文本单元中的至少一个文本单元被标记为与第一文档不一致。该方法还包括通过向第一样本施加干扰信息来生成第一对抗样本,干扰信息被施加到第一样本以及第一摘要中除至少一个文本单元之外的其他文本单元。该方法还包括至少基于第一样本、第一对抗样本和第一标注信息,根据训练目标来训练一致性检测模型,一致性检测模型被配置为检测摘要与文档是否一致,训练目标被配置为使一致性检测模型对第一样本和第一对抗样本的检测结果与第一标注信息之间的差异均在预定阈值内。
在本公开的第二方面,提供了一种电子设备。该设备包括至少一个处理单元;以及至少一个存储器,至少一个存储器被耦合到至少一个处理单元并且存储用于由至少一个处理单元执行的指令。指令在由至少一个处理单元执行时使设备执行以下动作:确定第一样本和第一标注信息,第一标注信息指示第一样本包括的第一摘要与第一文档不一致,第一摘要的多个文本单元中的至少一个文本单元被标记为与第一文档不一致;通过向第一样本施加干扰信息来生成第一对抗样本,干扰信息被施加到第一样本以及第一摘要中除至少一个文本单元之外的其他文本单元;以及至少基于第一样本、第一对抗样本和第一标注信息,根据训练目标来训练一致性检测模型,一致性检测模型被配置为检测摘要与文档是否一致,训练目标被配置为使一致性检测模型对第一样本和第一对抗样本的检测结果与第一标注信息之间的差异均在预定阈值内。
在本公开的第三方面,提供了一种用于文档和摘要的一致性检测的装置。该装置包括:确定模块,被配置为确定第一样本和第一标注信息,第一标注信息指示第一样本包括的第一摘要与第一文档不一致,第一摘要的多个文本单元中的至少一个文本单元被标记为与第一文档不一致;对抗生成模块,被配置为通过向第一样本施加干扰信息来生成第一对抗样本,干扰信息被施加到第一样本以及第一摘要中除至少一个文本单元之外的其他文本单元;以及训练模块,被配置为至少基于第一样本、第一对抗样本和第一标注信息,根据训练目标来训 练一致性检测模型,一致性检测模型被配置为检测摘要与文档是否一致,训练目标被配置为使一致性检测模型对第一样本和第一对抗样本的检测结果与第一标注信息之间的差异均在预定阈值内。
在本公开的第四方面,提供了一种计算机可读存储介质。介质上存储有计算机程序,程序被处理器执行时实现第一方面的方法。
应当理解,本发明内容部分中所描述的内容并非旨在限定本公开的实施例的关键特征或重要特征,也不用于限制本公开的范围。本公开的其它特征将通过以下的描述而变得容易理解。
附图说明
结合附图并参考以下详细说明,本公开各实施例的上述和其他特征、优点及方面将变得更加明显。在附图中,相同或相似的附图标记表示相同或相似的元素,其中:
图1示出了本公开的实施例能够在其中实现的示例环境的示意图;
图2示出了根据本公开的一些实施例的用于训练一致性检测模型的架构;
图3示出了根据本公开的一些实施例的源文档和摘要的示例;
图4示出了根据本公开的一些实施例的用于应用一致性检测模型的架构;
图5示出了根据本公开的一些实施例的对摘要的错误追踪的示例;
图6示出了根据本公开的一些实施例的用于文档与摘要的一致性检测的过程的流程图;
图7示出了根据本公开的一些实施例的用于文档与摘要的一致性检测的装置的框图;以及
图8示出了能够实施本公开的多个实施例的设备的框图。
具体实施方式
下面将参照附图更详细地描述本公开的实施例。虽然附图中示出了本公开的某些实施例,然而应当理解的是,本公开可以通过各种形式来实现,而且不应该被解释为限于这里阐述的实施例,相反,提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是,本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。
在本公开的实施例的描述中,术语“包括”及其类似用语应当理解为开放性包含,即“包括但不限于”。术语“基于”应当理解为“至少部分地基于”。术语“一个实施例”或“该实施例”应当理解为“至少一个实施例”。术语“一些实施例”应当理解为“至少一些实施例”。下文还可能包括其他明确的和隐含的定义。
如本文中所使用的,术语“模型”可以从训练数据中学习到相应的输入与输出之间的关联,从而在训练完成后可以针对给定的输入,生成对应的输出。模型的生成可以基于机器学习技术。深度学习是一种机器学习算法,通过使用多层处理单元来处理输入和提供相应输出。神经网络模型是基于深度学习的模型的一个示例。在本文中,“模型”也可以被称为“机器学习模型”、“学习模型”、“机器学习网络”或“学习网络”,这些术语在本文中可互换地使用。
“神经网络”是一种基于深度学习的机器学习网络。神经网络能够处理输入并且提供相应输出,其通常包括输入层和输出层以及在输入层与输出层之间的一个或多个隐藏层。在深度学习应用中使用的神经网络通常包括许多隐藏层,从而增加网络的深度。神经网络的各个层按顺序相连,从而前一层的输出被提供作为后一层的输入,其中输入层接收神经网络的输入,而输出层的输出作为神经网络的最终输出。神经网络的每个层包括一个或多个节点(也称为处理节点或神经元),每个节点处理来自上一层的输入。
通常,机器学习大致可以包括三个阶段,即训练阶段、测试阶段和应用阶段(也称为推理阶段)。在训练阶段,给定的模型可以使用大量的训练数据进行训练,不断迭代更新参数值,直到模型能够从训 练数据中获取一致的满足预期目标的推理。通过训练,模型可以被认为能够从训练数据中学习从输入到输出之间的关联(也称为输入到输出的映射)。训练后的模型的参数值被确定。在测试阶段,将测试输入应用到训练后的模型,测试模型是否能够提供正确的输出,从而确定模型的性能。在应用阶段,模型可以被用于基于训练得到的参数值,对实际的输入进行处理,确定对应的输出。
如前文提及的,期望检测摘要与文档是否一致。当前存在一些方案用于检测或提升摘要与文档的一致性。一些方案专注于利用信息提取工具,从文档和摘要分别提取事实,并通过比较所提取的事实来判断文档与摘要是否一致。然而,这样的方案依赖于对信息提取工具的准确性。还有一些方案提出利用自然语言推理或问答模型来进行事实检查,通过设计与文档相关的问题,并验证是否能够从摘要中找到正确答案,由此来检测摘要的一致性。然而,问答机制的准确性依赖于对关键句子的标识,但文档和摘要的文本长度不同,导致难以保证问答的可靠性。
此外,还提出了通过训练一致性检测模型,来学习文档与摘要之间的一致性相关的特性。这样的方案更可靠、稳定。然而,当前的训练方案所训练的模型仍然具有很多改进的需要。
示例环境
图1示出了能够实施本公开的多个实现的环境100的框图。在图1的环境100中,期望训练和使用这样的模型,即一致性检测模型105,该模型被配置用于检测摘要与文档是否一致。
在本文中,摘要与文档的一致指的是摘要中不存在与文档表述的事实的错误或偏差,即摘要不具有事实性错误。通常,人在阅读文档时能够全面了解其中呈现的事实,而由于简化的关系,摘要可能会存在事实性错误。特别地,在一些应用中,可能存在通过模型自动生成的文档摘要。这样的摘要与文档的一致性检测更需要关注。摘要与文档的一致性也能够用于衡量摘要生成模型的可靠性和可用性。
在本文中,摘要与文档的一致指的是摘要中不存在与文档表述的事实的错误或偏差,即摘要不具有事实性错误。通常,人在阅读文档时能够全面了解其中呈现的事实,而由于简化的关系,摘要可能会存在事实性错误。特别地,在一些应用中,可能存在通过模型自动生成的文档摘要。这样的摘要与文档的一致性检测更需要关注。摘要与文档的一致性也能够用于衡量摘要生成模型的可靠性和可用性。
如图1所示,环境100包括模型训练系统110和模型应用系统120。在图1的示例实施例以及下文将会描述的一些示例实施例中,模型训练系统110被配置利用多个训练样本112-1、112-2、……、112-N和标注信息集114来训练一致性检测模型105,其中N为大于等于1的整数。为便于讨论,这些样本通常为样本112。每个样本112包括文档113和摘要115。标注信息集114包括针对样本112的标注信息,其指示样本112中的摘要与文档是否一致。用于训练模型的样本112可以包括一个或多个正(positive)样本和一个或多个负(negative)样本。正样本中的摘要与文档一致,而负样本中的摘要与文档不一致。一致性检测模型105可以从正样本中学习到具有何种特性的摘要与文档是彼此相一致的,并可以从负样本中学习到具有何种特性的摘要与文档是彼此不一致的。
在本文中,“文档”指的是部分或全部呈现自然语言形式的文本的对象。文档可以具有任何电子格式,只要可以提取其中的文本信息即可。在后续处理中,以文档中的文本作为处理对象。每个文档可以包括多个文本单元。
在本文中,“摘要”指的是文档的简化版本,其以更简洁、更少的文本来表述文档中的重要信息。每个摘要可以包括多个文本单元。
在本文中,“文本单元”指的是在自然语言处理任务中处理的单元,并且其粒度可以根据应用而改变和设置。例如,文本单元可以包括词、短语、符号、前述的组合,或者任何其他在自然语言表达中会出现的单元。在一些示例中,文本单元也被称为令牌(token)。
在训练前,一致性检测模型105的参数值可以是被初始化的,或 者是可以通过预训练过程而获得经预训练的参数值。经过训练过程,一致性检测模型105的参数值被更新和调整。在训练完成后,一致性检测模型105具有训练后的参数值。基于这样的参数值,一致性检测模型105能够被用于实现摘要与文档的一致性检测任务。
在图1中,模型应用系统120接收输入的源文档132和目标摘要134。模型应用系统120可以被配置为利用训练后的一致性检测模型105来执行针对源文档132和目标摘要134的一致性检测。
在图1中,模型训练系统110和模型应用系统120可以是任何具有计算能力的系统,例如各种计算设备/系统、终端设备、服务器等。终端设备可以是任意类型的移动终端、固定终端或便携式终端,包括移动手机、台式计算机、膝上型计算机、笔记本计算机、上网本计算机、平板计算机、媒体计算机、多媒体平板、或者前述各项的任意组合,包括这些设备的配件和外设或者其任意组合。服务器包括但不限于大型机、边缘计算节点、云环境中的计算设备,等等。
应当理解,图1示出的环境中的部件和布置仅是示例,适于用于实现本公开所描述的示例实施例的计算系统可以包括一个或多个不同的部件、其他部件和/或不同的布置方式。例如,虽然被示出为是分离的,但模型训练系统110和模型应用系统120可以集成在相同系统或设备。本公开的实施例在此方面不受限制。
以下将继续参考附图,分别描述模型训练和模型应用的示例实施例。
模型训练架构
根据本公开的实施例,提出了一种改进的文档和摘要的一致性检测方案。根据该方案,在训练一致性检测模型时,针对用于训练的负样本,即摘要与文档不匹配的样本,利用对抗数据增强的训练方式,构建对抗负样本。对抗负样本通常是通过向负样本施加扰动信息来生成的。在本公开的实施例中,提出了一种更有效的对抗数据增强方式来构建对抗负样本。然而,利用负样本和对抗负样本来训练一致性检 测模型,使该模型能够更好地检测和追踪摘要中与文档不一致的部分。
图2示出了根据本公开的一些实施例的用于训练一致性检测模型105的架构200的示例。图2的架构200可以被实现在图1的模型训练系统110中。架构200中的各个模块/组件可以由硬件、软件、固件或者它们的任意组合来实现。
如图2所示,示出了用于训练一致性检测模型105的样本202和204。样本202和204要被用于训练一致性检测模型105,例如可以被包括图1所示的样本112。样本202和204可以被表示为文本序列的形式,该文本序列包括文档的多个文本单元和摘要的多个文本单元级联而成。此外,为了区别,在一些实施例中,样本202和204的文本序列还包括位于起始位置的符号[CLS],用于指示文本序列的起始,以及插入在文档和摘要之间的符号[SEP],用于分隔文档和摘要。
在图2中,所示出的样本202包括文档211和摘要213,其中摘要213与文档211不一致,即样本202是用于训练一致性检测模型105的负样本,也称为不一致样本。所示出的样本204包括文档215和摘要217,其中摘要217与文档215不一致,即样本204是用于训练一致性检测模型105的正样本,也称为一致样本。每个样本的文档与摘要的一致性由标注信息集116指示。
为讨论的目的,样本202(被表示为x p)中的文档211可以被表示为s={s 1,s 2,...s Ls1},其中s n表示文档211中的第n个文本单元(或令牌),n=1,2,……Ls1,Ls1表示文档211中的文本单元的数目。样本202中的摘要213可以被表示为t’={t’ 1,t’ 2,...t’ Lt1},其中t’ n表示摘要213中的第n个文本单元(或令牌),n=1,2,……Lt1,Lt1表示摘要213中的文本单元的数目。
样本204(被表示为x n)中的文档215可以被表示为s={s 1,s 2,...s Ls2},其中s n表示文档215中的第n个文本单元(或令牌),n=1,2,……Ls2,Ls2表示文档215中的文本单元的数目。样本204中的摘要217可以被表示为t={t 1,t 2,...t Lt2},其中t n表示摘要217中的第n个文本 单元(或令牌),n=1,2,……Lt2,Lt2表示摘要217中的文本单元的数目。
不同样本中的文档与摘要的文本单元的可以相同或不相同。在一些实施例中,为了后续处理方便,可以通过填充方式,使不同样本中的文档和摘要级联得到的文本序列所包括的文本单元等于预定数目。
注意,虽然图2中仅示出了两个样本202和204,在训练一致性检测模型105时,可能需要更多的样本。这些样本不再一一示出。
如图2所示,每个样本的文本序列可以被提供到嵌入层210,以由嵌入层210输出每个样本对应的嵌入表示。在本文中,“嵌入表示”指的是文本序列的向量化表示,其中文本序列中的每个文本单元和其他特殊符号(例如,[CLS]和[SEP])可以被转换到对应的向量。文本序列的总体的向量化表示可以是多维向量形式。这样,后续处理可以在向量化表示的基础上进行。在嵌入表示的生成中,不同文本单元或符号可以被转换到不同的向量。
假设样本的文本序列被表示为x={x 1,x 2,...x Lx},其中x i表示第i个文本单元,i=1,2,……Lx,Lx表示样本中文档和摘要的文本单元以及特殊符号的总数目。嵌入层210生成的嵌入表示为e,其包括e i=E(x i),其中e i指示文本序列中第i个文本单元或符号转换后的向量。在图2中,嵌入层210对样本202确定嵌入表示212,针对样本204确定嵌入表示214。
在一些实施例中,嵌入层210可以利用预定的文本单元和符号与向量的映射表来执行确定嵌入表示,或者可以利用机器学习模型,例如语言模型等,来提取文本序列的特征作为嵌入表示。本公开的实施例在此方面不受限制。
在一些实施例中,用于训练一致性检测模型105的正样本和负样本(例如,样本204和202)可以从数据库获得,或者从其他数据源获得。在实际应用中,已有数据源中可能存在较多的正样本,即彼此一致的文档与摘要。为了扩充用于训练一致性检测模型105的样本,在一些实施例中,还可以基于已有的正样本来构建负样本,从而获得 人造训练数据。这样可以避免人工生成或标记负样本所导致的巨大成本,并且可以实现在仅有正样本的监督信息的基础上,也能够快速有效且低成本地获得负样本及其监督信息用于模型训练。
在生成人造训练数据的实施例中,假设图2中具有不一致的文档211和摘要213的样本202是从正样本(例如,样本204)生成的。假设正样本204被表示为x n={s,t},其中s表示文档215,t表示摘要217。在生成样本202时,可以通过修改摘要217中的一个或多个文本单元,来破坏摘要217与文档215的一致性,以获得不一致的样本202。在这样的实施例中,样本204可以被表示为x p={s,t’},,文档211与样本205中的文档215相同,即s,摘要213(即,t’)是摘要217(即,t)的修改后的版本。在生成人造训练数据的实施例中,标注信息集116不仅记录已有的样本204的标注信息,还补充新生成的样本202的标注信息,该标注信息指示文档211与摘要213不一致。
可以通过多种方式来破坏摘要217与文档215的一致性。下文将描述一些示例方式。
在一些实施例中,可以通过实体替换的方式来修改摘要217中的一个或多个文本单元。具体地,可以将摘要217中的实体替换为文档215中具有相同类型的实体,以得到摘要213。在一些示例中,可以替换摘要217中的一个或多个实体。在文本中,“实体”指的是事物或概念。每个实体可以由一个或多个文本单元(例如,单词、词组)等表示。实体可以按类型划分为人、角色、对象、事件等。在修改摘要217时,可以将摘要217中存在的实体(例如,人名)替换为文档215中出现的相同类型的另一实体(例如,另一人名)。在一些实施例中,可以从文档215中随机选择相同类型的另一实体。在一些实施例中,为了降低由于近义词、同义词等导致的误差,还可以计算摘要217中要替换的实体与从文档215中随机选择的相同类型的多个实体之间的相似度,并且利用文档215中具有相似度大于阈值相似度的实体来替换摘要217中的实体。实体之间的相似度例如可以基于文本的距离算法来衡量。阈值相似度可以根据需要配置。
在一些实施例中,附加地或备选地,可以通过代词替换的方式来修改摘要217中的一个或多个文本单元。具体地,可以将摘要217中的代词替换为另一代词,以得到摘要213。每个代词可以由一个或多个文本单元(例如,单词、词组)等表示。另一代词可以是与摘要217中的代词所在的句子语法匹配的代词,以避免修改后的摘要存在语法错误。例如,可以将代词“他”和“他的”分别与代词“她”和“她的”相互替换,将代词“他们”、“我们”、“你们”等相互替换,等等。在一些实施例中,可以替换摘要217中的一个或多个代词。在一些实施例中,可以随机选择摘要217中要被替换的代词。
在一些实施例中,附加地或备选地,可以通过肯定-否定修改方式来修改摘要217中的一个或多个文本单元。具体地,可以将摘要217中肯定形式的动词修改为否定形式的动词,和/或将否定形式的动词修改为肯定形式的代词,以得到摘要213。每个动词可以由一个或多个文本单元(例如,单词、词组)等表示。通过将动词在肯定形式与否定形式之间更改,会更改摘要217所描述的事实,从而使修改后得到的摘要与原始的文档215不一致。在一些实施例中,在诸如英语等拉丁语系的语言中,还可以具体修改助动词,例如be动词、情态动词(例如,should、could、would等)的肯定形式和否定形式。在一些实施例中,可以替换摘要217中的一个或多个动词。在一些实施例中,可以随机选择摘要217中要被替换的动词。
图3示出了根据本公开的一些实施例的源文档和摘要的示例。在该示例中,给出了文档310以及,与文档310在事实上一致的摘要320。摘要320正确概述文档310中的关键信息,例如文档310中加粗带下划线的句子。因此,文档310和摘要320可以组成一致性的正样本。为了破坏文档310与摘要320的一致性,通过实体替换的方式,将摘要320中的人名实体“Davis”替换为文档320中出现的另一个人名实体“Harry”,从而得到修改后的摘要330。摘要330所描述的事实与文档310不再一致,因此文档310和摘要330可以组成不一致的负样本。
以上描述了修改正样本的摘要中的一些示例方式,以构造负样本中的摘要。在其他实施例中,还可以应用其他方式来修改摘要217,以构建与原始文档不一致的摘要。在一些实施例中,对于同一摘要217,可以利用一种或多种方式来修改摘要217中的一个或多个文本单元。
在摘要213中,通过修改摘要217中的文本单元所得到的修改后的文本单元是导致摘要213与原始文档不一致的原因,因此那个或那些文本单元可以被标记为与原始文档不一致。这样的标记在后续模型训练中将被使用。
在一些实施例中,除从已有的正样本构建负样本之外或者作为备选的方案,还可以从已有的数据库或者通过人工的方式获得一个或多个负样本用于训练一致性检测模型105。在一些实施例中,对于某些正样本,还可以不构建对应的负样本。本公开的实施例在此方面不受限制。
在一些实施例中,除了基于正样本和负样本来训练一致性检测模型105之外,还可以利用对抗增强的方式,来提高一致性检测模型105的鲁棒性。通常,在一般的正样本和负样本基础上,特别是在人工构造的样本上训练出来的模型,往往对简单的输入能够给出正确的结果,但对于实际应用中可能出现的复杂情况的鲁棒性不高。因此,对抗增强的方式能够提高所训练的模型对复杂样本鲁棒性。
笼统来说,对抗增强的方式是向已有样本(正样本和负样本)施加干扰信息,以获得对抗样本。由于干扰信息的加入,对抗样本区别于简单的已有样本。在模型训练时,要求模型针对对抗样本进行学习,以针对对抗样本也能够输出与已有样本相同的检测结果。例如,对于从已有的正样本构建的对抗正样本,要求模型能够判断对抗正样本中的摘要与文档相一致,而对于从已有的负样本构建的对抗负样本,要求模型能够判断对抗负样本中的摘要与文档不一致。通过这种方式训练出的模型,在面对实际应用中变化的复杂输入时,也能够给出正确的检测结果。
对抗增强在机器学习应用中常被使用。然而,在常规方案中,对于正样本和负样本,均以相同方式将扰动信息完全施加到样本中。本申请的发明人发现,在涉及文档与摘要的一致性检测的任务中,在提高检测的准确性以及对于摘要中错误部分的追踪方面,这样的扰动施加方式是不利的。因此,根据本公开的实施例,提出了改进的对抗增强方式。下文将首先讨论如何确定用于生成对抗样本的扰动信息,然后讨论改进的对抗增强方式。
在一些实施例中,可以针对正样本和负样本均确定扰动信息。仍参考图2,以样本202和样本204为例说明。可以将样本202和样本204分别应用到一致性检测模型105,具体地将样本202对应的嵌入表示212和样本204对应的嵌入表示214输入到一致性检测模型105。一致性检测模型105利用当前的参数值来处理嵌入表示212和214,以给出相应的检测结果。针对样本202的检测结果指示样本202中的摘要213与文档211是否一致,针对样本204的检测结果指示样本204中的摘要215与文档217是否一致。当前的检测结果反映一致性检测模型105初始的或者中间学习到的检测能力。注意,模型的训练过程是一个迭代过程,在迭代过程中模型的检测能力会不断提高。
在一些实施例中,一致性检测模型105可以包括特征提取部分和结果预测部分。特征提取部分用于从嵌入表示提取与文档和摘要相关的特征表示。特征提取部分可以被认为是对文本序列的编码过程,特征表示可以被表示为r i=f(E(x i)),其中f(·)表示特征提取处理,r i表示针对输入的文本序列中第i个文本单元或特殊符号x i提取的特征表示。在一些实施例中,特征提取部分可以利用各种适合用于对文本进行特征提取的机器学习模型、神经网络等来实现,例如Roberta模型,各种编码器模型等。
结果预测部分用于基于特征提取部分提取的特征来确定预测结果,即输入的摘要与文档是否一致。在一些实施例中,结果预测部分可以被实现为线性层,例如softmax层等。一致性检测模型105的输出是二分类输入,即一致与不一致两种预测结果。
由一致性检测模型105针对各个样本生成的检测结果被提供给损失函数计算模块220。损失函数计算模块220被配置为基于标注信息集116,确定一致性检测模型105针对每个样本生成的检测结果与标注信息集116中针对该样本的标注信息之间的差异。在一些实施例中,这样的差异可以被表示为损失函数的形式,例如交叉熵损失,可以被表示为
Figure PCTCN2022112869-appb-000001
其中e表示一个样本(具体是样本的嵌入表示),θ表示一致性检测模型105的当前参数值,Y表示样本的标注信息,Y∈{一致、不一致}。损失函数
Figure PCTCN2022112869-appb-000002
用于衡量一致性检测模型105基于当前的参数值,对样本e给出的预测结果与标注信息Y给出的真实结果之间的差异。
在模型训练过程中,训练目标被配置为使一致性检测模型105对样本的检测结果与标注信息之间的差异降低或最小化,例如降低到某个预定阈值(根据需要设置)内。这样的训练目标可以通过更新一致性检测模型105的参数值,从而使损失函数
Figure PCTCN2022112869-appb-000003
降低或最小化来实现。具体地,架构200中包括参数更新模块230,其被配置为根据该训练目标来更新一致性检测模型105的参数值。因此,损失函数
Figure PCTCN2022112869-appb-000004
可以被提供给参数更新模块230以用于模型的参数值更新。
在训练的初始阶段时,由于参数值不理想,一致性检测模型105可能还不能够准确预测输入的样本中文档与摘要的一致性,随着参数值不断更新,模型的检测能力得到提高,从而损失函数的值会被不断降低。
基于损失函数来执行模型训练时,可以利用各种训练方法,例如随机梯度下降法等来更新模型参数,从而确定如何更新模型的参数值。
在一些实施例中,在确定某个样本的对抗样本时,可以基于原始的样本的检测结果与标注信息之间的差异,即基于损失函数
Figure PCTCN2022112869-appb-000005
来确定要施加到样本的总干扰信息。基于总干扰信息和原始的样本,来生成对抗样本。
在图2中,可以由扰动确定模块240来确定各个样本的总干扰信 息。总干扰信息可以被表示为扰动向量(perturbation vector),其包括被施加到样本的文本序列(例如,样本202或204)的每个文本单元或特殊符号上的向量。在一些实施例,总干扰信息可以被确定为能够最大化损失函数
Figure PCTCN2022112869-appb-000006
的最差干扰向量,也就是说,期望总干扰信息能够干扰或阻碍一致性检测模型105对对抗样本的正确检测,以便增强一致性检测模型105对对抗样本的检测能力。
在一些示例中,针对样本的总干扰信息的确定可以被表示为如下:
Figure PCTCN2022112869-appb-000007
其中
Figure PCTCN2022112869-appb-000008
表示针对样本e确定的总干扰信息,∈表示总干扰信息的范数界,其可以是预定值,e+v表示向样本e施加干扰信息v后得到的对抗样本,arg max()表示在损失函数
Figure PCTCN2022112869-appb-000009
最大化的情况下所得到的干扰信息v,其被确定为针对样本e确定的总干扰信息
Figure PCTCN2022112869-appb-000010
考虑到一致性检测模型105的复杂性,可能难以准确计算总干扰信息
Figure PCTCN2022112869-appb-000011
在一些实施例中,可以通过各种近似的方式,从式(1)确定总干扰信息
Figure PCTCN2022112869-appb-000012
在一些实现中,可以利用快速梯度值(Fast Gradient Value,FGV)算法来计算总干扰信息
Figure PCTCN2022112869-appb-000013
这可以被表示为如下:
Figure PCTCN2022112869-appb-000014
其中
Figure PCTCN2022112869-appb-000015
在式(2)中,梯度g是损失函数
Figure PCTCN2022112869-appb-000016
的一阶微分,其表示损失函数关于样本e的快速变化,即快速增长的方向;
Figure PCTCN2022112869-appb-000017
表示对梯度g的归一化,其中||g||表示梯度g的范数。这样的归一化可以确保总干扰信息的近似值
Figure PCTCN2022112869-appb-000018
更合理。
如图2所示,扰动确定模块240基于损失函数
Figure PCTCN2022112869-appb-000019
例如利用式(2),确定针对样本202的干扰信息242并通过归一化,获得针对样本202的总干扰信息252
Figure PCTCN2022112869-appb-000020
扰动确定模块240可以类似地确定针对样本204的干扰信息244,并通过归一化确定针对样本204 的总干扰信息254
Figure PCTCN2022112869-appb-000021
针对一个样本确定的总干扰信息
Figure PCTCN2022112869-appb-000022
包括被施加到样本中的各个文本单元的干扰向量。根据本公开的实施例,在确定针对负样本的对抗样本时,过滤出要被施加到负样本的摘要中被标记为不一致的文本单元的干扰信息部分,并对负样本中的其他文本单元施加干扰信息。也就是说,对于负样本,摘要中被标记为与文档不一致的那个/那些文本单元将不会被施加干扰。
在图2的示例中,样本202是负样本,因此,总干扰信息252将由过滤向量262过滤,以得到过滤后的干扰信息272
Figure PCTCN2022112869-appb-000023
过滤向量262可以由0和1组成,其中0的值被施加到总干扰信息252中与摘要213中被标记为不一致的文本单元对应的干扰向量,1的值被施加到总干扰信息252中与文档211以及摘要213中的其他文本单元对应的干扰向量。因此,干扰信息272中不再包括与摘要213中被标记为不一致的文本单元对应的干扰向量。
在一些实施例中,在从正样本构造负样本202时,可以标记摘要213中哪个/哪些文本单元是从摘要217修改得到的,这样在过滤时可以直接利用这样的标记信息。在一些实施例中,如果从已有数据源直接获得负样本202,可以通过其他方式,人工或自动地标记处摘要213中与文档211不一致的文本单元。
干扰信息272
Figure PCTCN2022112869-appb-000024
被施加到样本202,例如被施加到样本202对应的嵌入表示e,得到样本202的对抗样本对应的嵌入表示216,其被表示为
Figure PCTCN2022112869-appb-000025
在一些实施例中,对于正样本,例如样本204,归一化后得到的总干扰信息254
Figure PCTCN2022112869-appb-000026
可以被直接施加到样本204对应的嵌入表示e,得到样本204的对抗样本对应的嵌入表示218,其被表示为
Figure PCTCN2022112869-appb-000027
也就是说,对于正样本,文档和摘要的各个文本单元均可能被干扰。
正样本和负样本的对抗样本也可以被应用一致性检测模型105,以用于构建另外的损失函数。例如,如图2所示,样本202和204对应的对抗样本的嵌入表示216和218可以分别被输入到一致性检测模 型105,以由一致性检测模型105利用当前的参数值来分别处理嵌入表示216和218,以给出相应的检测结果。针对嵌入表示216的检测结果指示样本202对应的对抗样本中的摘要与文档是否一致,即,被干扰后的摘要213与被干扰后的文档211是否一致。针对嵌入表示218的检测结果指示样本204对应的对抗样本中的摘要与文档是否一致,即,被干扰后的摘要217与被干扰后的文档215是否一致。
对抗样本的标注信息与原样本的标注信息一致。换言之,期望一致性检测模型105具有更高的鲁棒性,对于被干扰信息改动后的摘要和文档,仍然能够给出与未被干扰之前的摘要和样本相同的检测结果。
由一致性检测模型105针对各个样本生成的检测结果被提供给损失函数计算模块220。损失函数计算模块220被配置为基于标注信息集116,确定一致性检测模型105针对每个对抗样本生成的检测结果与标注信息集116中针对对抗样本对应的原始样本的标注信息之间的差异。在一些实施例中,这样的差异可以被表示为损失函数的形式,例如交叉熵损失,这可以被表示为对抗损失函数
Figure PCTCN2022112869-appb-000028
其中e’表示一个对抗样本(具体是对抗样本的嵌入表示),θ表示一致性检测模型105的当前参数值,Y表示对抗样本对应的原始样本e的标注信息,Y∈{一致、不一致}。
在模型训练过程中,训练目标被配置为使一致性检测模型105对对抗样本的检测结果与标注信息之间的差异降低或最小化,例如降低到某个预定阈值(根据需要设置)内。这样的训练目标可以通过更新一致性检测模型105的参数值,从而使损失函数
Figure PCTCN2022112869-appb-000029
降低或最小化来实现。因此,损失函数
Figure PCTCN2022112869-appb-000030
可以被提供给参数更新模块230以用于模型的参数值更新。
因此,参数更新模块230可以基于两种损失函数来更新模型的参数值,以达到总的训练目标,即使一致性检测模型105对原始样本的检测结果与标注信息之间的差异降低或最小化,并且对抗样本的检测结果与标注信息之间的差异也降低或最小化。参数更新模块230用于 模型参数值更新的总损失函数可以被表示为:
Figure PCTCN2022112869-appb-000031
其中α是在0和1之间的预定值,用于权衡两个损失函数。
基于损失函数来执行模型训练时,参数更新模块230可以利用各种训练方法,例如随机梯度下降法等来更新模型参数,以使总损失函数
Figure PCTCN2022112869-appb-000032
降低到预定阈值以内或者最小化。
通常,干扰信息的施加和对抗样本的使用是为了让模型对于输入的鲁棒性提高,但这也会降低模型对于输入的敏感性。根据本公开的实施例,通过掩盖掉针对负样本的摘要中不一致的文本单元的干扰信息,使一致性检测模型对于摘要找不一致的文本单元仍然保持敏感性。这不仅能够提高一致性检测模型对于一致性的准确检测能力,而且还可以使一致性检测模型能够更好地追踪摘要中的错误部分,从而获得自动的错误追踪能力。
这样的错误追踪能力是通过使用后向传播的梯度g来实现的。下面来分析如何能够实现这样的错误追踪。
对式(3),假设
Figure PCTCN2022112869-appb-000033
由于损失函数
Figure PCTCN2022112869-appb-000034
是利用对抗样本来确定的,其中的对抗样本与标注信息之间的差异程度可能会高于原始的样本与标注信息之间的差异程度,因此
Figure PCTCN2022112869-appb-000035
式(3)可以被简化为
Figure PCTCN2022112869-appb-000036
在训练过程中,针对负样本,由于不一致的文本单元的扰动被掩蔽(即未被施加到对抗样本),这些文本单元的变化会导致总损失函数的更大变化,也就是说,一致性检测模型105会保持对这些文本单元的敏感度。因此,这些文本单元的变化会导致检测结果的更大变化,相应地,在利用损失函数计算梯度时,可以观察到损失函数相对于不一致的文本单元的梯度g较高,因为损失函数
Figure PCTCN2022112869-appb-000037
相对这些不一致的文本单元的变化率更大。这个现象能够在模型应用阶段被利用,从而帮助在模型应用阶段实现对不一致的摘要中的错误进行标记或提醒。这在下文中将详细描述。
模型应用架构
以上讨论的对一致性检测模型105的训练。训练后的一致性检测模型105可以被提供到模型应用系统120中使用,以用于对输入的源文档132和目标摘要134进行一致性的判断。
图4示出了根据本公开的一些实施例的用于应用一致性检测模型105的架构400。图4的架构400可以被实现在图1的模型应用系统120中。架构400中的各个模块/组件可以由硬件、软件、固件或者它们的任意组合来实现。
如图4所示,源文档132和目标摘要134可以形成文本序列402,其包括源文档132和目标摘要134的文本单元,并且还包括指示文本序列起始的特殊符号[CLS]和用于分隔文档和摘要的特殊符号[SEP]。源文档132可以被表示为s={s 1,s 2,...s Ls},其中s n表示源文档132中的第n个文本单元(或令牌),n=1,2,……Ls,Ls表示源文档132中的文本单元的数目。目标摘要134可以被表示为t={t 1,t 2,...t Lt},其中t n表示目标摘要134中的第n个文本单元(或令牌),n=1,2,……Lt,Lt表示目标摘要134中的文本单元的数目。
文本序列402被提供给嵌入层210,其将文本序列402转换为对应的嵌入表示412。对应的嵌入表示412可以被输入到一致性检测模型105。一致性检测模型105利用训练后的参数值,处理嵌入表示412,以获得目标检测结果415,其指示目标摘要134与源文档132是否一致。
如以上提及的,所训练的一致性检测模型105还能够提供错误追踪能力。具体地,架构400包括错误追踪模块420,其提供错误追踪的功能。如果目标检测结果415指示目标摘要134与源文档132不一致,那么错误追踪模块420被激活。错误追踪模块420确定目标检测结果415相对目标摘要134中的多个目标文本单元的多个变化率。在一些示例中,变化率的计算可以包括计算目标检测结果415相对目标摘要134中的多个目标文本单元的梯度。错误追踪模块420可以基于 文本序列402对应的嵌入表示412、模型的当前参数值(即训练后的参数值)以及目标检测结果415,计算交叉熵损失,类似于损失函数
Figure PCTCN2022112869-appb-000038
然后,计算该交叉熵损失相对目标摘要134中的各个目标文本单元的各个梯度。这些文本单元的梯度分布(即变化率的分布)可以指示每个文本单元对于目标摘要134与源文档132的不一致性的贡献程度。
在一些实施例中,错误追踪模块420基于所确定的变化率,例如各个文本单元的提取,从目标摘要134中选择具有较高变化率的文本单元,并将所选择的文本单元确定为是目标摘要134中的错误文本单元。在一些实施例中,错误追踪模块420可以选择变化率最高的前k个文本单元(k是大于等于1的整数),并将这些文本单元标记为是错误的。在一些实施例中,错误追踪模块420可以提供错误提示信息422,以指示目标摘要134中被确定为错误的一个或多个文本单元。
错误提示信息422可以被提供给用户,从而使用户能够快速了解目标摘要134中哪些文本单元是错误的,从而导致目标摘要134与源文档132不一致。在一些实施例中,还可以通过对目标摘要520中的文本单元的各种标注(高亮、加下划线、虚框等)方式,向用户指示存在不一致的部分。
图5示出了根据本公开的一些实施例的对摘要的错误追踪的示例。在图5的示例中,给出了源文档510和目标摘要520。在该示例中,预定提取变化率在前5的文本单元,将其标记为不一致的文本单元。通过对目标文摘要520中各个文本单元的变化率的确定,可以确定其中的词“day for June 2010”和“holiday”是错误的概括提取,从而导致与源文档510描述的事实存在不一致。
示例过程
图6示出了根据本公开的一些实施例的用于文档与摘要的一致性检测的过程600的流程图。过程600可以被实现在模型训练系统110和/或模型应用系统120处。
在框610,确定第一样本和第一标注信息,第一标注信息指示第一样本包括的第一摘要与第一文档不一致。第一摘要的多个文本单元中的至少一个文本单元被标记为与第一文档不一致。
在一些实施例中,在确定第一样本时,可以基于文档与摘要一致的样本来生成第一样本。具体地,可以获取包括第一文档和第二摘要的第二样本和第二标注信息,第二标注信息指示第二摘要与第一文档一致。通过修改第二摘要中的至少一个文本单元来生成第一摘要,并将第一文档和第一摘要组成第一样本。还可以生成第一标注信息,以指示第一文档与第一摘要不一致。在一些实施例中,第一摘要中包括的被修改后的至少一个文本单元被标记为与第一文档不一致。
在一些实施例中,在生成第一摘要时,可以将第二摘要中的实体替换为第一文档中具有相同类型的另一实体。在一些实施例中,备选地或附加地,在生成第一摘要时,将第二摘要中的代词替换为另一代词。在一些实施例中,备选地或附加地,在生成第一摘要时,将第二摘要中的肯定形式的动词修改为否定形式的动词,和/或将第二摘要中的否定形式的动词修改为肯定形式的动词。
在框620,通过向第一样本施加干扰信息来生成第一对抗样本。干扰信息被施加到第一样本以及第一摘要中除至少一个文本单元之外的其他文本单元。
在一些实施例中,可以通过以下来确定要施加的干扰信息:将第一样本应用于一致性检测模型,以获得一致性检测模型输出的第一检测结果,第一检测结果指示第一样本中的第一文档与第一摘要是否一致。基于第一检测结果与第一标注信息之间的第一差异,确定针对第一样本的总干扰信息。从总干扰信息中过滤出要施加到第一摘要中被标记为不一致的至少一个文本单元的信息部分,以获得干扰信息。这样,对于包含不一致的文档和摘要的第一样本,干扰信息不会被施加到不一致的文本单元。
在框630,至少基于第一样本、第一对抗样本和第一标注信息,根据训练目标来训练一致性检测模型,一致性检测模型被配置为检测 摘要与文档是否一致,训练目标被配置为使一致性检测模型对第一样本和第一对抗样本的检测结果与第一标注信息之间的差异均在预定阈值内。
在一些实施例中,在训练一致性检测模型上,可以将第一样本和第一对抗样本分别应用于一致性检测模型,以分别获得一致性检测模型输出的第一检测结果和第二检测结果。第一检测结果指示第一样本中的第一文档与第一摘要是否一致,第二检测结果指示第一文档与第一干扰摘要是否一致。至少基于第一检测结果与第一标注信息之间的第一差异和第二检测结果与第一标注信息之间的第二差异来更新一致性检测模型的参数值。
在一些实施例中,还利用具有一致的文本和摘要的样本来训练一致性检测模型。具体地,可以确定第三样本和第三标注信息,第三标注信息指示第三样本包括的第三文档与第三摘要一致。通过向第三文档和第三摘要施加干扰信息来生成第三对抗样本。还可以基于第三样本、第三对抗样本和第三标注信息,根据训练目标来训练一致性检测模型,训练目标还被配置为使一致性检测模型对第三样本和第三对抗样本的检测结果与第三标注信息之间的差异均在预定阈值内。
训练后的一致性检测模型可以被应用于检测文档与摘要的一致性。具体地,在一些实施例中,获得源文档和目标摘要,并且将源文档和目标摘要应用于训练后的一致性检测模型,以获得一致性检测模型输出的目标检测结果,目标检测结果指示目标摘要与源文档是否一致。
在一些实施例中,训练后的一致性检测模型还可以提供错误追踪能力。具体地,如果目标检测结果指示目标摘要与源文档不一致,确定目标检测结果相对目标摘要中的多个目标文本单元的多个变化率。基于多个变化率,从多个目标文本单元中选择至少一个目标文本单元,至少一个目标文本单元的变化率比目标摘要中的其他文本单元的变化率更大。在一些实施例中,可以提供错误提示信息,以指示目标摘要中的至少一个目标文本单元是错误的。
示例装置和设备
图7示出了根据本公开的一些实施例的用于文档与摘要的一致性检测的装置700的框图。装置700可以被实现为或者被包括在模型训练系统110和/或模型应用系统120中。装置700中的各个模块/组件可以由硬件、软件、固件或者它们的任意组合来实现。
如图所示,装置700包括确定模块710,被配置为确定第一样本和第一标注信息,第一标注信息指示第一样本包括的第一摘要与第一文档不一致,第一摘要的多个文本单元中的至少一个文本单元被标记为与第一文档不一致。装置700还包括对抗生成模块720,被配置为通过向第一样本施加干扰信息来生成第一对抗样本,干扰信息被施加到第一样本以及第一摘要中除至少一个文本单元之外的其他文本单元。装置700还包括训练模块730,被配置为至少基于第一样本、第一对抗样本和第一标注信息,根据训练目标来训练一致性检测模型,一致性检测模型被配置为检测摘要与文档是否一致,训练目标被配置为使一致性检测模型对第一样本和第一对抗样本的检测结果与第一标注信息之间的差异均在预定阈值内。
在一些实施例中,确定模块710包括:获取模块,被配置为获取包括第一文档和第二摘要的第二样本和第二标注信息,第二标注信息指示第二摘要与第一文档一致;摘要生成模块,被配置为通过修改第二摘要中的至少一个文本单元来生成第一摘要;样本组成模块,被配置为将第一文档和第一摘要组成第一样本;以及标注生成模块,被配置为生成第一标注信息,以指示第一文档与第一摘要不一致。
在一些实施例中,第一摘要中包括的被修改后的至少一个文本单元被标记为与第一文档不一致。
在一些实施例中,摘要生成模块被配置为通过以下至少一项来修改第二摘要中的至少一个文本单元:将第二摘要中的实体替换为第一文档中具有相同类型的另一实体,将第二摘要中的代词替换为另一代词,将第二摘要中的肯定形式的动词修改为否定形式的动词,以及将 第二摘要中的否定形式的动词修改为肯定形式的动词。
在一些实施例中,装置700还包括干扰确定模块,被配置为通过以下来确定要施加的干扰信息:将第一样本应用于一致性检测模型,以获得一致性检测模型输出的第一检测结果,第一检测结果指示第一样本中的第一文档与第一摘要是否一致;基于第一检测结果与第一标注信息之间的第一差异,确定针对第一样本的总干扰信息;以及从总干扰信息中过滤出要施加到第一摘要中被标记为不一致的至少一个文本单元的信息部分,以获得干扰信息。
在一些实施例中,模型训练模块720包括:样本应用模块,被配置为将第一样本和第一对抗样本分别应用于一致性检测模型,以分别获得一致性检测模型输出的第一检测结果和第二检测结果,第一检测结果指示第一样本中的第一文档与第一摘要是否一致,第二检测结果指示第一文档与第一干扰摘要是否一致;以及参数更新模块,被配置为至少基于第一检测结果与第一标注信息之间的第一差异和第二检测结果与第一标注信息之间的第二差异来更新一致性检测模型的参数值。
在一些实施例中,模型训练模块720还包括:样本确定模块,被配置为确定第三样本和第三标注信息,第三标注信息指示第三样本包括的第三文档与第三摘要一致;另一对抗样本生成模块,被配置为通过向第三文档和第三摘要施加干扰信息来生成第三对抗样本;以及另外的模型训练模块,被配置为还基于第三样本、第三对抗样本和第三标注信息,根据训练目标来训练一致性检测模型,训练目标还被配置为使一致性检测模型对第三样本和第三对抗样本的检测结果与第三标注信息之间的差异均在预定阈值内。
在一些实施例中,装置700还包括文档和摘要获得模块,被配置为获得源文档和目标摘要;以及模型应用模块,被配置为将源文档和目标摘要应用于训练后的一致性检测模型,以获得一致性检测模型输出的目标检测结果,目标检测结果指示目标摘要与源文档是否一致。
在一些实施例中,装置700还包括:变化率确定模块,被配置为 如果目标检测结果指示目标摘要与源文档不一致,确定目标检测结果相对目标摘要中的多个目标文本单元的多个变化率;文本单元选择模块,被配置为基于多个变化率,从多个目标文本单元中选择至少一个目标文本单元,至少一个目标文本单元的变化率比目标摘要中的其他文本单元的变化率更大;以及错误提示模块,被配置为提供错误提示信息,以指示目标摘要中的至少一个目标文本单元是错误的。
图8示出了示出了其中可以实施本公开的一个或多个实施例的计算设备800的框图。应当理解,图8所示出的计算设备800仅仅是示例性的,而不应当构成对本文所描述的实施例的功能和范围的任何限制。图8所示出的计算设备800可以用于实现图1的模型训练系统110和/或模型应用系统120。
如图8所示,计算设备800是通用计算设备的形式。计算设备800的组件可以包括但不限于一个或多个处理器或处理单元810、存储器820、存储设备830、一个或多个通信单元840、一个或多个输入设备850以及一个或多个输出设备860。处理单元810可以是实际或虚拟处理器并且能够根据存储器820中存储的程序来执行各种处理。在多处理器系统中,多个处理单元并行执行计算机可执行指令,以提高计算设备800的并行处理能力。
计算设备800通常包括多个计算机存储介质。这样的介质可以是计算设备800可访问的任何可以获得的介质,包括但不限于易失性和非易失性介质、可拆卸和不可拆卸介质。存储器820可以是易失性存储器(例如寄存器、高速缓存、随机访问存储器(RAM))、非易失性存储器(例如,只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、闪存)或它们的某种组合。存储设备830可以是可拆卸或不可拆卸的介质,并且可以包括机器可读介质,诸如闪存驱动、磁盘或者任何其他介质,其可以能够用于存储信息和/或数据(例如用于训练的训练数据)并且可以在计算设备800内被访问。
计算设备800可以进一步包括另外的可拆卸/不可拆卸、易失性/非易失性存储介质。尽管未在图8中示出,可以提供用于从可拆卸、 非易失性磁盘(例如“软盘”)进行读取或写入的磁盘驱动和用于从可拆卸、非易失性光盘进行读取或写入的光盘驱动。在这些情况中,每个驱动可以由一个或多个数据介质接口被连接至总线(未示出)。存储器820可以包括计算机程序产品825,其具有一个或多个程序模块,这些程序模块被配置为执行本公开的各种实施例的各种方法或动作。
通信单元840实现通过通信介质与其他计算设备进行通信。附加地,计算设备800的组件的功能可以以单个计算集群或多个计算机器来实现,这些计算机器能够通过通信连接进行通信。因此,计算设备800可以使用与一个或多个其他服务器、网络个人计算机(PC)或者另一个网络节点的逻辑连接来在联网环境中进行操作。
输入设备850可以是一个或多个输入设备,例如鼠标、键盘、追踪球等。输出设备860可以是一个或多个输出设备,例如显示器、扬声器、打印机等。计算设备800还可以根据需要通过通信单元840与一个或多个外部设备(未示出)进行通信,外部设备诸如存储设备、显示设备等,与一个或多个使得用户与计算设备800交互的设备进行通信,或者与使得计算设备800与一个或多个其他计算设备通信的任何设备(例如,网卡、调制解调器等)进行通信。这样的通信可以经由输入/输出(I/O)接口(未示出)来执行。
根据本公开的示例性实现方式,提供了一种计算机可读存储介质,其上存储有计算机可执行指令,其中计算机可执行指令被处理器执行以实现上文描述的方法。根据本公开的示例性实现方式,还提供了一种计算机程序产品,计算机程序产品被有形地存储在非瞬态计算机可读介质上并且包括计算机可执行指令,而计算机可执行指令被处理器执行以实现上文描述的方法。
这里参照根据本公开实现的方法、装置、设备和计算机程序产品的流程图和/或框图描述了本公开的各个方面。应当理解,流程图和/或框图的每个方框以及流程图和/或框图中各方框的组合,都可以由计算机可读程序指令实现。
这些计算机可读程序指令可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理单元,从而生产出一种机器,使得这些指令在通过计算机或其他可编程数据处理装置的处理单元执行时,产生了实现流程图和/或框图中的一个或多个方框中规定的功能/动作的装置。也可以把这些计算机可读程序指令存储在计算机可读存储介质中,这些指令使得计算机、可编程数据处理装置和/或其他设备以特定方式工作,从而,存储有指令的计算机可读介质则包括一个制造品,其包括实现流程图和/或框图中的一个或多个方框中规定的功能/动作的各个方面的指令。
可以把计算机可读程序指令加载到计算机、其他可编程数据处理装置、或其他设备上,使得在计算机、其他可编程数据处理装置或其他设备上执行一系列操作步骤,以产生计算机实现的过程,从而使得在计算机、其他可编程数据处理装置、或其他设备上执行的指令实现流程图和/或框图中的一个或多个方框中规定的功能/动作。
附图中的流程图和框图显示了根据本公开的多个实现的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或指令的一部分,模块、程序段或指令的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或动作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
以上已经描述了本公开的各实现,上述说明是示例性的,并非穷尽性的,并且也不限于所公开的各实现。在不偏离所说明的各实现的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各 实现的原理、实际应用或对市场中的技术的改进,或者使本技术领域的其他普通技术人员能理解本文公开的各个实现方式。

Claims (20)

  1. 一种用于文档与摘要的一致性检测的方法,包括:
    确定第一样本和第一标注信息,所述第一标注信息指示所述第一样本包括的第一摘要与第一文档不一致,所述第一摘要的多个文本单元中的至少一个文本单元被标记为与所述第一文档不一致;
    通过向所述第一样本施加干扰信息来生成第一对抗样本,所述干扰信息被施加到所述第一样本以及所述第一摘要中除所述至少一个文本单元之外的其他文本单元;以及
    至少基于所述第一样本、所述第一对抗样本和所述第一标注信息,根据训练目标来训练一致性检测模型,所述一致性检测模型被配置为检测摘要与文档是否一致,所述训练目标被配置为使所述一致性检测模型对所述第一样本和所述第一对抗样本的检测结果与所述第一标注信息之间的差异均在预定阈值内。
  2. 根据权利要求1所述的方法,其中确定所述第一样本和所述第一标注信息包括:
    获取包括所述第一文档和第二摘要的第二样本和第二标注信息,所述第二标注信息指示所述第二摘要与所述第一文档一致;
    通过修改所述第二摘要中的至少一个文本单元来生成所述第一摘要;
    将所述第一文档和所述第一摘要组成所述第一样本;以及
    生成所述第一标注信息,以指示所述第一文档与所述第一摘要不一致。
  3. 根据权利要求2所述的方法,其中所述第一摘要中包括的被修改后的所述至少一个文本单元被标记为与所述第一文档不一致。
  4. 根据权利要求2或3所述的方法,其中生成所述第一摘要包括:
    通过以下至少一项来修改所述第二摘要中的至少一个文本单元:
    将所述第二摘要中的实体替换为所述第一文档中具有相同类型 的另一实体,
    将所述第二摘要中的代词替换为另一代词,
    将所述第二摘要中的肯定形式的动词修改为否定形式的动词,以及
    将所述第二摘要中的否定形式的动词修改为肯定形式的动词。
  5. 根据权利要求1至4中任一项所述的方法,还包括通过以下来确定要施加的所述干扰信息:
    将所述第一样本应用于所述一致性检测模型,以获得所述一致性检测模型输出的第一检测结果,所述第一检测结果指示所述第一样本中的所述第一文档与所述第一摘要是否一致;
    基于所述第一检测结果与所述第一标注信息之间的第一差异,确定针对所述第一样本的总干扰信息;以及
    从所述总干扰信息中过滤出要施加到所述第一摘要中被标记为不一致的所述至少一个文本单元的信息部分,以获得所述干扰信息。
  6. 根据权利要求1至5中任一项所述的方法,其中训练所述一致性检测模型包括:
    将所述第一样本和所述第一对抗样本分别应用于所述一致性检测模型,以分别获得所述一致性检测模型输出的第一检测结果和第二检测结果,所述第一检测结果指示所述第一样本中的所述第一文档与所述第一摘要是否一致,所述第二检测结果指示所述第一文档与所述第一干扰摘要是否一致;以及
    至少基于所述第一检测结果与所述第一标注信息之间的第一差异和所述第二检测结果与所述第一标注信息之间的第二差异来更新所述一致性检测模型的参数值。
  7. 根据权利要求1至6中任一项所述的方法,其中训练所述一致性检测模型还包括:
    确定第三样本和第三标注信息,所述第三标注信息指示所述第三样本包括的第三文档与第三摘要一致;
    通过向所述第三文档和所述第三摘要施加干扰信息来生成第三 对抗样本;以及
    还基于所述第三样本、所述第三对抗样本和所述第三标注信息,根据所述训练目标来训练所述一致性检测模型,所述训练目标还被配置为使所述一致性检测模型对所述第三样本和所述第三对抗样本的检测结果与所述第三标注信息之间的差异均在所述预定阈值内。
  8. 根据权利要求1至7中任一项所述的方法,还包括:
    获得源文档和目标摘要;以及
    将所述源文档和所述目标摘要应用于训练后的所述一致性检测模型,以获得所述一致性检测模型输出的目标检测结果,所述目标检测结果指示所述目标摘要与所述源文档是否一致。
  9. 根据权利要求8所述的方法,还包括:
    如果所述目标检测结果指示所述目标摘要与所述源文档不一致,确定所述目标检测结果相对所述目标摘要中的多个目标文本单元的多个变化率;
    基于所述多个变化率,从所述多个目标文本单元中选择至少一个目标文本单元,所述至少一个目标文本单元的变化率比所述目标摘要中的其他文本单元的变化率更大;以及
    提供错误提示信息,以指示所述目标摘要中的所述至少一个目标文本单元是错误的。
  10. 一种电子设备,包括:
    至少一个处理单元;以及
    至少一个存储器,所述至少一个存储器被耦合到所述至少一个处理单元并且存储用于由所述至少一个处理单元执行的指令,所述指令在由所述至少一个处理单元执行时使所述设备执行以下动作:
    确定第一样本和第一标注信息,所述第一标注信息指示所述第一样本包括的第一摘要与第一文档不一致,所述第一摘要的多个文本单元中的至少一个文本单元被标记为与所述第一文档不一致;
    通过向所述第一样本施加干扰信息来生成第一对抗样本,所述干扰信息被施加到所述第一样本以及所述第一摘要中除所述至少 一个文本单元之外的其他文本单元;以及
    至少基于所述第一样本、所述第一对抗样本和所述第一标注信息,根据训练目标来训练一致性检测模型,所述一致性检测模型被配置为检测摘要与文档是否一致,所述训练目标被配置为使所述一致性检测模型对所述第一样本和所述第一对抗样本的检测结果与所述第一标注信息之间的差异均在预定阈值内。
  11. 根据权利要求10所述的设备,其中确定所述第一样本和所述第一标注信息包括:
    获取包括所述第一文档和第二摘要的第二样本和第二标注信息,所述第二标注信息指示所述第二摘要与所述第一文档一致;
    通过修改所述第二摘要中的至少一个文本单元来生成所述第一摘要;
    将所述第一文档和所述第一摘要组成所述第一样本;以及
    生成所述第一标注信息,以指示所述第一文档与所述第一摘要不一致。
  12. 根据权利要求11所述的设备,其中所述第一摘要中包括的被修改后的所述至少一个文本单元被标记为与所述第一文档不一致。
  13. 根据权利要求11或12所述的设备,其中生成所述第一摘要包括:
    通过以下至少一项来修改所述第二摘要中的至少一个文本单元:
    将所述第二摘要中的实体替换为所述第一文档中具有相同类型的另一实体,
    将所述第二摘要中的代词替换为另一代词,
    将所述第二摘要中的肯定形式的动词修改为否定形式的动词,以及
    将所述第二摘要中的否定形式的动词修改为肯定形式的动词。
  14. 根据权利要求10至13中任一项所述的设备,还包括通过以下来确定要施加的所述干扰信息:
    将所述第一样本应用于所述一致性检测模型,以获得所述一致性 检测模型输出的第一检测结果,所述第一检测结果指示所述第一样本中的所述第一文档与所述第一摘要是否一致;
    基于所述第一检测结果与所述第一标注信息之间的第一差异,确定针对所述第一样本的总干扰信息;以及
    从所述总干扰信息中过滤出要施加到所述第一摘要中被标记为不一致的所述至少一个文本单元的信息部分,以获得所述干扰信息。
  15. 根据权利要求10至14中任一项所述的设备,其中训练所述一致性检测模型包括:
    将所述第一样本和所述第一对抗样本分别应用于所述一致性检测模型,以分别获得所述一致性检测模型输出的第一检测结果和第二检测结果,所述第一检测结果指示所述第一样本中的所述第一文档与所述第一摘要是否一致,所述第二检测结果指示所述第一文档与所述第一干扰摘要是否一致;以及
    至少基于所述第一检测结果与所述第一标注信息之间的第一差异和所述第二检测结果与所述第一标注信息之间的第二差异来更新所述一致性检测模型的参数值。
  16. 根据权利要求10至15中任一项所述的设备,其中训练所述一致性检测模型还包括:
    确定第三样本和第三标注信息,所述第三标注信息指示所述第三样本包括的第三文档与第三摘要一致;
    通过向所述第三文档和所述第三摘要施加干扰信息来生成第三对抗样本;以及
    还基于所述第三样本、所述第三对抗样本和所述第三标注信息,根据所述训练目标来训练所述一致性检测模型,所述训练目标还被配置为使所述一致性检测模型对所述第三样本和所述第三对抗样本的检测结果与所述第三标注信息之间的差异均在所述预定阈值内。
  17. 根据权利要求10至16中任一项所述的设备,其中所述动作还包括:
    获得源文档和目标摘要;以及
    将所述源文档和所述目标摘要应用于训练后的所述一致性检测模型,以获得所述一致性检测模型输出的目标检测结果,所述目标检测结果指示所述目标摘要与所述源文档是否一致。
  18. 根据权利要求17所述的设备,其中所述动作还包括:
    如果所述目标检测结果指示所述目标摘要与所述源文档不一致,确定所述目标检测结果相对所述目标摘要中的多个目标文本单元的多个变化率;
    基于所述多个变化率,从所述多个目标文本单元中选择至少一个目标文本单元,所述至少一个目标文本单元的变化率比所述目标摘要中的其他文本单元的变化率更大;以及
    提供错误提示信息,以指示所述目标摘要中的所述至少一个目标文本单元是错误的。
  19. 一种用于文档和摘要的一致性检测的装置,包括
    确定模块,被配置为确定第一样本和第一标注信息,所述第一标注信息指示所述第一样本包括的第一摘要与第一文档不一致,所述第一摘要的多个文本单元中的至少一个文本单元被标记为与所述第一文档不一致;
    对抗生成模块,被配置为通过向所述第一样本施加干扰信息来生成第一对抗样本,所述干扰信息被施加到所述第一样本以及所述第一摘要中除所述至少一个文本单元之外的其他文本单元;以及
    训练模块,被配置为至少基于所述第一样本、所述第一对抗样本和所述第一标注信息,根据训练目标来训练一致性检测模型,所述一致性检测模型被配置为检测摘要与文档是否一致,所述训练目标被配置为使所述一致性检测模型对所述第一样本和所述第一对抗样本的检测结果与所述第一标注信息之间的差异均在预定阈值内。
  20. 一种计算机可读存储介质,其上存储有计算机程序,所述程序被处理器执行时实现根据权利要求1至9中任一项所述的方法。
PCT/CN2022/112869 2021-09-13 2022-08-16 用于文档和摘要的一致性检测的方法、设备和介质 WO2023035883A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111070769.7 2021-09-13
CN202111070769.7A CN113779199B (zh) 2021-09-13 2021-09-13 用于文档和摘要的一致性检测的方法、设备、装置和介质

Publications (1)

Publication Number Publication Date
WO2023035883A1 true WO2023035883A1 (zh) 2023-03-16

Family

ID=78843370

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/112869 WO2023035883A1 (zh) 2021-09-13 2022-08-16 用于文档和摘要的一致性检测的方法、设备和介质

Country Status (2)

Country Link
CN (1) CN113779199B (zh)
WO (1) WO2023035883A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113779199B (zh) * 2021-09-13 2022-12-27 北京有竹居网络技术有限公司 用于文档和摘要的一致性检测的方法、设备、装置和介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190130221A1 (en) * 2017-11-02 2019-05-02 Royal Bank Of Canada Method and device for generative adversarial network training
CN110347819A (zh) * 2019-06-21 2019-10-18 同济大学 一种基于正负样本对抗训练的文本摘要生成方法
CN110991181A (zh) * 2019-11-29 2020-04-10 腾讯科技(深圳)有限公司 用于增强已标注样本的方法和设备
CN111783451A (zh) * 2020-06-30 2020-10-16 北京百度网讯科技有限公司 用于增强文本样本的方法和装置
CN113204958A (zh) * 2021-05-26 2021-08-03 天九共享网络科技集团有限公司 文档摘要生成方法、装置、存储介质及电子设备
CN113779199A (zh) * 2021-09-13 2021-12-10 北京有竹居网络技术有限公司 用于文档和摘要的一致性检测的方法、设备、装置和介质

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108595629B (zh) * 2018-04-24 2021-08-06 北京慧闻科技发展有限公司 用于答案选择系统的数据处理方法及应用
CN111078892B (zh) * 2019-11-25 2023-05-23 百度在线网络技术(北京)有限公司 对抗样本生成方法、装置、电子设备及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190130221A1 (en) * 2017-11-02 2019-05-02 Royal Bank Of Canada Method and device for generative adversarial network training
CN110347819A (zh) * 2019-06-21 2019-10-18 同济大学 一种基于正负样本对抗训练的文本摘要生成方法
CN110991181A (zh) * 2019-11-29 2020-04-10 腾讯科技(深圳)有限公司 用于增强已标注样本的方法和设备
CN111783451A (zh) * 2020-06-30 2020-10-16 北京百度网讯科技有限公司 用于增强文本样本的方法和装置
CN113204958A (zh) * 2021-05-26 2021-08-03 天九共享网络科技集团有限公司 文档摘要生成方法、装置、存储介质及电子设备
CN113779199A (zh) * 2021-09-13 2021-12-10 北京有竹居网络技术有限公司 用于文档和摘要的一致性检测的方法、设备、装置和介质

Also Published As

Publication number Publication date
CN113779199A (zh) 2021-12-10
CN113779199B (zh) 2022-12-27

Similar Documents

Publication Publication Date Title
CN107908635B (zh) 建立文本分类模型以及文本分类的方法、装置
US10726061B2 (en) Identifying text for labeling utilizing topic modeling-based text clustering
Rei et al. Jointly learning to label sentences and tokens
CN111611775B (zh) 一种实体识别模型生成方法、实体识别方法及装置、设备
Gaur et al. Semi-supervised deep learning based named entity recognition model to parse education section of resumes
CN109783796A (zh) 预测文本内容中的样式破坏
CN110427612B (zh) 基于多语言的实体消歧方法、装置、设备和存储介质
US20220358361A1 (en) Generation apparatus, learning apparatus, generation method and program
CN111368082A (zh) 一种基于层次网络的领域自适应词嵌入的情感分析方法
WO2023071581A1 (zh) 用于确定响应语句的方法、设备、装置和介质
Yang et al. A topic drift model for authorship attribution
JP7155625B2 (ja) 検査装置、検査方法、プログラム及び学習装置
WO2023035883A1 (zh) 用于文档和摘要的一致性检测的方法、设备和介质
Zheng et al. A review on authorship attribution in text mining
Wang et al. Word vector modeling for sentiment analysis of product reviews
Zhang et al. Supervised hierarchical Dirichlet processes with variational inference
WO2023088278A1 (zh) 用于验证表述的真实性的方法、设备、装置和介质
CN116151258A (zh) 文本消岐方法、电子设备、存储介质
Xu et al. Robust learning for text classification with multi-source noise simulation and hard example mining
Panthum et al. Generating functional requirements based on classification of mobile application user reviews
US20240232245A1 (en) Method, device, and medium for consistency detection of a document and an abstract
Zurini Stylometry metrics selection for creating a model for evaluating the writing style of authors according to their cultural orientation
Umare et al. A survey on machine learning techniques to extract chemical names from text documents
Zhao et al. Effective authorship attribution in large document collections
Hemmer et al. Estimating Post-OCR Denoising Complexity on Numerical Texts

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22866356

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 18558157

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE