WO2018214486A1 - 一种多文档摘要生成的方法、装置和终端 - Google Patents

一种多文档摘要生成的方法、装置和终端 Download PDF

Info

Publication number
WO2018214486A1
WO2018214486A1 PCT/CN2017/116658 CN2017116658W WO2018214486A1 WO 2018214486 A1 WO2018214486 A1 WO 2018214486A1 CN 2017116658 W CN2017116658 W CN 2017116658W WO 2018214486 A1 WO2018214486 A1 WO 2018214486A1
Authority
WO
WIPO (PCT)
Prior art keywords
candidate
phrase
importance
candidate sentence
speech
Prior art date
Application number
PCT/CN2017/116658
Other languages
English (en)
French (fr)
Inventor
李丕绩
吕正东
李航
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2018214486A1 publication Critical patent/WO2018214486A1/zh
Priority to US16/688,090 priority Critical patent/US10929452B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning

Definitions

  • the embodiments of the present invention relate to the field of data processing, and in particular, to a method, an apparatus, and a terminal for generating multiple document digests.
  • MDS Automatic Multi-Document Summarization
  • a method for generating a digest is: using a deep neural network model to train a corpus to obtain a word vector representation of a feature word; obtaining a candidate sentence set according to a preset query word in a corpus; and obtaining a word vector according to the feature word
  • the above method calculates the similarity of different candidate sentences in the candidate sentence set by the word vector representation of the feature words, so that when the feature word extraction is not accurate, the accuracy of the candidate sentence similarity is directly affected, thereby causing subsequent generated document abstracts. More redundant information.
  • the present application provides a method, an apparatus, and a terminal for generating multiple document digests to solve the problem of redundant information in a document digest generated in the prior art.
  • the present application provides a method for generating multiple document digests, including: acquiring a set of candidate sentences, the set of candidate sentences includes candidate sentences included in each candidate document among a plurality of candidate documents of the same event; using a preset network
  • the cascading attention mechanism and the unsupervised learning model in the model train each candidate sentence in the candidate sentence set to obtain the importance of each candidate sentence, the importance of a candidate sentence and one of the cascade attention mechanism matrix
  • the modular correspondence of the row vectors, the cascade attention mechanism matrix is output in the process of optimizing the reconstruction error function by using the unsupervised learning model; the importance of the candidate sentence is used to indicate that the meaning of the candidate sentence is expressed in multiple candidates.
  • the degree of importance in the document according to the importance of each candidate sentence, a phrase that meets the preset condition is selected from the set of candidate sentences as a summary phrase set; and according to the summary phrase set, a summary of the plurality of candidate documents is obtained.
  • An embodiment of the present invention provides a method for generating multiple document abstracts, which is obtained by training each candidate sentence in a candidate sentence set by using a cascade attention mechanism and an unsupervised learning model in a preset network model to obtain a candidate sentence set.
  • the importance of each candidate sentence in the conjunction because the cascading attention mechanism target sequence will consider the segment from which the source segment is found when generating the next state, and improve the accuracy of decoding, so that the candidate with high importance Sentences will be treated with emphasis, and the reconstruction of the error function will reach the extreme value in the process of unsupervised learning. Therefore, the attentional attention mechanism can be used to draw attention information of each candidate sentence in different semantic dimensions of the preset network model.
  • Convergence is performed to improve the accuracy of each sentence's importance estimation, so that when a phrase that meets the preset condition is selected as a summary phrase set from the candidate sentence set according to the importance of each candidate sentence, the summary phrase set can be reduced. Redundancy, thus avoiding the problem of redundant information in the generated document digest.
  • the candidate attention sentence is trained in the candidate sentence set by using the cascading attention mechanism and the unsupervised learning model in the preset network model, and the candidate sentence is obtained.
  • the importance of each candidate sentence in the set includes: obtaining m vectors for describing the event according to the preset network model; performing unsupervised learning according to each candidate sentence, m vectors for describing the event, and the candidate matrix Optimize the reconstruction error function in the model process.
  • the model of the row vector of each row in the cascade attention mechanism matrix outputted by the preset network model is taken as the importance of a candidate sentence.
  • the importance of the candidate sentence, the reconstruction error function includes: the relationship between each candidate sentence and the m vectors used to describe the event, the candidate matrix and the weight corresponding to the candidate matrix, and the candidate matrix is a matrix of m rows ⁇ n columns, Where m and n are positive integers, n is the number of words included in multiple candidate documents, and the purpose of the reconstruction error function is to use the output m
  • the vector is used to reconstruct each candidate sentence in the candidate sentence set, and the error is small, indicating that the m vectors extracted from each candidate sentence in the candidate sentence set almost carry important information of the event, and the key step of extraction is cascading attention.
  • the force mechanism matrix is responsible for which candidate sentences are focused on, so that the modulus of the row vector of each row in the cascade attention mechanism matrix can be regarded as the importance of a candidate sentence.
  • the phrase as a summary phrase includes: filtering out words in each candidate sentence that do not conform to the preset rule, obtaining each filtered candidate sentence; extracting at least one first from the filtered syntax tree of each candidate sentence The part-of-speech phrase and the at least one second part-of-speech phrase form a set of phrases; calculating, according to the respective importance of each candidate sentence, the importance of at least one first part-of-speech phrase and at least one second part-of-speech phrase extracted from each candidate sentence; The at least one first part-of-speech phrase and the at least one second part-of-speech phrase importance corresponding to the candidate sentences, and the first part-of-speech phrase and the second part-of-speech phrase satisfying the preset condition are selected from the phrase set as the summary phrase set.
  • the set selects the first part-of-speech phrase and the second part-of-speech phrase satisfying the preset condition as the summary phrase set, so that the selected summary phrase set can be further prevented from introducing redundant information.
  • each candidate sentence after filtering includes: filtering out noise in each candidate sentence, obtaining a candidate word set corresponding to each candidate sentence, each candidate sentence includes a plurality of words, and each of the plurality of words corresponds to An importance; according to the importance of each word, the words whose importance is lower than the preset threshold in the candidate word set corresponding to each candidate sentence are filtered out, and each filtered candidate sentence is obtained.
  • Combining the importance of words to filter words in the candidate sentences whose importance is lower than the preset threshold can further avoid the introduction of redundant words into each candidate sentence. In the child.
  • each candidate sentence corresponding is filtered according to the importance of each word
  • the method provided by the embodiment of the present invention further includes: using a cascading attention mechanism in the preset network model and an unsupervised learning model for each of the candidate sentence sets before the words of the candidate word set are lower than the preset threshold.
  • the candidate sentence training acquires the importance of each of a plurality of different words included in the plurality of candidate documents.
  • the cascading attention mechanism and the unsupervised in the preset network model are utilized
  • the learning model trains each candidate sentence in the candidate sentence set, and acquires the importance of each of the plurality of different words included in the plurality of candidate documents, including: according to each candidate sentence, m vectors for describing the event, and
  • the candidate matrix optimizes the reconstruction error function in the process of unsupervised learning model. When the value of the reconstruction error function is the smallest, the modulus of the column vector of each column in the candidate matrix is taken as the importance of a word, and each word is obtained. importance.
  • the candidate sentence is calculated from each candidate sentence according to the importance of each candidate sentence And extracting at least one first part-of-speech phrase and at least one second part-of-speech phrase importance, comprising: acquiring a word frequency of each of the at least one first part-of-speech phrase and the at least one second part-of-speech phrase; word frequency according to each part-of-speech phrase And the importance of the candidate sentence in which each part of the phrase is located, calculating at least one first part-of-speech phrase and at least one second part-of-speech importance extracted from each candidate sentence.
  • the at least one first part-of-speech phrase corresponding to each candidate sentence and at least a second part-of-speech phrase importance, the first part-of-speech phrase and the second part-of-speech phrase satisfying the preset condition are selected from the phrase set as the summary phrase set, including: each of the at least one first part-of-speech phrase and the at least one second part-of-speech phrase
  • the importance of part-of-speech phrases and the similarity between individual part-of-speech phrases are entered into the integer linear programming function.
  • the candidate weight of each part-of-speech phrase and the similarity between each part-of-speech phrase are determined.
  • the candidate weight of a part-of-speech phrase is used to determine whether the one part-of-speech phrase is a part-of-speech phrase that satisfies a preset condition; the contact weight is used to determine whether a similar phrase is simultaneously selected;
  • an embodiment of the present invention provides an apparatus for generating multiple document digests, including: an obtaining unit, configured to acquire a candidate sentence set, where the candidate sentence set includes each candidate document included in a plurality of candidate documents for the same event. a candidate sentence; an estimating unit, configured to train each candidate sentence in the candidate sentence set by using a cascade attention mechanism and an unsupervised learning model in a preset network model, to obtain the importance of each candidate sentence, and a candidate sentence Importance and the predetermined network model utilizes an unsupervised learning model to optimize the modular correspondence of a row vector in the cascade attention mechanism matrix output during the reconstruction error function; the importance of the candidate sentence is used to represent the expression of the candidate sentence Meaning of importance in a plurality of candidate documents; a selecting unit, configured to select, according to the importance of each candidate sentence, a phrase that meets a preset condition from the set of candidate sentences as a set of abstract phrases; and a generating unit for using the abstract phrase Collection, get a summary of multiple candidate documents.
  • the acquiring unit is further configured to: According to each candidate sentence, m vectors for describing the event, and the candidate matrix, the reconstruction error function is optimized in the process of performing the unsupervised learning model, and the preset network model is output when the value of the reconstruction error function is the smallest.
  • the apparatus provided by the embodiment of the present invention further includes: a filtering unit, configured to filter out each a candidate word that does not meet the preset rule, obtain each filtered candidate sentence; and an extracting unit, configured to extract at least one first part-of-speech phrase and at least one second from the filtered syntax tree of each candidate sentence
  • the part-of-speech phrase constitutes a set of phrases
  • the estimating unit is further configured to calculate the importance of the at least one first part-of-speech phrase and the at least one second part-of-speech phrase extracted from each candidate sentence according to the respective importance of each candidate sentence; And: selecting, according to the at least one first part-of-speech phrase and the at least one second part-of-speech phrase importance corresponding to each candidate sentence, from the phrase set, the first part-of-speech phrase and the second part-of-speech phrase satisfying the preset condition as the summary
  • the filtering unit is specifically configured to: filter out noise in each candidate sentence Obtaining a set of candidate words corresponding to each candidate sentence, each candidate sentence includes a plurality of words, each of the plurality of words corresponding to an importance; and filtering each candidate according to the importance of each word The words whose importance is lower than the preset threshold in the candidate word set corresponding to the sentence obtain each filtered candidate sentence.
  • the estimating unit is further configured to utilize the cascading in the preset network model
  • the attention mechanism and the unsupervised learning model train each candidate sentence in the candidate sentence set to obtain the importance of each of the plurality of different words included in the plurality of candidate documents.
  • the estimating unit is further configured to be used according to each candidate sentence Describe the m vectors of the event and the candidate matrix, optimize the reconstruction error function in the process of unsupervised learning model, and use the modulus of the column vector of each column in the candidate matrix as a word when the value of the reconstruction error function is the smallest. Sex, get the importance of each word.
  • the acquiring unit is further configured to acquire the at least one first part-of-speech phrase and at least The word frequency of each part-of-speech phrase in a second part-of-speech phrase; the estimating unit is further configured to: calculate the extracted from each candidate sentence according to the word frequency of each part-of-speech phrase and the importance of the candidate sentence in which each part-of-speech phrase is located At least one first part-of-speech phrase and at least one second part-of-speech phrase importance.
  • the acquiring unit is specifically configured to: use the at least one first part-of-speech phrase And the importance of each part-of-speech phrase in at least one second part-of-speech phrase, the similarity between each part-of-speech phrase is input into an integer linear programming function, and in the case where the integer linear programming function takes the extreme value, the candidate for each part-of-speech phrase is determined Weights and the weight of the similarity between the individual part-of-speech phrases; the selection unit is specifically used: according to each The candidate weight of the part-of-speech phrase and the connection weight of the similarity between the individual part-of-speech phrases, determining the part-of-speech phrase satisfying the preset condition, and the candidate weight of the part-of-speech phrase is used to determine whether the one-word phrase is a part-of-speech phrase satisfying the preset condition.
  • an embodiment of the present invention provides a terminal, where the terminal includes a processor, a memory, a system bus, and a communication interface.
  • the memory is used to store a computer to execute an instruction, and the processor and the memory are connected through a system bus, when the terminal is running.
  • the processor executes the memory-stored computer-executable instructions to cause the terminal to perform the multi-document summary generation method as described in the first aspect to the seventh possible implementation of the first aspect.
  • an embodiment of the present invention provides a computer readable storage medium, including instructions, when the instruction is run on a terminal, causing the terminal to perform the seventh possible implementation manner of the first aspect to the first aspect. Multi-document summary generation method.
  • FIG. 1 is a schematic structural diagram 1 of an apparatus for generating multiple document digests according to an embodiment of the present invention
  • FIG. 2 is a schematic flowchart 1 of a method for generating multiple document digests according to an embodiment of the present invention
  • FIG. 3 is a schematic flowchart 2 of a method for generating multiple document digests according to an embodiment of the present invention
  • FIG. 4 is a schematic flowchart 3 of a method for generating multiple document digests according to an embodiment of the present invention
  • FIG. 5 is a schematic flowchart 4 of a method for generating multiple document digests according to an embodiment of the present invention
  • FIG. 6 is a schematic structural diagram 2 of a device for generating multiple document digests according to an embodiment of the present disclosure
  • FIG. 7 is a schematic structural diagram 3 of a device for generating multiple document digests according to an embodiment of the present invention.
  • FIG. 8 is a schematic structural diagram of a terminal according to an embodiment of the present invention.
  • FIG. 1 is a schematic structural diagram of a multi-document summary generating apparatus according to an embodiment of the present invention.
  • the apparatus includes: a data processing module 101, an importance estimating module 102 connected to the data processing module 101, and A digest generation module 103 coupled to the importance estimation module 102.
  • the candidate sentences input to the importance estimation module 102 are not limited to x 1 , x 2 , x 3 , and x 4 , even more than x 1 , x 2 , x 3 , and x 4 , and the present invention
  • the examples are described by taking x 1 , x 2 , x 3 and x 4 as examples.
  • the importance estimation module 102 is obtained by modeling the Cascaded Attention Mechanism and the unsupervised learning model, and iteratively trains the N candidate sentences input by the data processing module 101, and the maximum iteration is 300 rounds of convergence. Finally, the importance of each candidate sentence, and the importance of each word, wherein the importance of the candidate sentence is used to finalize the set of summary phrases, the importance of the words is used to filter out the redundancy in each candidate sentence information.
  • the importance estimation module 102 refers to the cascading attention mechanism in the process of estimating the importance of the candidate sentence based on the framework of data reconstruction. Since the hidden layer vector and the output layer vector of the preset network model belong to different vector spaces, representing different semantics, different cascading attention mechanisms are introduced in different semantic representations. Calculating method, which can further improve the candidate sentence importance estimation method, because the final high importance candidate sentence or phrase is extracted to form a summary phrase set, and subsequently generate multiple document abstracts, so that the final generated multiple document summary The redundancy is reduced, so that the generated multiple documents can more accurately cover the main content expressed by the event.
  • the importance estimation module 102 provided by the embodiment of the present invention can improve the decoding effect by modeling the cascading attention mechanism, and the information of the cascading attention mechanism matrix can be used to estimate the importance of each candidate sentence.
  • This application proposes a cascading attention mechanism, which aims to fuse the attention information of different semantic dimensions to further improve the accuracy of sentence importance estimation.
  • the importance estimation module 102 includes two phases, one of which is a Reader phase, also referred to as an encoding phase, and the other phase is a Recaller phase, also referred to as a decoding phase.
  • the vector model based on the word bag model is initially, and there are problems such as sparseness, inaccurate semantic description, and dimensional disaster. Therefore, the process of reading can first map each sentence into a hidden layer of a neural network, generate a dense embedded vector representation, and then use a recurrent neural network (Recurrent Neural Networks) established in the Encoding layer (Enc layer). , RNN) coding model model to map all of the candidate sentence to a new state, and take the last moments of this event as the state of the global variable c g, c g then reflected in all the articles about the event candidate The information of the document is then entered into the decoding phase.
  • Recurrent Neural Networks recurrent Neural Networks
  • Enc layer Encoding layer
  • the importance estimation module 102 employs a formula during the encoding phase. Each candidate sentence is mapped to a hidden layer of the encoding stage, and each candidate sentence is represented by a dense embedded vector.
  • H stands for hidden layer.
  • the importance estimation module 102 can further encode all the candidate sentences represented by the dense embedded vector into a vector by using the RNN model established by the coding layer, and the one vector is used as a global semantic vector c g reflecting multiple candidate documents.
  • the mapping logic of the RNN model is as follows:
  • e denotes the RNN model of the coding stage
  • f(.) is a Long Short Term Memory (LSTM) model, a Gated Recurrent Unit (GRU) model or an RNN model. among them, Representing the state vector of each candidate sentence at the tth time of the encoding phase, The embedded vector of each candidate sentence at the tth time of the table input phase; Indicates the state vector of each candidate sentence at the t-1th moment of the encoding phase.
  • LSTM Long Short Term Memory
  • GRU Gated Recurrent Unit
  • the candidate sentences x 1 , x 2 , x 3 , and x 4 respectively adopt a formula.
  • the embedded vector representation of each candidate sentence is obtained.
  • the state vector of each candidate sentence at the tth moment of the encoding phase is calculated. Since the target sequence of the cascade attention mechanism is generated in the next state, it is considered to find the source sequence.
  • the fragment according to the fragment therefore, the state vector of the candidate sentence x 1 at time t can Enter the state vector obtained in the f(.) model and then the candidate sentence x 2 at time t can with Enter the state vector of other candidate sentences obtained in the f(.) model. See the state vector of the candidate sentence x 2 at time t.
  • the embodiments of the present invention are not described herein again.
  • the embodiment of the present invention takes the RNN model as an example, wherein the RNN model maps all sentences to a new state. And take the state of the last moment as the global variable c g of this event, as shown in Figure 1, c g by And the state vector of the candidate sentence x 3 at time t Entered into the f(.) model, so c g reflects information about all the multiple candidate documents for the event, and then enters the decoding phase.
  • decoding layer a decoding layer
  • m vectors capable of describing m different aspect information of the event.
  • a decoding model based on the RNN model is established in the decoding layer.
  • the decoding model established in the decoding layer is also based on the RNN model:
  • d is the decoding layer
  • o is the output layer
  • a map is added:
  • the output layer remaps the hidden layer vector to a vector of the dictionary dimension that can represent the aspect information of the event:
  • the preset network model is established by the cascade attention mechanism and the unsupervised learning model.
  • the source vector that is, N candidate sentences
  • the source vector is encoded by the RNN model of the coding layer, and the source vector is encoded into one.
  • Fixed dimension intermediate vector Then use the RNN model of the decoding layer to decode the translation to the target vector, for example
  • the cascading attention mechanism to establish a preset network model can improve the decoding effect, and the modulo of each row vector in the cascading attention mechanism matrix can be used to estimate the importance of the sentence.
  • This application proposes a cascading attention mechanism, which aims to fuse the attention information of different semantic dimensions to further improve the accuracy of sentence importance estimation.
  • the present application introduces a cascade attention mechanism in the hidden layer of the decoding stage, and the attention calculation method is as follows:
  • cascading attention mechanism is introduced in the hidden layer of the decoding stage, but also the cascading attention mechanism is introduced in the output layer of the decoding stage, and the attention information of the hidden layer in the decoding stage is integrated, as follows:
  • ⁇ a is the weight of attention information and the model learns automatically.
  • the concat method is used in the hidden layer of the decoding stage, and the dot method is used in the output layer of the decoding stage to further improve the accuracy of the candidate sentence importance estimation.
  • the cascading attention mechanism is only a part of the components and parameters of the preset network model.
  • the present application solves the parameters through the unsupervised learning model. details as follows:
  • the training goal of the model is to reconstruct the initial N sentence vectors X with m subject aspect vectors Y, which is an unsupervised data reconstruction process.
  • the goal of training is to minimize the reconstruction error:
  • the cascading attention mechanism matrix of the output layer corresponds to the modulus of the vector of each sentence is used as the score of the sentence importance.
  • the modulus of the column vector of each column corresponding to the candidate matrix Y outputted by the output layer is used as the importance score of the word.
  • the candidate matrix Y is a matrix constructed by using m vectors as row vectors and n words as column vectors.
  • the digest generating module 103 is configured to: cull redundant information in multiple candidate documents to obtain a summary phrase set, and combine the summary phrase sets into a summary sentence according to a preset combination manner, and obtain a summary of the plurality of candidate documents, and Output.
  • the summary generating module 103 is configured to perform two processes in the process of culling redundant information in multiple candidate documents, one of which is: coarse-grained sentence filtering, that is, filtering noise that is more obvious in each candidate sentence according to an empirical rule. .
  • Another process is: fine-grained sentence filtering, that is, each candidate sentence compressed by coarse-grained sentences is parsed into a syntax tree of each candidate sentence by a syntax parser, and noun phrases and noun phrases are extracted from the syntax tree of each candidate sentence.
  • Verb phrases, and based on the importance of each candidate sentence calculate the importance of noun phrases and verb phrases included in each candidate sentence, and finally pass Integer Linear Programming (ILP) while ensuring correct grammar.
  • ILP Integer Linear Programming
  • the model selects a phrase to remove a phrase whose importance does not satisfy the preset requirement from the grammar tree of each candidate sentence, and retains a phrase whose importance satisfies the requirement. Since in this process, the ILP model does not select phrases whose importance is not satisfactory, it is used to further filter the redundancy in each candidate sentence at a fine-grained level.
  • the apparatus for generating multiple document summaries shown in FIG. 1 may include more components such as those shown in FIG. 1 during actual use, which is not limited by the embodiment of the present invention.
  • a method for generating a multi-document summary is implemented by a device for generating a multi-document summary as shown in FIG. 1, the method comprising:
  • the device generated by the multiple document summary obtains a candidate sentence set, where the candidate sentence set includes candidate sentences included in each candidate document among the plurality of candidate documents of the same event.
  • the multiple candidate documents in the embodiment of the present invention are related to the same event.
  • the embodiment of the present invention does not limit the event.
  • all the candidate documents related to the same event can be extracted as the present application.
  • the plurality of candidate documents may be a news report about the same event, or may be other articles related to the same event, which is not limited by the embodiment of the present invention.
  • the embodiment of the present invention takes the news report of the same event as an example, and the event may be a news report of “a certain earthquake” or the like.
  • the number of the plurality of candidate documents may be set as needed in the actual use, which is not limited by the embodiment of the present invention.
  • the number of candidate documents is 10-20.
  • each candidate sentence included in the candidate sentence set in the embodiment of the present invention is represented in the form of a vector.
  • each candidate sentence can be represented by an n-dimensional vector, where n is the number of words included in the plurality of candidate documents.
  • the apparatus for generating multiple document summaries uses a cascading attention mechanism and an unsupervised learning model in a preset network model to train each candidate sentence in the candidate sentence set to obtain the importance of each candidate sentence, and the candidate sentence is The importance corresponds to the modular correspondence of a row vector in the cascading attention mechanism matrix.
  • the cascading attention mechanism matrix is output in the process of optimizing the reconstruction error function using the unsupervised learning model; the importance of the candidate sentence is used for Indicates the degree of importance of the meaning expressed by the candidate sentence in multiple candidate documents.
  • all the candidate sentences included in the candidate sentence set may be input into the importance estimation module shown in FIG. 1 to perform iterative training, and the maximum iteration is 300 rounds of convergence.
  • the modulus of the row vector of each row in the cascade attention mechanism matrix is taken as the importance of a candidate sentence.
  • the apparatus for generating multiple document abstracts selects a phrase that meets a preset condition from the set of candidate sentences as a summary phrase set according to the importance of each candidate sentence.
  • the apparatus for generating multiple document abstracts obtains a summary of multiple candidate documents according to the summary phrase set.
  • An embodiment of the present invention provides a method for generating multiple document abstracts, which uses a cascading attention mechanism and an unsupervised learning model in a preset network model to train each candidate sentence in a candidate sentence set, and obtain each candidate in the candidate sentence set.
  • the importance of the sentence because the cascading attention mechanism target sequence will be tested when generating the next state Considering the fragment from which the source sequence is found, and improving the accuracy of decoding, the candidate sentences with high importance will be treated with emphasis, and the reconstruction error function will reach the extreme value in the process of unsupervised learning model.
  • the attention mechanism can fuse the attention information of each candidate sentence in different semantic dimensions of the preset network model, thereby improving the accuracy of each sentence importance estimation, so that according to the importance of each candidate sentence,
  • a phrase that meets the preset condition is selected as the summary phrase set in the candidate sentence set, the redundancy in the summary phrase set can be reduced, thereby avoiding the problem that the redundant information in the generated document summary is relatively large.
  • step S102 provided by the embodiment of the present invention may be specifically implemented by using steps S105 and S106 as shown in FIG. 3:
  • the apparatus for generating multiple document summaries acquires m vectors for describing an event according to a preset network model.
  • the device for generating multiple document summaries optimizes the reconstruction error function in the process of performing an unsupervised learning model according to each candidate sentence, m vectors for describing the event, and the candidate matrix.
  • the importance of each candidate sentence is obtained by taking the modulus of the row vector of each row in the cascade attention mechanism matrix of the preset network model as the importance of a candidate sentence, and the reconstruction error function includes: each candidate sentence and the candidate sentence Describe the relationship between the m vectors of the event, the candidate matrix, and the weight corresponding to the candidate matrix, the candidate matrix is a matrix of m rows ⁇ n columns, where m and n are positive integers, and n is a word of a plurality of candidate documents Quantity.
  • the reconstruction error function is Reconstruct the initial N sentence vectors x i with m vectors, train the target J in the process of unsupervised learning model, and cascade the attention of the preset network model when the value of the reconstruction error function is the smallest.
  • the modulus of the row vector of each row in the mechanism matrix is taken as the importance of a candidate sentence.
  • step S103 in the embodiment of the present invention may be implemented by steps S107-S110 as shown in FIG. 3:
  • the device generated by the multiple document summary filters out words in each candidate sentence that do not meet the preset rule, and obtains each candidate sentence after filtering.
  • the apparatus for generating multiple document digests provided by the embodiment of the present invention is further configured to parse each of the filtered candidate sentences into a corresponding syntax tree by using a parser.
  • the parser may construct a post-sentence post-syntax tree by semantic analysis of each candidate sentence in the plurality of candidate documents to decompose each candidate sentence into a plurality of phrases, and decompose the respective phrases. A branch of the syntax tree.
  • the syntax parser in the embodiment of the present invention may be an internal device of a device for generating multiple document digests, that is, the device for generating multiple document digests itself includes: a parser, and of course, the parser may also be a device for generating multiple document digests.
  • the external device for example, the multi-document summary generating device may also obtain a syntax tree of each candidate sentence through a parser that is requested by the network, which is not limited by the embodiment of the present invention.
  • the multi-document summary generating device may acquire a phrase set of each candidate sentence according to all the phrases included in the syntax tree of each candidate sentence, each of The phrase set of candidate sentences includes noun part-of-speech phrases, verb part-of-speech phrases, several-word parts, and shapes
  • the vocabulary vocabulary and the like, in particular, the vocabulary of the vocabulary is specifically required to be combined with the phrase included in each candidate sentence, which is not limited by the embodiment of the present invention.
  • the multi-document summary generating device may acquire at least one first part-of-speech phrase and at least one second part-of-speech phrase from the phrase set of each candidate sentence.
  • parsing tools may be used to parse each candidate sentence into a syntax tree to obtain a phrase set of each candidate sentence.
  • the preset rule in the embodiment of the present invention may be set according to experience or actual requirements, which is not limited by the embodiment of the present invention.
  • the apparatus for generating multiple document digests filters out the words in the candidate sentence set that do not conform to the preset rule by using the preset rule, that is, the obvious noise in each candidate sentence will be filtered out, for example, “ A certain newspaper reported that" "Some TV station reported that" "...he said” and so on.
  • step S107 in the embodiment of the present invention may be specifically implemented by using step S107a and step S107b:
  • the device generated by the multiple document summary filters out the noise in each candidate sentence, and obtains a candidate word set corresponding to each candidate sentence, each candidate sentence includes a plurality of words, and each of the plurality of words corresponds to an importance .
  • the apparatus for generating multiple document digests in the embodiment of the present invention filters out noise in each candidate sentence according to an empirical rule.
  • the apparatus for generating multiple document summaries filters out words whose importance is lower than a preset threshold in the candidate word set corresponding to each candidate sentence according to the importance of each word, and obtains each candidate sentence after filtering.
  • the preset threshold is not limited in the embodiment of the present invention, and may be set according to requirements during actual use. However, in order to avoid introducing noise into the summary phrase set of the final composition summary, the preset threshold may be set relatively large during setting.
  • the multi-document summary generating device extracts at least one first part-of-speech phrase and at least one second part-of-speech phrase to form a phrase set from the filtered syntax tree of each candidate sentence.
  • FIG. 5 shows a syntax tree structure of a candidate sentence.
  • a candidate sentence is parsed into a syntax tree, including: noun phrase (NP) and verb.
  • Phrase verb phrase, VP.
  • NP noun phrase
  • VP verb phrase
  • a noun phrase includes: an article (Article), an adjective (JJ), and a noun (Noun, NN), such as the Indefinite Article, as shown in Figure 5, "An”; “man” shown in Figure 5.
  • a VP and a VP in a syntax tree of a candidate sentence may also be connected by a Connective (CC).
  • CC Connective
  • Verb phrase (VP) the type of the specific verb phrase, the embodiment of the present invention will not be described here, and may be composed of a verb plus a preposition (PP), for example, a verb plus a noun phrase, for example, "walked into an Amish school" in 5, NNS in Fig. 5 represents noun plural.
  • PP verb plus a preposition
  • the obtained verb phrase further includes: “sent the boys outside” “tied up and shot the girls” “killing three of them”.
  • the apparatus for generating multiple document abstracts calculates at least one first part-of-speech phrase and at least one second part-of-speech phrase importance extracted from each candidate sentence according to respective importances of each candidate sentence, at least one first part-of-speech phrase and At least one second part-of-speech phrase belongs to a collection of phrases.
  • step S109 can be specifically implemented by using step S109a and step S109b:
  • the multi-document summary generating device acquires a word frequency of the phrase of each part of the at least one first part-of-speech phrase and the at least one second part-of-speech phrase in the plurality of candidate documents.
  • the "word frequency” in the embodiment of the present invention refers to the sum of frequencies of occurrence of a certain word in each candidate document included in a plurality of candidate documents.
  • the apparatus for generating multiple document abstracts calculates at least one first part-of-speech phrase and at least one second extracted from each candidate sentence according to the word frequency of each part-of-speech phrase and the importance of the candidate sentence in which each part of speech is located. The importance of part of speech.
  • the importance of the phrase inheriting the candidate sentence in which it is located that is, the Attention value of the candidate sentence
  • the importance of the phrase inheriting the candidate sentence in which it is located that is, the Attention value of the candidate sentence
  • i denotes the number of the phrase
  • S i denotes the importance of the phrase numbered i
  • a i denotes the importance of the candidate sentence indicating the phrase numbered i
  • tf(t) denotes the word frequency
  • Topic denotes the same All the words in the multiple candidate documents of the event
  • P i represents the phrase numbered i.
  • the importance of a candidate sentence is used to measure the degree of importance of the information or content represented by the candidate sentence in expressing the semantics of the candidate document in which it is located.
  • the first part-of-speech phrase in the embodiment of the present invention may be a noun part-of-speech phrase (abbreviation: noun phrase), and the second part-of-speech phrase may be a verb part-of-speech phrase (abbreviation: verb phrase).
  • the present application may also include other part-of-speech phrases, such as adjective phrases, numeral phrases, and the like, depending on the phrases contained in the plurality of candidate documents, which are not limited herein. It can be understood that in natural language processing, noun phrases actually contain pronouns, and pronouns are considered as one type of nouns.
  • Noun phrase (NP) selection the subject of each candidate sentence is composed of noun phrases, and such noun phrases are selected as candidate subjects for generating new sentences. For example, as shown in FIG. 5, "An armed man" can be selected as a noun phrase in FIG. 5,
  • Verb phrase (VP) selection The verb-object structure of a sentence consists of verb phrases, and such verb phrases are selected as candidate verb-destructors for generating new sentences. For example, as shown in FIG. 5, in FIG. 5, "walked into an Amish school, sent the boys outside and tied up and shot the girls, killing three of them", “walked into an Amish school”, “sent the boys outside” ", and “tied up and shot the girls, killing three of them”.
  • the apparatus for generating multiple document abstracts selects, according to at least one first part-of-speech phrase and at least one second part-of-speech phrase importance corresponding to each candidate sentence, a first part-of-speech phrase and a second part-of-speech phrase satisfying a preset condition are selected from the set of phrases. Summary phrase collection.
  • step S110 can be specifically implemented by:
  • the device for generating a plurality of document abstracts inputs the importance of each part of the at least one first part-of-speech phrase and the at least one second part-of-speech phrase, and the similarity between each part-of-speech phrase into an integer linear programming function, in an integer linear programming
  • the candidate weight of each part-of-speech phrase and the connection weight of the similarity between each part-of-speech phrase are determined; the candidate weight of a part-of-speech phrase is used to determine whether the one-word phrase is a part of speech satisfying the preset condition phrase.
  • the apparatus for generating multiple document abstracts determines a part-of-speech phrase that satisfies a preset condition according to a candidate weight of each part-of-speech phrase and a contact weight of a similarity between the respective part-of-speech phrases.
  • the preset condition includes the constraint of the similarity between each phrase feature and each phrase in the phrase combination, and the phrase that does not meet the preset condition will be eliminated until the first part of speech satisfying the preset condition is retained.
  • the phrase and the second part of speech are used as a collection of abstract phrases.
  • the candidate weight of the part-of-speech phrase is 1 indicates that the part-of-speech phrase is a part-of-speech phrase that satisfies a preset condition when the integer linear programming function takes the extreme value
  • the candidate weight of the part-of-speech phrase is 0, indicating that the part-of-speech phrase is in an integer linear programming function. In the case of taking the extreme value, it is a part-of-speech phrase that does not satisfy the preset condition.
  • the similarity between two phrases is used to indicate the redundancy of the phrase in a plurality of candidate documents, and the preconditions can filter the importance and redundancy of the phrase by constraining the similarity between the features and the respective phrases.
  • step S110 can be specifically implemented by:
  • P i represents a phrase numbered i
  • P j represents a phrase numbered j
  • S i represents an importance parameter value of the phrase P i
  • S j represents an importance parameter value of the phrase P j
  • R ij represents a phrase P i Similarity to the phrase P j
  • ⁇ ij represents the weight of the phrase P i and the phrase P j similarity
  • ⁇ i represents the weight of the phrase numbered i.
  • the candidate weight of a part-of-speech phrase is used to determine whether the one part-of-speech phrase is a part-of-speech phrase that satisfies a preset condition; the contact weight is used to determine whether a similar phrase is selected at the same time.
  • the similarity between phrases is used to measure the degree of semantic similarity between phrases.
  • calculating the similarity between two phrases may be: calculating the two-two similarity between the verb phrases, the two-two similarity between the noun phrases, and the cosine similarity or the exponential function (jaccard index) )to realise.
  • the method provided by the embodiment of the present invention further includes:
  • the apparatus for generating multiple document summaries optimizes the reconstruction error function in the process of performing an unsupervised learning model according to each candidate sentence, m vectors for describing the event, and the candidate matrix, and when the value of the reconstruction error function is the smallest, The importance of each word is obtained by taking the modulus of the column vector of each column in the candidate matrix as the importance of a word.
  • step S111 can be specifically implemented by:
  • the role of the candidate matrix Y is such that the output vector y is as sparse as possible.
  • step S104 in the embodiment of the present invention may be implemented in the following manner:
  • the summary phrase set is combined in a preset combination manner to obtain a summary of a plurality of candidate documents.
  • the preset combination mode in the embodiment of the present invention may be an existing combination mode, or may be other combination manners, which is not limited by the embodiment of the present invention.
  • step S104 can be specifically implemented by using steps S112-S113:
  • the apparatus for generating multiple document summaries sorts the plurality of part-of-speech phrases included in the summary phrase set according to the order of each part-of-speech phrase in the summary phrase set in each candidate sentence of the plurality of candidate documents, to obtain a summary sentence.
  • the device for generating multiple document abstracts rehearsing the summary sentence according to the earliest order in which the verb part-of-speech phrases appear in the plurality of candidate documents, and obtaining a summary of the plurality of candidate documents.
  • the method further includes:
  • the apparatus for generating a plurality of document abstracts adds a conjunction to a plurality of verb phrases of the summary sentence to a summary sentence including a plurality of verb part-of-speech phrases.
  • the multi-document summary has a standard English validation data set, such as the DUC 2007 dataset and the TAC 2011 dataset.
  • a method for generating multiple document digests provided by an embodiment of the present invention is applied to the DUC 2007 data set and the TAC 2011 data set to determine the effect of the extracted multi-document summary.
  • DUC 2007 has 45 themes, 20 topics for each topic, 4 manuals for abstracts, and the number of abstract words is limited to 250 words.
  • TAC 2011 has 44 themes, each with 10 news, 4 manuals, and a summary of 100 words.
  • the evaluation index is the F-measure (Measure) of coverage (ROUGE).
  • the evaluation results are shown in Table 1 and Table 2.
  • Table 1 shows the summary results of the DUC 2007 data set applied by the method provided by the embodiment of the present invention
  • Table 2 shows The summary results of the TAC 2011 data set using the method provided by the embodiment of the present invention are as follows:
  • Table 1 summarizes the results generated in the DUC 2007 data set by using the method provided by the embodiment of the present invention.
  • Table 2 summarizes the results generated by the method provided by the embodiment of the present invention in TAC 2011.
  • Tables 1 and 2 show the comparison of the summary results generated by the present technology in the DUC 2007 dataset and the TAC 2011 dataset, respectively, and compared with other best unsupervised multi-document summary models, the results show that the multiple documents provided by the present application
  • the method of generating abstracts has achieved the best results in all indicators, improving the effect of multi-text summaries.
  • the DUC 2007 data set has 45 themes, 20 topics for each topic, 4 hand-painted abstracts, and the number of abstract words is limited to 250 words.
  • TAC 2011 has 44 themes, each with 10 news, 4 manuals, and a summary of 100 words.
  • the evaluation index is RUGE's F-Measure.
  • the method of generating multiple document digests is capable of estimating the importance of words included in a plurality of candidate documents.
  • four topics were selected from the TAC 2011 data set, namely "Finland Shooting”, “Heart Disease”, “Hiv Infection Africa” and “Pet Food Recal”. Each topic selects the top 10 words with the highest dictionary dimension from the output vector, as shown in Table 3 below:
  • the first 10 words of each topic can accurately reflect the main content of each topic, so it can be known that the method provided by the embodiment of the present invention has a better effect on predicting the importance of words.
  • this application selected several typical topics from the TAC 2011 data set (for example, the theme “VTech Shooting”, the theme “Oil Spill South Korea", the specific content of each topic involved can be from TAC In the 2011 data set acquisition, the embodiment of the present invention is not described here again, and the multi-document summary generated by the method provided by the embodiment of the present invention is selected for the selected typical theme, and the multi-document summary generated by the manual annotation is compared, such as 4 and Table 5:
  • the apparatus for generating multiple document summaries includes hardware structures and/or software modules corresponding to the execution of the respective functions in order to implement the above functions.
  • the present invention can be implemented in a combination of hardware or hardware and computer software in combination with the apparatus and method steps of the multiple document summary generation of the various examples described in the embodiments disclosed herein. Whether a function is implemented in hardware or computer software to drive hardware depends on the specific application and design constraints of the solution. A person skilled in the art can use different methods to implement the described functions for each particular application, but such implementation should not be considered to be beyond the scope of the present application.
  • the embodiment of the present invention may divide a function module by using a method for generating multiple document digests according to the above method example.
  • each function module may be divided according to each function, or two or more functions may be integrated into one processing module.
  • the above integrated modules can be implemented in the form of hardware or in the form of software functional modules. It should be noted that the division of the module in the embodiment of the present invention is schematic, and is only a logical function division, and the actual implementation may have another division manner.
  • FIG. 6 is a schematic diagram of a possible structure of the apparatus for generating multiple document digests involved in the foregoing embodiment.
  • the acquiring unit 601 is shown in FIG.
  • the means for selecting the multi-document summary generation by the selecting unit 603 performs steps S103 and S110 in the above embodiment (S110a, S110b)
  • the generating unit 604 is configured to support the multi-document digest generating device to perform step S104 in the foregoing embodiment (specifically, may be: S112, S113, and S114), and may further include a filtering unit 605 for supporting multi-document digest generating.
  • the apparatus performs step S107 (specifically, for example, S107a and S107b) in the above embodiment, and the extracting unit 606 is configured to support multiple documents. It means to execute the steps of generating the above-described embodiments of S108.
  • the generating unit 604 in the embodiment of the present invention is the digest generating module 103, the obtaining unit 601, the estimating unit 602, the selecting unit 603, and the generating unit 604 in the apparatus for generating multiple document digests as shown in FIG. That is, the importance estimation module 102 in the apparatus for generating multiple document summaries shown in FIG.
  • FIG. 7 shows a possible logical structure diagram of the apparatus for generating multiple document digests involved in the above embodiment.
  • the apparatus for generating multiple document summaries includes: a processing module 512 and Communication module 513.
  • the processing module 512 is configured to control and manage the actions of the device generated by the multiple document summary.
  • the processing module 512 is configured to perform steps S101, S105, S102, S106, and S109 in the foregoing embodiment (specifically, for example, S109a and step S109b.
  • the communication module 513 is used to support communication of devices with multiple document summary generations with other devices.
  • the apparatus for generating multiple document summaries may further include a storage module 511 for storing program codes and data of the apparatus for generating multiple document summaries.
  • the processing module 512 can be a processor or a controller, for example, a central processing unit, a general purpose processor, a digital signal processor, an application specific integrated circuit, a field programmable gate array or other programmable logic device, a transistor logic device, Hardware components or any combination thereof. It is possible to implement or carry out the various illustrative logical blocks, modules and circuits described in connection with the present disclosure.
  • the processor may also be a combination of computing functions, for example, including one or more microprocessor combinations, combinations of digital signal processors and microprocessors, and the like.
  • the communication module 513 can be a communication interface or the like.
  • the storage module 511 can be a memory.
  • the apparatus for generating multiple document digests according to the embodiment of the present invention may be the terminal shown in FIG.
  • FIG. 8 is a schematic structural diagram of a terminal according to an embodiment of the present invention.
  • the terminal includes: a processor 301, a communication interface 302, a memory 304, and a bus 303.
  • the communication interface 302, the processor 301, and the memory 304 are connected to each other through a bus 303.
  • the bus 303 may be a PCI bus or an EISA bus.
  • the bus can be divided into an address bus, a data bus, a control bus, and the like. For ease of representation, only one thick line is shown in Figure 8, but it does not mean that there is only one bus or one type of bus.
  • the memory 304 is used to store program codes and data of the terminal.
  • the communication interface 302 is used to support the terminal to communicate with other devices.
  • the processor 301 is configured to support the terminal to execute the program code and data stored in the memory 304 to implement the multi-document summary generation method provided by the embodiment of the present invention.
  • an embodiment of the present invention provides a computer readable storage medium, where instructions are stored in a computer readable storage medium, and when the computer readable storage medium is run on the terminal, the apparatus for generating multiple document digests is executed in the foregoing embodiment.
  • Steps S101, S105, S102, S106, S109 (specifically, for example, S109a and step S109b), S111, S103, and S110 (S110a, S110b), step S104 (specifically, may be: S112, S113, and S114), step S107 (specifically, for example, S107a and S107b), step S108.
  • the disclosed system, apparatus, and method may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the modules or units is only a logical function division.
  • there may be another division manner for example, multiple units or components may be used. Combinations can be integrated into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
  • the integrated unit if implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a computer readable storage medium.
  • a computer readable storage medium A number of instructions are included to cause a computer device (which may be a personal computer, server, or network device, etc.) or processor to perform all or part of the steps of the methods described in various embodiments of the present application.
  • the foregoing storage medium includes: a flash memory, a mobile hard disk, a read only memory, a random access memory, a magnetic disk, or an optical disk, and the like, which can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • Algebra (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种多文档摘要生成的方法、装置和终端,涉及数据处理领域,用以解决现有技术中生成的文档摘要中的冗余信息比较多的问题。包括:获取候选句子集合(S101);利用预设网络模型中的级联注意力机制和无监督学习模型对所述候选句子集合中每个候选句子训练,获取所述每个候选句子的重要性(S102);根据每个候选句子的重要性,从候选句子集合中选择符合预设条件的短语作为摘要短语集合(S103);根据摘要短语集合,以获取多篇候选文档的摘要(S104)。

Description

一种多文档摘要生成的方法、装置和终端
本申请要求于2017年05月23日提交中国专利局、申请号为201710369694.X、申请名称为“一种多文档摘要生成的方法、装置和终端”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明实施例涉及数据处理领域,尤其涉及一种多文档摘要生成的方法、装置和终端。
背景技术
自动多文档摘要(Multi-Document Summarization,MDS)技术,以同一主题(例如,新闻事件)下的多篇候选文档作为输入,通过对多篇候选文档分析和加工,按照需要自动地生成特定长度摘要文本,以最大限度的描述新闻事件的中心思想,从而将新闻事件的重要信息快速、简洁的提取出来。
现有技术中,一种摘要生成的方法为:利用深度神经网络模型训练语料集获取特征词的词向量表示;在语料集中根据预设查询词得到候选句子集合;根据特征词的词向量表示得到候选句子集合中不同候选句子之间的语义相似性,从而得到两个候选句子之间的相似度,以构建句子图模型;在构建句子图模型后计算候选句子权重,最后利用最大边缘相关算法生成文档摘要。
然而,上述方法通过特征词的词向量表示来计算候选句子集合中不同候选句子的相似度,这样在特征词提取不准时会直接影响候选句子相似度的准确性,从而造成后续生成的文档摘要中的冗余信息比较多。
发明内容
本申请提供一种多文档摘要生成的方法、装置和终端,用以解决现有技术中生成的文档摘要中的冗余信息比较多的问题。
为达到上述目的,本申请采用如下技术方案:
第一方面,本申请提供一种多文档摘要生成的方法,包括:获取候选句子集合,候选句子集合包括关于同一个事件的多篇候选文档中每篇候选文档包括的候选句子;利用预设网络模型中的级联注意力机制和无监督学习模型对候选句子集合中每个候选句子进行训练,获得每个候选句子的重要性,一个候选句子的重要性与级联注意力机制矩阵中的一个行向量的模对应,级联注意力机制矩阵为预设网络模型利用无监督学习模型优化重建误差函数过程中输出的;候选句子的重要性用于表示该候选句子所表达的含义在多篇候选文档中的重要程度;根据每个候选句子的重要性,从候选句子集合中选择符合预设条件的短语作为摘要短语集合;根据摘要短语集合,获得多篇候选文档的摘要。
本发明实施例提供一种多文档摘要生成的方法,通过利用预设网络模型中的级联注意力机制和无监督学习模型对候选句子集合中每个候选句子训练,获取候选句子集 合中每个候选句子的重要性,由于,级联注意力机制目标序列在生成下一个状态时,会考虑从源序列中找到所依据的片段,提高解码的准确率,这样重要性高的候选句子会被重点对待,在进行无监督学习模型过程中重建误差函数才会达到极值,因此,利用级联注意力机制可以将每个候选句子在预设网络模型的不同语义维度的注意力信息进行融合,从而提升每个句子重要性估计的准确性,这样在根据每个候选句子的重要性,从候选句子集合中选择符合预设条件的短语作为摘要短语集合时,可以减少摘要短语集合中的冗余,从而避免生成的文档摘要中的冗余信息比较多的问题。
结合第一方面,在第一方面的第一种可能的实现方式中,利用预设网络模型中的级联注意力机制和无监督学习模型对候选句子集合中每个候选句子训练,获取候选句子集合中每个候选句子的重要性,包括:根据预设网络模型获取用于描述事件的m个向量;根据每个候选句子、用于描述事件的m个向量以及候选矩阵,在进行无监督学习模型过程中优化重建误差函数,在重建误差函数取值最小的情况下,将预设网络模型输出的级联注意力机制矩阵中每一行的行向量的模作为一个候选句子的重要性,获得每个候选句子的重要性,重建误差函数包括:每个候选句子与用于描述事件的m个向量之间的关系、候选矩阵以及候选矩阵对应的权重、候选矩阵为m行×n列的矩阵,其中,m和n为正整数,n为多篇候选文档包括的词语的数量,重建误差函数的目的在于用输出的m个向量来重构候选句子集合中的每个候选句子,误差小,说明从候选句子集合中的每个候选句子所萃取的m个向量几乎携带该事件的重要信息,而萃取关键步骤是级联注意力机制矩阵负责重点关注哪些候选句子,从而可以将级联注意力机制矩阵中每一行的行向量的模作为一个候选句子的重要性。
结合第一方面或第一方面的第一种可能的实现方式,在第一方面的第二种可能的实现方式中,根据每个候选句子的重要性,从候选句子集合中选择符合预设条件的短语作为摘要短语集合,包括:过滤掉每个候选句子中不符合预设规则的词语,获取过滤后的每个候选句子;从过滤后的每个候选句子的语法树中提取至少一个第一词性短语和至少一个第二词性短语组成短语集合;根据每个候选句子各自的重要性,计算从每个候选句子中提取的至少一个第一词性短语和至少一个第二词性短语重要性;根据每个候选句子对应的至少一个第一词性短语和至少一个第二词性短语重要性,从短语集合选取满足预设条件的第一词性短语和第二词性短语作为摘要短语集合。通过对候选句子按照预设规则过滤,并将过滤后的每个候选句子按照每个候选句子的重要性提取至少一个第一词性短语和至少一个第二词性短语重要性组成短语集合,并从短语集合选取满足预设条件的第一词性短语和第二词性短语作为摘要短语集合,这样可以进一步避免所选择的摘要短语集合引入冗余信息。
结合第一方面至第一方面的第二种可能的实现方式中任一项,在第一方面的第三种可能的实现方式中,过滤掉每个候选句子中不符合预设规则的词语,获取过滤后的每个候选句子,包括:过滤掉每个候选句子中的噪音,得到每个候选句子对应的候选词语集合,每个候选句子中包括多个词语,多个词语中每个词语对应一个重要性;根据每个词语的重要性,过滤掉每个候选句子对应的候选词语集合中重要性低于预设阈值的词语,获取过滤后的每个候选句子。结合词语的重要性对候选句子中重要性低于预设阈值的词语进行过滤,可以进一步避免冗余词语引入每个候选句 子中。
结合第一方面至第一方面的第三种可能的实现方式中任一项,在第一方面的第四种可能的实现方式中,根据每个词语的重要性,过滤掉每个候选句子对应的候选词语集合中重要性低于预设阈值的词语之前,本发明实施例提供的方法还包括:利用预设网络模型中的级联注意力机制和无监督学习模型对候选句子集合中每个候选句子训练,获取多篇候选文档中包括的多个不同词语中每个词语的重要性。
结合第一方面至第一方面的第四种可能的实现方式中任一项,在第一方面的第五种可能的实现方式中,利用预设网络模型中的级联注意力机制和无监督学习模型对候选句子集合中每个候选句子训练,获取多篇候选文档中包括的多个不同词语中每个词语的重要性,包括:根据每个候选句子、用于描述事件的m个向量以及候选矩阵,在进行无监督学习模型过程中优化重建误差函数,在重建误差函数取值最小的情况下,将候选矩阵中每一列的列向量的模作为一个词语的重要性,获得每个词语的重要性。
结合第一方面至第一方面的第五种可能的实现方式中任一项,在第一方面的第六种可能的实现方式中,根据每个候选句子的重要性,计算从每个候选句子中提取的至少一个第一词性短语和至少一个第二词性短语重要性,包括:获取至少一个第一词性短语和至少一个第二词性短语中每个词性短语的词频;根据每个词性短语的词频,以及每个词性的短语所在的候选句子的重要性,计算从每个候选句子中提取的至少一个第一词性短语和至少一个第二词性短语重要性。
结合第一方面至第一方面的第六种可能的实现方式中任一项,在第一方面的第七种可能的实现方式中,根据每个候选句子对应的至少一个第一词性短语和至少一个第二词性短语重要性,从短语集合选取满足预设条件的第一词性短语和第二词性短语作为摘要短语集合,包括:将至少一个第一词性短语和至少一个第二词性短语中每个词性短语的重要性、各个词性短语之间的相似度输入整数线性规划函数中,在整数线性规划函数取极值的情况下,确定每个词性短语的候选权重以及各个词性短语之间的相似度的联系权重;以及根据每个词性短语的候选权重以及各个词性短语之间的相似度的联系权重,确定满足预设条件的词性短语。一个词性短语的候选权重用于确定该一个词性短语是否为满足预设条件的词性短语;联系权重用于确定相似的短语是否同时被选择;。
第二方面,本发明实施例提供一种多文档摘要生成的装置,包括:获取单元,用于获取候选句子集合,候选句子集合包括关于同一个事件的多篇候选文档中每篇候选文档包括的候选句子;估计单元,用于利用预设网络模型中的级联注意力机制和无监督学习模型对候选句子集合中每个候选句子进行训练,获得每个候选句子的重要性,一个候选句子的重要性与所述预设网络模型利用无监督学习模型优化重建误差函数过程中输出的级联注意力机制矩阵中的一个行向量的模对应;候选句子的重要性用于表示候选句子所表达的含义在多篇候选文档中的重要程度;选择单元,用于根据每个候选句子的重要性,从候选句子集合中选择符合预设条件的短语作为摘要短语集合;生成单元,用于根据摘要短语集合,获得多篇候选文档的摘要。
结合第二方面,在第二方面的第一种可能的实现方式中,获取单元,还用于: 根据每个候选句子、用于描述事件的m个向量以及候选矩阵,在进行无监督学习模型过程中优化重建误差函数,在重建误差函数取值最小的情况下,将所述预设网络模型输出的级联注意力机制矩阵中每一行的行向量的模作为一个候选句子的重要性,获得每个候选句子的重要性,重建误差函数包括:每个候选句子与用于描述所述事件的m个向量之间的关系、所述候选矩阵以及候选矩阵对应的权重、候选矩阵为m行×n列的矩阵,其中,m和n为正整数,n为所述多篇候选文档包括的词语的数量。
结合第二方面或第二方面的第一种可能的实现方式中,在第二方面的第二种可能的实现方式中,本发明实施例提供的装置还包括:过滤单元,用于过滤掉每个候选句子中不符合预设规则的词语,获取过滤后的每个候选句子;提取单元,用于从过滤后的每个候选句子的语法树中提取至少一个第一词性短语和至少一个第二词性短语组成短语集合;估计单元,还用于根据每个候选句子各自的重要性,计算从每个候选句子中提取的至少一个第一词性短语和至少一个第二词性短语重要性;选择单元具体用于:根据每个候选句子对应的至少一个第一词性短语和至少一个第二词性短语重要性,从短语集合选取满足预设条件的第一词性短语和第二词性短语作为摘要短语集合。
结合第二方面至第二方面的第二种可能的实现方式中任一项,在第二方面的第三种可能的实现方式中,过滤单元具体用于:过滤掉每个候选句子中的噪音,得到每个候选句子对应的候选词语集合,每个候选句子中包括多个词语,多个词语中每个词语对应一个重要性;以及用于根据每个词语的重要性,过滤掉每个候选句子对应的候选词语集合中重要性低于预设阈值的词语,获取过滤后的每个候选句子。
结合第二方面至第二方面的第三种可能的实现方式中任一项,在第二方面的第四种可能的实现方式中,估计单元,还用于利用预设网络模型中的级联注意力机制和无监督学习模型对候选句子集合中每个候选句子训练,获取多篇候选文档中包括的多个不同词语中每个词语的重要性。
结合第二方面至第二方面的第四种可能的实现方式中任一项,在第二方面的第五种可能的实现方式中,估计单元,还具体用于根据每个候选句子、用于描述事件的m个向量以及候选矩阵,在进行无监督学习模型过程中优化重建误差函数,在重建误差函数取值最小的情况下,将候选矩阵中每一列的列向量的模作为一个词语的重要性,获得每个词语的重要性。
结合第二方面至第二方面的第五种可能的实现方式中任一项,在第二方面的第六种可能的实现方式中,获取单元,还用于获取至少一个第一词性短语和至少一个第二词性短语中每个词性短语的词频;估计单元还用于:根据每个词性短语的词频,以及每个词性的短语所在的候选句子的重要性,计算从每个候选句子中提取的至少一个第一词性短语和至少一个第二词性短语重要性。
结合第二方面至第二方面的第六种可能的实现方式中任一项,在第二方面的第七种可能的实现方式中,获取单元具体用于,将所述至少一个第一词性短语和至少一个第二词性短语中每个词性短语的重要性、各个词性短语之间的相似度输入整数线性规划函数中,在整数线性规划函数取极值的情况下,确定每个词性短语的候选权重以及所述各个词性短语之间的相似度的联系权重;选择单元具体用于:根据每 个词性短语的候选权重以及各个词性短语之间的相似度的联系权重,确定满足预设条件的词性短语,一个词性短语的候选权重用于确定该一个词性短语是否为满足预设条件的词性短语;联系权重用于确定相似的短语是否同时被选择;。
第三方面,本发明实施例提供一种终端,该终端包括处理器、存储器、系统总线和通信接口;其中,存储器用于存储计算机执行指令,处理器与存储器通过系统总线连接,当终端运行时,处理器执行存储器存储的计算机执行指令,以使终端执行如第一方面至第一方面的第七种可能的实现方式所描述的多文档摘要生成的方法。
第四方面,本发明实施例提供一种计算机可读存储介质,包括指令,当指令在终端上运行时,使得终端执行如第一方面至第一方面的第七种可能的实现方式所描述的多文档摘要生成的方法。
附图说明
图1为本发明实施例提供的一种多文档摘要生成的装置的结构示意图一;
图2为本发明实施例提供的一种多文档摘要生成的方法的流程示意图一;
图3为本发明实施例提供的一种多文档摘要生成的方法的流程示意图二;
图4为本发明实施例提供的一种多文档摘要生成的方法的流程示意图三;
图5为本发明实施例提供的一种多文档摘要生成的方法的流程示意图四;
图6为本发明实施例提供的一种多文档摘要生成的装置的结构示意图二;
图7为本发明实施例提供的一种多文档摘要生成的装置的结构示意图三;
图8为本发明实施例提供的一种终端的结构示意图。
具体实施方式
如图1所示,图1示出了本发明实施例提供的一种多文档摘要生成装置的结构示意图,该装置包括:数据处理模块101、与数据处理模块101相连的重要性估计模块102以及与重要性估计模块102相连的摘要生成模块103。
其中,数据处理模块101用于将待生成摘要的关于同一个事件的多篇候选文档中每篇候选文档转换成候选句子,以获取候选句子集合D;然后对于该关于同一个事件的多篇候选文档中的所有词语,生成大小为V的词典;最后,将每个候选句子用V维的向量xj(j=1,...,N,N为候选句子集合D中最多的候选句子数量)表示,并将以V维向量表示的每个候选句子输入至重要性估计模块102中,例如,如图1所示的候选句子x1、x2、x3和x4,可以理解的是,在实际使用过程中,输入至重要性估计模块102的候选句子不仅仅局限于x1、x2、x3和x4,甚至比x1、x2、x3和x4多,本发明实施例仅是以x1、x2、x3和x4为例进行说明。
重要性估计模块102中是通过级联注意力机制(Cascaded Attention Mechanism)和无监督学习模型建模得到的,其主要对数据处理模块101输入的N个候选句子进行迭代训练,最大迭代300轮收敛,最终输出每个候选句子的重要性,以及每个词语的重要性,其中,候选句子的重要性用于最终确定摘要短语集合,词语的重要性用于过滤掉每个候选句子中的冗余信息。
其中,重要性估计模块102基于数据重建的框架,在估计候选句子重要性的过程中引用了级联注意力机制。由于预设网络模型的隐层向量和输出层向量属于不同的向量空间,代表不同的语义,因此,在不同的语义表示中引入不同的级联注意力机制计 算方法,这样可以进一步提升候选句子重要性估计方法,由于最终重要性高的候选句子或者短语被提取出来以组成摘要短语集合,为后续生成多文档摘要,从而使得最终生成的多文档摘要中的冗余减少,使得生成的多文档要能够更加准确的覆盖事件所表达的主要内容。
本发明实施例提供的重要性估计模块102通过对级联注意力机制进行建模,这样可以提升解码的效果,同时级联注意力机制矩阵的信息可以用来估计每个候选句子的重要性。本申请提出了级联注意力机制,旨在将不同语义维度的注意力信息进行融合,进一步提升句子重要程度估计的准确性。
如图1所示,重要性估计模块102包括两个阶段,其中一个阶段为阅读(Reader)阶段,也称作编码阶段,另一个阶段为回忆(Recaller)阶段,也称作解码阶段。
一、阅读阶段
对于同一个事件中的所有候选句子集合D中每个候选句子,初始为基于词袋模型的向量模型,存在诸如稀疏、语义描述不准确、维度灾难等问题。所以阅读的过程可以先将每个句子映射到一个神经网络的隐层中,产生密集的嵌入式向量表示,然后采用编码层(Encoding layer,Enc layer)中建立的基于递归神经网络(Recurrent Neural Networks,RNN)模型的编码模型,将所有的候选句子映射到新的状态,并取最后一个时刻的状态作为这个事件的全局变量cg,这时cg中反映了关于该事件的所有多篇候选文档的信息,然后进入解码阶段。
编码阶段具体过程如下:
首先,重要性估计模块102在编码阶段采用公式
Figure PCTCN2017116658-appb-000001
将每个候选句子映射到编码阶段的隐层,将每个候选句子以密集的嵌入式向量表示。
其中,i表示输入阶段,j表示候选句子的编号(j=1,...,N,N为候选句子集合D中最多的候选句子数量),W和b分别为隐层对应的神经网络参数,H代表隐层。
其次,重要性估计模块102通过编码层建立的RNN模型可以将所有以密集的嵌入式向量表示的候选句子进一步编码成一个向量,该一个向量用于成为反映多篇候选文档的全局语义向量cg。由图1可以知道,其中,RNN模型的映射逻辑如下:
Figure PCTCN2017116658-appb-000002
其中,e表示编码阶段的RNN模型,f(.)是长短期记忆网络(Long Short Term Memory,LSTM)模型、门递归网络(Gated Recurrent Unit,GRU)模型或者RNN模型。其中,
Figure PCTCN2017116658-appb-000003
表示编码阶段第t个时刻每个候选句子的状态向量,
Figure PCTCN2017116658-appb-000004
表输入阶段第t个时刻每个候选句子的嵌入式向量;
Figure PCTCN2017116658-appb-000005
表示编码阶段第t-1个时刻每个候选句子的状态向量。
示例性的,如图1所示的编码层可知,候选句子x1、x2、x3和x4分别采用公式
Figure PCTCN2017116658-appb-000006
映射后得到每个候选句子的嵌入式向量表示
Figure PCTCN2017116658-appb-000007
Figure PCTCN2017116658-appb-000008
然后根据每个候选句子的嵌入式向量表示计算每个候选句子在编码阶段第t个时刻的状态向量,由于级联注意力机制目标序列在生成下一个状态时,会考虑从源序列中找到所依据的片段,因此,候选句子x1在t时刻的状态向量
Figure PCTCN2017116658-appb-000009
可以将
Figure PCTCN2017116658-appb-000010
输入至f(.)模型中得到,然后候选句子x2在t时刻的状态向量
Figure PCTCN2017116658-appb-000011
可以将
Figure PCTCN2017116658-appb-000012
Figure PCTCN2017116658-appb-000013
输入至f(.)模型中得到,其他候选句子的状态向量,可以参见候选句子x2在t时刻的状态向量
Figure PCTCN2017116658-appb-000014
本发明实施例在此不再赘述。
本发明实施例以RNN模型为例,其中RNN模型将所有的句子映射到新的状态
Figure PCTCN2017116658-appb-000015
并取最后一个时刻的状态作为这个事件的全局变量cg,如图1所示,cg
Figure PCTCN2017116658-appb-000016
和候选句子x3在t时刻的状态向量
Figure PCTCN2017116658-appb-000017
输入至f(.)模型中得到的,因此cg中反映了关于该事件的所有多篇候选文档的信息,然后进入解码阶段。
二、解码阶段
主要用于将编码阶段生成的全局变量cg在解码层(decode layer,dec layer)解码成能够描述该事件m个不同方面(Aspect)信息的m个向量的过程。其中,解码层中建立有基于RNN模型的解码模型。
其中,m要远小于多篇候选文档包括的候选句子的数目N,通过m个浓缩的向量,最大程度重建输入的N个候选句子,所以这m个输出向量要包含最重要的信息,旨在只将最重要的信息解码出来,从而能够对原始的输入进行重建。解码层中建立的解码模型也是基于RNN模型:
Figure PCTCN2017116658-appb-000018
其中,d表示解码层,o表示输出层,然后再加入一个映射:
Figure PCTCN2017116658-appb-000019
最后,输出层将隐层向量重新映射到词典维大小的能够表示事件某Aspect信息的向量:
Figure PCTCN2017116658-appb-000020
示例性的,如图1所示的y1和y2。
综上所述,可知通过级联注意力机制和无监督学习模型建立预设网络模型,首先通过编码层的RNN模型对源向量(也即N个候选句子)进行编码,将源向量编码到一个固定维度的中间向量
Figure PCTCN2017116658-appb-000021
然后再利用解码层的RNN模型解码翻译到目标向量,例如
Figure PCTCN2017116658-appb-000022
利用级联注意力机制建立预设网络模型,可以提升解码的效果,同时级联注意力机制矩阵中每一个行向量的模可以用来估计句子的重要性。本申请提出了级联注意力机制,旨在将不同语义维度的注意力信息进行融合,进一步提升句子重要程度估计的准确性。首先,本申请在解码阶段的隐层引入级联注意力机制,注意力计算方法如下:
Figure PCTCN2017116658-appb-000023
其中,score(.)函数用来计算目标向量
Figure PCTCN2017116658-appb-000024
和源向量
Figure PCTCN2017116658-appb-000025
的注意关系,其中,
Figure PCTCN2017116658-appb-000026
表示输入阶段的第s个候选句子。
然后根据级联注意力机制矩阵,更新解码阶段的隐层向量:
Figure PCTCN2017116658-appb-000027
Figure PCTCN2017116658-appb-000028
不仅在解码阶段的隐层引入级联注意力机制,本申请还在解码阶段的输出层也引入级联注意力机制,并且融合了解码阶段的隐层的注意力信息,具体如下:
Figure PCTCN2017116658-appb-000029
Figure PCTCN2017116658-appb-000030
Figure PCTCN2017116658-appb-000031
Figure PCTCN2017116658-appb-000032
其中λa为注意力信息的权重,模型自动学习。
对于score(.)函数,本申请可以采用如下三种不同的计算方法:
Figure PCTCN2017116658-appb-000033
通过对比试验结果,在解码阶段的隐层使用concat方法,在解码阶段的输出层使用dot方法可以进一步的提高候选句子重要性估计的准确性。
级联注意力机制只是预设网络模型的一个组件和参数的一部分,为了求解引入注意力机制后的预设网络模型中的每个参数,本申请通过无监督学习模型求解参数。具体如下:
模型的训练目标为:用m个主题方面向量Y来重构初始的N个句子向量X,这是一个无监督数据重建过程,训练的目标为最小化重建误差:
Figure PCTCN2017116658-appb-000034
训练完成后,输出层的级联注意力机制矩阵对应每个句子的向量的模被用来作为句子重要性的分数。输出层输出的候选矩阵Y对应的每一列的列向量的模被用来作为词语的重要性分数。其中,候选矩阵Y为以m个向量为行向量以n个词语为列向量所构建的矩阵。
其中,摘要生成模块103主要用于剔除多篇候选文档中的冗余信息,以获取摘要短语集合,并将摘要短语集合按照预置组合方式组合为摘要句子,获取多篇候选文档的摘要,并输出。
摘要生成模块103,用于在剔除多篇候选文档中的冗余信息时主要有两个过程,其中一个过程为:粗粒度句子过滤,即根据经验规则将每个候选句子中比较明显的噪音过滤。另一个过程为:细粒度句子过滤,即将经过粗粒度句子压缩后的每个候选句子采用句法解析器解析成每个候选句子的语法树,从每个候选句子的语法树中提取出名词短语和动词短语,并根据每个候选句子的重要性,计算每个候选句子中包括的名词短语和动词短语的重要性,最终在保证语法正确的情况下,通过整数线性规划(Integer Linear Programming,ILP)模型对短语进行选择,以从每个候选句子的语法树中删掉重要性不满足预设要求的短语,并将重要性满足要求的短语保留。由于在此过程中,ILP模型不会将重要性不满足要求的短语选入摘要中,从而起到进一步在细粒度层次上过滤每个候选句子中冗余的作用。
可以理解的是,在实际使用过程中图1所示的一种多文档摘要生成的装置可以包括比如图1所示更多的部件,本发明实施例对此不进行限定。
如图2所示,本发明实施例提供的一种多文档摘要生成的方法,该方法由如图1所示的多文档摘要生成的装置执行,该方法包括:
S101、多文档摘要生成的装置获取候选句子集合,该候选句子集合中包括关于同一个事件的多篇候选文档中每篇候选文档包括的候选句子。
本发明实施例中的多篇候选文档是关于同一个事件的,本发明实施例对该事件不进行限定,在实际使用过程中所有关于同一个事件的多篇候选文档均可以作为本申请提取该多篇候选文档摘要的基础文件。该多篇候选文档可以为关于同一个事件的新闻报道,也可以时关于同一个事件的其他文章,本发明实施例对此不限定。
示例性的,本发明实施例以该多篇候选文档均是关于同一个事件的新闻报道为例,例如,该事件可以为“某某地震”的新闻报道等。
具体的,在实际使用过程中多篇候选文档的数量可以根据需要设置,本发明实施例对此不限定。
示例性的,多篇候选文档的数量为10-20篇。
可以理解的是,本发明实施例中的候选句子集合中包括的每个候选句子均以向量的形式表示。
示例性的,可以将每个候选句子以n维的向量表示,其中n为多篇候选文档中包括的词语的数量。
需要说明的是,在实际使用过程中,由于该多篇候选文档是关于同一个事件的,因此肯定在多篇候选文档的每篇候选文档中可能存在相同的词语,或者在同一个候选文档中存在相同的词语,因此,在计算多篇候选文档中包括的词语的数量时,需要将相同词语的数量记为1,示例性的,词语“某某地震”在多篇候选文档中出现了10次,其余词语(各不相同)的数量和为50,则该多篇候选文档中包括的词语的数量为:51。
S102、多文档摘要生成的装置利用预设网络模型中的级联注意力机制和无监督学习模型对候选句子集合中每个候选句子进行训练,获得每个候选句子的重要性,一个候选句子的重要性与级联注意力机制矩阵中的一个行向量的模对应,级联注意力机制矩阵为预设网络模型利用无监督学习模型优化重建误差函数过程中输出的;候选句子的重要性用于表示候选句子所表达的含义在多篇候选文档中的重要程度。
具体的,在实际使用过程中可以将候选句子集合包括的所有候选句子(以向量的形式表示)输入如图1所示的重要性估计模块,进行迭代训练,最大迭代300轮收敛。该重要性估计模块的输出中,将级联注意力机制矩阵中每一行的行向量的模作为一个候选句子的重要性。
S103、多文档摘要生成的装置根据每个候选句子的重要性,从候选句子集合中选择符合预设条件的短语作为摘要短语集合。
S104、多文档摘要生成的装置根据摘要短语集合,获得多篇候选文档的摘要。
本发明实施例提供一种多文档摘要生成的方法,利用预设网络模型中的级联注意力机制和无监督学习模型对候选句子集合中每个候选句子训练,获取候选句子集合中每个候选句子的重要性,由于,级联注意力机制目标序列在生成下一个状态时,会考 虑从源序列中找到所依据的片段,提高解码的准确率,这样重要性高的候选句子会被重点对待,在进行无监督学习模型过程中重建误差函数才会达到极值,因此,利用级联注意力机制可以将每个候选句子在预设网络模型的不同语义维度的注意力信息进行融合,从而提升每个句子重要性估计的准确性,这样在根据每个候选句子的重要性,从候选句子集合中选择符合预设条件的短语作为摘要短语集合时,可以减少摘要短语集合中的冗余,从而避免生成的文档摘要中的冗余信息比较多的问题。
可选的,如图3所示,本发明实施例提供的步骤S102具体可以通过如图3所示的步骤S105和S106来实现:
S105、多文档摘要生成的装置根据预设网络模型获取用于描述事件的m个向量。
S106、多文档摘要生成的装置根据每个候选句子、用于描述事件的m个向量以及候选矩阵,在进行无监督学习模型过程中优化重建误差函数,在重建误差函数取值最小的情况下,将预设网络模型输出的级联注意力机制矩阵中每一行的行向量的模作为一个候选句子的重要性,获得每个候选句子的重要性,重建误差函数包括:每个候选句子与用于描述事件的m个向量之间的关系、候选矩阵以及候选矩阵对应的权重、候选矩阵为m行×n列的矩阵,其中,m和n为正整数,n为多篇候选文档包括的词语的数量。
可选的,重建误差函数为
Figure PCTCN2017116658-appb-000035
对用m个向量来重构初始的N个句子向量xi,在进行无监督学习模型过程中训练目标J,在重建误差函数取值最小的情况,将预设网络模型输出的级联注意力机制矩阵中每一行的行向量的模作为一个候选句子的重要性。
为了进一步的提高所选择的摘要短语集合精确,本发明实施例在步骤S103中首先对候选句子集合中的候选句子根据预设规则进行初步过滤,并在初步过滤后的候选句子基础上根据每个候选句子的重要性,选择符合预设条件的短语作为摘要短语集合。结合图2,本发明实施例中的步骤S103可以通过如图3所示的步骤S107-S110来实现:
S107、多文档摘要生成的装置过滤掉每个候选句子中不符合预设规则的词语,获取过滤后的每个候选句子。
可以理解的是,在执行步骤S108时,本发明实施例提供的多文档摘要生成的装置还用于将过滤后的每个候选句子采用语法解析器解析成各自对应的语法树。在步骤S107中语法解析器可以通过对多篇候选文档中的每个候选句子的语义分析,构建每个候选句子后语法树以将每个候选句子分解为多个短语,分解出各个短语称为语法树的分支。
本发明实施例中的语法解析器可以是多文档摘要生成的装置的内部设备,即多文档摘要生成的装置本身包括:语法解析器,当然,该语法解析器还可以是多文档摘要生成的装置的外部设备,例如,多文档摘要生成的装置还可以通过网络请求的语法解析器以获取每个候选句子的语法树,本发明实施例对此不进行限定。
在语法解析器将过滤后的每个候选句子解析成语法树之后,多文档摘要生成的装置可以根据每个候选句子的语法树中包括的全部短语,获取每个候选句子的短语集合,该每个候选句子的短语集合中包括名词词性短语、动词词性短语、数次词性短语、形 容词性短语等等,具体是哪种词性的短语具体需要结合每个候选句子的所包括的短语而定,本发明实施例对此不进行限定。
在获取每个候选句子的短语集合后,多文档摘要生成的装置可以从每个候选句子的短语集合中获取至少一个第一词性短语和至少一个第二词性短语。
需要说明的是,在实际使用过程中还可以采用其他的解析工具将每个候选句子解析成语法树,以获取每个候选句子的短语集合。
可选的,本发明实施例中的预设规则可以根据经验、或者实际需求来设定,本发明实施例对此不限定。
可以理解的是,多文档摘要生成的装置通过预设规则过滤掉候选句子集合中每个候选句子中不符合预设规则的词语是指将过滤掉每个候选句子中的明显噪音,例如,“某某报刊报道说…”“某某电视台报道说…”“…他说”等。
可选的,结合图2,图3,如图4所示,本发明实施例中的步骤S107具体可以通过步骤S107a和步骤S107b来实现:
S107a、多文档摘要生成的装置过滤掉每个候选句子中的噪音,得到每个候选句子对应的候选词语集合,每个候选句子中包括多个词语,多个词语中每个词语对应一个重要性。
可以理解的是,本发明实施例中多文档摘要生成的装置根据经验规则过滤掉每个候选句子中的噪音。
S107b、多文档摘要生成的装置根据每个词语的重要性,过滤掉每个候选句子对应的候选词语集合中重要性低于预设阈值的词语,获取过滤后的每个候选句子。
本发明实施例对预设阈值不进行限定,在实际使用过程中可以根据需要设置,不过为了尽量避免最终组成摘要的摘要短语集合中引入噪音,在设置时可以将预设阈值设置的比较大。
S108、多文档摘要生成的装置从过滤后的每个候选句子的语法树中提取至少一个第一词性短语和至少一个第二词性短语组成短语集合。
示例性的,如图5所示,图5示出了一个候选句子的语法树结构,在图5中可以看出一个候选句子解析成语法树后包括:名词短语(Noun phrase,NP)和动词短语(verb phrase,VP)。如图5所示,NP为“An armed man”,VP为“walked into an Amish school”。
可以理解的是,一个名词短语包括:冠词(Article)、形容词(JJ)以及名词(Noun,NN),例如不定冠词(the Indefinite Article),如图5所示的“An”;名词如图5所示的“man”。
如图5所示,一个候选句子的语法树中VP和VP之间还可以由连接词(Connective,CC)连接,例如,连接词为图5中的“and”。
动词短语(verb phrase,VP),具体的动词短语的类型本发明实施例在此不再赘述,可以时动词加介词(preposition,PP)构成,例如,也可以是动词加名词短语构成,例如图5中的“walked into an Amish school”,图5中的NNS表示名词复数。
具体的,如图5所示,将一个候选句子解析成语法树后,所得到的动词短语还包括:“sent the boys outside”“tied up and shot the girls”“killing three of them”。
S109、多文档摘要生成的装置根据每个候选句子各自的重要性,计算从每个候选句子中提取的至少一个第一词性短语和至少一个第二词性短语重要性,至少一个第一词性短语和至少一个第二词性短语属于短语集合。
可选的,步骤S109具体可以通过步骤S109a和步骤S109b来实现:
S109a、多文档摘要生成的装置获取至少一个第一词性短语和至少一个第二词性短语每个词性的短语在多篇候选文档中的词频。
其中,本发明实施例中的“词频”是指某一个词语在多篇候选文档包括的每篇候选文档中出现的频率之和。
S109b、多文档摘要生成的装置根据每个词性短语的词频,以及每个词性的短语所在的候选句子的重要性,计算从每个候选句子中提取的至少一个第一词性短语和至少一个第二词性短语重要性。
其中,在计算短语的重要性时,短语继承其所在候选句子的重要性,即候选句子的注意力Attention值,可以通过以下公式
Figure PCTCN2017116658-appb-000036
确定每个词性短语的重要性。
其中,i表示短语的编号,Si表示编号为i的短语的重要性,ai表示表示编号为i的短语所在的候选句子的重要性,t f(t)表示词频,Topic表示关于该同一个事件的多篇候选文档中所有的词语,Pi表示表示编号为i的短语。
其中,一个候选句子的重要性用于衡量该候选句子所代表的信息或者内容在表达其所在的候选文档语义中所体现的重要程度。
短语重要性用于衡量短语所代表的概念或者信息在其表达文献语义中体现的重要程度。
可选的是,本发明实施例中的第一词性短语可以为名词词性短语(简称:名词短语),第二词性短语可以为动词词性短语(简称:动词短语)。
当然,本申请还可以包括其他词性的短语,例如形容词短语,数词短语等等,具体依多篇候选文档中含有的短语而定,此处不做限定。可以理解的是,在自然语言处理中,名词短语实际上包含代词,代词被认为是名词的一种。
名词短语(Noun phrase,NP)选取:每个候选句子的主语由名词短语构成,选取此类名词短语,作为生成新句子的候选主语。例如,如图5所示,图5中可以选择“An armed man”作为名词短语。
动词短语(verb phrase,VP)选取:句子的动宾结构由动词短语构成,选取此类动词短语,作为生成新句子的候选动宾解构。例如,如图5所示,图5中选取“walked into an Amish school,sent the boys outside and tied up and shot the girls,killing three of them”,“walked into an Amish school”,“sent the boys outside”,and“tied up and shot the girls,killing three of them”。
S110、多文档摘要生成的装置根据每个候选句子对应的至少一个第一词性短语和至少一个第二词性短语重要性,从短语集合选取满足预设条件的第一词性短语和第二词性短语作为摘要短语集合。
可选的,步骤S110具体可以通过以下方式实现:
S110a、多文档摘要生成的装置将至少一个第一词性短语和至少一个第二词性短语中每个词性短语的重要性、各个词性短语之间的相似度输入整数线性规划函数中,在整数线性规划函数取极值的情况下,确定每个词性短语的候选权重以及各个词性短语之间的相似度的联系权重;一个词性短语的候选权重用于确定该一个词性短语是否为满足预设条件的词性短语。
S110b、多文档摘要生成的装置根据每个词性短语的候选权重以及各个词性短语之间的相似度的联系权重,确定满足预设条件的词性短语。
可以理解的是,预设条件中包括了对短语结合中各个短语特征和各个短语之间相似度的约束,不符合预设条件的短语都会被剔除掉,直至保留满足预设条件的第一词性短语和第二词性短语作为摘要短语集合。其中,词性短语的候选权重为1表示该词性短语在整数线性规划函数取极值的情况下,为满足预设条件的词性短语,词性短语的候选权重为0表示该词性短语在整数线性规划函数取极值的情况下,为不满足预设条件的词性短语。
两个短语之间的相似度用于表示短语在多篇候选文档中的冗余度,预设条件通过对特征和各个短语之间相似度的约束可以对短语的重要性和冗余度筛选。
可选的,步骤S110具体可以通过以下方式实现:
将至少一个第一词性短语和至少一个第二词性短语以及各自对应的重要性参数值输入到整数线性规划函数max{∑iαiSi-∑i<jαij(Si+Sj)Rij}中,以优化整数线性规划函数,在保证目标函数的值最大的情况下,尽量避免选择相似的短语进入摘要中。通过求解该优化问题,将符合条件的至少一个第一词性短语和至少一个第二词性短语保留下来,组成摘要短语集合,以生成最后的多文档摘要。
其中,Pi表示编号为i的短语,Pj表示编号为j的短语,Si表示短语Pi的重要性参数值,Sj表示短语Pj的重要性参数值,Rij表示短语Pi和短语Pj的相似度,αij表示短语Pi和短语Pj相似度的权重,αi表示编号为i的短语的权重。一个词性短语的候选权重用于确定该一个词性短语是否为满足预设条件的词性短语;联系权重用于确定相似的短语是否同时被选择。
短语之间的相似度用于衡量短语之间语义相似的程度。
可以理解的是,上述只是整数线性规划函数的一个实例,在实际使用过程中还可以采用其他各种形式的整数线性规划函数,以得到各个词性短语的权重或联系权重。
其中,计算两个短语之间的相似性可以为:计算动词短语之间的两两相似度、名词短语之间的两两相似度,可以通过余弦相似度(cosine similarity)或者指数函数(jaccard index)来实现。
目标函数定义:最大化选中的短语重要度之和,同时最小化选中短语之间的冗余重要度之和部分为选中的名词和动词短语权重之和如果同时选中的名词短语对或者动词短语对存在冗余,则进行惩罚目标函数形式:
可选的,在步骤S107之前,本发明实施例提供的方法还包括:
S111、多文档摘要生成的装置根据每个候选句子、用于描述事件的m个向量以及候选矩阵,在进行无监督学习模型过程中优化重建误差函数,在重建误差函数取值最小的情况下,将候选矩阵中每一列的列向量的模作为一个词语的重要性,获得每个词语的重要性。
可选的,步骤S111具体可以通过以下方式实现:
根据公式
Figure PCTCN2017116658-appb-000037
其中,候选矩阵Y的作用为使得输出向量y尽量稀疏。
可选的,本发明实施例中的步骤S104可以通过以下方式实现:
将摘要短语集合按照预置组合方式组合,获得多篇候选文档的摘要。
需要说明的是,本发明实施例中的预置组合方式可以是现有组合方式,也可以是其他组合方式,本发明实施例对此不进行限定。
示例性的,步骤S104可以通过步骤S112-S113具体实现:
S112、多文档摘要生成的装置按照摘要短语集合中每个词性短语在多篇候选文档的每个候选句子中的顺序,对摘要短语集合中包括的多个词性短语排序,得到摘要句子。
S113、多文档摘要生成的装置将摘要句子按照多篇候选文档中动词词性短语出现的最早顺序进行排练,得到多篇候选文档的摘要。
可选的,在步骤S113之前还包括:
S114、多文档摘要生成的装置对包括多个动词词性短语的摘要句子,在该摘要句子的多个动词短语之间添加连词。
多文档摘要有标准的英文验证数据集,例如,DUC 2007数据集和TAC 2011数据集。下面将对本发明实施例提供的一种多文档摘要生成的方法应用于DUC 2007数据集和TAC 2011数据集中以确定所提取的多文档摘要的效果予以说明:
本技术首先在DUC 2007和TAC 2011上进行了效果验证试验。其中DUC 2007有45个主题,每个主题20篇新闻,4个人工标注摘要,摘要字数限制为250字。TAC 2011有44个主题,每个主题有10篇新闻,4个人工标注,摘要字数限制100字。评测指标为覆盖度(ROUGE)的F-测度(Measure)。为了评价本发明实施例提取的多文档摘要的精度,评测结果如表1和表2所示,表1示出了应用本发明实施例提供的方法在DUC 2007数据集的摘要结果,表2示出了应用本发明实施例提供的方法在TAC 2011数据集的摘要结果:
表1采用本发明实施例提供的方法在DUC 2007数据集生成的摘要结果
系统(System) R-1 R-2 R-3
Random 0.302 0.046 0.088
Lead 0.312 0.058 0.102
MDS-Sparse 0.353 0.055 0.112
DSDR 0.398 0.087 0.137
RA-MDS 0.406 0.095 0.148
Ours 0.423* 0.107* 0.161*
表2采用本发明实施例提供的方法在TAC 2011生成的摘要结果
系统(System) R-1 R-2 R-3
Random 0.303 0.045 0.090
Lead 0.315 0.071 0.103
PKUTM 0.396 0.113 0.148
ABS-Phrase 0.393 0.117 0.148
RA-MDS 0.400 0.117 0.151
Ours 0.400* 0.121* 0.153*
表1和表2展示了本技术分别在DUC 2007数据集和TAC 2011数据集生成的摘要结果的对比,并且与其他最好的无监督多文档摘要模型进行比较,结果表明本申请提供的多文档摘要的生成的方法在各项指标都取得了最好的结果,提升了多文本摘要的效果。
其中,DUC 2007数据集有45个主题,每个主题20篇新闻,4个人工标注摘要,摘要字数限制为250字。TAC 2011有44个主题,每个主题有10篇新闻,4个人工标注,摘要字数限制100字。评测指标为ROUGE的F-Measure。
如之前所述,本本申请提供的多文档摘要的生成的方法能够估计多篇候选文档中包括的词语的重要性。为了验证估计的词语的重要性的效果,从TAC 2011数据集中选择了4个主题,分别是“Finland Shooting”,“Heart Disease”,“Hiv Infection Africa”和“Pet Food Recal”。每个主题从输出向量中选择词典维对应值最大的前10个词,如下表3所示:
表3采用本申请提供的方法在TAC 2011数据集的4个主题中所估计的词语的重要性
Figure PCTCN2017116658-appb-000038
从表3可以看出,每个主题的前10个词语,已经可以准确地反映出每个主题的主要内容,因此可知道本发明实施例提供的方法对词语重要性预估效果较好。
在该实验设计中,本申请从TAC 2011数据集中选择了几个典型的主题(例如,主题“VTech Shooting”,主题“Oil Spill South Korea”,具体的每个主题所涉及的文章内容可以从TAC 2011数据集中获取,本发明实施例在此不再赘述),针对所选择的典型的主题采用本发明实施例提供的方法生成的多文档摘要,以及人工标注生成的多文档摘要进行对比,如表4和表5:
表4:主题“VTech Shooting”
Figure PCTCN2017116658-appb-000039
Figure PCTCN2017116658-appb-000040
Figure PCTCN2017116658-appb-000041
表5:主题“Oil Spill South Korea”
Figure PCTCN2017116658-appb-000042
Figure PCTCN2017116658-appb-000043
Figure PCTCN2017116658-appb-000044
对比表4和表5的内容可以知道,采用本申请提供的方法生成的多文档摘要在应用于相同的主题时,其生成的多文档摘要的内容和人工标注生成的多文档摘要的内容基本一致,能够覆盖原主题的中心思想,并且句子规整,也符合正确的语法规则。
上述主要从多文档摘要生成的装置的角度对本申请提供的方案进行了介绍。可以理解的是,多文档摘要生成的装置为了实现上述功能,其包含了执行各个功能相应的硬件结构和/或软件模块。本领域技术人员应该很容易意识到,结合本文中所公开的实施例描述的各示例的多文档摘要生成的装置及方法步骤,本发明能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
本发明实施例可以根据上述方法示例对多文档摘要生成的装置等进行功能模块的划分,例如,可以对应各个功能划分各个功能模块,也可以将两个或两个以上的功能集成在一个处理模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。需要说明的是,本发明实施例中对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。
在采用对应各个功能划分各个功能模块的情况下,图6示出了上述实施例中所涉及的多文档摘要生成的装置的一种可能的结构示意图,如图6所示,包括:获取单元601、估计单元602、选择单元603以及生成单元604,其中,获取单元601用于支持多文档摘要生成的装置执行上述实施例中的步骤S101、S105,估计单元602用于支持多文档摘要生成的装置执行上述实施例中的步骤S102、S106、S109(具体的,例如S109a和步骤S109b)、S111,选择单元603用于支持多文档摘要生成的装置执行上述实施例中的步骤S103以及S110(S110a、S110b),生成单元604用于支持多文档摘要生成的装置执行上述实施例中的步骤S104(具体的可以为:S112、S113以及S114),当然还可以包括过滤单元605用于支持多文档摘要生成的装置执行上述实施例中的步骤S107(具体的,例如,S107a和S107b),提取单元606用于支持多文档摘要生成的装置执行上述实施例中的步骤S108。
可以理解的是,本发明实施例中的生成单元604即为如图1所示的多文档摘要生成的装置中的摘要生成模块103,获取单元601、估计单元602、选择单元603以及生成单元604即为图1所示的多文档摘要生成的装置中的重要性估计模块102。
在采用集成的单元的情况下,图7示出了上述实施例中所涉及的多文档摘要生成的装置的一种可能的逻辑结构示意图。多文档摘要生成的装置包括:处理模块512和 通信模块513。处理模块512用于对多文档摘要生成的装置的动作进行控制管理,例如,处理模块512用于执行上述实施例中的步骤S101、S105,S102、S106、S109(具体的,例如S109a和步骤S109b)、S111,S103以及S110(S110a、S110b),步骤S104(具体的可以为:S112、S113以及S114),步骤S107(具体的,例如,S107a和S107b),步骤S108。和/或用于本文所描述的技术的其他过程。通信模块513用于支持多文档摘要生成的装置与其他设备的通信。多文档摘要生成的装置还可以包括存储模块511,用于存储多文档摘要生成的装置的程序代码和数据。
其中,处理模块512可以是处理器或控制器,例如可以是中央处理器单元,通用处理器,数字信号处理器,专用集成电路,现场可编程门阵列或者其他可编程逻辑器件、晶体管逻辑器件、硬件部件或者其任意组合。其可以实现或执行结合本发明公开内容所描述的各种示例性的逻辑方框,模块和电路。所述处理器也可以是实现计算功能的组合,例如包含一个或多个微处理器组合,数字信号处理器和微处理器的组合等等。通信模块513可以是通信接口等。存储模块511可以是存储器。
当处理模块512为处理器,通信模块513为通信接口,存储模块511为存储器时,本发明实施例所涉及的多文档摘要生成的装置可以为图8所示的终端。
图8提出了本发明实施例提供的一种终端的结构示意图,如图8可知,终端包括:处理器301、通信接口302、存储器304以及总线303。其中,通信接口302、处理器301以及存储器304通过总线303相互连接;总线303可以是PCI总线或EISA总线等。所述总线可以分为地址总线、数据总线、控制总线等。为便于表示,图8中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。其中,存储器304用于存储终端的程序代码和数据。通信接口302用于支持终端与其他设备通信,处理器301用于支持终端执行存储器304中存储的程序代码和数据以实现本发明实施例提供的多文档摘要生成的方法。
一方面,本发明实施例提供一种计算机可读存储介质,计算机可读存储介质中存储有指令,当计算机可读存储介质在终端上运行时,使得多文档摘要生成的装置执行上述实施例中的步骤S101、S105,S102、S106、S109(具体的,例如S109a和步骤S109b)、S111,S103以及S110(S110a、S110b),步骤S104(具体的可以为:S112、S113以及S114),步骤S107(具体的,例如,S107a和S107b),步骤S108。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:快闪存储器、移动硬盘、只读存储器、随机存取存储器、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何在本申请揭露的技术范围内的变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (18)

  1. 一种多文档摘要生成的方法,其特征在于,包括:
    获取候选句子集合,所述候选句子集合包括关于同一个事件的多篇候选文档中每篇候选文档包括的候选句子;
    利用预设网络模型中的级联注意力机制和无监督学习模型对所述候选句子集合中每个候选句子进行训练,获得所述每个候选句子的重要性,一个候选句子的重要性与级联注意力机制矩阵中的一个行向量的模对应,所述级联注意力机制矩阵为所述预设网络模型利用无监督学习模型优化重建误差函数过程中输出的;所述候选句子的重要性用于表示所述候选句子所表达的含义在所述多篇候选文档中的重要程度;
    根据所述每个候选句子的重要性,从所述候选句子集合中选择符合预设条件的短语作为摘要短语集合;
    根据所述摘要短语集合获得所述多篇候选文档的摘要。
  2. 根据权利要求1所述的方法,其特征在于,所述利用预设网络模型中的级联注意力机制和无监督学习模型对所述候选句子集合中每个候选句子训练,获取所述候选句子集合中每个候选句子的重要性,包括:
    根据所述预设网络模型获取用于描述所述事件的m个向量;
    根据所述每个候选句子、所述用于描述所述事件的m个向量以及候选矩阵,在进行无监督学习模型过程中优化重建误差函数,在所述重建误差函数取值最小的情况下,将所述预设网络模型输出的级联注意力机制矩阵中每一行的行向量的模作为一个候选句子的重要性,获得所述每个候选句子的重要性,所述重建误差函数包括:所述每个候选句子与所述用于描述所述事件的m个向量之间的关系、所述候选矩阵以及所述候选矩阵对应的权重、所述候选矩阵为m行×n列的矩阵,其中,m和n为正整数,n为所述多篇候选文档包括的词语的数量。
  3. 根据权利要求1或2所述的方法,其特征在于,所述根据所述每个候选句子的重要性,从所述候选句子集合中选择符合预设条件的短语作为摘要短语集合,包括:
    过滤掉所述每个候选句子中不符合预设规则的词语,获取过滤后的每个候选句子;
    从所述过滤后的每个候选句子的语法树中提取至少一个第一词性短语和至少一个第二词性短语组成短语集合;
    根据所述每个候选句子各自的重要性,计算从所述每个候选句子中提取的至少一个第一词性短语和至少一个第二词性短语重要性;
    根据所述每个候选句子对应的至少一个第一词性短语和至少一个第二词性短语重要性,从所述短语集合选取满足预设条件的第一词性短语和第二词性短语作为摘要短语集合。
  4. 根据权利要求3所述的方法,其特征在于,所述过滤掉所述每个候选句子中不符合预设规则的词语,获取过滤后的每个候选句子,包括:
    过滤掉所述每个候选句子中的噪音,得到所述每个候选句子对应的候选词语集合,所述每个候选句子中包括多个词语,所述多个词语中每个词语对应一个重要性;
    根据所述每个词语的重要性,过滤掉所述每个候选句子对应的候选词语集合中重要性低于预设阈值的词语,获取所述过滤后的每个候选句子。
  5. 根据权利要求4所述的方法,其特征在于,所述根据所述每个词语的重要性,过滤掉所述每个候选句子对应的候选词语集合中重要性低于预设阈值的词语之前,所述方法还包括:
    利用预设网络模型中的级联注意力机制和无监督学习模型对所述候选句子集合中每个候选句子训练,获取所述多篇候选文档中包括的多个不同词语中每个词语的重要性。
  6. 根据权利要求5所述的方法,其特征在于,所述利用预设网络模型中的级联注意力机制和无监督学习模型对所述候选句子集合中每个候选句子训练,获取所述多篇候选文档中包括的多个不同词语中每个词语的重要性,包括:
    根据所述每个候选句子、所述用于描述所述事件的m个向量以及所述候选矩阵,在进行无监督学习模型过程中优化重建误差函数,在所述重建误差函数取值最小的情况下,将所述候选矩阵中每一列的列向量的模作为一个词语的重要性,获得每个词语的重要性。
  7. 根据权利要求3-6任意一项所述的方法,其特征在于,所述根据所述每个候选句子的重要性,计算从所述每个候选句子中提取的至少一个第一词性短语和至少一个第二词性短语重要性,包括:
    获取所述至少一个第一词性短语和所述至少一个第二词性短语中每个词性短语的词频;
    根据所述每个词性短语的词频,以及所述每个词性短语所在的候选句子的重要性,计算从所述每个候选句子中提取的至少一个第一词性短语和至少一个第二词性短语重要性。
  8. 根据权利要求3-7任意一项所述的方法,其特征在于,所述根据所述每个候选句子对应的至少一个第一词性短语和至少一个第二词性短语重要性,从所述短语集合选取满足预设条件的第一词性短语和第二词性短语作为摘要短语集合,包括:
    将所述至少一个第一词性短语和至少一个第二词性短语中每个词性短语的重要性、各个词性短语之间的相似度输入整数线性规划函数中,在所述整数线性规划函数取极值的情况下,确定所述每个词性短语的候选权重以及所述各个词性短语之间的相似度的联系权重;一个词性短语的候选权重用于确定该一个词性短语是否为满足预设条件的词性短语;所述联系权重用于确定相似的短语是否同时被选择;
    根据所述每个词性短语的候选权重以及所述各个词性短语之间的相似度的联系权重,确定满足预设条件的词性短语。
  9. 一种多文档摘要生成的装置,其特征在于,包括:
    获取单元,用于获取候选句子集合,所述候选句子集合包括关于同一个事件的多篇候选文档中每篇候选文档包括的候选句子;
    估计单元,用于利用预设网络模型中的级联注意力机制和无监督学习模型对所述候选句子集合中每个候选句子进行训练,获得所述每个候选句子的重要性,一个候选句子的重要性与级联注意力机制矩阵中的一个行向量的模对应,所述级联注意 力机制矩阵为所述预设网络模型利用无监督学习模型优化重建误差函数过程中输出的;所述候选句子的重要性用于表示所述候选句子所表达的含义在所述多篇候选文档中的重要程度;
    选择单元,用于根据所述每个候选句子的重要性,从所述候选句子集合中选择符合预设条件的短语作为摘要短语集合;
    生成单元,用于根据所述摘要短语集合,获得所述多篇候选文档的摘要。
  10. 根据权利要求9所述的装置,其特征在于,所述获取单元,还用于:根据所述预设网络模型获取用于描述所述事件的m个向量;
    所述估计单元具体用于:根据所述每个候选句子、所述用于描述所述事件的m个向量以及候选矩阵,在进行无监督学习模型过程中优化重建误差函数,在所述重建误差函数取值最小的情况下,将所述预设网络模型输出的级联注意力机制矩阵中每一行的行向量的模作为一个候选句子的重要性,获得所述每个候选句子的重要性,所述重建误差函数包括:所述每个候选句子与所述用于描述所述事件的m个向量之间的关系、所述候选矩阵以及所述候选矩阵对应的权重、所述候选矩阵为m行×n列的矩阵,其中,m和n为正整数,n为所述多篇候选文档包括的词语的数量。
  11. 根据权利要求9或10所述的装置,其特征在于,所述装置还包括:
    过滤单元,用于过滤掉所述每个候选句子中不符合预设规则的词语,获取过滤后的每个候选句子;
    提取单元,用于从所述过滤后的每个候选句子的语法树中提取至少一个第一词性短语和至少一个第二词性短语组成短语集合;
    所述估计单元还用于根据所述每个候选句子各自的重要性,计算从所述每个候选句子中提取的至少一个第一词性短语和至少一个第二词性短语重要性;
    所述选择单元具体用于,根据所述每个候选句子对应的至少一个第一词性短语和至少一个第二词性短语重要性,从所述短语集合选取满足预设条件的第一词性短语和第二词性短语作为摘要短语集合。
  12. 根据权利要求11所述的装置,其特征在于,所述过滤单元具体用于:过滤掉所述每个候选句子中的噪音,得到所述每个候选句子对应的候选词语集合,所述每个候选句子中包括多个词语,所述多个词语中每个词语对应一个重要性;以及用于根据所述每个词语的重要性,过滤掉所述每个候选句子对应的候选词语集合中重要性低于预设阈值的词语,获取所述过滤后的每个候选句子。
  13. 根据权利要求12所述的装置,其特征在于,所述估计单元,还用于利用预设网络模型中的级联注意力机制和无监督学习模型对所述候选句子集合中每个候选句子训练,获取所述多篇候选文档中包括的多个不同词语中每个词语的重要性。
  14. 根据权利要求13所述的装置,其特征在于,所述估计单元,还具体用于:根据所述每个候选句子、所述用于描述所述事件的m个向量以及候选矩阵,在进行无监督学习模型过程中优化重建误差函数,在所述重建误差函数取值最小的情况下,将所述候选矩阵中每一列的列向量的模作为一个词语的重要性,获得每个词语的重要性。
  15. 根据权利要求11-14任意一项所述的装置,其特征在于,所述获取单元, 还用于获取所述至少一个第一词性短语和所述至少一个第二词性短语中每个词性短语的词频;
    所述估计单元还用于:根据所述每个词性短语的词频,以及所述每个词性的短语所在的候选句子的重要性,计算从所述每个候选句子中提取的至少一个第一词性短语和至少一个第二词性短语重要性。
  16. 根据权利要求10-15任意一项所述的装置,其特征在于,所述获取单元具体用于,将所述至少一个第一词性短语和至少一个第二词性短语中每个词性短语的重要性、各个词性短语之间的相似度输入整数线性规划函数中,在所述整数线性规划函数取极值的情况下,确定所述每个词性短语的候选权重以及所述各个词性短语之间的相似度的联系权重;一个词性短语的候选权重用于确定该一个词性短语是否为满足预设条件的词性短语;联系权重用于确定相似的短语是否同时被选择;
    所述选择单元具体用于:根据所述每个词性短语的候选权重以及所述各个词性短语之间的相似度的联系权重,确定满足预设条件的词性短语。
  17. 一种终端,其特征在于,所述终端包括处理器、存储器、系统总线和通信接口;其中,所述存储器用于存储计算机执行指令,所述处理器与所述存储器通过所述系统总线连接,当所述终端运行时,所述处理器执行所述存储器存储的所述计算机执行指令,以使所述终端执行如权利要求1-8中任一项所述的多文档摘要生成的方法。
  18. 一种计算机可读存储介质,包括指令,当指令在终端上运行时,使得终端执行如权利要求1-8中任一项所述的多文档摘要生成的方法。
PCT/CN2017/116658 2017-05-23 2017-12-15 一种多文档摘要生成的方法、装置和终端 WO2018214486A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/688,090 US10929452B2 (en) 2017-05-23 2019-11-19 Multi-document summary generation method and apparatus, and terminal

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710369694.XA CN108959312B (zh) 2017-05-23 2017-05-23 一种多文档摘要生成的方法、装置和终端
CN201710369694.X 2017-05-23

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/688,090 Continuation US10929452B2 (en) 2017-05-23 2019-11-19 Multi-document summary generation method and apparatus, and terminal

Publications (1)

Publication Number Publication Date
WO2018214486A1 true WO2018214486A1 (zh) 2018-11-29

Family

ID=64396188

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/116658 WO2018214486A1 (zh) 2017-05-23 2017-12-15 一种多文档摘要生成的方法、装置和终端

Country Status (3)

Country Link
US (1) US10929452B2 (zh)
CN (1) CN108959312B (zh)
WO (1) WO2018214486A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112347242A (zh) * 2020-11-11 2021-02-09 北京沃东天骏信息技术有限公司 摘要生成方法、装置、设备及介质
FR3102276A1 (fr) * 2019-10-17 2021-04-23 Amadeus Procedes et systemes pour résumer des document multiples en utilisant une approche d’apprentissage automatique
CN113221967A (zh) * 2021-04-23 2021-08-06 中国农业大学 特征抽取方法、装置、电子设备及存储介质
US20220215177A1 (en) * 2018-07-27 2022-07-07 Beijing Jingdong Shangke Information Technology Co., Ltd. Method and system for processing sentence, and electronic device
US12032905B2 (en) 2019-10-17 2024-07-09 Amadeus S.A.S. Methods and systems for summarization of multiple documents using a machine learning approach

Families Citing this family (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107368476B (zh) * 2017-07-25 2020-11-03 深圳市腾讯计算机系统有限公司 一种翻译的方法、目标信息确定的方法及相关装置
US11106872B2 (en) * 2018-01-09 2021-08-31 Jyu-Fang Yu System and method for improving sentence diagram construction and analysis by enabling a user positioning sentence construction components and words on a diagramming interface
CN108628833B (zh) * 2018-05-11 2021-01-22 北京三快在线科技有限公司 原创内容摘要确定方法及装置,原创内容推荐方法及装置
US11023682B2 (en) 2018-09-30 2021-06-01 International Business Machines Corporation Vector representation based on context
CN109919174A (zh) * 2019-01-16 2019-06-21 北京大学 一种基于门控级联注意力机制的文字识别方法
CN111597791A (zh) * 2019-02-19 2020-08-28 北大方正集团有限公司 评论短语的提取方法及设备
CN110162618B (zh) * 2019-02-22 2021-09-17 北京捷风数据技术有限公司 一种非对照语料的文本概要生成方法及装置
CN110287491B (zh) * 2019-06-25 2024-01-12 北京百度网讯科技有限公司 事件名生成方法及装置
CN110363000B (zh) * 2019-07-10 2023-11-17 深圳市腾讯网域计算机网络有限公司 识别恶意文件的方法、装置、电子设备及存储介质
CN110442866A (zh) * 2019-07-28 2019-11-12 广东工业大学 一种融合语法信息的句子压缩方法
KR102098734B1 (ko) * 2019-08-06 2020-04-08 전자부품연구원 대화 상대의 외형을 반영한 수어 영상 제공 방법, 장치 및 단말
US11281854B2 (en) * 2019-08-21 2022-03-22 Primer Technologies, Inc. Limiting a dictionary used by a natural language model to summarize a document
CN110825870B (zh) * 2019-10-31 2023-07-14 腾讯科技(深圳)有限公司 文档摘要的获取方法和装置、存储介质及电子装置
US20210192813A1 (en) * 2019-12-18 2021-06-24 Catachi Co. DBA Compliance.ai Methods and systems for facilitating generation of navigable visualizations of documents
EP4127967A4 (en) 2020-03-23 2024-05-01 Sorcero, Inc. FEATURE ENGINEERING WITH QUESTION GENERATION
CN111597327B (zh) * 2020-04-22 2023-04-07 哈尔滨工业大学 一种面向舆情分析的无监督式多文档文摘生成方法
US11640295B2 (en) * 2020-06-26 2023-05-02 Intel Corporation System to analyze and enhance software based on graph attention networks
CN111797226B (zh) * 2020-06-30 2024-04-05 北京百度网讯科技有限公司 会议纪要的生成方法、装置、电子设备以及可读存储介质
CN112016296B (zh) * 2020-09-07 2023-08-25 平安科技(深圳)有限公司 句子向量生成方法、装置、设备及存储介质
CN112069309B (zh) * 2020-09-14 2024-03-15 腾讯科技(深圳)有限公司 信息获取方法、装置、计算机设备及存储介质
CN114600112A (zh) * 2020-09-29 2022-06-07 谷歌有限责任公司 使用自然语言处理的文档标记和导航
CN112560479B (zh) * 2020-12-24 2024-01-12 北京百度网讯科技有限公司 摘要抽取模型训练方法、摘要抽取方法、装置和电子设备
CN112784585A (zh) * 2021-02-07 2021-05-11 新华智云科技有限公司 金融公告的摘要提取方法与摘要提取终端
CN112711662A (zh) * 2021-03-29 2021-04-27 贝壳找房(北京)科技有限公司 文本获取方法和装置、可读存储介质、电子设备
CN113221559B (zh) * 2021-05-31 2023-11-03 浙江大学 利用语义特征的科技创新领域中文关键短语抽取方法及系统
KR20230046086A (ko) * 2021-09-29 2023-04-05 한국전자통신연구원 중요 문장 기반 검색 서비스 제공 장치 및 방법
CN114239587A (zh) * 2021-11-24 2022-03-25 北京三快在线科技有限公司 一种摘要生成方法、装置、电子设备及存储介质
CN114706972B (zh) * 2022-03-21 2024-06-18 北京理工大学 一种基于多句压缩的无监督科技情报摘要自动生成方法
CN117668213B (zh) * 2024-01-29 2024-04-09 南京争锋信息科技有限公司 一种基于级联抽取和图对比模型的混沌工程摘要生成方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020052901A1 (en) * 2000-09-07 2002-05-02 Guo Zhi Li Automatic correlation method for generating summaries for text documents
CN102411621A (zh) * 2011-11-22 2012-04-11 华中师范大学 一种基于云模型的中文面向查询的多文档自动文摘方法
CN104156452A (zh) * 2014-08-18 2014-11-19 中国人民解放军国防科学技术大学 一种网页文本摘要生成方法和装置
CN105005563A (zh) * 2014-04-15 2015-10-28 腾讯科技(深圳)有限公司 一种摘要生成方法及装置
US20170060826A1 (en) * 2015-08-26 2017-03-02 Subrata Das Automatic Sentence And Clause Level Topic Extraction And Text Summarization

Family Cites Families (47)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6546379B1 (en) * 1999-10-26 2003-04-08 International Business Machines Corporation Cascade boosting of predictive models
WO2004061702A1 (en) * 2002-12-26 2004-07-22 The Trustees Of Columbia University In The City Of New York Ordered data compression system and methods
US7783135B2 (en) * 2005-05-09 2010-08-24 Like.Com System and method for providing objectified image renderings using recognition information from images
WO2008023280A2 (en) * 2006-06-12 2008-02-28 Fotonation Vision Limited Advances in extending the aam techniques from grayscale to color images
CN101008941A (zh) * 2007-01-10 2007-08-01 复旦大学 多文档自动摘要的逐次主轴筛选法
CN101398814B (zh) * 2007-09-26 2010-08-25 北京大学 一种同时抽取文档摘要和关键词的方法及系统
US20100299303A1 (en) * 2009-05-21 2010-11-25 Yahoo! Inc. Automatically Ranking Multimedia Objects Identified in Response to Search Queries
US8473430B2 (en) * 2010-01-29 2013-06-25 Microsoft Corporation Deep-structured conditional random fields for sequential labeling and classification
CN102385574B (zh) * 2010-09-01 2014-08-20 株式会社理光 从文档抽取句子的方法和装置
CN102043851A (zh) * 2010-12-22 2011-05-04 四川大学 一种基于频繁项集的多文档自动摘要方法
US8856050B2 (en) * 2011-01-13 2014-10-07 International Business Machines Corporation System and method for domain adaption with partial observation
US8909643B2 (en) * 2011-12-09 2014-12-09 International Business Machines Corporation Inferring emerging and evolving topics in streaming text
US9256617B2 (en) * 2012-07-06 2016-02-09 Samsung Electronics Co., Ltd. Apparatus and method for performing visual search
US9436911B2 (en) * 2012-10-19 2016-09-06 Pearson Education, Inc. Neural networking system and methods
US9129148B1 (en) * 2012-11-09 2015-09-08 Orbeus Inc. System, method and apparatus for scene recognition
US20140236577A1 (en) * 2013-02-15 2014-08-21 Nec Laboratories America, Inc. Semantic Representations of Rare Words in a Neural Probabilistic Language Model
WO2014203042A1 (en) * 2013-06-21 2014-12-24 Aselsan Elektronik Sanayi Ve Ticaret Anonim Sirketi Method for pseudo-recurrent processing of data using a feedforward neural network architecture
US9730643B2 (en) * 2013-10-17 2017-08-15 Siemens Healthcare Gmbh Method and system for anatomical object detection using marginal space deep neural networks
US9471886B2 (en) * 2013-10-29 2016-10-18 Raytheon Bbn Technologies Corp. Class discriminative feature transformation
CN103593703A (zh) * 2013-11-26 2014-02-19 上海电机学院 基于遗传算法的神经网络优化系统及方法
CN103885935B (zh) 2014-03-12 2016-06-29 浙江大学 基于图书阅读行为的图书章节摘要生成方法
CN103853834B (zh) 2014-03-12 2017-02-08 华东师范大学 基于文本结构分析的Web文档摘要的生成方法
US9996976B2 (en) * 2014-05-05 2018-06-12 Avigilon Fortress Corporation System and method for real-time overlay of map features onto a video feed
US20180107660A1 (en) * 2014-06-27 2018-04-19 Amazon Technologies, Inc. System, method and apparatus for organizing photographs stored on a mobile computing device
CN105320642B (zh) 2014-06-30 2018-08-07 中国科学院声学研究所 一种基于概念语义基元的文摘自动生成方法
US9767385B2 (en) * 2014-08-12 2017-09-19 Siemens Healthcare Gmbh Multi-layer aggregation for object detection
US10806374B2 (en) * 2014-08-25 2020-10-20 Georgia Tech Research Corporation Noninvasive systems and methods for monitoring health characteristics
CN105488021B (zh) 2014-09-15 2018-09-28 华为技术有限公司 一种生成多文档摘要的方法和装置
CN105530554B (zh) * 2014-10-23 2020-08-07 南京中兴新软件有限责任公司 一种视频摘要生成方法及装置
CN104503958B (zh) 2014-11-19 2017-09-26 百度在线网络技术(北京)有限公司 文档摘要的生成方法及装置
WO2016090376A1 (en) * 2014-12-05 2016-06-09 Texas State University Eye tracking via patterned contact lenses
WO2016132150A1 (en) * 2015-02-19 2016-08-25 Magic Pony Technology Limited Enhancing visual data using and augmenting model libraries
CN104778157A (zh) 2015-03-02 2015-07-15 华南理工大学 一种多文档摘要句的生成方法
US9842105B2 (en) * 2015-04-16 2017-12-12 Apple Inc. Parsimonious continuous-space phrase representations for natural language processing
CN104834735B (zh) 2015-05-18 2018-01-23 大连理工大学 一种基于词向量的文档摘要自动提取方法
CN105183710A (zh) 2015-06-23 2015-12-23 武汉传神信息技术有限公司 一种文档摘要自动生成的方法
US20170083623A1 (en) * 2015-09-21 2017-03-23 Qualcomm Incorporated Semantic multisensory embeddings for video search by text
US10296846B2 (en) * 2015-11-24 2019-05-21 Xerox Corporation Adapted domain specific class means classifier
US10354199B2 (en) * 2015-12-07 2019-07-16 Xerox Corporation Transductive adaptation of classifiers without source data
US10424072B2 (en) * 2016-03-01 2019-09-24 Samsung Electronics Co., Ltd. Leveraging multi cues for fine-grained object classification
CN105930314B (zh) 2016-04-14 2019-02-05 清华大学 基于编码-解码深度神经网络的文本摘要生成系统及方法
US20170351786A1 (en) * 2016-06-02 2017-12-07 Xerox Corporation Scalable spectral modeling of sparse sequence functions via a best matching algorithm
CN106054606B (zh) * 2016-06-12 2019-01-11 金陵科技学院 基于级联观测器的无模型控制方法
US10223612B2 (en) * 2016-09-01 2019-03-05 Microsoft Technology Licensing, Llc Frame aggregation network for scalable video face recognition
TWI612488B (zh) * 2016-12-05 2018-01-21 財團法人資訊工業策進會 用於預測商品的市場需求的計算機裝置與方法
US11205103B2 (en) * 2016-12-09 2021-12-21 The Research Foundation for the State University Semisupervised autoencoder for sentiment analysis
US10460727B2 (en) * 2017-03-03 2019-10-29 Microsoft Technology Licensing, Llc Multi-talker speech recognizer

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020052901A1 (en) * 2000-09-07 2002-05-02 Guo Zhi Li Automatic correlation method for generating summaries for text documents
CN102411621A (zh) * 2011-11-22 2012-04-11 华中师范大学 一种基于云模型的中文面向查询的多文档自动文摘方法
CN105005563A (zh) * 2014-04-15 2015-10-28 腾讯科技(深圳)有限公司 一种摘要生成方法及装置
CN104156452A (zh) * 2014-08-18 2014-11-19 中国人民解放军国防科学技术大学 一种网页文本摘要生成方法和装置
US20170060826A1 (en) * 2015-08-26 2017-03-02 Subrata Das Automatic Sentence And Clause Level Topic Extraction And Text Summarization

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220215177A1 (en) * 2018-07-27 2022-07-07 Beijing Jingdong Shangke Information Technology Co., Ltd. Method and system for processing sentence, and electronic device
US12039281B2 (en) * 2018-07-27 2024-07-16 Beijing Jingdong Shangke Information Technology Co., Ltd. Method and system for processing sentence, and electronic device
FR3102276A1 (fr) * 2019-10-17 2021-04-23 Amadeus Procedes et systemes pour résumer des document multiples en utilisant une approche d’apprentissage automatique
US12032905B2 (en) 2019-10-17 2024-07-09 Amadeus S.A.S. Methods and systems for summarization of multiple documents using a machine learning approach
CN112347242A (zh) * 2020-11-11 2021-02-09 北京沃东天骏信息技术有限公司 摘要生成方法、装置、设备及介质
CN113221967A (zh) * 2021-04-23 2021-08-06 中国农业大学 特征抽取方法、装置、电子设备及存储介质
CN113221967B (zh) * 2021-04-23 2023-11-24 中国农业大学 特征抽取方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
CN108959312A (zh) 2018-12-07
CN108959312B (zh) 2021-01-29
US20200081909A1 (en) 2020-03-12
US10929452B2 (en) 2021-02-23

Similar Documents

Publication Publication Date Title
WO2018214486A1 (zh) 一种多文档摘要生成的方法、装置和终端
WO2022227207A1 (zh) 文本分类方法、装置、计算机设备和存储介质
KR102342066B1 (ko) 뉴럴 네트워크 모델을 이용한 기계 번역 방법, 장치 및 그 장치를 학습시키기 위한 방법
CN108733682B (zh) 一种生成多文档摘要的方法及装置
EP3616083A2 (en) Multi-lingual semantic parser based on transferred learning
US20170193086A1 (en) Methods, devices, and systems for constructing intelligent knowledge base
US20130007020A1 (en) Method and system of extracting concepts and relationships from texts
US20150227505A1 (en) Word meaning relationship extraction device
Mills et al. Graph-based methods for natural language processing and understanding—A survey and analysis
CN108491389B (zh) 点击诱饵标题语料识别模型训练方法和装置
CN110110332B (zh) 文本摘要生成方法及设备
Hasan et al. Neural clinical paraphrase generation with attention
KR101717230B1 (ko) 재귀 오토인코더 기반 문장 벡터 모델링을 이용하는 문서 요약 방법 및 문서 요약 시스템
Schwartz et al. Neural polysynthetic language modelling
CN116628186B (zh) 文本摘要生成方法及系统
Maučec et al. Slavic languages in phrase-based statistical machine translation: a survey
CN114625866A (zh) 训练摘要生成模型的方法、装置、设备及介质
Luz et al. Semantic parsing natural language into SPARQL: improving target language representation with neural attention
Cao et al. Inference time style control for summarization
CN114330335A (zh) 关键词抽取方法、装置、设备及存储介质
CN114218921A (zh) 一种优化bert的问题语义匹配方法
US11281855B1 (en) Reinforcement learning approach to decode sentence ambiguity
Long [Retracted] The Construction of Machine Translation Model and Its Application in English Grammar Error Detection
Mahmoud et al. Arabic semantic textual similarity identification based on convolutional gated recurrent units
Niu et al. Faithful target attribute prediction in neural machine translation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17910852

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17910852

Country of ref document: EP

Kind code of ref document: A1