CN108733682A

CN108733682A - A kind of method and device generating multi-document summary

Info

Publication number: CN108733682A
Application number: CN201710245997.0A
Authority: CN
Inventors: 李丕绩; 吕正东; 李航
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2017-04-14
Filing date: 2017-04-14
Publication date: 2018-11-02
Anticipated expiration: 2037-04-14
Also published as: CN108733682B

Abstract

The embodiment of the present application discloses a kind of method and device generating multi-document summary, is related to data processing field, solves the problems, such as that the performance that existing automatic multi-document summary technology generates abstract is poor.Concrete scheme is：Multiple documents are divided into n sentence, generate input bag of words vector, unsupervised training is carried out to each sentence shown with input bag of words vector table, obtain the potential applications vector of each sentence of coding hidden layer vector sum of each sentence, m potential applications vector of acquisition, according to m potential applications vector, obtain m output bag of words vector of m decoding hidden layer vector sum, it is updated, estimate the importance of each sentence, obtain the importance and redundancy of the verb phrase of each sentence, and the importance and redundancy of the noun phrase of each sentence, according to the importance and redundancy of all noun phrases, and the importance and redundancy of all verb phrases, generate the abstract of multiple documents.The embodiment of the present application is used to generate the process of multi-document summary.

Description

Method and device for generating multi-document abstract

Technical Field

The embodiment of the application relates to the field of data processing, in particular to a method and a device for generating a multi-document abstract.

Background

In the information explosion era, people are faced with massive information, and rapid and effective information processing means are increasingly urgently needed. As one of the channels for acquiring information, news reading occupies a considerable portion of people's life. The immediacy and redundancy of news bring great inconvenience to people to read. The Multi-Document Summarization (MDS) technology is to automatically generate a brief summary with word number limitation for a plurality of documents of a topic, and can describe the main content of the topic to the maximum extent for a user to read. Thereby improving the efficiency of information reading and information acquisition.

The digest generation method can be classified into the following three methods. The method relies on natural language understanding, natural language generation and other technologies, and under the condition that both the understanding and the generation technologies are not ideal, the difficulty of the method is high. The abstract is formed by directly selecting the most important sentences from the original text, and a certain mechanism is provided to ensure that the extracted sentences are not repeated in the semantic level, so that the importance and the coverage are ensured, however, the noise is more. The compressed abstract is to delete noise or redundant information in a sentence under constraints such as sentence integrity and the like on the basis of the abstract, and only retain important information to form the abstract, however, the sentence may not be smooth. Therefore, the existing automatic multi-document summarization technology cannot well meet the requirements of users, and the performance of generated summaries is poor.

Disclosure of Invention

The embodiment of the application provides a method and a device for generating a multi-document abstract, which solve the problem of poor performance of generating the abstract by the existing automatic multi-document abstract technology.

In order to achieve the above purpose, the embodiment of the present application adopts the following technical solutions:

in a first aspect of an embodiment of the present application, a method for generating a multiple-document summary is provided, including:

firstly, dividing a plurality of documents into n sentences, generating input bag-of-word vectors of the sentences for each sentence, enabling the input bag-of-word vectors of the n sentences to form an input bag-of-word vector space, then carrying out unsupervised training on each sentence represented by the input bag-of-word vectors based on a Variational Auto-Encoder (VAE) model to obtain a coding hidden layer vector of each sentence and a potential semantic vector of each sentence, enabling the coding hidden layer vectors of the n sentences to form a coding hidden layer vector space, enabling the potential semantic vectors of the n sentences to form a potential semantic vector space, then collecting m potential semantic vectors from the potential semantic vector space, obtaining m decoding hidden layer vectors and m output bag-of-word vectors according to the m potential semantic vectors, updating the m decoding hidden layer vectors and the m output bag-of-word vectors according to an alignment mechanism, and generating a plurality of input bag-of-word vectors according to the input bag-of-word vector space, Encoding a hidden layer vector space, a potential semantic vector space, m potential semantic vectors, m updated decoding hidden layer vectors and m updated output bag-of-words vectors to estimate the importance of each sentence, finally, obtaining verb phrases of each sentence and noun phrases of each sentence, obtaining the importance of each noun phrase according to the importance of the sentence in which the noun phrase is located, obtaining the importance of each verb phrase according to the importance of the sentence in which the verb phrase is located, obtaining the redundancy of each verb phrase and the redundancy of each noun phrase, and generating abstracts of a plurality of documents according to the importance and the redundancy of all noun phrases and the importance and the redundancy of all verb phrases based on an integer linear programming model, wherein n is an integer greater than or equal to 1, m is an integer greater than or equal to 1 and smaller than n.

The method for generating the multi-document abstract provided by the embodiment of the application is characterized in that each sentence represented by an input bag vector is subjected to unsupervised training based on a variational self-coding model to generate a potential semantic vector of the sentence, so that the measurement effect of the sentence is improved, meanwhile, a decoding hidden layer vector and an output bag vector are obtained according to the potential semantic vector, and then respective potential semantic vector space, encoding hidden layer vector space and input bag vector space are respectively reconstructed according to the potential semantic vector, the decoding hidden layer vector and the output bag vector to estimate the importance of the sentence, namely, the multi-semantic space is jointly considered to estimate the importance of the sentence to generate the abstract of the multi-document. Therefore, the abstracts of the multiple documents are generated through the variation self-coding model and the sentence importance estimation model of the joint multi-semantic space, and the performance of the abstracts is greatly improved.

With reference to the first aspect, in a first implementation manner, the performing unsupervised training on each sentence represented by an input bag of words vector based on a variational self-coding model to obtain a coding hidden layer vector of each sentence and a latent semantic vector of each sentence includes: step 1, mapping a sentence x represented by an input bag-of-words vector to a first coding hidden layer to obtain a coding hidden layer vector of the sentence x, wherein the sentence x is any one of n sentences; step 2, mapping the coding hidden layer vector of the sentence x to a second coding hidden layer to obtain a mean vector and a variance vector, wherein the mean vector and the variance vector are used for representing a potential semantic vector to be determined of the sentence x; step 3, obtaining a to-be-determined potential semantic vector of the sentence x according to the mean vector and the variance vector; step 4, mapping the potential semantic vector to be determined of the sentence x to a decoding hidden layer to obtain a decoding hidden layer vector of the sentence x; step 5, mapping the decoding hidden layer vector of the sentence x to an output layer to obtain an output bag-of-words vector of the sentence x, namely regenerating the sentence x to obtain a sentence x'; repeating the steps 1 to 5, and obtaining the value of the objective function of the first optimization problem according to the input bag-of-words vector, the output bag-of-words vector, the mean vector and the variance vector; and when the value of the objective function of the first optimization problem is an extreme value, determining the potential semantic vector to be determined as the potential semantic vector of the sentence x, namely when the value of the objective function of the first optimization problem is an extreme value, the regenerated sentence x' is most similar to the sentence x.

The method for generating the multi-document abstract provided by the embodiment of the application is used for carrying out unsupervised training on each sentence represented by the input bag-of-words vector based on the variational self-coding model, and specifically determining the potential semantic vector of the sentence by acquiring the extreme value of the objective function of the first optimization problem, so that the measurement effect of the sentence is improved.

In order to jointly consider the importance of a multi-semantic-space estimation sentence to generate abstracts of a plurality of documents and improve the performance of the abstracts, in combination with the first implementation manner, in the second implementation manner, obtaining m decoding hidden layer vectors and m output bag-of-words vectors according to m potential semantic vectors includes: mapping the m potential semantic vectors to the decoding hidden layer to obtain m decoding hidden layer vectors; and mapping the m decoding hidden layer vectors to the output layer to obtain m output bag-of-word vectors.

In order to generate abstracts of a plurality of documents by jointly considering the importance of a multi-semantic-space estimation sentence and improve the performance of the abstracts, in combination with the first aspect, in a third implementation manner, obtaining m decoding hidden layer vectors and m output bag-of-words vectors according to m potential semantic vectors includes: mapping the m potential semantic vectors to decoding hidden layers to obtain m decoding hidden layer vectors; and mapping the m decoding hidden layer vectors to an output layer to obtain m output bag-of-word vectors.

With reference to the first aspect or any one of the first to third implementable manners, in a fourth implementable manner, updating the m decoding hidden layer vectors according to the alignment mechanism includes: acquiring the relationship between each decoding hidden layer vector in the m decoding hidden layer vectors and the encoding hidden layer vectors of the n sentences to obtain a first alignment value; weighting and summing the first alignment value and the coding hidden layer vectors of the n sentences to obtain a first context vector; updating the m decoded hidden layer vectors according to the first context vector. When m decoding hidden layer vectors are generated, some low-frequency detail information may be lost, and in order to complement the information, the m decoding hidden layer vectors are updated through an alignment mechanism.

With reference to the first aspect or any one of the first to fourth implementable manners, in a fifth implementable manner, updating the m output bag-of-words vectors according to the alignment mechanism includes: obtaining the relation between each output bag-of-words vector in the m output bag-of-words vectors and the input bag-of-words vectors of the n sentences to obtain a second alignment value; weighting and summing the second alignment value and the input word bag vectors of the n sentences to obtain a second context vector; and updating the m output bag-of-words vectors according to the second context vector. When m output bag-of-word vectors are generated, some low-frequency detail information may be lost, and in order to complement the information, the m output bag-of-word vectors are updated through an alignment mechanism.

With reference to the first aspect or any one of the first to fifth implementable manners, in a sixth implementable manner, estimating the importance of each sentence according to the input bag of words vector space, the coding hidden layer vector space, the latent semantic vector space, the m latent semantic vectors, the updated m decoding hidden layer vectors, and the updated m output bag of words vector includes: reconstructing a potential semantic vector space according to the m potential semantic vectors, reconstructing a coding hidden vector space according to the updated m decoding hidden vectors, and reconstructing an input bag-of-words vector space according to the updated m output bag-of-words vectors, so as to obtain a value of a target function of a second optimization problem; when the value of the objective function of the second optimization problem is an extreme value, a reconstruction coefficient matrix is obtained; and taking a module of the vector corresponding to each sentence in the reconstruction coefficient matrix, and determining the module of the vector corresponding to the sentence as the importance of the sentence. Therefore, the importance of the sentence is estimated by jointly considering the multiple semantic spaces to generate the abstract, and the performance of the abstract can be greatly improved.

In order to generate abstracts of a plurality of documents and improve the performance of the abstracts, in combination with any one of the first aspect or the first implementable manner to the sixth implementable manner, in the seventh implementable manner, obtaining a verb phrase of each sentence and a noun phrase of each sentence includes: parsing each sentence into a syntax tree; and acquiring a noun phrase of each sentence and a verb phrase of each sentence from the grammar tree of each sentence.

In a second aspect of the embodiments of the present application, an apparatus for generating a multiple document summary is provided, including:

the system comprises a dividing unit, a searching unit and a searching unit, wherein the dividing unit is used for dividing a plurality of documents into n sentences, and n is an integer which is greater than or equal to 1; the first generation unit is used for generating input word bag vectors of sentences for each sentence, and the input word bag vectors of n sentences form an input word bag vector space; the training unit is used for carrying out unsupervised training on each sentence represented by the input bag-of-words vector based on the variational self-coding model to obtain a coding hidden layer vector of each sentence and a potential semantic vector of each sentence, the coding hidden layer vectors of n sentences form a coding hidden layer vector space, and the potential semantic vectors of n sentences form a potential semantic vector space; the acquisition unit is used for acquiring m potential semantic vectors from the potential semantic vector space, wherein m is an integer which is greater than or equal to 1 and less than n; the mapping unit is used for obtaining m decoding hidden layer vectors and m output bag-of-words vectors according to the m potential semantic vectors; the updating unit is used for updating the m decoding hidden layer vectors and the m output bag-of-words vectors according to an alignment mechanism; the estimation unit is used for estimating the importance of each sentence according to the input bag-of-words vector space, the coding hidden layer vector space, the potential semantic vector space, the m potential semantic vectors, the updated m decoding hidden layer vectors and the updated m output bag-of-words vectors; a first obtaining unit configured to obtain a verb phrase of each sentence and a noun phrase of each sentence; the second acquisition unit is used for acquiring the importance of the noun phrase according to the importance of the sentence in which the noun phrase is positioned and acquiring the importance of the verb phrase according to the importance of the sentence in which the verb phrase is positioned; a third obtaining unit, configured to obtain redundancy of each verb phrase and redundancy of each noun phrase; and the second generating unit is used for generating the abstracts of the plurality of documents according to the importance and the redundancy of all noun phrases and the importance and the redundancy of all verb phrases based on the integer linear programming model.

In a third aspect of the embodiments of the present application, an apparatus for generating a multiple document summary is provided, where the apparatus for generating a multiple document summary may include: at least one processor, a memory, a communication interface, a communication bus;

the at least one processor is connected with the memory and the communication interface through a communication bus, the memory is used for storing computer-executable instructions, and when the apparatus for generating the multi-document summary is running, the processor executes the computer-executable instructions stored in the memory, so that the apparatus for generating the multi-document summary executes the method for generating the multi-document summary according to the first aspect or any one of the possible implementation manners of the first aspect.

In a fourth aspect of the embodiments of the present application, a computer storage medium is provided for storing computer software instructions for the apparatus for generating a multiple document summary, where the computer software instructions include a program designed to execute the method for generating a multiple document summary.

In a fifth aspect, embodiments of the present application provide a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of any of the above aspects.

In addition, the technical effects brought by any one of the design manners of the second aspect to the fifth aspect can be referred to the technical effects brought by different design manners of the first aspect, and are not described herein again.

In the embodiment of the present application, the names of the apparatuses for generating a multi-document summary do not limit the apparatuses themselves, and in actual implementation, the apparatuses may appear by other names. Provided that the function of each device is similar to the embodiments of the present application, and fall within the scope of the claims of the present application and their equivalents.

These and other aspects of the embodiments of the present application will be more readily apparent from the following description of the embodiments.

Drawings

Fig. 1 is a schematic structural diagram of a computer device according to an embodiment of the present application;

FIG. 2 is a flowchart of a method for generating a multi-document summary according to an embodiment of the present application;

FIG. 3 is a diagram illustrating a multi-document summarization model according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an apparatus for generating a multiple document summary according to an embodiment of the present application.

Detailed Description

The embodiment of the application provides a method for generating a multi-document abstract, which has the following basic principle: firstly, dividing a plurality of documents into n sentences, representing each sentence by using an input bag-of-words vector, enabling the input bag-of-words vector of the n sentences to form an input bag-of-words vector space, then inputting each sentence represented by the input bag-of-words vector into a variational self-coding model to perform unsupervised training on each sentence to obtain a coding hidden-layer vector and a potential semantic vector of each sentence, enabling the coding hidden-layer vector of the n sentences to form a coding hidden-layer vector space, enabling the potential semantic vectors of the n sentences to form a potential semantic vector space, collecting m potential semantic vectors from the potential semantic vector space, obtaining m decoding hidden-layer vectors and m output bag-of-words vectors according to the m potential semantic vectors, updating the m decoding hidden-layer vectors and the m output bag-of-words vectors according to an alignment mechanism, and according to the input bag-of-words vector space and the coding hidden-layer vector space, Estimating the importance of each sentence by using a potential semantic vector space, m potential semantic vectors, m updated decoding hidden layer vectors and m updated output bag-of-words vectors, finally, obtaining verb phrases of each sentence and noun phrases of each sentence, obtaining the importance of the noun phrases according to the importance of the sentences in which the noun phrases are positioned, obtaining the importance of the verb phrases according to the importance of the sentences in which the verb phrases are positioned, and obtaining the redundancy of each verb phrase and the redundancy of each noun phrase; and generating the abstracts of the plurality of documents according to the importance and the redundancy of all noun phrases and the importance and the redundancy of all verb phrases based on an integer linear programming model. Therefore, in the method for generating a multi-document abstract provided by the embodiment of the application, each sentence represented by an input bag vector is subjected to unsupervised training based on a variational self-coding model to generate a potential semantic vector of the sentence, so as to improve the measurement effect of the sentence, meanwhile, a decoding hidden layer vector and an output bag vector are obtained according to the potential semantic vector, and then respective potential semantic vector space, a coding hidden layer vector space and an input bag vector space are respectively reconstructed according to the potential semantic vector, the decoding hidden layer vector and the output bag vector to estimate the importance of the sentence, that is, the importance of the multi-document abstract is generated by jointly considering the importance of the multi-semantic space estimation sentence. Therefore, the abstracts of the multiple documents are generated through the variation self-coding model and the sentence importance estimation model of the joint multi-semantic space, and the performance of the abstracts is greatly improved.

Embodiments of the present application will be described in detail below with reference to the accompanying drawings.

Fig. 1 is a schematic structural diagram of a computer device according to an embodiment of the present disclosure, and as shown in fig. 1, the computer device may include at least one processor 11, a memory 12, a communication interface 13, and a communication bus 14.

The following describes the components of the computer device in detail with reference to fig. 1:

the processor 11 is a control center of a computer device, and may be a single processor or a collective term for a plurality of processing elements. For example, the processor 11 is a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits configured to implement the embodiments of the present Application, such as: one or more microprocessors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs).

The processor 11 may perform various functions of the server by running or executing software programs stored in the memory 12, and calling data stored in the memory 12, among others.

In particular implementations, processor 11 may include one or more CPUs, such as CPU0 and CPU1 shown in FIG. 1, for example, as one embodiment.

In particular implementations, a server may include multiple processors, such as processor 11 and processor 15 shown in FIG. 1, as one embodiment. Each of these processors may be a single-Core Processor (CPU) or a multi-Core Processor (CPU). A processor herein may refer to one or more devices, circuits, and/or processing cores for processing data (e.g., computer program instructions).

The Memory 12 may be a Read-Only Memory (ROM) or other type of static storage device that can store static information and instructions, a Random Access Memory (RAM) or other type of dynamic storage device that can store information and instructions, an electrically erasable Programmable Read-Only Memory (EEPROM), a Compact Disc Read-Only Memory (CD-ROM) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 12 may be self-contained and coupled to the processor 11 via a communication bus 14. The memory 12 may also be integrated with the processor 11.

The memory 12 is used for storing software programs for executing the scheme of the application, and is controlled by the processor 11 to execute.

The communication interface 13 is any device such as a transceiver for communicating with other devices or communication networks, such as ethernet, Radio Access Network (RAN), Wireless Local Area Networks (WLAN), and the like. The communication interface 13 may include a receiving unit implementing a receiving function and a transmitting unit implementing a transmitting function.

The communication bus 14 may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 1, but it is not intended that there be only one bus or one type of bus.

The device architecture shown in fig. 1 does not constitute a limitation of computer devices and may include more or fewer components than shown, or some components may be combined, or a different arrangement of components.

In a specific implementation, the computer device shown in FIG. 1 may be, for one embodiment, a means for generating a multi-document digest.

A processor 11 for dividing a plurality of documents into n sentences, n being an integer greater than or equal to 1;

the processor 11 is further configured to generate an input bag of words vector for each sentence, where the input bag of words vectors for n sentences form an input bag of words vector space;

the processor 11 is further configured to perform unsupervised training on each sentence represented by the input bag-of-words vector based on a variational self-coding model to obtain a coding hidden vector of each sentence and a potential semantic vector of each sentence, where the coding hidden vectors of n sentences form a coding hidden vector space, and the potential semantic vectors of n sentences form a potential semantic vector space;

the processor 11 is further configured to collect m potential semantic vectors from the potential semantic vector space, where m is an integer greater than or equal to 1 and less than n;

the processor 11 is further configured to obtain m decoding hidden layer vectors and m output bag-of-words vectors according to the m potential semantic vectors;

the processor 11 is further configured to update the m decoding hidden layer vectors and the m output bag-of-words vectors according to an alignment mechanism;

the processor 11 is further configured to estimate importance of each sentence according to the input bag-of-words vector space, the coding hidden-layer vector space, the potential semantic vector space, the m potential semantic vectors, the updated m decoding hidden-layer vectors, and the updated m output bag-of-words vectors;

the processor 11 is further configured to obtain a verb phrase of each sentence and a noun phrase of each sentence;

the processor 11 is further configured to obtain the importance of the noun phrase according to the importance of the sentence in which the noun phrase is located, and obtain the importance of the verb phrase according to the importance of the sentence in which the verb phrase is located;

the processor 11 is further used for obtaining the redundancy of each verb phrase and the redundancy of each noun phrase;

and the processor 11 is further used for generating the abstracts of the plurality of documents according to the importance and the redundancy of all noun phrases and the importance and the redundancy of all verb phrases based on the integer linear programming model.

And the memory 12 is used for storing an input bag of words vector space, a coding hidden vector space, a potential semantic vector space and the like.

Fig. 2 is a flowchart of a method for generating a multiple document summary according to an embodiment of the present application, where as shown in fig. 2, the method may include:

201. a plurality of documents is divided into n sentences.

Documents are composed of sentences, and documents can be generally divided into sentences according to punctuation marks in the documents, for example, the documents are divided into periods. n is an integer of 1 or more.

202. An input bag of words vector for the sentence is generated for each sentence.

Each sentence is divided into words, for example, chinese patent can be divided into "chinese" and "patent", and then a dictionary with a size of V is generated, and each sentence can be represented as a V-dimensional vector, i.e., an input bag-of-words vector. The input bag of words vectors of all sentences constitute the input bag of words vector space.

Bag-of-words vectors are proper nouns in the fields of natural language processing and information retrieval. That is, the text is regarded as a collection of words, which are contained in a bag, called a bag for short, because of the large number of words. The grammar and the sequence of the text are omitted, and the appearance of each word in the document is independent. Assuming that there are N words in the dictionary, a sentence or document may be represented by a length N vector, and each dimension in the vector may represent a weight, such as a word frequency, of the corresponding word.

203. And carrying out unsupervised training on each sentence represented by the input bag-of-words vector based on a variational self-coding model to obtain a coding hidden layer vector of each sentence and a potential semantic vector of each sentence.

The coding hidden vector of n sentences forms a coding hidden vector space, and the potential semantic vectors of n sentences form a potential semantic vector space.

Variational self-coding model: the method is a generation model, and the variation inference in a neural network model and a traditional probability generation model is combined, and input variables are mapped into hidden variables with lower dimensionality. New samples may be generated by sampling in a continuous random hidden variable distribution. The unsupervised training of each sentence represented by the input bag-of-words vector is based on the variational self-coding model, namely, the unsupervised training of each sentence one by one is carried out, and each sentence is represented by the input bag-of-words vector.

Specifically, step 1, mapping a sentence x represented by an input bag-of-words vector to a first coding hidden layer according to a neural network model to obtain a coding hidden layer vector of the sentence x, where the sentence x is any one of n sentences, and the mapping process is shown in formula 1:

h_enc＝relu(W_xhx+b_xh) (1)

wherein h is_encRepresenting the encoded hidden layer vector, relu being the activation function in the neural network model, W and b representing variables.

Assuming that the prior and the posterior of the latent semantic vector are both gaussian distributed, the mean and the variance can be represented by h_encLinearly changing.

Step 2, mapping the coding hidden layer vector of the sentence x to a second coding hidden layer according to a probability generation model to obtain a mean vector and a variance vector, wherein the mean vector and the variance vector are used for representing a potential semantic vector to be determined of the sentence x, and the mapping process is shown as a formula 2:

μ＝W_hμh_enc+b_hμ,log(σ²)＝W_hσh_enc+b_hσ(2)

step 3, mapping the coding hidden layer vector of the sentence x to a second coding hidden layer, and combining the obtained mean vector and variance vector to obtain a to-be-determined potential semantic vector z of the sentence x, as shown in formula 3:

where ε is noise. Formula 1, formula 2, and formula 3 are the encoding process for sentences.

Step 4, mapping the potential semantic vector to be determined of the sentence x to a decoding hidden layer to obtain a decoding hidden layer vector of the sentence x, wherein the mapping process is shown as a formula 4:

h_dec＝relu(W_zhz+b_zh) (4)

wherein h is_decRepresenting the decoding hidden layer vector, relu being the activation function in the probability generating model, W and b representing variables.

Step 5, mapping the decoding hidden layer vector of the sentence x to an output layer to obtain an output bag-of-words vector of the sentence x, namely regenerating the sentence x to obtain a sentence x', wherein the mapping process is shown as a formula 5:

x,＝sigmoid(W_hxh_dec+b_hx) (5)

wherein sigmoid is an activation function in the probability generation model, and W and b represent variables.

Equations 4 and 5 are the decoding process for the sentence.

Repeating the steps 1 to 5, and obtaining the value of the objective function of the first optimization problem according to the input bag-of-words vector, the output bag-of-words vector, the mean vector and the variance vector, namely the lower bound of variation in the probability generation model; and when the value of the objective function of the first optimization problem is an extreme value, determining that the obtained potential semantic vector to be determined is the potential semantic vector of the sentence x, and when the value of the objective function of the first optimization problem is the extreme value, the regenerated sentence x' is most similar to the sentence x. The objective function of the first optimization problem is shown in equation 6:

the first term of formula 6 indicates that the input bag-of-words vector is most similar to the output bag-of-words vector, and the second term of formula 6 indicates the similarity of the mean vector and the variance vector.

The analytical expressions of two terms in the objective function of the first optimization problem can be derived by equations (2) and (3):

and carrying out unsupervised training on each sentence in the n sentences according to formulas 1 to 7 to obtain a potential semantic vector of each sentence.

204. M potential semantic vectors are collected from the potential semantic vector space.

m is an integer of 1 or more and less than n. The m potential semantic vectors are the m points that are most able to describe the potential semantic vector space.

205. And obtaining m decoding hidden layer vectors and m output bag-of-word vectors according to the m potential semantic vectors.

Suppose thatThe method comprises the steps that m potential semantic vectors are collected from a potential semantic vector space, and then the m potential semantic vectors are mapped to decoding hidden layers according to the decoding process of a variational self-coding model to obtain m decoding hidden layer vectors; and mapping the m decoding hidden layer vectors to an output layer to obtain m output bag-of-word vectors. The mapping process is shown in equation 8:

wherein s is_hRepresenting decoded hidden layer vectors, s_xRepresenting the output bag-of-words vector.

It should be noted that although the variational self-coding model can map the sentence to a more ideal low-dimensional latent semantic vector space, some low-frequency detail information is lost in the process of sentence regeneration, and the sentence regeneration process in the variational self-coding model is improved in order to complement the information. Step 206 is performed.

206. And updating m decoding hidden layer vectors and m output bag-of-word vectors according to an alignment mechanism.

Decoding hidden layer vectors for each of m decoded hidden layer vectorsCalculating the relation between the encoded hidden layer vectors of the n sentences to obtain a first alignment valueAs shown in equation 9:

and weighting and summing the first alignment value and the coding hidden layer vectors of the n sentences to obtain a first context vector, as shown in formula 10:

updating each of the m decoded hidden layer vectors according to the first context vectorAs shown in equation 11:

similarly, obtaining the relationship between each output bag-of-words vector in the m output bag-of-words vectors and the input bag-of-words vectors of the n sentences to obtain a second alignment value;

weighting and summing the second alignment value and the input word bag vectors of the n sentences to obtain a second context vector;

and updating the m output bag-of-words vectors according to the second context vector.

207. And estimating the importance of each sentence according to the input bag-of-words vector space, the coding hidden layer vector space, the potential semantic vector space, the m potential semantic vectors, the updated m decoding hidden layer vectors and the updated m output bag-of-words vectors.

The importance estimate of a sentence is an important reference variable for generating a summary of multiple documents.

The m potential semantic vectors, the updated m decoded hidden layer vectors and the updated m output bag-of-words vectors are the most important points capable of reconstructing respective original vector spaces. Reconstructing a latent semantic vector space according to the m latent semantic vectors, reconstructing a coding latent vector space according to the updated m decoding latent vectors, reconstructing an input bag-of-words vector space according to the updated m output bag-of-words vectors, performing unsupervised training in a reconstruction process considering all vector spaces jointly to obtain a value of a target function of the second optimization problem, obtaining a reconstruction coefficient matrix when the value of the target function of the second optimization problem is an extreme value, performing modulo on a vector corresponding to each sentence in the reconstruction coefficient matrix, and determining the modulo of the vector corresponding to the sentence as the importance of the sentence. The objective function of the second optimization problem is shown in equation 13:

wherein λ is_z,λ_h,λ_zCan take 1, Z represents potential semantic vector space, H represents coding hidden layer vector space, X represents input bag of words vector space, s_zRepresenting potential semantic vectors. s_hRepresenting the updated decoded hidden layer vector. s_xRepresenting the updated output bag-of-words vector. The modulus of the vector in the finally obtained reconstruction coefficient matrix A can be used for calculating the importance of the sentence.

It should be noted that the variational self-coding model training and the sentence importance estimation training can be integrated, as shown in equation 14:

wherein,for variational self-coding models, λ L_AModels are estimated for sentence importance in a joint multi-semantic space.

Fig. 3 is a schematic diagram of a multi-document summarization model generation method according to an embodiment of the present invention, as shown in fig. 3, the left side is a variational self-coding model, and the right side is a sentence importance estimation model of a joint multi-semantic space. The joint multi-semantic space comprehensively considers a potential semantic vector space, a coding hidden vector space and an input bag of words vector space for data reconstruction. The two components of the bag-of-words vector can be simultaneously and unsupervised trained in a multitask learning mode. x is the number ofⁱIs the input bag-of-words vector for sentence x, which is any one of n sentences. x'ⁱThe bag-of-words vector is the output of sentence x. s_zRepresenting potential semantic vectors. s_hRepresenting the updated decoded hidden layer vector. s_xRepresenting the updated output bag-of-words vector.

208. Obtaining a verb phrase of each sentence and a noun phrase of each sentence.

Each sentence is firstly analyzed into a grammar tree according to a semantic analysis tool, and noun phrases and verb phrases are obtained from the grammar tree of each sentence.

209. And acquiring the importance of the noun phrase according to the importance of the sentence in which the noun phrase is positioned, and acquiring the importance of the verb phrase according to the importance of the sentence in which the verb phrase is positioned.

The importance is used to measure the importance of the concept or information represented by the phrase in the semantics of the expression document. Obtaining the importance of the noun phrase according to the importance of the sentence in which the noun phrase is located, and obtaining the importance of the verb phrase according to the importance of the sentence in which the verb phrase is located, as shown in formula 15:

representing the sum of word frequencies of all words in the noun phrase and the verb phrase.Representing the sum of word frequencies in all documents of the described multi-document. a is_iRepresenting the importance of the sentence in which the noun phrase is located.

It should be noted that there may be more than two sentences including the same noun phrase, then, the importance of the noun phrase may be obtained according to the importance of the sentence in which the noun phrase is located, at this time, the importance of the noun phrase is input to the integer linear programming model, the importance of the noun phrase is determined by the integer linear programming model according to the importance of the noun phrase, and the importance with the largest value is selected from the importance of the noun phrase by the integer linear programming model as the importance of the noun phrase. Similarly, more than two sentences may include the same verb phrase, so the importance of more than two verb phrases can be obtained according to the importance of the sentence in which the verb phrase is located, at this time, the importance of more than two verb phrases is input to the integer linear programming model, the importance of the verb phrase is determined by the integer linear programming model according to the importance of more than two verb phrases, and the integer linear programming model usually selects the importance with the largest value from the importance of more than two verb phrases as the importance of the verb phrase.

210. And acquiring the redundancy of each verb phrase and the redundancy of each noun phrase.

The prior art Jaccard Index (Jaccard Index) may be sampled to obtain the similarity between phrases, which may be used as the redundancy between phrases. Similarity is used to measure the degree of semantic similarity between phrases. For example, if Y noun phrases and Y verb phrases are obtained, the redundancy between the noun phrase and other Y-1 noun phrases is obtained for each noun phrase; similarly, the redundancy between the verb phrase and other Y-1 verb phrases is obtained for each verb phrase.

211. And generating the abstracts of the plurality of documents according to the importance and the redundancy of all noun phrases and the importance and the redundancy of all verb phrases based on an integer linear programming model.

It is understood that the importance of phrases is related to the amount of information and the similarity of phrases is related to the redundancy. And inputting all phrases, the importance of the phrases and the redundancy of the phrases into an integer linear programming model, and obtaining the value of the objective function of the third optimization problem. The objective function of the third optimization problem means that similar phrases are prevented from being selected to enter the abstract as far as possible under the condition that the score is maximum, and the abstract of a plurality of documents is generated by solving the optimization problem and splicing noun phrases and verb phrases. When the value of the objective function of the third optimization problem is an extreme value, the information amount is the largest and the redundancy is the smallest, so that the importance-related parameters need to be rewarded and the similarity-related parameters need to be punished in the objective function of the third optimization problem. It should be noted that the integer linear programming model selects the most significant importance from the plurality of importance of the same noun phrase as the importance of the noun phrase, and similarly, the integer linear programming model selects the most significant importance from the plurality of importance of the same verb phrase as the importance of the verb phrase to generate the abstracts of the plurality of documents. The objective function of the third optimization problem is shown in equation 16:

the noun phrases and the verb phrases are numbered respectively, wherein S is an importance parameter of the phrases and is related to importance. Subscript i is the serial number i of the selected phrase, subscript j indicates the serial number j of the selected phrase, and superscript N indicates thatNoun phrase, superscript V indicates that verb phrase is selected, then S_i ^NAn importance parameter, S, representing a noun phrase with a sequence number i_i ^VImportance parameter, S, representing verb phrase with sequence number i_j ^NAn importance parameter, S, representing noun phrases with sequence number j_j ^VThe importance parameter of the verb phrase with sequence number j is represented. R represents the redundancy parameter of the phrase, is related to the similarity, and can acquire the redundancy of the phrase by the prior art. Since similarity is the relationship between phrases, the subscript of R is the sequence number of two noun phrases or two verb phrases, representing the redundancy before these two selected phrases, R_ij ^NRepresenting the redundancy between noun phrases with sequence number i and sequence number j, R_ij ^Vthe first term and the third term of the objective function of the third optimization problem reward the importance parameter of the phrase, the sum of the importance weights of the phrases is added to obtain the importance sum part of the objective function, the second term and the fourth term of the objective function of the third optimization problem punish the redundancy parameter of the phrase, and the sum of the redundancy parameter weights of the phrases is subtracted_icandidate weights, β, representing noun phrases with sequence number i_icandidate weight, α, representing verb phrase with sequence number i_ijdenoting the link weight, β, between noun phrases numbered i and j_ijIndicating the link weight between verb phrases with sequence numbers i and j. It should be understood that the objective function of the third optimization problem is only an example of the objective function, and there may be other objective functions in various forms, and candidate weights or link weights of each phrase may be obtained, as long as importance is rewarded in the objective function and redundancy is penalized, and the specific form of the objective function is not limited herein.

For example, the multi-document digest has a standard English authentication dataset. The method for generating the multi-document abstract is firstly subjected to effect verification tests on the DUC 2006, the DUC2007 and the TAC 2011. Where DUC 2006 and DUC2007 have 50 and 45 topics, respectively, 20 news per topic, 4 manually labeled summaries, with the number of summary words limited to 250 words. The TAC 2011 has 44 topics, each topic has 10 news, 4 manual labels, and the number of abstract words is limited to 100 words. The evaluation index is F-Measure of ROUGE.

Table 1 summary results of DUC 2006

TABLE 2 summary results of DUC2007

TABLE 3 Abstract results of TAC 2011

Tables 1, 2 and 3 show the summary results of the summary generated by the method for generating the multi-document summary, and the results are compared with other best unsupervised multi-document summary models, and the results show that the method for generating the multi-document summary obtains the best results in all indexes, and the effect of the multi-document summary is improved. The results obtained are the best of all unsupervised models so far.

Those of skill in the art will readily appreciate that the present application is capable of hardware or a combination of hardware and computer software implementing the various illustrative algorithm steps described in connection with the embodiments disclosed herein. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiment of the present application, the device for generating a multi-document summary may perform division of the functional modules according to the method example, for example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. It should be noted that, in the embodiment of the present application, the division of the module is schematic, and is only one logic function division, and there may be another division manner in actual implementation.

In the case of dividing each function module by corresponding functions, fig. 4 shows a possible composition diagram of the apparatus for generating a multiple document summary as described above and referred to in the embodiments, and as shown in fig. 4, the apparatus 30 for generating a multiple document summary may include the following detailed units:

a dividing unit 301 configured to divide a plurality of documents into n sentences, where n is an integer greater than or equal to 1;

a first generating unit 302, configured to generate an input bag of words vector for each sentence, where the input bag of words vectors for n sentences form an input bag of words vector space;

a training unit 303, configured to perform unsupervised training on each sentence represented by the input bag-of-words vector based on a variational self-coding model to obtain a coding hidden vector of each sentence and a potential semantic vector of each sentence, where the coding hidden vectors of n sentences form a coding hidden vector space, and the potential semantic vectors of n sentences form a potential semantic vector space;

an acquisition unit 304, configured to acquire m potential semantic vectors from a potential semantic vector space, where m is an integer greater than or equal to 1 and smaller than n;

a mapping unit 305, configured to obtain m decoding hidden layer vectors and m output bag-of-words vectors according to the m potential semantic vectors;

an updating unit 306, configured to update the m decoding hidden layer vectors and the m output bag-of-words vectors according to an alignment mechanism;

an estimating unit 307, configured to estimate importance of each sentence according to the input bag-of-words vector space, the coding hidden-layer vector space, the latent semantic vector space, the m latent semantic vectors, the updated m decoding hidden-layer vectors, and the updated m output bag-of-words vectors;

a first obtaining unit 308 for obtaining a verb phrase of each sentence and a noun phrase of each sentence;

a second obtaining unit 309, configured to obtain the importance of the noun phrase according to the importance of the sentence in which the noun phrase is located, and obtain the importance of the verb phrase according to the importance of the sentence in which the verb phrase is located;

a third obtaining unit 310, configured to obtain redundancy of each verb phrase and redundancy of each noun phrase;

and a second generating unit 311, configured to generate the abstracts of the multiple documents according to the importance and the redundancy of all noun phrases and the importance and the redundancy of all verb phrases based on the integer linear programming model.

It should be noted that all relevant contents of each step related to the above method embodiment may be referred to the functional description of the corresponding functional module, and are not described herein again.

The device for generating the multi-document abstract provided by the embodiment of the application is used for executing the method for generating the multi-document abstract, so that the same effect as that of the method for generating the multi-document abstract can be achieved.

Through the above description of the embodiments, it is clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the above described functions.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the modules or units is only one logical functional division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another device, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may be one physical unit or a plurality of physical units, that is, may be located in one place, or may be distributed in a plurality of different places. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially or partially contributed to by the prior art, or all or part of the technical solutions may be embodied in the form of a software product, where the software product is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only an embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions within the technical scope of the present disclosure should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of generating a multi-document summary, comprising:

dividing a plurality of documents into n sentences, wherein n is an integer greater than or equal to 1;

generating input bag-of-words vectors of the sentences for each sentence, wherein the input bag-of-words vectors of the n sentences form an input bag-of-words vector space;

carrying out unsupervised training on each sentence represented by the input bag-of-words vector based on a variational self-coding model to obtain a coding hidden layer vector of each sentence and a potential semantic vector of each sentence, wherein the coding hidden layer vectors of n sentences form a coding hidden layer vector space, and the potential semantic vectors of n sentences form a potential semantic vector space;

collecting m potential semantic vectors from the potential semantic vector space, wherein m is an integer which is greater than or equal to 1 and less than n;

obtaining m decoding hidden layer vectors and m output bag-of-words vectors according to the m potential semantic vectors;

updating the m decoding hidden layer vectors and the m output bag-of-words vectors according to an alignment mechanism;

estimating the importance of each sentence according to the input bag-of-words vector space, the coding hidden layer vector space, the latent semantic vector space, the m latent semantic vectors, the updated m decoding hidden layer vectors and the updated m output bag-of-words vectors;

obtaining a verb phrase of each sentence and a noun phrase of each sentence;

acquiring the importance of the noun phrase according to the importance of the sentence in which the noun phrase is positioned, and acquiring the importance of the verb phrase according to the importance of the sentence in which the verb phrase is positioned;

obtaining the redundancy of each verb phrase and the redundancy of each noun phrase;

and generating the abstracts of the plurality of documents according to the importance and the redundancy of all the noun phrases and the importance and the redundancy of all the verb phrases based on an integer linear programming model.

2. The method of claim 1, wherein the unsupervised training of each sentence represented by the input bag of words vector based on the variational self-coding model to obtain a coding hidden layer vector of each sentence and a latent semantic vector of each sentence comprises:

step 1, mapping a sentence x represented by the input bag-of-words vector to a first coding hidden layer to obtain a coding hidden layer vector of the sentence x, wherein the sentence x is any one of the n sentences;

step 2, mapping the coding hidden layer vector of the sentence x to a second coding hidden layer to obtain a mean vector and a variance vector, wherein the mean vector and the variance vector are used for representing a potential semantic vector to be determined of the sentence x;

step 3, obtaining a potential semantic vector to be determined of the sentence x according to the mean vector and the variance vector;

step 4, mapping the potential semantic vector to be determined of the sentence x to a decoding hidden layer to obtain a decoding hidden layer vector of the sentence x;

step 5, mapping the decoding hidden layer vector of the sentence x to an output layer to obtain an output bag-of-words vector of the sentence x;

repeating the steps 1 to 5, and obtaining the value of the objective function of the first optimization problem according to the input bag-of-words vector, the output bag-of-words vector, the mean vector and the variance vector;

and when the value of the objective function of the first optimization problem is an extreme value, determining the potential semantic vector to be determined as the potential semantic vector of the sentence x.

3. The method according to claim 1 or 2, wherein the deriving m decoded hidden layer vectors and m output bag of words vectors from the m potential semantic vectors comprises:

mapping the m potential semantic vectors to decoding hidden layers to obtain m decoding hidden layer vectors;

and mapping the m decoding hidden layer vectors to an output layer to obtain the m output bag-of-word vectors.

4. The method according to any of claims 1-3, wherein said updating said m decoding hidden layer vectors according to an alignment mechanism comprises:

obtaining the relationship between each decoding hidden layer vector in the m decoding hidden layer vectors and the encoding hidden layer vectors of the n sentences to obtain a first alignment value;

weighting and summing the first alignment value and the coding hidden layer vectors of the n sentences to obtain a first context vector;

updating the m decoded hidden layer vectors according to the first context vector.

5. The method according to any of claims 1-4, wherein said updating the m output bag-of-words vectors according to an alignment mechanism comprises:

obtaining a relationship between each output bag-of-words vector in the m output bag-of-words vectors and the input bag-of-words vectors of the n sentences to obtain a second alignment value;

weighting and summing the second alignment value and the input bag-of-words vectors of the n sentences to obtain a second context vector;

updating the m output bag-of-words vectors according to the second context vector.

6. The method according to any of claims 1-5, wherein said estimating the importance of each of said sentences from said input bag of words vector space, said coding hidden vector space, said potential semantic vector space, said m of said potential semantic vectors, said updated m decoded hidden vectors and said updated m output bag of words vectors comprises:

reconstructing the latent semantic vector space according to the m latent semantic vectors, reconstructing the coding hidden vector space according to the updated m decoding hidden vectors, and reconstructing the input bag-of-words vector space according to the updated m output bag-of-words vectors, thereby obtaining a value of a target function of a second optimization problem;

when the value of the objective function of the second optimization problem is an extreme value, obtaining a reconstruction coefficient matrix;

and taking a module of the vector corresponding to each sentence in the reconstruction coefficient matrix, and determining the module of the vector corresponding to the sentence as the importance of the sentence.

7. The method of any one of claims 1-6, wherein said obtaining a verb phrase for each of said sentences and a noun phrase for each of said sentences comprises:

parsing each sentence into a syntax tree;

and acquiring a noun phrase of each sentence and a verb phrase of each sentence from the grammar tree of each sentence.

8. An apparatus for generating a multi-document summary, comprising:

the system comprises a dividing unit, a searching unit and a searching unit, wherein the dividing unit is used for dividing a plurality of documents into n sentences, and n is an integer which is greater than or equal to 1;

a first generating unit, configured to generate an input bag-of-words vector for each sentence, where the input bag-of-words vectors for n sentences form an input bag-of-words vector space;

the training unit is used for carrying out unsupervised training on each sentence represented by the input bag-of-words vector based on a variational self-coding model to obtain a coding hidden layer vector of each sentence and a potential semantic vector of each sentence, the coding hidden layer vectors of n sentences form a coding hidden layer vector space, and the potential semantic vectors of n sentences form a potential semantic vector space;

the acquisition unit is used for acquiring m potential semantic vectors from the potential semantic vector space, wherein m is an integer which is greater than or equal to 1 and less than n;

the mapping unit is used for obtaining m decoding hidden layer vectors and m output bag-of-words vectors according to the m potential semantic vectors;

an updating unit, configured to update the m decoding hidden layer vectors and the m output bag-of-words vectors according to an alignment mechanism;

an estimation unit, configured to estimate importance of each sentence according to the input bag-of-words vector space, the coding hidden layer vector space, the potential semantic vector space, the m potential semantic vectors, the updated m decoding hidden layer vectors, and the updated m output bag-of-words vectors;

a first obtaining unit configured to obtain a verb phrase of each of the sentences and a noun phrase of each of the sentences;

the second obtaining unit is used for obtaining the importance of the noun phrase according to the importance of the sentence in which the noun phrase is positioned, and obtaining the importance of the verb phrase according to the importance of the sentence in which the verb phrase is positioned;

a third obtaining unit, configured to obtain redundancy of each verb phrase and redundancy of each noun phrase;

and the second generating unit is used for generating the noun phrases and the verb phrases into the abstracts of the plurality of documents according to the importance of all the noun phrases and the importance of all the verb phrases on the basis of an integer linear programming model.

9. The apparatus according to claim 8, wherein the training unit is specifically configured to:

10. The apparatus according to claim 8 or 9, wherein the mapping unit is specifically configured to:

11. The apparatus according to any of claims 8 to 10, wherein the updating unit is specifically configured to:

12. The apparatus according to any of claims 8 to 11, wherein the updating unit is specifically configured to:

13. The apparatus according to any of claims 8-12, wherein the estimation unit is specifically configured to:

14. The apparatus according to any one of claims 8 to 13, wherein the obtaining unit is specifically configured to:

parsing each sentence into a syntax tree;