CN113282745B - Automatic generation method and device for event encyclopedia document - Google Patents

Automatic generation method and device for event encyclopedia document Download PDF

Info

Publication number
CN113282745B
CN113282745B CN202010104947.2A CN202010104947A CN113282745B CN 113282745 B CN113282745 B CN 113282745B CN 202010104947 A CN202010104947 A CN 202010104947A CN 113282745 B CN113282745 B CN 113282745B
Authority
CN
China
Prior art keywords
event
topic
determining
document
topics
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010104947.2A
Other languages
Chinese (zh)
Other versions
CN113282745A (en
Inventor
侯磊
祝方韦
史佳欣
李涓子
张鹏
唐杰
许斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202010104947.2A priority Critical patent/CN113282745B/en
Publication of CN113282745A publication Critical patent/CN113282745A/en
Application granted granted Critical
Publication of CN113282745B publication Critical patent/CN113282745B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The embodiment of the invention provides an automatic generation method and a device of an event encyclopedia document, wherein the automatic generation method of the event encyclopedia document comprises the following steps: generating a topic tree of the event category based on the existing encyclopedia documents of the event category which is the same as the event to be processed, wherein the topic tree comprises a plurality of topics; acquiring a relevant document set of the event to be processed; determining target text information respectively corresponding to a plurality of topics based on the relevant document set and the topic tree; determining abstracts respectively corresponding to the plurality of themes according to the target text information respectively corresponding to the plurality of themes; generating an encyclopedic document of the event to be processed based on the abstracts respectively corresponding to the plurality of subjects. The method for automatically generating the event encyclopedia document can automatically generate the encyclopedia document comprising a plurality of subjects for a new event, so that the generated encyclopedia document catalog is more complete, and the key information of different aspects of the event is more finely described.

Description

Automatic generation method and device for event encyclopedia document
Technical Field
The invention relates to the field of encyclopedia generation, in particular to an event encyclopedia document automatic generation method and device.
Background
The encyclopedia document is mostly written by human beings, the writing habits of different authors are different, and finally the produced document directory structure is also different. The direct application of a directory structure of a certain document cannot ensure the universality and integrity of the structure, thereby causing damage to the rationality of the finally generated document.
In the prior art, encyclopedic documents are mostly written manually, some researchers developed some encyclopedic document generation methods in order to improve editing efficiency and time effectiveness of the encyclopedic documents, but the encyclopedic document generation methods can only generate summary parts of the encyclopedic documents and cannot generate complete documents.
Disclosure of Invention
Embodiments of the present invention provide an event encyclopedia document automatic generation method, apparatus, electronic device and readable storage medium that overcome the above-mentioned problems or at least partially solve the above-mentioned problems.
In a first aspect, an embodiment of the present invention provides an event encyclopedia document automatic generation method, including: generating a theme tree of the event category based on the existing encyclopedic documents of the event category which is the same as the event to be processed, wherein the theme tree comprises a plurality of themes; acquiring a relevant document set of the event to be processed; determining target text information respectively corresponding to a plurality of topics based on the relevant document set and the topic tree; determining abstracts respectively corresponding to the plurality of themes according to the target text information respectively corresponding to the plurality of themes; generating an encyclopedic document of the event to be processed based on the abstracts respectively corresponding to the plurality of subjects.
In some embodiments, the generating a topic tree for the event category based on existing encyclopedia documents of the event category with the pending event comprises: acquiring an encyclopedia document of the existing event class and the event class to be processed; obtaining a title set from the encyclopedia document, wherein the title set comprises a plurality of titles; determining a set of topics based on the plurality of titles, the set of topics including the plurality of topics; determining the topic tree based on the topic collection.
In some embodiments, said determining a set of topics based on said plurality of topics comprises: obtaining a plurality of cluster sets by clustering the plurality of titles; determining the plurality of topics based on the plurality of clusters, wherein the plurality of topics are in one-to-one correspondence with the plurality of clusters, and determining the topic set according to the plurality of topics.
In some embodiments, said determining said topic tree based on said set of topics comprises: determining parent-child relationship probabilities among a plurality of the topics in the set of topics; and determining the theme tree based on the parent-child relationship probability.
In some embodiments, the determining parent-child relationship probabilities between the plurality of topics in the set of topics comprises: determining a topic t based on a directory structure of existing encyclopedic documents of the event category with the event to be processed i With the subject t j Structural information of the subject t, the subject t i With the subject t j The structural information between characterizes the directory structure of the existing encyclopedia document, for the subject t i As subject t j The supporting degree of the parent theme of (1); determining a topic t based on a text distribution of existing encyclopedia documents of a co-event category with a pending event i With the subject t j The subject t i With the subject t j The text association feature between them characterizes the topic t i With the subject t j Distribution of text between, for topic t i As subject t j The degree of support of the parent topic of (c); determining parent-child relationship probabilities among the plurality of topics based on the structural information and the text association features.
In some embodiments, the determining, based on the relevant document set and the topic tree, target text information corresponding to a plurality of the topics, respectively, includes: determining the contribution of each word in each text segment e to the corresponding topic, and determining the target text information of the corresponding topic; wherein the contribution of the word to the corresponding topic applies the formula:
Figure BDA0002388230420000031
determining, wherein W (W, T) is the contribution of the word W to the topic T, T is the set of topics, tt is any topic in the set of topics T, p w,t Representing the probability of the occurrence of the word w under the topic t, p w,tt Representing the probability of the word w appearing under the topic tt.
In some embodiments, the determining, according to the target text information corresponding to a plurality of the topics, abstracts corresponding to a plurality of the topics, respectively, includes: dividing the target text information according to sentences to obtain a sentence set; vectorizing the sentence set to obtain a matrix representation of sentences; determining core sentences in the matrix representation of the sentences, and composing the core sentences into the abstract.
In a second aspect, an embodiment of the present invention provides an event encyclopedia document automatic generation apparatus, including: the topic tree generating unit is used for generating a topic tree of the event category based on encyclopedic documents of the same event category as the event to be processed, wherein the topic tree comprises a plurality of topics; the document acquisition unit is used for acquiring a relevant document set of the event to be processed; a text screening unit, configured to determine target text information corresponding to each of the plurality of topics based on the relevant document set and the topic tree; the abstract generating unit is used for determining the abstract corresponding to the plurality of themes according to the target text information corresponding to the plurality of themes; and the combination unit is used for generating the encyclopedic document of the event to be processed based on the abstracts respectively corresponding to the plurality of subjects.
In a third aspect, an embodiment of the present invention provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the steps of the method provided in the first aspect when executing the program.
In a fourth aspect, an embodiment of the present invention provides a non-transitory computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the steps of the method as provided in the first aspect.
The method, the device, the electronic equipment and the readable storage medium for automatically generating the event encyclopedia document can automatically generate the encyclopedia document comprising a plurality of topics for a new event, so that the generated encyclopedia document directory is more complete, and the key information of different aspects of the event is more finely described.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a flow chart of a method for automatically generating an event encyclopedia document according to an embodiment of the invention;
FIG. 2 is a flow chart of a method for automatically generating an event encyclopedia document according to an embodiment of the invention;
FIG. 3 is a schematic structural diagram of an event encyclopedia document automatic generation apparatus according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.
An event encyclopedia document automatic generation method according to an embodiment of the present invention is described below with reference to fig. 1 to 2.
As shown in fig. 1 and fig. 2, the method for automatically generating an event encyclopedia document according to the embodiment of the present invention includes steps S100 to S500.
Step S100, generating a theme tree of the event category based on the existing encyclopedia documents of the event category which is the same as the event to be processed, wherein the theme tree comprises a plurality of themes.
It should be noted that the event to be processed is a new event that needs to generate an encyclopedic document, and in actual execution, the category of the event to be processed may be determined according to the encyclopedic document category list, and this step may be performed manually. Under the corresponding category, a plurality of encyclopedia documents exist, and a theme tree of the event category is generated based on the existing encyclopedia documents, namely, a theme tree common to the event category is generated. Topic trees may be understood as topic templates, each topic tree comprising a plurality of topics.
And step S200, acquiring a relevant document set of the event to be processed.
The related document set is a set of documents related to the event to be processed, and the source mode comprises internet search or other literature data.
And step S300, determining target text information respectively corresponding to a plurality of topics based on the relevant document set and the topic tree.
The step is used for screening the collected related document set and finding the target text information corresponding to each topic.
And S400, determining abstracts respectively corresponding to a plurality of subjects according to the target text information respectively corresponding to the plurality of subjects.
After the target text information corresponding to each topic is found in step S300, for each topic, the corresponding abstract may be obtained according to the target text information corresponding to each topic.
And S500, generating an encyclopedic document of the event to be processed based on the abstracts respectively corresponding to the plurality of subjects.
And filling the abstract corresponding to each theme into the corresponding theme respectively to generate the encyclopedic document of the event to be processed.
According to the automatic generation method of the event encyclopedia document, provided by the embodiment of the invention, the encyclopedia document comprising a plurality of subjects can be automatically generated for a new event, so that the generated encyclopedia document catalog is more perfect, and the key information of different aspects of the event can be more finely described.
In some embodiments, step S100, based on existing encyclopedia documents of the event category with the pending event, generates a topic tree for the event category, including sub-step S110 through sub-step S130.
Step S110, acquiring the existing encyclopedic document of the event class matched with the event to be processed.
In actual implementation, the category of the event to be processed may be determined according to the encyclopedia document category list, and under the corresponding category, a plurality of existing encyclopedia documents are obtained.
Step S120, a title set is obtained from the encyclopedia document, and the title set comprises a plurality of titles.
It is understood that the encyclopedia documents in this step are existing encyclopedia documents acquired in step S110, and these documents include at least one title, and a considerable number of documents include a plurality of titles.
In actual implementation, given an encyclopedic text collection D for a certain event category, from which a set of titles T, T = { can be obtained 1 ,...,t N }。
Step S130, determining a theme set based on the plurality of titles, wherein the theme set comprises a plurality of themes.
Further, step S130, determining a topic set based on the plurality of titles, comprising sub-step S131 to sub-step S132.
Step S131, obtaining a plurality of clusters by clustering the plurality of titles.
In actual implementation, for the title T in the title set T i And t j Can use t i And t j The similarity between the titles is measured by the TF-IDF similarity between the corresponding texts.
Two thresholds λ are defined 1 >λ 2 In the first round of clustering, for each title T e T, each current cluster C e C is traversed, and if the similarity between T and the title in a certain cluster C exceeds lambda 1 Adding t to the cluster c; if the similarity between t and the title in a certain cluster c exceeds lambda 2 But not more than lambda 1 Putting t into the candidate sequence P; if the similarity between the t and the c is not more than lambda 2 Then a new cluster c 'is created, and t is put into the new cluster c'.
In the second round of clustering, for each title t ∈ P, each current cluster C ∈ C is traversed, and if the similarity between t and the title in a certain cluster C exceeds λ 1 Adding t to the cluster c; if t is related to the header in a cluster cLambda of similarity is insufficient 1 Then a new cluster c 'is created, and t is put into the new cluster c'.
Step S132, determining a plurality of subjects based on the plurality of clusters, wherein the plurality of subjects correspond to the plurality of clusters one by one, and determining a subject set according to the plurality of subjects.
Finally, each cluster is used as a theme, a title with the highest frequency of occurrence is used as a theme name, and a theme set T of the event category is constructed c
And step S140, determining a theme tree based on the theme set.
After the theme set is obtained, the theme tree of the event category can be determined according to the theme set.
Further, step S140, based on the theme set, determines a theme tree, including substeps S141 to substep S142.
And step S141, determining parent-child relationship probability among a plurality of topics in the topic collection.
It should be noted that, some topics are in a parallel relationship, that is, a sibling relationship, but some topics may have a parent-child relationship, that is, a certain topic is a previous stage of another topic, and this is a topic in a pair of parent-child relationships.
Step S141, determining parent-child relationship probabilities among multiple topics in the topic set, including substeps S141a through S141c.
Step S141a, based on the directory structure of the encyclopedia documents of the existing event and event co-occurrence categories to be processed, determining the subject t i With the subject t j Structural information between, topic t i With the subject t j The structural information between is used for characterizing the directory structure of the existing encyclopedic document to the subject t i As subject t j The supporting degree of the parent theme of (c).
In actual execution, the subject t i With the subject t j The structural information in between can be quantized as:
Figure BDA0002388230420000071
wherein, P stru c(t i |t j ) As a subject t i And the subject t j Structural information of (d), n (t) i ,t j ) As a subject t i As subject t j N (t) of the parent topic j ) As a subject t j Total number of occurrences, T d Representing a topic t i The number of possible sub-topics, α, is the laplacian smoothing factor.
Step S141b, determining a subject t based on the text distribution of the encyclopedia documents of the existing event and event co-occurrence categories to be processed i With the subject t j The text association feature between, topic t i With the subject t j The text association feature between them characterizes the subject t i And the subject t j Text distribution between to topic t i As subject t j The supporting degree of the parent theme of (c).
In actual execution, the subject t i And the subject t j The text association features between can be quantized using a hierarchical dirichlet model as:
Figure BDA0002388230420000081
wherein, P text (t i |t j ) As a subject t i And the subject t j The text correlation characteristics between, Z is a normalization factor,
Figure BDA0002388230420000082
representing words w in subject t i Probability of occurrence in->
Figure BDA0002388230420000083
Representing words w in subject t j The probability of occurrence of β is an artificial parameter that controls the degree of probability concentration, and may be β =5, for example.
And step S141c, determining parent-child relationship probability among a plurality of subjects based on the structural information and the text association characteristics.
In actual implementation, after the two aspects of association are quantified, the probability that parent-child topic relationships exist between topics is calculated by using a weighted average:
w(t i ,t j )=λ·log(P struc (t i |t j ))+(1-λ)·log(P text (t i |t j ))
wherein, w (t) i ,t j ) As a subject t i Is a subject t j λ is an artificial parameter controlling the weight of two topics, for example, λ =0.8 may be taken.
In the above embodiments, the event category is based on a topic set, such as topic set T given the event category c And calculating the possibility of parent-child theme relationship between different themes by integrating the structural information and the text association.
And S142, determining the subject tree based on the parent-child relationship probability.
In other words, a topic tree common to the event categories is generated based on the possibility that parent-child topic relationships exist between different topics using the maximum tree algorithm in the directed graph.
In an actual implementation, a probability model can be used to evaluate the probability of establishment of the whole topic tree based on the possibility that the parent-child topic relationship exists between the topics, and the optimization goal is to maximize the overall possibility of the parent-child topic relationship in the topic tree. Regarding the hierarchical model of the whole theme as a Bayesian network, optimizing the target H * Can be expressed as:
Figure BDA0002388230420000091
wherein par H (n) represents the parent of node n in the subject tree H, P (par) H (n) | n) represents the probability that the parent node of node n is the correct parent node in the generated subject tree, argmax H Representing taking the maximum value.
Thus, only the set of likelihood values { w (t) } is selected i ,t j ) The whole subject tree, target, can be obtained by composing the tree and adding the largest edgeThus, the method is converted into a problem of finding the maximum tree in the directed graph, for example, a general subject tree of the event category can be generated by using the Chu-Liu/Edmonds algorithm.
Thus, the method for automatically generating the event encyclopedia document in the embodiment replaces a single-layer structure with a hierarchical structure, so that a directory of the generated document is more complete, and key information of different aspects of an event can be more finely described; meanwhile, the improved topic model and the neural network are comprehensively used, the text word distribution characteristics under each topic are explicitly mined, and better performance is achieved compared with the related technology.
In some embodiments, for step S200, acquiring the relevant document set of the event to be processed, in an actual implementation, given a new event name N that needs to be generated, two approaches, document reference and web search, may be used to search for relevant web pages from the internet.
In the document reference path, if N is an existing event in encyclopedia, a webpage set W referenced in encyclopedia document is selected g As a result; in the network searching approach, the event name N is used as a search request to search on a necessary search engine, web pages with predetermined names (such as the top 20) before ranking are selected, and the web pages from encyclopedia are removed and then a corresponding web page set W is used s As a result. For W g And W s Using beautilfruup tool library in Python to filter irrelevant content, and obtaining relevant document set D of the new event.
In some embodiments, the step S300 of determining target text information respectively corresponding to a plurality of topics based on the relevant document set and the topic tree includes:
determining the contribution of each word in each text segment e to the corresponding topic, determining the contribution of the text segment to the topic t, and determining the target text information of the corresponding topic;
wherein the contribution of a word to a corresponding topic applies a formula
Figure BDA0002388230420000101
Determining, wherein W (W, T) is the contribution of the word W to the topic T, T is the set of topics, tt is any topic in the set of topics T, p w,t Representing the probability of the occurrence of the word w under the topic t, p w,tt Representing the probability of the word w appearing under the topic tt.
Determining corresponding word probability distribution F under each theme according to the relevant document set and the theme tree t ={(w,p w,t ) Where w represents a word, p w,t Representing the probability of the word w appearing under the topic t.
Considering that the high-frequency appearance of words represents close connection with the topics, if the words appear in high frequency under other topics, the reliability of the connection is also reduced correspondingly.
For each text segment e in the candidate document, the average value of the word contribution in the text segment e is used for quantifying the contribution of the text segment to the specific subject t. And sorting all the text segments from high to low according to the contribution values, taking the first 1000 words, and screening to obtain valuable text information Ct under each topic t.
In some embodiments, step S400, determining abstracts corresponding to a plurality of topics according to target text information corresponding to the plurality of topics, respectively, includes substeps S410 to substep S430.
Step S410, dividing target text information according to sentences to obtain a sentence set;
step S420, vectorizing the sentence set to obtain a matrix representation of the sentences;
and step S430, determining core sentences in the matrix representation of the sentences, and combining the core sentences into a summary.
In actual implementation, the text information C valuable in each topic obtained based on step C t Firstly, the sentence is divided into sentences to obtain a sentence set D = [ D ] 1 ,d 2 ,...,d n ]Then, the pre-trained word vector representation in the GLOVE is utilized to carry out vectorization to obtain the matrix representation of the sentence
Figure BDA0002388230420000111
Moments of sentencesAnd (3) inputting the matrix representation into the deep channel model, extracting the core sentences with the largest information amount in the sentences, and combining the core sentences as the abstract of the theme. The step is to respectively generate abstracts for the subjects in the subject tree and combine a plurality of abstracts to obtain the final encyclopedic document.
Aiming at the automatic generation method of the event encyclopedia document, the inventor finds that the method is more excellent than the related technology through experimental verification.
For example, the effect (measured by F1 value index) of the present invention and the related art on the theme tree construction is shown in table 1.
TABLE 1
Earthquake Election of Hurricane wind General of
Prior Art 0.434 0.780 0.586 0.586
The invention (Single round cluster) 0.854 0.937 0.949 0.914
The invention (complete model) 0.844 0.953 0.957 0.918
The effect of the present invention and the related art (measured in F1 value index) on overall document creation is shown in table 2.
TABLE 2
Figure BDA0002388230420000112
Figure BDA0002388230420000121
The rough-1 is an index for evaluating text summaries, the similarity between a reaction generation result and a standard answer is reflected, the rough-1 comprises two parts, the accuracy (the ratio of the number of correct results in the generation result to the total number of results) and the recall rate (the ratio of the number of correct results in the generation result to the number of results in the standard answer), and the F1 is a result integrating the accuracy and the recall rate, for example, the accuracy and the recall rate can be harmonized and averaged to obtain the F1.
The invention has good effect, surpasses the related technology in multiple aspects and meets the application requirement from the experimental result.
The following describes an event encyclopedia document automatic generation device provided by an embodiment of the present invention, and the event encyclopedia document automatic generation device described below and the event encyclopedia document automatic generation method described above may be referred to in correspondence with each other.
As shown in fig. 3, the automatic generation apparatus for event encyclopedia document according to the embodiment of the present invention includes: a subject tree generating unit 610, a document acquiring unit 620, a text filtering unit 630, a summary generating unit 640, and a combining unit 650.
The topic tree generation unit 610 is configured to generate a topic tree of the event category based on the encyclopedic document of the event category identical to the event to be processed, where the topic tree includes multiple topics; a document acquiring unit 620, configured to acquire a relevant document set of an event to be processed; a text screening unit 630, configured to determine target text information corresponding to a plurality of topics, respectively, based on the relevant document set and the topic tree; an abstract generating unit 640, configured to determine, according to target text information corresponding to a plurality of topics, abstracts corresponding to the plurality of topics, respectively; a combining unit 650, configured to generate an encyclopedia document of the event to be processed based on the abstracts respectively corresponding to the plurality of topics.
Fig. 4 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 4: a processor (processor) 810, a communication Interface 820, a memory 830 and a communication bus 840, wherein the processor 810, the communication Interface 820 and the memory 830 communicate with each other via the communication bus 840. The processor 810 may invoke logic instructions in the memory 830 to perform an event encyclopedia document auto-generation method comprising: generating a topic tree of the event category based on the existing encyclopedia documents of the event category which is the same as the event to be processed, wherein the topic tree comprises a plurality of topics; acquiring a relevant document set of an event to be processed; determining target text information respectively corresponding to a plurality of topics based on the relevant document set and the topic tree; determining abstracts respectively corresponding to the plurality of themes according to the target text information respectively corresponding to the plurality of themes; and generating an encyclopedia document of the event to be processed based on the abstracts respectively corresponding to the plurality of subjects.
It should be noted that, when being implemented specifically, the electronic device in this embodiment may be a server, a PC, or other devices, as long as the structure includes the processor 810, the communication interface 820, the memory 830, and the communication bus 840 shown in fig. 4, where the processor 810, the communication interface 820, and the memory 830 complete mutual communication through the communication bus 840, and the processor 810 may call the logic instructions in the memory 830 to execute the above method. The embodiment does not limit the specific implementation form of the electronic device.
In addition, the logic instructions in the memory 830 can be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Further, an embodiment of the present invention discloses a computer program product, the computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions, when the program instructions are executed by a computer, the computer being capable of executing the method for automatically generating an event-executing encyclopedic document provided by the above method embodiments, the method comprising: generating a theme tree of the event category based on the existing encyclopedic documents of the event category which are matched with the event to be processed, wherein the theme tree comprises a plurality of themes; acquiring a relevant document set of an event to be processed; determining target text information respectively corresponding to a plurality of topics based on the relevant document set and the topic tree; determining abstracts respectively corresponding to a plurality of subjects according to the target text information respectively corresponding to the plurality of subjects; and generating an encyclopedia document of the event to be processed based on the abstracts respectively corresponding to the plurality of subjects.
In another aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented by a processor to execute the method for automatically generating an event encyclopedia document provided in the foregoing embodiments, where the method includes: generating a topic tree of the event category based on the existing encyclopedia documents of the event category which is the same as the event to be processed, wherein the topic tree comprises a plurality of topics; acquiring a relevant document set of an event to be processed; determining target text information respectively corresponding to a plurality of topics based on the relevant document set and the topic tree; determining abstracts respectively corresponding to a plurality of subjects according to the target text information respectively corresponding to the plurality of subjects; and generating an encyclopedia document of the event to be processed based on the abstracts respectively corresponding to the plurality of subjects.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (9)

1. An event encyclopedia document automatic generation method is characterized by comprising the following steps:
generating a theme tree of the event category based on the existing encyclopedic documents of the event category which is the same as the event to be processed, wherein the theme tree comprises a plurality of themes;
generating a topic tree of the event category based on the existing encyclopedia documents of the event category which is the same as the event to be processed, wherein the topic tree comprises:
acquiring an encyclopedia document of the existing event class and the event class to be processed;
obtaining a title set from the encyclopedia document, wherein the title set comprises a plurality of titles;
determining a set of topics based on the plurality of titles, the set of topics including the plurality of topics;
determining the topic tree based on the topic collection;
acquiring a relevant document set of the event to be processed;
determining target text information respectively corresponding to a plurality of themes based on the relevant document set and the theme tree;
determining abstracts respectively corresponding to the plurality of themes according to the target text information respectively corresponding to the plurality of themes;
generating an encyclopedic document of the event to be processed based on the abstracts respectively corresponding to the plurality of subjects.
2. The method for automatically generating an event encyclopedia document according to claim 1, wherein the determining a topic set based on the plurality of titles comprises:
obtaining a plurality of cluster sets by clustering the plurality of titles;
determining the plurality of topics based on the plurality of clusters, wherein the plurality of topics are in one-to-one correspondence with the plurality of clusters, and determining the topic set according to the plurality of topics.
3. The method of claim 1, wherein the determining the topic tree based on the topic collection comprises:
determining parent-child relationship probabilities between a plurality of the topics in the set of topics;
and determining the theme tree based on the parent-child relationship probability.
4. The method of claim 3, wherein the determining parent-child relationship probabilities between the topics in the topic set comprises:
determining a topic t based on a directory structure of existing encyclopedia documents of a co-event category with a pending event i With the subject t j Structural information of the subject t, the subject t i With the subject t j The structural information between characterizes the directory structure of the existing encyclopedia document, for the subject t i As subject t j The degree of support of the parent topic of (c);
determining a topic t based on a text distribution of existing encyclopedic documents of the event category with the pending event i And the subject t j The subject t i With the subject t j The text association feature between them characterizes the subject t i With the subject t j Distribution of text between, for topic t i As subject t j The degree of support of the parent topic of (c);
determining parent-child relationship probabilities among the plurality of topics based on the structural information and the text association features.
5. The method for automatically generating event encyclopedia documents according to any one of claims 1-3, wherein the determining target text information respectively corresponding to a plurality of topics based on the relevant document set and the topic tree comprises:
determining the contribution of each text segment to a topic t and determining target text information of the corresponding topic based on the contribution of each word in each text segment e to the corresponding topic;
wherein, the contribution of the word to the corresponding theme is applied to the formula:
Figure FDA0003919674900000021
determining, wherein W (W, T) is the contribution of the word W to the topic T, T is the set of topics, tt is any topic in the set of topics T, p w,t Representing the probability of the occurrence of the word w under the topic t, p w,tt Representing the probability of the word w appearing under the topic tt.
6. The method for automatically generating an event encyclopedia document according to any one of claims 1-3, wherein the determining the abstracts corresponding to the plurality of subjects respectively according to the target text information corresponding to the plurality of subjects respectively comprises:
dividing the target text information according to sentences to obtain a sentence set;
vectorizing the sentence set to obtain a matrix representation of sentences;
determining core sentences in the matrix representation of the sentences, and composing the core sentences into the abstract.
7. An event encyclopedia document automatic generation device, characterized by comprising:
the topic tree generating unit is used for generating a topic tree of the event category based on encyclopedic documents of the same event category as the event to be processed, wherein the topic tree comprises a plurality of topics;
the theme tree generation unit is specifically configured to:
acquiring an encyclopedia document of the existing event class and the event class to be processed;
obtaining a title set from the encyclopedia document, wherein the title set comprises a plurality of titles;
determining a set of topics based on the plurality of titles, the set of topics including the plurality of topics;
determining the topic tree based on the topic collection;
the document acquisition unit is used for acquiring a relevant document set of the event to be processed;
a text screening unit, configured to determine target text information corresponding to each of the plurality of topics based on the relevant document set and the topic tree;
the abstract generating unit is used for determining the abstract corresponding to the plurality of themes according to the target text information corresponding to the plurality of themes;
and the combination unit is used for generating the encyclopedic document of the event to be processed based on the abstracts respectively corresponding to the plurality of subjects.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the event encyclopedia document automatic generation method according to any one of claims 1 to 6 when executing said program.
9. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method for automatic generation of an event encyclopedia document according to any one of claims 1 to 6.
CN202010104947.2A 2020-02-20 2020-02-20 Automatic generation method and device for event encyclopedia document Active CN113282745B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010104947.2A CN113282745B (en) 2020-02-20 2020-02-20 Automatic generation method and device for event encyclopedia document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010104947.2A CN113282745B (en) 2020-02-20 2020-02-20 Automatic generation method and device for event encyclopedia document

Publications (2)

Publication Number Publication Date
CN113282745A CN113282745A (en) 2021-08-20
CN113282745B true CN113282745B (en) 2023-04-18

Family

ID=77275185

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010104947.2A Active CN113282745B (en) 2020-02-20 2020-02-20 Automatic generation method and device for event encyclopedia document

Country Status (1)

Country Link
CN (1) CN113282745B (en)

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004025490A1 (en) * 2002-09-16 2004-03-25 The Trustees Of Columbia University In The City Of New York System and method for document collection, grouping and summarization
CN102637173B (en) * 2011-02-10 2015-09-02 北京百度网讯科技有限公司 Network encyclopaedia content formation method, the webserver and client
CN104252487B (en) * 2013-06-28 2019-05-03 百度在线网络技术(北京)有限公司 A kind of method and apparatus for generating entry information
CN104484374B (en) * 2014-12-08 2018-11-16 百度在线网络技术(北京)有限公司 A kind of method and device creating network encyclopaedia entry
CN109657054B (en) * 2018-12-13 2021-02-02 北京百度网讯科技有限公司 Abstract generation method, device, server and storage medium

Also Published As

Publication number Publication date
CN113282745A (en) 2021-08-20

Similar Documents

Publication Publication Date Title
CN108920456B (en) Automatic keyword extraction method
Mihalcea Language independent extractive summarization
CN108733682B (en) Method and device for generating multi-document abstract
CN111177365A (en) Unsupervised automatic abstract extraction method based on graph model
CN108647322B (en) Method for identifying similarity of mass Web text information based on word network
WO2018056423A1 (en) Scenario passage classifier, scenario classifier, and computer program therefor
CN112527981B (en) Open type information extraction method and device, electronic equipment and storage medium
CN112836029A (en) Graph-based document retrieval method, system and related components thereof
CN112270178B (en) Medical literature cluster theme determination method and device, electronic equipment and storage medium
Nyberg et al. Document classification utilising ontologies and relations between documents
CN114818724A (en) Construction method of social media disaster effective information detection model
CN113761192A (en) Text processing method, text processing device and text processing equipment
CN113282745B (en) Automatic generation method and device for event encyclopedia document
CN111899832B (en) Medical theme management system and method based on context semantic analysis
US8819023B1 (en) Thematic clustering
CN113449063B (en) Method and device for constructing document structure information retrieval library
KR102454261B1 (en) Collaborative partner recommendation system and method based on user information
CN114547233A (en) Data duplicate checking method and device and electronic equipment
CN109684442B (en) Text retrieval method, device, equipment and program product
Triwijoyo et al. Analysis of Document Clustering based on Cosine Similarity and K-Main Algorithms
CN111079448A (en) Intention identification method and device
CN111930880A (en) Text code retrieval method, device and medium
CN111368068A (en) Short text topic modeling method based on part-of-speech feature and semantic enhancement
CN113591468B (en) Automatic construction and topic discovery method for international organization science and technology text vocabulary chain
US11243985B1 (en) System and method for name entity disambiguation with latent topic and deep graph analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant