CN113282745A - Automatic generation method and device for event encyclopedia document - Google Patents

Automatic generation method and device for event encyclopedia document Download PDF

Info

Publication number
CN113282745A
CN113282745A CN202010104947.2A CN202010104947A CN113282745A CN 113282745 A CN113282745 A CN 113282745A CN 202010104947 A CN202010104947 A CN 202010104947A CN 113282745 A CN113282745 A CN 113282745A
Authority
CN
China
Prior art keywords
event
topic
determining
document
topics
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010104947.2A
Other languages
Chinese (zh)
Other versions
CN113282745B (en
Inventor
侯磊
祝方韦
史佳欣
李涓子
张鹏
唐杰
许斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202010104947.2A priority Critical patent/CN113282745B/en
Publication of CN113282745A publication Critical patent/CN113282745A/en
Application granted granted Critical
Publication of CN113282745B publication Critical patent/CN113282745B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the invention provides an automatic generation method and a device of an event encyclopedia document, wherein the automatic generation method of the event encyclopedia document comprises the following steps: generating a topic tree of the event category based on the existing encyclopedia documents of the event category which is the same as the event to be processed, wherein the topic tree comprises a plurality of topics; acquiring a relevant document set of the event to be processed; determining target text information respectively corresponding to a plurality of topics based on the relevant document set and the topic tree; determining abstracts respectively corresponding to the plurality of themes according to the target text information respectively corresponding to the plurality of themes; generating an encyclopedic document of the event to be processed based on the abstracts respectively corresponding to the plurality of subjects. The method for automatically generating the event encyclopedia document can automatically generate the encyclopedia document comprising a plurality of subjects for a new event, so that the generated encyclopedia document catalog is more complete, and the key information of different aspects of the event is more finely described.

Description

Automatic generation method and device for event encyclopedia document
Technical Field
The invention relates to the field of encyclopedia generation, in particular to an event encyclopedia document automatic generation method and device.
Background
The encyclopedia document is mostly written by human beings, the writing habits of different authors are different, and finally the produced document directory structure is also different. The direct application of a directory structure of a certain document cannot ensure the universality and integrity of the structure, thereby causing damage to the rationality of the finally generated document.
In the prior art, encyclopedia documents are mostly written manually, and in order to improve the editing efficiency and the time effectiveness of the encyclopedia documents, some researchers develop some encyclopedia document generation methods, but the encyclopedia document generation methods can only generate summary parts of the encyclopedia documents and cannot generate complete documents.
Disclosure of Invention
Embodiments of the present invention provide an event encyclopedia document automatic generation method, apparatus, electronic device and readable storage medium that overcome the above-mentioned problems or at least partially solve the above-mentioned problems.
In a first aspect, an embodiment of the present invention provides an event encyclopedia document automatic generation method, including: generating a topic tree of the event category based on the existing encyclopedia documents of the event category which is the same as the event to be processed, wherein the topic tree comprises a plurality of topics; acquiring a relevant document set of the event to be processed; determining target text information respectively corresponding to a plurality of topics based on the relevant document set and the topic tree; determining abstracts respectively corresponding to the plurality of themes according to the target text information respectively corresponding to the plurality of themes; generating an encyclopedic document of the event to be processed based on the abstracts respectively corresponding to the plurality of subjects.
In some embodiments, the generating a topic tree for the event category based on existing encyclopedia documents of the event category that are co-located with the pending event comprises: acquiring an encyclopedia document of the existing event class and the event class to be processed; obtaining a title set from the encyclopedia document, wherein the title set comprises a plurality of titles; determining a set of topics based on the plurality of titles, the set of topics including the plurality of topics; determining the topic tree based on the topic collection.
In some embodiments, said determining a set of topics based on said plurality of topics comprises: obtaining a plurality of cluster sets by clustering the plurality of titles; determining the plurality of topics based on the plurality of clusters, wherein the plurality of topics are in one-to-one correspondence with the plurality of clusters, and determining the topic set according to the plurality of topics.
In some embodiments, said determining said topic tree based on said set of topics comprises: determining parent-child relationship probabilities between a plurality of the topics in the set of topics; and determining the theme tree based on the parent-child relationship probability.
In some embodiments, the determining parent-child relationship probabilities between the plurality of topics in the set of topics comprises: determining a topic t based on a directory structure of existing encyclopedia documents of a co-event category with a pending eventiWith the subject tjStructural information of the subject t, the subject tiWith the subject tjThe structural information between characterizes the directory structure of the existing encyclopedia document, for the subject tiAs subject tjThe degree of support of the parent topic of (c); determining a topic t based on a text distribution of existing encyclopedia documents of a co-event category with a pending eventiWith the subject tjThe subject tiWith the subject tjThe text association feature between them characterizes the subject tiWith the subject tjDistribution of text between, for topic tiAs subject tjThe degree of support of the parent topic of (c); determining between a plurality of the topics based on the structural information and the text association featuresThe parent-child relationship probability.
In some embodiments, the determining, based on the relevant document set and the topic tree, target text information corresponding to a plurality of the topics, respectively, includes: determining the contribution of each text segment to a topic t and determining target text information of the corresponding topic based on the contribution of each word in each text segment e to the corresponding topic; wherein the contribution of the word to the corresponding topic applies the formula:
Figure BDA0002388230420000031
determining, wherein W (W, T) is the contribution of the word W to the topic T, T is the set of topics, tt is any topic in the set of topics T, pw,tRepresenting the probability of the occurrence of the word w under the topic t, pw,ttRepresenting the probability of the word w appearing under the topic tt.
In some embodiments, the determining the abstracts corresponding to the plurality of subjects respectively according to the target text information corresponding to the plurality of subjects respectively comprises: dividing the target text information according to sentences to obtain a sentence set; vectorizing the sentence set to obtain a matrix representation of sentences; determining core sentences in the matrix representation of the sentences, and composing the core sentences into the abstract.
In a second aspect, an embodiment of the present invention provides an event encyclopedia document automatic generation apparatus, including: the system comprises a theme tree generating unit, a processing unit and a processing unit, wherein the theme tree generating unit is used for generating a theme tree of an event category based on encyclopedic documents of the event category identical to an event to be processed, and the theme tree comprises a plurality of themes; the document acquisition unit is used for acquiring a relevant document set of the event to be processed; a text screening unit, configured to determine target text information corresponding to each of the plurality of topics based on the relevant document set and the topic tree; the abstract generating unit is used for determining the abstract corresponding to the plurality of themes according to the target text information corresponding to the plurality of themes; and the combination unit is used for generating the encyclopedic document of the event to be processed based on the abstracts respectively corresponding to the plurality of subjects.
In a third aspect, an embodiment of the present invention provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the steps of the method provided in the first aspect when executing the program.
In a fourth aspect, an embodiment of the present invention provides a non-transitory computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the steps of the method as provided in the first aspect.
The method, the device, the electronic equipment and the readable storage medium for automatically generating the event encyclopedia document can automatically generate the encyclopedia document comprising a plurality of topics for a new event, so that the generated encyclopedia document directory is more complete, and the key information of different aspects of the event is more finely described.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a flow chart of a method for automatically generating an event encyclopedia document according to an embodiment of the invention;
FIG. 2 is a flow chart of a method for automatically generating an event encyclopedia document according to an embodiment of the present invention;
FIG. 3 is a schematic structural diagram of an event encyclopedia document automatic generation device according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
An event encyclopedia document automatic generation method according to an embodiment of the present invention is described below with reference to fig. 1 to 2.
As shown in fig. 1 and fig. 2, the method for automatically generating an event encyclopedia document according to the embodiment of the present invention includes steps S100 to S500.
Step S100, generating a theme tree of the event category based on the existing encyclopedia documents of the event category which is the same as the event to be processed, wherein the theme tree comprises a plurality of themes.
It should be noted that the event to be processed is a new event that needs to generate an encyclopedic document, and in actual execution, the category of the event to be processed may be determined according to the encyclopedic document category list, and this step may be performed manually. Under the corresponding category, a plurality of encyclopedia documents exist, and a theme tree of the event category is generated based on the existing encyclopedia documents, namely, a theme tree common to the event category is generated. Topic trees may be understood as topic templates, each topic tree comprising a plurality of topics.
And step S200, acquiring a relevant document set of the event to be processed.
The related document set is a set of documents related to the event to be processed, and the source mode comprises internet search or other literature data.
And step S300, determining target text information respectively corresponding to a plurality of topics based on the relevant document set and the topic tree.
The target text information is text information valuable to the corresponding subject, and the step is used for screening the collected related document set and finding the target text information corresponding to each subject.
And step S400, determining abstracts respectively corresponding to the plurality of subjects according to the target text information respectively corresponding to the plurality of subjects.
After the target text information corresponding to each topic is found in step S300, for each topic, the corresponding abstract may be obtained according to the target text information corresponding to each topic.
And S500, generating an encyclopedic document of the event to be processed based on the abstracts respectively corresponding to the plurality of subjects.
And filling the abstract corresponding to each topic into the corresponding topic, so as to generate the encyclopedia document of the event to be processed.
According to the automatic generation method of the event encyclopedia document, provided by the embodiment of the invention, the encyclopedia document comprising a plurality of subjects can be automatically generated for a new event, so that the generated encyclopedia document catalog is more complete, and the key information of different aspects of the event can be more finely described.
In some embodiments, step S100, based on existing encyclopedia documents of the event category with the pending event, generates a topic tree for the event category, including sub-step S110 through sub-step S130.
Step S110, acquiring the existing encyclopedia document of the event class matched with the event to be processed.
In actual implementation, the category of the event to be processed may be determined according to the encyclopedia document category list, and under the corresponding category, a plurality of existing encyclopedia documents are obtained.
Step S120, a title set is obtained from the encyclopedia document, and the title set comprises a plurality of titles.
It is understood that the encyclopedia documents in this step are existing encyclopedia documents acquired in step S110, and these documents include at least one title, and a considerable number of documents include a plurality of titles.
In an actual implementation, given an encyclopedia text set D of a certain event category, a title set T can be obtained from it, T ═ T1,...,tN}。
Step S130, determining a theme set based on the plurality of titles, wherein the theme set comprises a plurality of themes.
Further, step S130, based on the plurality of titles, determines a topic set, including sub-step S131 to sub-step S132.
Step S131, obtaining a plurality of clusters by clustering the plurality of titles.
In actual implementation, for the title T in the title set TiAnd tjCan use tiAnd tjThe similarity between the titles is measured by the TF-IDF similarity between the corresponding texts.
Two thresholds λ are defined1>λ2In the first round of clustering, for each title T e T, each current cluster C e C is traversed, and if the similarity between T and the title in a certain cluster C exceeds lambda1Adding t to the cluster c; if the similarity between t and the title in a certain cluster c exceeds lambda2But not more than lambda1Putting t into the candidate sequence P; if the similarity between the t and the c is not more than lambda2Then a new cluster c 'is created, and t is put into the new cluster c'.
In the second round of clustering, for each title t ∈ P, each current cluster C ∈ C is traversed, and if the similarity between t and the title in a certain cluster C exceeds λ1Adding t to the cluster c; if t is not similar to the title in a certain cluster c enough1Then a new cluster c 'is created, and t is put into the new cluster c'.
Step S132, determining a plurality of subjects based on the plurality of clusters, wherein the plurality of subjects correspond to the plurality of clusters one by one, and determining a subject set according to the plurality of subjects.
Finally, each cluster is used as a theme, a title with the highest frequency of occurrence is used as a theme name, and a theme set T of the event category is constructedc
And step S140, determining a theme tree based on the theme set.
After the theme set is obtained, the theme tree of the event category can be determined according to the theme set.
Further, step S140, based on the topic collection, determines a topic tree, including sub-step S141 to sub-step S142.
And step S141, determining parent-child relationship probability among a plurality of topics in the topic collection.
It should be noted that some topics are in a parallel relationship, i.e. a sibling relationship, but some topics may have a parent-child relationship, i.e. a topic is a previous level of another topic, and this is a topic in a pair of parent-child relationships.
Step S141, determining parent-child relationship probabilities among a plurality of topics in the topic collection, including substeps S141a through substep S141 c.
Step S141a, determining subject t based on the directory structure of the encyclopedia documents of the existing event and event co-occurrence categories to be processediWith the subject tjStructural information in between, topic tiWith the subject tjThe structural information between is used for characterizing the directory structure of the existing encyclopedic document to the subject tiAs subject tjThe supporting degree of the parent theme of (c).
In actual execution, the subject tiWith the subject tjThe structural information in between can be quantized as:
Figure BDA0002388230420000071
wherein, Pstruc(ti|tj) As a subject tiWith the subject tjStructural information of (d), n (t)i,tj) As a subject tiAs subject tjN (t) of the parent topicj) As a subject tjTotal number of occurrences, TdRepresenting a topic tiThe number of possible sub-topics, α, is the laplacian smoothing factor.
Step S141b, determining a topic t based on the text distribution of the encyclopedia documents of the existing and pending event co-event categoriesiWith the subject tjThe subject t of the text association feature betweeniWith the subject tjThe text association feature between them characterizes the subject tiWith the subject tjText distribution between to topic tiAs subject tjThe supporting degree of the parent theme of (c).
In actual execution, the subject tiWith the subject tjThe text association feature between can use the hierarchical Dirichlet modelThe row quantization is:
Figure BDA0002388230420000081
wherein, Ptext(ti|tj) As a subject tiWith the subject tjThe text correlation characteristic between the two, Z is a normalization factor,
Figure BDA0002388230420000082
representing words w in subject tiThe probability of occurrence of (a) is,
Figure BDA0002388230420000083
representing words w in subject tjThe probability of occurrence of β is an artificial parameter that controls the degree of probability concentration, and β may be 5, for example.
Step S141c, determining parent-child relationship probability among a plurality of subjects based on the structural information and the text association characteristics.
In actual implementation, after the two aspects of association are quantified, the probability that parent-child topic relationships exist between topics is calculated by using a weighted average:
w(ti,tj)=λ·log(Pstruc(ti|tj))+(1-λ)·log(Ptext(ti|tj))
wherein, w (t)i,tj) As a subject tiIs a subject tjλ is an artificial parameter controlling the weight of the two topics, and may be, for example, 0.8.
In the above embodiments, the event category is based on a topic set, such as topic set T given the event categorycAnd calculating the possibility of parent-child theme relationship between different themes by integrating the structural information and the text association.
And S142, determining the subject tree based on the parent-child relationship probability.
In other words, a topic tree common to the event categories is generated based on the possibility that parent-child topic relationships exist between different topics using the maximum tree algorithm in the directed graph.
In an actual implementation, a probability model can be used to evaluate the probability of establishment of the whole topic tree based on the possibility that the parent-child topic relationship exists between the topics, and the optimization goal is to maximize the overall possibility of the parent-child topic relationship in the topic tree. Regarding the hierarchical model of the whole theme as a Bayesian network, optimizing the target H*Can be expressed as:
Figure BDA0002388230420000091
wherein parH(n) represents the parent of node n in the subject tree H, P (par)H(n) | n) represents the probability that the parent node of node n is the correct parent node in the generated topic tree, argmaxHRepresenting taking the maximum value.
Thus, only the set of likelihood values { w (t) } is selectedi,tj) The whole topic tree can be obtained by combining the maximum edges and the tree, and the target is converted into the maximum tree problem in the directed graph, for example, a topic tree common to the event types can be generated by using the Chu-Liu/Edmonds algorithm.
Thus, the method for automatically generating the event encyclopedia document in the embodiment replaces a single-layer structure with a hierarchical structure, so that the catalog of the generated document is more complete, and the key information of different aspects of the event can be more finely described; meanwhile, the improved topic model and the neural network are comprehensively used, the text word distribution characteristics under each topic are explicitly mined, and better performance is achieved compared with the related technology.
In some embodiments, for step S200, acquiring the relevant document set of the event to be processed, in an actual implementation, given a new event name N that needs to be generated, two approaches, document reference and web search, may be used to search for relevant web pages from the internet.
In the document reference path, if N is an existing event in encyclopedia, a webpage set W referenced in encyclopedia document is selectedgAs a result;in the network searching approach, the event name N is used as a search request to search on a necessary search engine, web pages with predetermined names (such as the top 20) before ranking are selected, and the web pages from encyclopedia are removed and then a corresponding web page set W is usedsAs a result. For WgAnd WsUsing the beautiful soup tool library in Python to filter the irrelevant content, and obtaining the relevant document set D of the new event.
In some embodiments, the step S300 of determining target text information respectively corresponding to a plurality of topics based on the relevant document set and the topic tree includes:
determining the contribution of each text segment to a topic t and determining the target text information of the corresponding topic based on the contribution of each word in each text segment e to the corresponding topic;
wherein the contribution of a word to a corresponding topic applies a formula
Figure BDA0002388230420000101
Determining, wherein W (W, T) is the contribution of the word W to the topic T, T is the set of topics, tt is any topic in the set of topics T, pw,tRepresenting the probability of the occurrence of the word w under the topic t, pw,ttRepresenting the probability of the word w appearing under the topic tt.
Determining corresponding word probability distribution F under each topic according to the related document set and the topic treet={(w,pw,t) Where w represents a word, pw,tRepresenting the probability of the word w appearing under the topic t.
Considering that the high-frequency appearance of words represents close connection with the topics, if the words appear in high frequency under other topics, the reliability of the connection is also reduced correspondingly.
For each text segment e in the candidate document, the average value of the word contribution in the text segment e is used for quantifying the contribution of the text segment to the specific subject t. And sorting all the text segments from high to low according to the contribution values, taking the first 1000 words, and screening to obtain valuable text information Ct under each topic t.
In some embodiments, step S400, determining the abstracts corresponding to the plurality of topics according to the target text information corresponding to the plurality of topics, includes sub-steps S410 to S430.
Step S410, dividing target text information according to sentences to obtain a sentence set;
step S420, vectorizing the sentence set to obtain a matrix representation of the sentences;
and step S430, determining core sentences in the matrix representation of the sentences, and combining the core sentences into a summary.
In the actual implementation, the text information C valuable under each topic obtained based on the step CtFirstly, it is divided into sentences to obtain sentence set D ═ D1,d2,...,dn]Then, the pre-trained word vector representation in the GLOVE is utilized to carry out vectorization to obtain the matrix representation of the sentence
Figure BDA0002388230420000111
And inputting the matrix representation of the sentences into a deep channel model, extracting the core sentences with the largest information amount in the sentences, and combining the core sentences as the abstract of the theme. The step is to respectively generate abstracts for the subjects in the subject tree and combine a plurality of abstracts to obtain the final encyclopedic document.
Aiming at the automatic generation method of the event encyclopedia document, the inventor finds that the method is more excellent than the related technology through experimental verification.
For example, the comparison of the effects of the present invention and the related art (measured by F1 value index) on the theme tree construction is shown in table 1.
TABLE 1
Earthquake Election of Hurricane wind General of
Prior Art 0.434 0.780 0.586 0.586
The invention (Single round cluster) 0.854 0.937 0.949 0.914
The invention (complete model) 0.844 0.953 0.957 0.918
The effect of the present invention and the related art (measured by F1 value index) on overall document creation is shown in table 2.
TABLE 2
Figure BDA0002388230420000112
Figure BDA0002388230420000121
The rough-1 is an index for evaluating text summaries, the similarity between a reaction generation result and a standard answer is reflected, the rough-1 comprises two parts, the accuracy (the ratio of the number of correct results in the generation result to the total number of results) and the recall rate (the ratio of the number of correct results in the generation result to the number of results in the standard answer), and F1 is a result integrating the accuracy and the recall rate, for example, the accuracy and the recall rate can be harmonized and averaged to obtain F1.
The invention has good effect, surpasses the related technology in multiple aspects and meets the application requirement from the experimental result.
The following describes an event encyclopedia document automatic generation device provided by an embodiment of the present invention, and the event encyclopedia document automatic generation device described below and the event encyclopedia document automatic generation method described above may be referred to in correspondence with each other.
As shown in fig. 3, an event encyclopedia document automatic generation apparatus according to an embodiment of the present invention includes: a subject tree generating unit 610, a document acquiring unit 620, a text filtering unit 630, a summary generating unit 640, and a combining unit 650.
The topic tree generation unit 610 is configured to generate a topic tree of the event category based on the encyclopedic document of the event category identical to the event to be processed, where the topic tree includes multiple topics; a document obtaining unit 620, configured to obtain a relevant document set of an event to be processed; a text screening unit 630, configured to determine target text information corresponding to a plurality of topics, respectively, based on the relevant document set and the topic tree; an abstract generating unit 640, configured to determine, according to target text information corresponding to a plurality of topics, abstracts corresponding to the plurality of topics, respectively; a combining unit 650, configured to generate an encyclopedia document of the event to be processed based on the abstracts respectively corresponding to the plurality of topics.
Fig. 4 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 4: a processor (processor)810, a communication Interface 820, a memory 830 and a communication bus 840, wherein the processor 810, the communication Interface 820 and the memory 830 communicate with each other via the communication bus 840. The processor 810 may invoke logic instructions in the memory 830 to perform an event encyclopedia document auto-generation method comprising: generating a topic tree of the event category based on the existing encyclopedia documents of the event category which is the same as the event to be processed, wherein the topic tree comprises a plurality of topics; acquiring a relevant document set of an event to be processed; determining target text information respectively corresponding to a plurality of topics based on the relevant document set and the topic tree; determining abstracts respectively corresponding to the plurality of themes according to the target text information respectively corresponding to the plurality of themes; and generating an encyclopedia document of the event to be processed based on the abstracts respectively corresponding to the plurality of subjects.
It should be noted that, when being implemented specifically, the electronic device in this embodiment may be a server, a PC, or other devices, as long as the structure includes the processor 810, the communication interface 820, the memory 830, and the communication bus 840 shown in fig. 4, where the processor 810, the communication interface 820, and the memory 830 complete mutual communication through the communication bus 840, and the processor 810 may call the logic instructions in the memory 830 to execute the above method. The embodiment does not limit the specific implementation form of the electronic device.
In addition, the logic instructions in the memory 830 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Further, an embodiment of the present invention discloses a computer program product, the computer program product includes a computer program stored on a non-transitory computer-readable storage medium, the computer program includes program instructions, when the program instructions are executed by a computer, the computer can execute the method for automatically generating an executive event encyclopedia document provided by the above-mentioned method embodiments, the method includes: generating a topic tree of the event category based on the existing encyclopedia documents of the event category which is the same as the event to be processed, wherein the topic tree comprises a plurality of topics; acquiring a relevant document set of an event to be processed; determining target text information respectively corresponding to a plurality of topics based on the relevant document set and the topic tree; determining abstracts respectively corresponding to the plurality of themes according to the target text information respectively corresponding to the plurality of themes; and generating an encyclopedia document of the event to be processed based on the abstracts respectively corresponding to the plurality of subjects.
In another aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented by a processor to execute the method for automatically generating an execution event encyclopedia document provided in the foregoing embodiments, where the method includes: generating a topic tree of the event category based on the existing encyclopedia documents of the event category which is the same as the event to be processed, wherein the topic tree comprises a plurality of topics; acquiring a relevant document set of an event to be processed; determining target text information respectively corresponding to a plurality of topics based on the relevant document set and the topic tree; determining abstracts respectively corresponding to the plurality of themes according to the target text information respectively corresponding to the plurality of themes; and generating an encyclopedia document of the event to be processed based on the abstracts respectively corresponding to the plurality of subjects.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. An event encyclopedia document automatic generation method is characterized by comprising the following steps:
generating a topic tree of the event category based on the existing encyclopedia documents of the event category which is the same as the event to be processed, wherein the topic tree comprises a plurality of topics;
acquiring a relevant document set of the event to be processed;
determining target text information respectively corresponding to a plurality of topics based on the relevant document set and the topic tree;
determining abstracts respectively corresponding to the plurality of themes according to the target text information respectively corresponding to the plurality of themes;
generating an encyclopedic document of the event to be processed based on the abstracts respectively corresponding to the plurality of subjects.
2. The method for automatically generating an event encyclopedia document according to claim 1, wherein the generating a subject tree of an event category based on existing encyclopedia documents of the event category and the event to be processed comprises:
acquiring an encyclopedia document of the existing event class and the event class to be processed;
obtaining a title set from the encyclopedia document, wherein the title set comprises a plurality of titles;
determining a set of topics based on the plurality of titles, the set of topics including the plurality of topics;
determining the topic tree based on the topic collection.
3. The method of automatically generating an event encyclopedia document according to claim 2, wherein said determining a set of topics based on said plurality of topics comprises:
obtaining a plurality of cluster sets by clustering the plurality of titles;
determining the plurality of topics based on the plurality of clusters, wherein the plurality of topics are in one-to-one correspondence with the plurality of clusters, and determining the topic set according to the plurality of topics.
4. The method of claim 2, wherein the determining the topic tree based on the topic collection comprises:
determining parent-child relationship probabilities between a plurality of the topics in the set of topics;
and determining the theme tree based on the parent-child relationship probability.
5. The method of claim 4, wherein the determining parent-child relationship probabilities between the topics in the topic set comprises:
determining a topic t based on a directory structure of existing encyclopedia documents of a co-event category with a pending eventiWith the subject tjStructural information of the subject t, the subject tiWith the subject tjThe structural information between characterizes the directory structure of the existing encyclopedia document, for the subject tiAs subject tjThe degree of support of the parent topic of (c);
determining a topic t based on a text distribution of existing encyclopedia documents of a co-event category with a pending eventiWith the subject tjThe subject tiWith the subject tjThe text association feature between them characterizes the subject tiWith the subject tjDistribution of text between, for topic tiAs subject tjThe degree of support of the parent topic of (c);
determining parent-child relationship probabilities among the plurality of topics based on the structural information and the text association features.
6. The method for automatically generating event encyclopedia documents according to any one of claims 1-4, wherein the determining target text information respectively corresponding to a plurality of topics based on the relevant document set and the topic tree comprises:
determining the contribution of each text segment to a topic t and determining target text information of the corresponding topic based on the contribution of each word in each text segment e to the corresponding topic;
wherein the contribution of the word to the corresponding topic applies the formula:
Figure FDA0002388230410000021
determining, wherein W (W, T) is the contribution of the word W to the topic T, T is the set of topics, tt is any topic in the set of topics T, pw,tRepresenting the probability of the occurrence of the word w under the topic t, pw,ttRepresenting the probability of the word w appearing under the topic tt.
7. The method for automatically generating an event encyclopedia document according to any one of claims 1-4, wherein the determining the abstracts corresponding to the plurality of subjects respectively according to the target text information corresponding to the plurality of subjects respectively comprises:
dividing the target text information according to sentences to obtain a sentence set;
vectorizing the sentence set to obtain a matrix representation of sentences;
determining core sentences in the matrix representation of the sentences, and composing the core sentences into the abstract.
8. An event encyclopedia document automatic generation device is characterized by comprising:
the system comprises a theme tree generating unit, a processing unit and a processing unit, wherein the theme tree generating unit is used for generating a theme tree of an event category based on encyclopedic documents of the event category identical to an event to be processed, and the theme tree comprises a plurality of themes;
the document acquisition unit is used for acquiring a relevant document set of the event to be processed;
a text screening unit, configured to determine target text information corresponding to each of the plurality of topics based on the relevant document set and the topic tree;
the abstract generating unit is used for determining the abstract corresponding to the plurality of themes according to the target text information corresponding to the plurality of themes;
and the combination unit is used for generating the encyclopedic document of the event to be processed based on the abstracts respectively corresponding to the plurality of subjects.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the event encyclopedia document automatic generation method according to any of the claims 1 to 7 are implemented by the processor when executing the program.
10. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method for automatic generation of an event encyclopedia document according to any one of claims 1 to 7.
CN202010104947.2A 2020-02-20 2020-02-20 Automatic generation method and device for event encyclopedia document Active CN113282745B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010104947.2A CN113282745B (en) 2020-02-20 2020-02-20 Automatic generation method and device for event encyclopedia document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010104947.2A CN113282745B (en) 2020-02-20 2020-02-20 Automatic generation method and device for event encyclopedia document

Publications (2)

Publication Number Publication Date
CN113282745A true CN113282745A (en) 2021-08-20
CN113282745B CN113282745B (en) 2023-04-18

Family

ID=77275185

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010104947.2A Active CN113282745B (en) 2020-02-20 2020-02-20 Automatic generation method and device for event encyclopedia document

Country Status (1)

Country Link
CN (1) CN113282745B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050203970A1 (en) * 2002-09-16 2005-09-15 Mckeown Kathleen R. System and method for document collection, grouping and summarization
CN102637173A (en) * 2011-02-10 2012-08-15 北京百度网讯科技有限公司 Content forming method for network encyclopedias, network server and client
CN104252487A (en) * 2013-06-28 2014-12-31 百度在线网络技术(北京)有限公司 Method and device for generating entry information
CN104484374A (en) * 2014-12-08 2015-04-01 百度在线网络技术(北京)有限公司 Method and device for creating Internet encyclopedia entry
CN109657054A (en) * 2018-12-13 2019-04-19 北京百度网讯科技有限公司 Abstraction generating method, device, server and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050203970A1 (en) * 2002-09-16 2005-09-15 Mckeown Kathleen R. System and method for document collection, grouping and summarization
CN102637173A (en) * 2011-02-10 2012-08-15 北京百度网讯科技有限公司 Content forming method for network encyclopedias, network server and client
CN104252487A (en) * 2013-06-28 2014-12-31 百度在线网络技术(北京)有限公司 Method and device for generating entry information
CN104484374A (en) * 2014-12-08 2015-04-01 百度在线网络技术(北京)有限公司 Method and device for creating Internet encyclopedia entry
CN109657054A (en) * 2018-12-13 2019-04-19 北京百度网讯科技有限公司 Abstraction generating method, device, server and storage medium

Also Published As

Publication number Publication date
CN113282745B (en) 2023-04-18

Similar Documents

Publication Publication Date Title
US10078688B2 (en) Evaluating text classifier parameters based on semantic features
US20170293607A1 (en) Natural language text classification based on semantic features
CN108733682B (en) Method and device for generating multi-document abstract
US20150356091A1 (en) Method and system for identifying microblog user identity
CN108647322B (en) Method for identifying similarity of mass Web text information based on word network
WO2015181639A2 (en) Methods and computer-program products for organizing electronic documents
WO2018056423A1 (en) Scenario passage classifier, scenario classifier, and computer program therefor
CN111090731A (en) Electric power public opinion abstract extraction optimization method and system based on topic clustering
US8396882B2 (en) Systems and methods for generating issue libraries within a document corpus
CN114706972B (en) Automatic generation method of unsupervised scientific and technological information abstract based on multi-sentence compression
CN112527981B (en) Open type information extraction method and device, electronic equipment and storage medium
CN114207604A (en) System and method for extracting scientific measurement context using targeted question answers
CN112270178B (en) Medical literature cluster theme determination method and device, electronic equipment and storage medium
CN113761192A (en) Text processing method, text processing device and text processing equipment
Mykowiecka et al. Recognition of irrelevant phrases<? br?> in automatically extracted lists<? br?> of domain terms
CN113282745B (en) Automatic generation method and device for event encyclopedia document
CN110619212B (en) Character string-based malicious software identification method, system and related device
US8819023B1 (en) Thematic clustering
CN113449063B (en) Method and device for constructing document structure information retrieval library
CN111899832B (en) Medical theme management system and method based on context semantic analysis
Triwijoyo et al. Analysis of Document Clustering based on Cosine Similarity and K-Main Algorithms
CN109684442B (en) Text retrieval method, device, equipment and program product
CN114547233A (en) Data duplicate checking method and device and electronic equipment
CN111079448A (en) Intention identification method and device
CN111930880A (en) Text code retrieval method, device and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant