CN116991967A - Method and device for generating event evolution relation tree - Google Patents

Method and device for generating event evolution relation tree Download PDF

Info

Publication number
CN116991967A
CN116991967A CN202311121228.1A CN202311121228A CN116991967A CN 116991967 A CN116991967 A CN 116991967A CN 202311121228 A CN202311121228 A CN 202311121228A CN 116991967 A CN116991967 A CN 116991967A
Authority
CN
China
Prior art keywords
text data
events
event
sub
elements
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311121228.1A
Other languages
Chinese (zh)
Inventor
屠隽弢
徐轶
刘欣
金康荣
龙浪
葛雅金
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Aerospace Information Research Institute
Original Assignee
Suzhou Aerospace Information Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Aerospace Information Research Institute filed Critical Suzhou Aerospace Information Research Institute
Priority to CN202311121228.1A priority Critical patent/CN116991967A/en
Publication of CN116991967A publication Critical patent/CN116991967A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/322Trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure provides a method, a device, equipment and a storage medium for generating an event evolution relation tree, which can be applied to the technical field of natural language processing. The method comprises the following steps: processing a text data set to be processed based on a word frequency-reverse file frequency algorithm to obtain a first text feature matrix; obtaining a density boundary threshold according to Euclidean distance between each element and other elements in the first text feature matrix; clustering elements in the first text feature matrix based on the density boundary threshold value to obtain a plurality of target text data sets corresponding to a plurality of theme events; aiming at a target text data set corresponding to each theme event, processing the target text data set based on an implicit dirichlet allocation algorithm to obtain a plurality of associated sub-events with relevance; and generating an event evolution relation tree corresponding to each subject event according to the plurality of associated sub-events based on the occurrence time of the plurality of associated sub-events.

Description

Method and device for generating event evolution relation tree
Technical Field
The present disclosure relates to the field of natural language processing, and more particularly, to a method and apparatus for generating an event evolution relationship tree.
Background
The rapidly growing information presents redundant and fragmented features that make it difficult for users to quickly and intuitively obtain the dynamic progression of an event from a large-scale text.
In the process of implementing the disclosed concept, the inventor finds that at least the following problems exist in the related art: the existing event story tree generation method is difficult to embody layering characteristics in the dynamic development process of the events and influence of external events on the subject events, and cannot fully mine association relations among the nodes due to lack of analysis on co-occurrence, evolution, correlation and the like of the front event node and the rear event node.
Disclosure of Invention
In view of the above, the present disclosure provides a method and apparatus for generating an event evolution relationship tree.
According to a first aspect of the present disclosure, there is provided a method for generating an event evolution relationship tree, including: processing a text data set to be processed based on a word frequency-reverse file frequency algorithm to obtain a first text feature matrix, wherein each element in the first text feature matrix represents word frequency of a keyword for describing an event;
obtaining a density boundary threshold according to the Euclidean distance between each element and other elements in the first text feature matrix;
Clustering elements in the first text feature matrix based on the density boundary threshold value to obtain a plurality of target text data sets corresponding to a plurality of theme events;
aiming at a target text data set corresponding to each theme event, processing the target text data set based on an implicit dirichlet allocation algorithm to obtain a plurality of related sub-events with relevance; and
and generating an event evolution relation tree corresponding to each theme event according to the plurality of associated sub-events based on the occurrence time of the plurality of associated sub-events.
According to an embodiment of the present disclosure, the first text feature matrix includes I elements, I is an integer greater than 1, and the obtaining the density boundary threshold according to the euclidean distance between each element and other elements in the first text feature matrix includes:
for the ith element, obtaining a jth element corresponding to the ith element from the ith-1 element according to Euclidean distance between the ith element and other I-1 elements, wherein I is an integer greater than or equal to 1 and less than or equal to I;
returning to execute the operation of obtaining the j-th element corresponding to the I-th element from the I-1-th element under the condition that the I is determined to be smaller than the I, and increasing the I;
Obtaining J elements under the condition that I is determined to be equal to I, wherein J is an integer equal to I, and J is an integer greater than or equal to 1 and less than J; and
and obtaining the density boundary threshold according to the Euclidean distances between the I elements and the J elements.
According to an embodiment of the present disclosure, the obtaining the density boundary threshold according to the euclidean distance between the I elements and the corresponding J elements includes:
obtaining an average value of Euclidean distances according to the plurality of Euclidean distances between the I elements and the corresponding J elements;
the Euclidean distances are arranged in a descending order to obtain an Euclidean distance sequence;
residual processing is carried out on the Euclidean distance sequence, so that the difference change rate of adjacent Euclidean distance values is obtained;
obtaining a target Euclidean distance value from the Euclidean distance sequence based on the difference change rate; and
and obtaining the density boundary threshold according to the Euclidean distance average value and the target Euclidean distance value.
According to an embodiment of the disclosure, clustering the elements in the first text feature matrix based on the density boundary threshold to obtain a plurality of target text data sets corresponding to a plurality of subject events includes:
Performing density analysis on elements in the first text feature matrix based on a density clustering algorithm to obtain a plurality of candidate text data sets and the number of the candidate text data sets, wherein the candidate text data sets represent text data sets with density distribution of the elements less than or equal to the density boundary threshold; and
and clustering the plurality of candidate text data sets by taking the number of the plurality of candidate text data sets as the number of target text data sets to obtain a plurality of target text data sets corresponding to the plurality of subject events.
According to an embodiment of the present disclosure, the clustering the plurality of candidate text data sets with the number of the plurality of candidate text data sets as the number of target text data sets to obtain a plurality of target text data sets corresponding to the plurality of subject events includes:
clustering the plurality of candidate text data sets based on a K-means clustering algorithm by taking the number of the plurality of candidate text data sets as the number of target text data sets to obtain a plurality of clustered text data sets and non-clustered text data corresponding to the plurality of subject events; and
And merging the non-clustered text data with the clustered text data set corresponding to the corresponding subject event based on the relevance between the non-clustered text data and the plurality of candidate subject events to obtain the target text data set.
According to an embodiment of the present disclosure, the processing, based on an implicit dirichlet allocation algorithm, the target text data set corresponding to each topic event to obtain a plurality of associated sub-events with relevance includes:
extracting keywords of a target text data set corresponding to each subject event to obtain a keyword set;
processing the keyword set based on an implicit dirichlet allocation algorithm according to the preset weight of the keyword to obtain the number of related sub-events with relevance in the target text data set;
processing the target text data set based on a word frequency-reverse file frequency algorithm to obtain a second text feature matrix, wherein elements in the second text feature matrix represent word frequencies of keywords used for describing topic events corresponding to the target text data set; and
And clustering the second text feature matrix based on the number of the associated sub-events based on a K center point clustering algorithm to obtain a plurality of associated sub-events with correlation.
According to an embodiment of the present disclosure, the processing the keyword set based on an implicit dirichlet allocation algorithm according to the predetermined weight of the keyword to obtain the number of related sub-events with relevance in the target text data set includes:
processing the keyword set based on an implicit dirichlet allocation algorithm according to the preset weight of the keyword to obtain a text consistency coefficient in the target text data set corresponding to the preset association event number; and
and obtaining the number of related sub-events with relevance in the target text data set based on the text consistency coefficient.
According to an embodiment of the present disclosure, the plurality of associated sub-events includes a plurality of sub-events in a first cluster and a plurality of sub-events in a second cluster, and the generating an event evolution relationship tree corresponding to each topic event according to the plurality of associated sub-events based on occurrence moments of the plurality of associated sub-events includes:
when determining that the occurrence time of the sub-event with the earliest occurrence time in the first cluster is earlier than the occurrence time of the sub-event with the earliest occurrence time in the second cluster, taking the plurality of sub-events in the first cluster and the sub-event with the earliest occurrence time in the second cluster as a trunk node;
And generating an event evolution relation tree corresponding to each subject event according to the trunk node and other sub-events in the second cluster.
According to an embodiment of the present disclosure, the generating an event evolution relationship tree corresponding to each of the subject events according to the trunk node and other sub-events in the second cluster includes:
sequentially associating the trunk nodes according to the forward sequence of occurrence moments of the sub-events corresponding to the trunk nodes to generate trunks of the event evolution relation tree;
sequentially associating other sub-events serving as branch nodes according to the forward sequence of the occurrence time of the other sub-events in the second cluster to generate branches of the event evolution relationship tree;
and associating the trunk with the branches based on the correlation between the trunk node and the branch node, and generating the event evolution relation tree.
A second aspect of the present disclosure provides a generating apparatus for an event evolution relationship tree, including: the first processing module is used for processing the text data set to be processed based on a word frequency-reverse file frequency algorithm to obtain a first text feature matrix, wherein each element in the first text feature matrix represents the word frequency of a keyword for describing an event;
The obtaining module is used for obtaining a density boundary threshold value according to Euclidean distance between each element and other elements in the first text feature matrix;
the clustering module is used for clustering elements in the first text feature matrix based on the density boundary threshold value to obtain a plurality of target text data sets corresponding to a plurality of theme events;
the second processing module is used for processing the target text data set corresponding to each theme event based on an implicit dirichlet allocation algorithm to obtain a plurality of related sub-events with relevance; and
the generation module is used for generating an event evolution relation tree corresponding to each theme event based on the occurrence time of the plurality of associated sub-events.
According to the method and the device for generating the event evolution relation tree, a text data set is processed to obtain a first text feature matrix; obtaining a density boundary threshold according to Euclidean distance between each element and other elements in the first text feature matrix; clustering elements in the first text feature matrix based on a density boundary threshold value to obtain a plurality of target text data sets corresponding to a plurality of theme events, and processing the target text data sets to obtain a plurality of associated sub-events with relevance; and generating an event evolution relation tree corresponding to each subject event according to the plurality of associated sub-events based on the occurrence time of the plurality of associated sub-events. The density boundary threshold is obtained by calculating Euclidean distance among elements in each text data, and has certain self-adaptability unlike the manual labeling in the prior art. Thus, at least partially solving: the existing event context generation method based on the historical rules and the time factors cannot sense event evolution from the sub-event angle and mine event development context from multiple view angles, and the existing algorithm cannot completely realize self-adaptive clustering, so that generalization capability and robustness are low.
Drawings
The foregoing and other objects, features and advantages of the disclosure will be more apparent from the following description of embodiments of the disclosure with reference to the accompanying drawings, in which:
FIG. 1 schematically illustrates an application scenario diagram of a method, apparatus, device, medium and program product for generating an event evolutionary relationship tree in accordance with an embodiment of the disclosure;
FIG. 2 schematically illustrates a flow chart of a method of generating an event evolution relationship tree in accordance with an embodiment of the present disclosure;
FIG. 3 schematically illustrates a flow chart of a subject matter cluster analysis in accordance with an embodiment of the present disclosure;
FIG. 4 schematically illustrates a flow chart of sub-event cluster analysis according to an embodiment of the disclosure;
FIG. 5 schematically illustrates a schematic diagram of an event evolution relationship tree building process according to an embodiment of the present disclosure;
FIG. 6 schematically illustrates a schematic diagram of a method of generating an event evolution relationship tree according to an embodiment of the present disclosure;
FIG. 7 schematically illustrates a block diagram of a generation apparatus of an event evolution relationship tree according to an embodiment of the present disclosure; and
fig. 8 schematically illustrates a block diagram of an electronic device adapted to implement a method of generating an event evolution relationship tree according to an embodiment of the present disclosure.
Detailed Description
Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is only exemplary and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the present disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the concepts of the present disclosure.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.
All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.
Where expressions like at least one of "A, B and C, etc. are used, the expressions should generally be interpreted in accordance with the meaning as commonly understood by those skilled in the art (e.g.," a system having at least one of A, B and C "shall include, but not be limited to, a system having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).
In the technical solution of the present disclosure, the related user information (including, but not limited to, user personal information, user image information, user equipment information, such as location information, etc.) and data (including, but not limited to, data for analysis, stored data, displayed data, etc.) are information and data authorized by the user or sufficiently authorized by each party, and the related data is collected, stored, used, processed, transmitted, provided, disclosed, applied, etc. and processed, all in compliance with the related laws and regulations and standards of the related country and region, necessary security measures are taken, no prejudice to the public order, and corresponding operation entries are provided for the user to select authorization or rejection.
The event is an important component of human society, and rapid development of social networks provides a convenient way for people to efficiently acquire hot news. However, the rapidly growing information presents redundant and fragmented features, especially news stories tend to have a hierarchical structure. Information screened by using space-time correlation alone cannot accurately grasp clues and development venues of events, and excessive construction of story trees around news elements causes major-minor inversion.
Therefore, how to highlight the importance of news elements and reasonably construct story trees from multiple perspectives in the process of feature mining by technical means is one of the important subjects in the current field of incident map research. In recent years, there has been a great deal of research on related problems and core technologies for story tree generation, and the number of topics and the range of events are determined by community detection and cluster analysis of different data in combination with machine learning algorithms.
However, most topic clustering algorithms need to rely on manual labeling results, and cannot fully realize an unsupervised adaptive clustering process. In addition, the existing event story tree generation method mainly acquires news events with single structure from a large number of texts, and mechanically connects the news events in series according to time sequence to construct a story tree with a single-chain structure, so that layering characteristics in the dynamic development process of the events and influence of external events on the theme events are difficult to embody.
In addition, most of methods based on historical rules utilize time lines to conduct coarse-granularity context combing on events, and lack analysis of co-occurrence, evolution, correlation and the like of front and rear event nodes, so that association relations among the nodes cannot be fully mined.
In order to at least partially solve the technical problems existing in the related art, an embodiment of the present disclosure provides a method for generating an event evolution relationship tree, which may be applied to the technical field of natural language processing. The method comprises the following steps: processing a text data set to be processed based on a word frequency-reverse file frequency algorithm to obtain a first text feature matrix; obtaining a density boundary threshold according to Euclidean distance between each element and other elements in the first text feature matrix; clustering elements in the first text feature matrix based on the density boundary threshold value to obtain a plurality of target text data sets corresponding to a plurality of theme events; aiming at a target text data set corresponding to each theme event, processing the target text data set based on an implicit dirichlet allocation algorithm to obtain a plurality of associated sub-events with relevance; and generating an event evolution relation tree corresponding to each subject event according to the plurality of associated sub-events based on the occurrence time of the plurality of associated sub-events.
Fig. 1 schematically illustrates an application scenario diagram of an event evolution relationship tree according to an embodiment of the present disclosure.
As shown in fig. 1, an application scenario 100 according to this embodiment may include a first terminal device 101, a second terminal device 102, a third terminal device 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications, such as shopping class applications, web browser applications, search class applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only) may be installed on the terminal devices 101, 102, 103.
The terminal devices 101, 102, 103 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.
The server 105 may be a server providing various services, such as a background management server (by way of example only) providing support for websites browsed by users using the terminal devices 101, 102, 103. The background management server may analyze and process the received data such as the user request, and feed back the processing result (e.g., the web page, information, or data obtained or generated according to the user request) to the terminal device.
It should be noted that, the event evolution relationship tree generating method provided by the embodiments of the present disclosure may be generally executed by the server 105. Accordingly, the event evolution relationship tree generating apparatus provided by the embodiments of the present disclosure may be generally disposed in the server 105. The event evolution relationship tree generation method provided by the embodiments of the present disclosure may also be performed by a server or server cluster that is different from the server 105 and is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105. Accordingly, the event evolution relationship tree generating apparatus provided by the embodiments of the present disclosure may also be provided in a server or a server cluster that is different from the server 105 and is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105.
It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
The event evolutionary relationship tree generation method of the disclosed embodiment will be described in detail below with reference to fig. 2 to 6 based on the scenario described in fig. 1.
Fig. 2 schematically illustrates a flowchart of an event evolution relationship tree generation method according to an embodiment of the present disclosure.
As shown in fig. 2, the event evolution relationship tree generating method of this embodiment includes operations S210 to S250.
In operation S210, a text data set to be processed is processed based on a word frequency-reverse file frequency algorithm, to obtain a first text feature matrix.
For example, the text data set W is first subjected to word frequency statistics to obtain a word frequency matrix TF. The inverse document frequency is then calculated according to equation (1):
wherein Y represents the total number of texts, Y ω For the text number containing the entry omega, converting the calculation result into an inverse document frequency matrix IDF, and finally obtaining the text number according to the Hadamard productAnd obtaining a TF-IDF matrix of the keywords in the text data set, namely a first text feature matrix.
In operation S220, a density boundary threshold is obtained according to euclidean distances between each element and other elements in the first text feature matrix.
For example, the euclidean distance between each element a, b, c, d, e and the other elements in the first text feature matrix [ a, b, c, d, e ] includes the euclidean distance of element a and element b, c, d, e.
In operation S230, elements in the first text feature matrix are clustered based on the density boundary threshold to obtain a plurality of target text datasets corresponding to a plurality of subject events.
According to the embodiment of the disclosure, a plurality of target text data sets are obtained after clustering, and the target text data sets respectively correspond to respective event topics, and each text data in each target text data set has a certain correlation and belongs to the same event topic.
For example, the target text data set a= [ earthquake, tsunami, flood ], the target text data set b= [ fire, explosion ], the data in the data set a all belong to the natural disaster topic, and the data in the data set B all belong to the artificial disaster topic.
In operation S240, for a target text data set corresponding to each subject event, the target text data set is processed based on an implicit dirichlet allocation algorithm, resulting in a plurality of associated sub-events having relevance.
According to the embodiment of the disclosure, based on an implicit dirichlet distribution algorithm, a target text data set corresponding to each subject event is processed to obtain a plurality of associated sub-events, and the associated sub-events have correlation.
For example, sub-event a: large earthquakes in the first place; sub-event B: the second experiment cabin successfully launches; sub-event C: the manned spacecraft XX successfully launches. Both the sub event B and the sub event C belong to Chinese aerospace events, so that the sub events B and the sub event C have certain correlation and belong to associated sub events. The sub-event A belongs to a natural disaster event, and has no correlation with the sub-event B and the sub-event C.
In operation S250, an event evolution relationship tree corresponding to each topic event is generated according to the plurality of associated sub-events based on the occurrence moments of the plurality of associated sub-events.
According to an embodiment of the present disclosure, the generation of the event evolution relationship tree depends on the occurrence time of the associated sub-event. Firstly, determining the occurrence time of each associated sub-event, dividing the trunk and branches of the story tree, establishing an event starting node, a trunk node, a branch node and a key node, and connecting all the nodes to generate an event evolution relation tree.
According to the embodiment of the disclosure, a feature matrix is obtained based on a word frequency-reverse file frequency algorithm, euclidean distance among elements in the feature matrix is calculated to obtain a density boundary threshold value, elements in the feature matrix are clustered to obtain a plurality of target text data sets corresponding to a plurality of subject events, related sub-events with relevance are obtained according to an implicit Dirichlet algorithm, and a corresponding event evolution relation tree is generated based on occurrence time of each related sub-event. And obtaining a density boundary threshold value by calculating Euclidean distance among elements in each text data set, and realizing self-adaptive clustering. The density boundary threshold is calculated according to the text data and is not preset, so that the text data to be processed is not limited, and the limitation that prior knowledge is relied on when parameters are adjusted by a traditional algorithm in the prior art is solved. The problem that local optimization is easy to fall into in the algorithm optimization process is solved by adaptively selecting the optimal parameters through text contents. Meanwhile, the tree structure intuitively describes the evolution development process of the events at each stage, and each event and the node construct a corresponding mapping relation, so that clues and development venation of the events can be accurately mastered.
According to an embodiment of the present disclosure, a first text feature matrix includes I elements, where I is an integer greater than 1, and obtaining a density boundary threshold according to a euclidean distance between each element and other elements in the first text feature matrix includes: aiming at the ith element, obtaining a jth element corresponding to the ith element from the ith-1 element according to Euclidean distance between the ith element and other I-1 elements, wherein I is an integer greater than or equal to 1 and less than or equal to I; returning to execute the operation of obtaining the j-th element corresponding to the I-th element from the I-1-th element and increasing the I under the condition that the I is determined to be smaller than the I; obtaining J elements under the condition that I is determined to be equal to I, wherein J is an integer equal to I, and J is an integer greater than or equal to 1 and less than J; and obtaining a density boundary threshold according to the plurality of Euclidean distances between the I elements and the corresponding J elements.
According to an embodiment of the disclosure, a density boundary threshold is obtained according to the euclidean distance between I elements and corresponding J elements, including the following steps: obtaining an average value of Euclidean distances according to the multiple Euclidean distances between the I elements and the corresponding J elements; the Euclidean distances are arranged in a descending order to obtain an Euclidean distance sequence; residual error processing is carried out on the Euclidean distance sequence, and the difference change rate of adjacent Euclidean distance values is obtained; obtaining a target Euclidean distance value from the Euclidean distance sequence based on the difference change rate; and obtaining a density boundary threshold according to the Euclidean distance average value and the target Euclidean distance value.
For example, the first text feature matrix [ m, n, o, p, q ] includes 5 elements, the euclidean distance between each element m, n, o, p, q and the remaining 4 elements is calculated, and the euclidean distances are ordered, the j element corresponding to each element is arranged at the nth element euclidean distance, and j is one of the other elements except the present element. A plurality of euclidean distances between 5 elements and the corresponding 5 elements may be calculated according to equation (2):
the set of euclidean distances between all elements m, n, o, p, q and the corresponding j-th element is D = { D 1 ,d 2 ,...,d n }. The Euclidean distance average may be calculated according to equation (3):
then, the Euclidean distances in the Distance set D are arranged in a descending order, a K-Distance curve is drawn, residual calculation is carried out on adjacent values, and the residual set is expressed as E= { E 1 ,e 2 ,...,e n-1 Performing secondary residual calculation on the residual result to obtain a residual value point e corresponding to the maximum difference change rate max The Euclidean distance value corresponding to the residual point is the target Euclidean distance value and is marked as d max Value, finally based on Euclidean distance average d aver And a target Euclidean distance value d max Calculating according to the formula (4) to obtain a density boundary threshold value:
according to the embodiment of the disclosure, the density boundary threshold is obtained according to the Euclidean distance between each element and other elements in the first text feature matrix, and a new adaptive-based eps parameter selection algorithm is adopted, so that the limitation of dependence on priori knowledge when the parameters are adjusted by the traditional algorithm is overcome, the optimal parameters are adaptively selected through text content, the problem that local optimization is easy to fall into in the algorithm optimization process is solved, and the robustness of the whole model is improved, and meanwhile, the model has stronger interpretability.
According to an embodiment of the present disclosure, clustering elements in a first text feature matrix based on a density boundary threshold to obtain a plurality of target text datasets corresponding to a plurality of subject events includes: performing density analysis on elements in the first text feature matrix based on a density clustering algorithm to obtain a plurality of candidate text data sets and the number of the candidate text data sets, wherein the candidate text data sets represent text data sets with element density distribution less than or equal to a density boundary threshold value; and clustering the plurality of candidate text data sets by taking the number of the plurality of candidate text data sets as the number of target text data sets to obtain a plurality of target text data sets corresponding to a plurality of subject events.
According to embodiments of the present disclosure, density analysis of elements in a first text feature matrix may utilize a density-based DBSCAN algorithm without pre-specifying the number of clusters.
For example, if the density boundary threshold is 0.75 and the density distribution of the elements a, b, c, and d in the first text feature matrix W is 3,0.6,0.75,0.92, the density distribution of the elements b and c is equal to or less than the density boundary threshold of 0.75, so [ b, c ] constitutes the candidate text data set.
For example, after the density clustering algorithm, 300 candidate text data sets are obtained, and then the number of target text data sets is 300.
According to an embodiment of the present disclosure, clustering a plurality of candidate text data sets with a number of the candidate text data sets as a number of target text data sets to obtain a plurality of target text data sets corresponding to a plurality of subject events, includes: taking the number of the candidate text data sets as the number of the target text data sets, and clustering the candidate text data sets based on a K-means clustering algorithm to obtain clustered text data sets and non-clustered text data corresponding to the topic events; and merging the non-clustered text data with the clustered text data set corresponding to the corresponding subject event based on the relevance between the non-clustered text data and the plurality of candidate subject events to obtain a target text data set.
According to the embodiment of the disclosure, the number of the plurality of candidate text data sets is defined as the number of target text data sets, n is defined as the number, and n is input into a K mean value clustering algorithm as a K value to perform secondary clustering.
According to the embodiment of the disclosure, merging non-clustered text data with clustered text data sets corresponding to corresponding topic events means merging and merging non-clustered orphan points with all existing topic clusters in a clustering process.
According to the embodiment of the disclosure, the first text feature matrix is clustered step by step twice through a density clustering algorithm and a K-means algorithm, so that a plurality of target text data sets corresponding to a plurality of subject events are obtained. In the process of clustering for multiple times, the obtained data correlation in each cluster is as high as possible, meanwhile, the density clustering algorithm and the K-means algorithm are both unsupervised learning algorithms, training is not needed, preset categories are not needed, cluster analysis can be adaptively carried out, and the problem that the event evolution relation tree is not generated by utilizing adaptive clustering in the prior art is solved. Meanwhile, the text data which are not clustered are combined with the clustered text data sets corresponding to the corresponding subject events, so that the integrity and the comprehensiveness of the text data are ensured, and the text extraction precision is as high as possible.
FIG. 3 schematically illustrates a flow chart of subject matter cluster analysis, according to an embodiment of the disclosure.
As shown in FIG. 3, the method 300 includes operations S310-S380.
In operation S310, input: the text data set is preprocessed to obtain a word segmentation set and a keyword set.
According to an embodiment of the present disclosure, the operation of preprocessing includes: inputting a text dataset Φ= { T for generating an event story tree 1 ,T 2 ,...,T n First, word segmentation is carried out on each text content by using a Chinese word segmentation tool to obtain word segmentation sets which are recorded as omega= { S 1 ,S 2 ,...,S n And load stop word listStop words in the text are filtered. Then, extracting keywords by using a TextRank algorithm, and representing the keyword result of all the extracted texts as ψ= { K 1 ,K 2 ,...,K n And (3) obtaining the keyword set. Wherein the keywords contained in the ith text can be denoted as K i ={k 1 ,k 2 ,...,k topK The number of keywords per text is topK.
According to the embodiment of the disclosure, the stop words refer to that in information retrieval, certain words or words are automatically filtered before text is processed in order to save storage space and improve searching efficiency.
For example, the stop vocabulary is [ one, that ], then the "one, that" contained in each text is filtered out.
In operation S320, a first text feature matrix is calculated and normalized.
According to the embodiment of the present disclosure, word frequency statistics is first performed using the keyword set ψ obtained in S3010, and the result is converted into a word frequency matrix TF. Word frequency statistics is to count how frequently any keyword in the keyword set ψ appears in the keywords. The process of obtaining the first text feature matrix may refer to S210.
In operation S330, it is determined whether the PCA dimension reduction threshold is a positive integer.
In operation S331, when the dimension threshold is a positive integer, the first text feature matrix is reduced to a feature matrix in the dimension.
In operation S332, when the dimension threshold is not a positive integer, the first text feature matrix is adaptively reduced in dimension by a percentage.
According to the embodiment of the disclosure, the first text feature matrix is subjected to dimension reduction processing through a PCA algorithm (principal component analysis algorithm), a dimension threshold is set to determine a dimension reduction interval, when the value range is not a positive integer, the dimension is reduced to the optimal dimension in a self-adaptive mode according to the data duty ratio of the required reserved information, when the value is the positive integer, the first text feature matrix is directly reduced to the dimension, and the setting of the dimension threshold can improve the algorithm performance while considering the data integrity.
In operation S340, euclidean distances for each element in the first text feature matrix are calculated.
In operation S350, a density boundary threshold is calculated.
In operation S360, the target text quantity based on the density clustering algorithm is confirmed.
In operation S370, topic events based on the K-means clustering algorithm are clustered.
In operation S380, output: and clustering results of the theme events.
According to an embodiment of the present disclosure, for a target text data set corresponding to each subject event, processing the target text data set based on an implicit dirichlet allocation algorithm to obtain a plurality of associated sub-events with relevance, including: extracting keywords of a target text data set corresponding to each subject event to obtain a keyword set; processing the keyword set based on an implicit dirichlet allocation algorithm according to the preset weight of the keyword to obtain the number of related sub-events with relevance in the target text data set; processing the target text data set based on a word frequency-reverse file frequency algorithm to obtain a second text feature matrix, wherein elements in the second text feature matrix represent word frequencies of keywords used for describing topic events corresponding to the target text data set; and clustering the second text feature matrix based on the number of the associated sub-events based on a K center point clustering algorithm to obtain a plurality of associated sub-events with relevance.
According to an embodiment of the present disclosure, processing a keyword set based on an implicit dirichlet allocation algorithm according to a predetermined weight of a keyword to obtain the number of related sub-events with relevance in a target text data set includes: processing the keyword set based on an implicit dirichlet allocation algorithm according to the preset weight of the keyword to obtain a text consistency coefficient in a target text data set corresponding to the preset associated event number; and obtaining the number of related sub-events with relevance in the target text data set based on the text consistency coefficient.
According to embodiments of the present disclosure, from each subject matterExtracting keyword set psi from corresponding target text data set C And word segmentation set omega C . Then according to ψ C The predetermined weight of each keyword in the tree-like structure is used for generating a corresponding Bow model and a keyword corpus, traversing and calculating document consistency coefficients under the condition of different sub-event cluster numbers based on implicit dirichlet distribution, and recording the result as H= { H 1 ,H 2 ,...,H m M is a positive integer, representing the number of sub-event clusters, and the maximum value does not exceed the corresponding keyword set ψ C And selecting the number lambda of related sub-events with correlation corresponding to the maximum value in the H set. Then omega C After matrix standardization, corresponding TF-IDF is calculated C And (3) matrix, and performing dimension reduction by using a PCA algorithm. Finally, the quantity lambda and the TF-IDF after dimension reduction are processed C The matrix is used as the input of a K center point clustering algorithm, and sub-event clustering analysis is carried out on the events in the topic cluster to obtain a plurality of associated sub-events with relevance. The clustering result is denoted as v= { V 1 ,V 2 ,...,V j Each cluster contains a plurality of sub-events with highest cross-correlation degree.
According to the embodiment of the disclosure, the target text data set corresponding to each topic event is processed by using the implicit dirichlet distribution and the K central point clustering algorithm to obtain the associated sub-event with correlation corresponding to a plurality of topic events, and in combination with the embodiment 300, the topic event clustering and the sub-event clustering are respectively completed based on the multi-stage hierarchical clustering algorithm, so that the defect that the prior art does not perceive event evolution from the sub-event angle of the topic event and performs analysis and carding is overcome, meanwhile, the extraction precision is high, and the integrity of the text is ensured.
FIG. 4 schematically illustrates a flow chart of sub-event cluster analysis according to an embodiment of the disclosure.
In operation S410, a keyword set is obtained for a target text data set corresponding to each subject event.
In operation S420, the keyword set is processed based on the implicit dirichlet allocation algorithm according to the predetermined weight of the keyword, so as to obtain the text consistency coefficient in the target text data set corresponding to the predetermined number of related events.
According to the embodiment of the disclosure, the processed keyword set is extracted from the target text set corresponding to the subject event.
In operation S430, the number of related sub-events having relevance in the target text data set is obtained based on the text consistency coefficient.
For example, the document consistency factor may be determined from both semantic and logical aspects.
In operation S440, the target text data set is processed to obtain a second text feature matrix.
According to the embodiment of the disclosure, the second text feature matrix is calculated by normalizing a word segmentation set extracted from the target text set corresponding to the subject event.
In operation S450, the second text feature matrix is clustered based on the number of associated sub-events based on the K-center clustering algorithm, to obtain a plurality of associated sub-events having relevance.
According to an embodiment of the present disclosure, each sub-event cluster includes a plurality of sub-events with the highest degree of cross-correlation.
According to an embodiment of the present disclosure, a plurality of associated sub-events includes a plurality of sub-events in a first cluster and a plurality of sub-events in a second cluster, and generating an event evolution relationship tree corresponding to each topic event according to the plurality of associated sub-events based on occurrence moments of the plurality of associated sub-events includes: under the condition that the occurrence time of the sub-event with the earliest occurrence time in the first cluster is determined to be earlier than the occurrence time of the sub-event with the earliest occurrence time in the second cluster, taking the plurality of sub-events in the first cluster and the sub-event with the earliest occurrence time in the second cluster as a trunk node; and generating an event evolution relation tree corresponding to each subject event according to the trunk node and other sub-events in the second cluster.
According to an embodiment of the present disclosure, generating an event evolution relationship tree corresponding to each topic event according to a backbone node and other sub-events in a second cluster, includes: sequentially associating the trunk nodes according to the forward sequence of occurrence moments of the sub-events corresponding to the trunk nodes to generate trunks of the event evolution relation tree; according to the forward sequence of the occurrence time of other sub-events in the second cluster, sequentially associating the other sub-events as branch nodes to generate branches of an event evolution relation tree; and associating the trunk with the branch based on the correlation of the trunk node and the branch node, and generating an event evolution relation tree.
According to the embodiment of the disclosure, the correlation among the nodes is calculated by utilizing the space-time correlation and the event importance degree, the isolated nodes which are not clustered are classified into sub-event clusters with highest correlation, and the corresponding starting nodes, trunk nodes, branch nodes and key nodes are divided according to the time mapping model of the event nodes.
For example, in the first cluster: sub-event 1 occurs on monday and sub-event 2 occurs on wednesday; in the second cluster: sub-event 3 occurs on tuesday, sub-event 4 occurs on tuesday, and sub-event 5 occurs on friday. The sub-events 1, 2 and 3 which occur in Monday, tuesday and Tuesday are ordered according to the occurrence time of the sub-events, and the rest sub-events are branch nodes. In the first cluster, the occurrence time of the earliest sub-event 1 is earlier than that of the earliest sub-event 3 in the second cluster, so that all the sub-events in the first cluster are main sub-events. Therefore, the sub-event points 1, 2 and 3 are used as main sub-events, and the arrangement sequence is sub-event 1, sub-event 3 and sub-event 2; taking the sub-events 4 and 5 as branch sub-events, wherein the arrangement sequence is sub-event 4 and sub-event 5; because the sub-event 3 and the branch sub-event come from the second cluster, the correlation is highest, and therefore, the branch sub-event 4 is associated with the main sub-event 3, and an event evolution relation tree is generated.
According to the embodiment of the disclosure, each sub-event and the node construct a corresponding mapping relation, the categories of the nodes are divided by combining time sequence information, event cross correlation and event importance, the node categories comprise an initial node, a trunk node, a key node and a branch node, the nodes are connected in a directed manner according to the classification result to generate a complete event story tree, and the tree structure of the event evolution relation tree can intuitively describe the event evolution development process of each stage.
Fig. 5 schematically illustrates a schematic diagram of an event evolution relationship tree construction process according to an embodiment of the present disclosure.
The topic cluster 510 is a target text data set, and operates in the embodiment 400 to obtain 3 sub-event clusters, where the correlation of the sub-events in each sub-event cluster is highest.
The time information of each sub-event is extracted from the topic cluster 510, so as to obtain the occurrence time of each sub-event in the topic cluster, and the judgment of the trunk node and the branch node is performed, and the judgment result is 520.
According to the embodiment of the disclosure, according to the classification result of the nodes, each clustered event node is connected in a directed manner according to the time sequence direction through a mapping model. Each trunk and branch is merged and embedded based on similarity principles and time information of the event, thereby generating a complete event evolution relationship tree structure 530.
According to embodiments of the present disclosure, each topic event may generate a corresponding evolutionary relationship tree.
Fig. 6 schematically illustrates a schematic diagram of a method of generating an event evolution relationship tree according to an embodiment of the present disclosure.
In operation S610, news story text is input, and a text data set is preprocessed.
In operation S620, processing a text data set to be processed based on a word frequency-reverse file frequency algorithm to obtain a first text feature matrix; obtaining a density boundary threshold according to Euclidean distance between each element and other elements in the first text feature matrix; and clustering elements in the first text feature matrix based on the density boundary threshold value to obtain a plurality of target text data sets corresponding to a plurality of theme events. Specific operations may refer to embodiment 300, and are not described herein.
In operation S630, the keyword set is processed based on the implicit dirichlet distribution algorithm, and according to the predetermined weight of the keyword, the text consistency coefficient in the target text data set corresponding to the predetermined number of related events is obtained.
In operation S640, the text consistency coefficient and the second text matrix after the dimension reduction are used as input, and the K-center clustering algorithm is used to divide and cluster each topic cluster by sub-events, and each sub-event cluster contains a plurality of sub-events with the highest cross-correlation degree.
In operation S650, the time information of each sub-event is extracted, so as to obtain the occurrence time of each sub-event in the topic cluster, and the node is determined, so as to generate an event evolution relationship tree.
Based on the method for generating the event evolution relationship tree, the invention further provides a device for generating the event evolution relationship tree. The device will be described in detail below in connection with fig. 7.
Fig. 7 schematically shows a block diagram of a structure of a generation apparatus of an event evolution relationship tree according to an embodiment of the present disclosure.
As shown in fig. 7, the generating apparatus 700 of the event evolution relationship tree of this embodiment includes a first processing module 710, an obtaining module 720, a clustering module 730, a second processing module 740, and a generating module 750.
The first processing module 710 is configured to process a text data set to be processed based on a word frequency-reverse file frequency algorithm to obtain a first text feature matrix, where each element in the first text feature matrix characterizes a word frequency of a keyword used to describe an event. In an embodiment, the first processing module 710 may be configured to perform the operation S210 described above, which is not described herein.
The obtaining module 720 is configured to obtain a density boundary threshold according to euclidean distances between each element and other elements in the first text feature matrix. In an embodiment, the obtaining module 720 may be configured to perform the operation S220 described above, which is not described herein.
According to an embodiment of the present disclosure, the obtaining module 720 includes a first obtaining sub-module, a second obtaining sub-module, a third obtaining sub-module, and a fourth obtaining sub-module.
The first obtaining submodule is used for obtaining the j-th element corresponding to the I-th element from the I-1-th element according to the Euclidean distance between the I-th element and other I-1 elements.
And the second obtaining submodule is used for returning to execute the operation of obtaining the j-th element corresponding to the I-th element from the I-1-th element and incrementing I under the condition that the I is determined to be smaller than the I.
And a third obtaining sub-module, configured to obtain J elements if I is determined to be equal to I.
And a fourth obtaining sub-module, configured to obtain a density boundary threshold according to the multiple euclidean distances between the I elements and the corresponding J elements.
According to an embodiment of the present disclosure, the fourth obtaining sub-module includes a first obtaining unit, a second obtaining unit, a third obtaining unit, a fourth obtaining unit, and a fifth obtaining unit.
The first obtaining unit is used for obtaining the Euclidean distance average value according to the multiple Euclidean distances between the I elements and the corresponding J elements.
And the second obtaining unit is used for carrying out descending order arrangement on the plurality of Euclidean distances to obtain the Euclidean distance sequence.
And the third obtaining unit is used for carrying out residual error processing on the Euclidean distance sequence to obtain the difference change rate of the adjacent Euclidean distance values.
And a fourth obtaining unit, configured to obtain a target euclidean distance value from the euclidean distance sequence based on the difference change rate.
And a fifth obtaining unit, configured to obtain a density boundary threshold according to the euclidean distance average value and the target euclidean distance value.
The clustering module 730 is configured to cluster the elements in the first text feature matrix based on the density boundary threshold, to obtain a plurality of target text data sets corresponding to a plurality of subject events. In an embodiment, the clustering module 730 may be configured to perform the operation S230 described above, which is not described herein.
According to an embodiment of the present disclosure, the clustering module 730 includes a first obtaining sub-module and a second obtaining sub-module.
The first obtaining submodule is used for carrying out density analysis on elements in the first text feature matrix based on a density clustering algorithm to obtain a plurality of candidate text data sets and the number of the candidate text data sets.
And the second obtaining submodule is used for clustering the plurality of candidate text data sets by taking the number of the plurality of candidate text data sets as the number of target text data sets to obtain a plurality of target text data sets corresponding to a plurality of theme events.
According to an embodiment of the present disclosure, the second obtaining sub-module includes a first obtaining unit, a second obtaining unit.
The first obtaining unit is used for taking the number of the plurality of candidate text data sets as the number of target text data sets, clustering the plurality of candidate text data sets based on a K-means clustering algorithm, and obtaining a plurality of clustered text data sets and non-clustered text data corresponding to a plurality of subject events.
The second obtaining unit is used for merging the non-clustered text data with the clustered text data set corresponding to the corresponding topic event based on the relevance between the non-clustered text data and the candidate topic events to obtain a target text data set.
The second processing module 740 is configured to process, for a target text data set corresponding to each topic event, the target text data set based on an implicit dirichlet distribution algorithm, to obtain a plurality of associated sub-events with relevance. In an embodiment, the second processing module 740 may be configured to perform the operation S240 described above, which is not described herein.
According to an embodiment of the present disclosure, the second processing module 740 includes a first obtaining sub-module, a second obtaining sub-module, a third obtaining sub-module, and a fourth obtaining sub-module.
The first obtaining sub-module is used for extracting keywords of the target text data set aiming at the target text data set corresponding to each theme event to obtain a keyword set.
And the second obtaining sub-module is used for processing the keyword set based on the implicit dirichlet distribution algorithm according to the preset weight of the keyword to obtain the number of related sub-events with relevance in the target text data set.
And the third obtaining submodule is used for processing the target text data set based on the word frequency-reverse file frequency algorithm to obtain a second text feature matrix.
And the fourth obtaining sub-module is used for clustering the second text feature matrix based on the number of the associated sub-events based on a K center point clustering algorithm to obtain a plurality of associated sub-events with correlation.
The second obtaining submodule comprises a first obtaining unit and a second obtaining unit.
The first obtaining unit is used for processing the keyword set based on the implicit dirichlet distribution algorithm according to the preset weight of the keyword to obtain the text consistency coefficient in the target text data set corresponding to the preset association event number.
And the second obtaining unit is used for obtaining the number of related sub-events with relevance in the target text data set based on the text consistency coefficient.
The generating module 750 is configured to generate an event evolution relationship tree corresponding to each topic event based on occurrence moments of the plurality of associated sub-events. In an embodiment, the first generating module 750 may be configured to perform the operation S250 described above, which is not described herein.
According to an embodiment of the present disclosure, the generation module 750 includes a determination sub-module and a generation sub-module.
And the determining sub-module is used for taking the plurality of sub-events in the first cluster and the sub-event with the earliest occurrence time in the second cluster as a trunk node when the occurrence time of the sub-event with the earliest occurrence time in the first cluster is determined to be earlier than the occurrence time of the sub-event with the earliest occurrence time in the second cluster.
And the generation sub-module is used for generating an event evolution relation tree corresponding to each theme event according to the trunk node and other sub-events in the second cluster.
According to an embodiment of the present disclosure, the generation sub-module includes a first generation unit, a second generation unit, and a third generation unit.
The first generation unit is used for sequentially associating the trunk nodes according to the forward sequence of the occurrence time of the sub-event corresponding to the trunk nodes to generate the trunk of the event evolution relation tree.
The second generating unit is used for sequentially associating other sub-events serving as branch nodes according to the forward sequence of the occurrence time of the other sub-events in the second cluster to generate branches of the event evolution relation tree.
And the third generating unit is used for associating the trunk with the branch based on the correlation between the trunk node and the branch node, and generating an event evolution relation tree.
According to an embodiment of the present disclosure, any of the plurality of modules of the first processing module 710, the obtaining module 720, the clustering module 730, the second processing module 740, and the generating module 750 may be combined in one module to be implemented, or any of the plurality of modules may be split into a plurality of modules. Alternatively, at least some of the functionality of one or more of the modules may be combined with at least some of the functionality of other modules and implemented in one module. According to embodiments of the present disclosure, at least one of the first processing module 710, the obtaining module 720, the clustering module 730, the second processing module 740, and the generating module 750 may be implemented at least in part as hardware circuitry, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in hardware or firmware in any other reasonable way of integrating or packaging the circuitry, or in any one of or a suitable combination of any of the three implementations of software, hardware, and firmware. Alternatively, at least one of the first processing module 710, the obtaining module 720, the clustering module 730, the second processing module 740, and the generating module 750 may be at least partially implemented as a computer program module, which when executed, may perform the respective functions.
Fig. 8 schematically illustrates a block diagram of an electronic device adapted to implement a method of generating an event evolution relationship tree according to an embodiment of the present disclosure.
As shown in fig. 8, an electronic device 800 according to an embodiment of the present disclosure includes a processor 801 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. The processor 801 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or an associated chipset and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. The processor 801 may also include on-board memory for caching purposes. The processor 801 may include a single processing unit or multiple processing units for performing the different actions of the method flows according to embodiments of the disclosure.
In the RAM 803, various programs and data required for the operation of the electronic device 800 are stored. The processor 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. The processor 801 performs various operations of the method flow according to the embodiments of the present disclosure by executing programs in the ROM 802 and/or the RAM 803. Note that the program may be stored in one or more memories other than the ROM 802 and the RAM 803. The processor 801 may also perform various operations of the method flows according to embodiments of the present disclosure by executing programs stored in the one or more memories.
According to an embodiment of the present disclosure, the electronic device 800 may also include an input/output (I/O) interface 805, the input/output (I/O) interface 805 also being connected to the bus 804. The electronic device 800 may also include one or more of the following components connected to the I/O interface 805: an input portion 806 including a keyboard, mouse, etc.; an output portion 807 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage section 808 including a hard disk or the like; and a communication section 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs communication processing via a network such as the internet. The drive 810 is also connected to the I/O interface 805 as needed. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as needed so that a computer program read out therefrom is mounted into the storage section 808 as needed.
The present disclosure also provides a computer-readable storage medium that may be embodied in the apparatus/device/system described in the above embodiments; or may exist alone without being assembled into the apparatus/device/system. The computer-readable storage medium carries one or more programs which, when executed, implement methods in accordance with embodiments of the present disclosure.
According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example, but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, the computer-readable storage medium may include ROM 802 and/or RAM 803 and/or one or more memories other than ROM 802 and RAM 803 described above.
Embodiments of the present disclosure also include a computer program product comprising a computer program containing program code for performing the methods shown in the flowcharts. The program code, when executed in a computer system, causes the computer system to implement the item recommendation method provided by embodiments of the present disclosure.
The above-described functions defined in the system/apparatus of the embodiments of the present disclosure are performed when the computer program is executed by the processor 801. The systems, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.
In one embodiment, the computer program may be based on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted, distributed, and downloaded and installed in the form of a signal on a network medium, and/or from a removable medium 811 via a communication portion 809. The computer program may include program code that may be transmitted using any appropriate network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.
In such an embodiment, the computer program may be downloaded and installed from a network via the communication section 809, and/or installed from the removable media 811. The above-described functions defined in the system of the embodiments of the present disclosure are performed when the computer program is executed by the processor 801. The systems, devices, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.
According to embodiments of the present disclosure, program code for performing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, such computer programs may be implemented in high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. Programming languages include, but are not limited to, such as Java, c++, python, "C" or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Those skilled in the art will appreciate that the features recited in the various embodiments of the disclosure and/or in the claims may be provided in a variety of combinations and/or combinations, even if such combinations or combinations are not explicitly recited in the disclosure. In particular, the features recited in the various embodiments of the present disclosure and/or the claims may be variously combined and/or combined without departing from the spirit and teachings of the present disclosure. All such combinations and/or combinations fall within the scope of the present disclosure.
The embodiments of the present disclosure are described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described above separately, this does not mean that the measures in the embodiments cannot be used advantageously in combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be made by those skilled in the art without departing from the scope of the disclosure, and such alternatives and modifications are intended to fall within the scope of the disclosure.

Claims (10)

1. A method for generating an event evolution relation tree comprises the following steps:
processing a text data set to be processed based on a word frequency-reverse file frequency algorithm to obtain a first text feature matrix, wherein each element in the first text feature matrix represents word frequency of keywords used for describing an event;
Obtaining a density boundary threshold according to the Euclidean distance between each element and other elements in the first text feature matrix;
clustering elements in the first text feature matrix based on the density boundary threshold to obtain a plurality of target text data sets corresponding to a plurality of subject events;
aiming at a target text data set corresponding to each theme event, processing the target text data set based on an implicit dirichlet allocation algorithm to obtain a plurality of related sub-events with relevance; and
and generating an event evolution relation tree corresponding to each theme event according to the plurality of associated sub-events based on the occurrence time of the plurality of associated sub-events.
2. The method of claim 1, wherein the first text feature matrix includes I elements, I is an integer greater than 1, and the obtaining the density boundary threshold according to the euclidean distance between each element and other elements in the first text feature matrix includes:
aiming at an ith element, obtaining a jth element corresponding to the ith element from the ith-1 element according to Euclidean distance between the ith element and other I-1 elements, wherein I is an integer greater than or equal to 1 and less than or equal to I;
Returning to execute the operation of obtaining a j-th element corresponding to the I-th element from the I-1-th element under the condition that I is determined to be smaller than I, and increasing I;
obtaining J elements under the condition that I is determined to be equal to I, wherein J is an integer equal to I, and J is an integer greater than or equal to 1 and less than J; and
and obtaining the density boundary threshold according to the plurality of Euclidean distances between the I elements and the J elements.
3. The method of claim 2, wherein the deriving the density boundary threshold from the euclidean distance of the I elements from the corresponding J elements comprises:
obtaining an Euclidean distance average value according to the I elements and the multiple Euclidean distances of the J elements;
the Euclidean distances are arranged in a descending order to obtain an Euclidean distance sequence;
residual error processing is carried out on the Euclidean distance sequence, so that the difference change rate of adjacent Euclidean distance values is obtained;
obtaining a target Euclidean distance value from the Euclidean distance sequence based on the difference change rate; and
and obtaining the density boundary threshold according to the Euclidean distance average value and the target Euclidean distance value.
4. The method of claim 1, wherein clustering the elements in the first text feature matrix based on the density boundary threshold to obtain a plurality of target text datasets corresponding to a plurality of subject events comprises:
Performing density analysis on elements in the first text feature matrix based on a density clustering algorithm to obtain a plurality of candidate text data sets and the number of the candidate text data sets, wherein the candidate text data sets represent text data sets with element density distribution less than or equal to the density boundary threshold; and
and clustering the plurality of candidate text data sets by taking the number of the plurality of candidate text data sets as the number of target text data sets to obtain a plurality of target text data sets corresponding to the plurality of subject events.
5. The method of claim 4, wherein clustering the plurality of candidate text data sets with the number of candidate text data sets as the number of target text data sets to obtain a plurality of target text data sets corresponding to the plurality of subject events comprises:
clustering the plurality of candidate text data sets based on a K-means clustering algorithm by taking the number of the plurality of candidate text data sets as the number of target text data sets to obtain a plurality of clustered text data sets and non-clustered text data corresponding to the plurality of subject events; and
And merging the non-clustered text data with the clustered text data set corresponding to the corresponding topic event based on the relevance of the non-clustered text data and the plurality of candidate topic events to obtain the target text data set.
6. The method of claim 1, wherein the processing the target text data set based on the implicit dirichlet distribution algorithm for the target text data set corresponding to each topic event, to obtain a plurality of associated sub-events with relevance, includes:
extracting keywords of a target text data set corresponding to each subject event to obtain a keyword set;
processing the keyword set based on an implicit dirichlet allocation algorithm according to the preset weight of the keyword to obtain the number of related sub-events with relevance in the target text data set;
processing the target text data set based on a word frequency-reverse file frequency algorithm to obtain a second text feature matrix, wherein elements in the second text feature matrix represent word frequencies of keywords used for describing topic events corresponding to the target text data set; and
And clustering the second text feature matrix based on the number of the associated sub-events based on a K center point clustering algorithm to obtain a plurality of associated sub-events with relevance.
7. The method of claim 6, wherein the processing the keyword set based on the implicit dirichlet allocation algorithm according to the predetermined weight of the keyword, to obtain the number of related sub-events with relevance in the target text data set, includes:
processing the keyword set based on an implicit dirichlet allocation algorithm according to the preset weight of the keyword to obtain a text consistency coefficient in the target text data set corresponding to the preset association event number; and
and obtaining the number of related sub-events with relevance in the target text data set based on the text consistency coefficient.
8. The method of claim 1, wherein the plurality of associated sub-events includes a plurality of sub-events in a first cluster and a plurality of sub-events in a second cluster, the generating an event evolution relationship tree corresponding to each topic event from the plurality of associated sub-events based on occurrence times of the plurality of associated sub-events, comprising:
Under the condition that the occurrence time of the sub event with the earliest occurrence time in the first cluster is determined to be earlier than the occurrence time of the sub event with the earliest occurrence time in the second cluster, taking the plurality of sub events in the first cluster and the sub event with the earliest occurrence time in the second cluster as a trunk node;
and generating an event evolution relation tree corresponding to each subject event according to the trunk node and other sub-events in the second cluster.
9. The method of claim 8, wherein the generating the event evolution relationship tree corresponding to each of the subject events from the backbone node and other sub-events in the second cluster comprises:
sequentially associating the trunk nodes according to the forward sequence of occurrence moments of the sub-events corresponding to the trunk nodes to generate trunks of the event evolution relationship tree;
sequentially associating other sub-events serving as branch nodes according to the forward sequence of the occurrence time of the other sub-events in the second cluster to generate branches of the event evolution relationship tree;
and associating the trunk with the branch based on the correlation between the trunk node and the branch node, and generating the event evolution relation tree.
10. An event evolution relationship tree generating device, comprising:
the first processing module is used for processing a text data set to be processed based on a word frequency-reverse file frequency algorithm to obtain a first text feature matrix, wherein each element in the first text feature matrix represents word frequency of keywords used for describing an event;
the obtaining module is used for obtaining a density boundary threshold according to Euclidean distance between each element and other elements in the first text feature matrix;
the clustering module is used for clustering elements in the first text feature matrix based on the density boundary threshold value to obtain a plurality of target text data sets corresponding to a plurality of theme events;
the second processing module is used for processing the target text data set corresponding to each theme event based on an implicit dirichlet allocation algorithm to obtain a plurality of related sub-events with relevance; and
and the generation module is used for generating an event evolution relation tree corresponding to each theme event based on the occurrence time of the plurality of associated sub-events.
CN202311121228.1A 2023-09-01 2023-09-01 Method and device for generating event evolution relation tree Pending CN116991967A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311121228.1A CN116991967A (en) 2023-09-01 2023-09-01 Method and device for generating event evolution relation tree

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311121228.1A CN116991967A (en) 2023-09-01 2023-09-01 Method and device for generating event evolution relation tree

Publications (1)

Publication Number Publication Date
CN116991967A true CN116991967A (en) 2023-11-03

Family

ID=88524929

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311121228.1A Pending CN116991967A (en) 2023-09-01 2023-09-01 Method and device for generating event evolution relation tree

Country Status (1)

Country Link
CN (1) CN116991967A (en)

Similar Documents

Publication Publication Date Title
US11227118B2 (en) Methods, devices, and systems for constructing intelligent knowledge base
CN110162593B (en) Search result processing and similarity model training method and device
Alam et al. Processing social media images by combining human and machine computing during crises
US9449271B2 (en) Classifying resources using a deep network
CN107992585B (en) Universal label mining method, device, server and medium
WO2019105432A1 (en) Text recommendation method and apparatus, and electronic device
US10289717B2 (en) Semantic search apparatus and method using mobile terminal
US8126897B2 (en) Unified inverted index for video passage retrieval
CN110489558B (en) Article aggregation method and device, medium and computing equipment
CN111339421A (en) Information search method, device, equipment and storage medium based on cloud technology
CN113705299A (en) Video identification method and device and storage medium
US9940355B2 (en) Providing answers to questions having both rankable and probabilistic components
CN112559747B (en) Event classification processing method, device, electronic equipment and storage medium
CN114328807A (en) Text processing method, device, equipment and storage medium
CN111651581A (en) Text processing method and device, computer equipment and computer readable storage medium
CN111552788A (en) Database retrieval method, system and equipment based on entity attribute relationship
CN114329051B (en) Data information identification method, device, apparatus, storage medium and program product
CN114119136A (en) Product recommendation method and device, electronic equipment and medium
CN111930949B (en) Search string processing method and device, computer readable medium and electronic equipment
CN117176471A (en) Dual high-efficiency detection method, device and storage medium for anomaly of text and digital network protocol
CN114547257B (en) Class matching method and device, computer equipment and storage medium
US20230367644A1 (en) Computing environment provisioning
CN116975271A (en) Text relevance determining method, device, computer equipment and storage medium
CN116991967A (en) Method and device for generating event evolution relation tree
CN111368036B (en) Method and device for searching information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination