CN113722478B - Multi-dimensional feature fusion similar event calculation method and system and electronic equipment - Google Patents

Multi-dimensional feature fusion similar event calculation method and system and electronic equipment Download PDF

Info

Publication number
CN113722478B
CN113722478B CN202110906530.2A CN202110906530A CN113722478B CN 113722478 B CN113722478 B CN 113722478B CN 202110906530 A CN202110906530 A CN 202110906530A CN 113722478 B CN113722478 B CN 113722478B
Authority
CN
China
Prior art keywords
event
historical
current
historical event
current event
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110906530.2A
Other languages
Chinese (zh)
Other versions
CN113722478A (en
Inventor
韩勇
李青龙
骆飞
赵冲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Smart Starlight Information Technology Co ltd
Original Assignee
Beijing Smart Starlight Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Smart Starlight Information Technology Co ltd filed Critical Beijing Smart Starlight Information Technology Co ltd
Priority to CN202110906530.2A priority Critical patent/CN113722478B/en
Publication of CN113722478A publication Critical patent/CN113722478A/en
Application granted granted Critical
Publication of CN113722478B publication Critical patent/CN113722478B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Abstract

The invention discloses a method, a system and electronic equipment for calculating a multi-dimensional feature fusion similar event, wherein the method comprises the following steps: according to the subject words, abstract, titles, argument, trigger words, life cycle and text types of the current event and the historical event, respectively obtaining a subject word semantic similarity value, an abstract sentence semantic similarity value, a syntax similarity value, an argument similarity value, a trigger word similarity value, a time window similarity value and a text category similarity value of the current event and each historical event; weighting and fusing the plurality of similar values to obtain a similarity score value of the current event and each historical event; when the similarity score value is larger than a preset threshold value, weighting according to the occurrence places and named entities of the current event and the historical event to obtain a weighted similarity score value; and sorting the historical events according to the final similarity score value to obtain a sorting result, and determining the historical events with high similarity with the current event according to the sorting result. The accuracy of the event similarity lookup is improved through multiple dimensions.

Description

Multi-dimensional feature fusion similar event calculation method and system and electronic equipment
Technical Field
The invention relates to the field of text analysis, in particular to a method, a system, electronic equipment and a storage medium for calculating a multi-dimensional feature fusion similar event.
Background
The existing public opinion analysis platform is used for carrying out heat and propagation analysis on the occurring events, and the comparison analysis of the historical similar events is omitted.
At present, the historical events containing the keywords are usually obtained according to keyword comparison or through keyword search, and the method is rough in searching the historical events and cannot obtain the historical events with high similarity.
Disclosure of Invention
In view of the above, the embodiments of the present invention provide a method, a system, an electronic device, and a storage medium for calculating a multi-dimensional feature fusion similarity event, so as to solve the defect of low similarity in searching for historical events in the prior art.
Therefore, the embodiment of the invention provides the following technical scheme:
according to a first aspect, an embodiment of the present invention provides a method for calculating a multi-dimensional feature fusion similarity event, including: acquiring a current event subject word, a current event abstract, a current event title, a current event argument, a current event trigger word, a current event life cycle, a current event text category, a current event naming entity and a current event occurrence place of a current event; acquiring a historical event subject word, a historical event abstract, a historical event title, a historical event argument, a historical event trigger word, a historical event life cycle, a historical event text category, a historical event naming entity and a historical event occurrence place of each historical event; obtaining a subject word semantic similarity value of the current event and each historical event according to the subject word of the current event and the subject word of each historical event; obtaining the abstract sentence semantic similarity value of the current event and each historical event according to the current event abstract and the historical event abstract of each historical event; obtaining a syntax similarity value of the current event and each historical event according to the current event title and the historical event title of each historical event; obtaining the argument similarity value of the current event and each historical event according to the current event argument and the historical event argument of each historical event; obtaining a trigger word similarity value of the current event and each historical event according to the current event trigger word and the historical event trigger word of each historical event; obtaining a time window similarity value of the current event and each historical event according to the life cycle of the current event and the life cycle of the historical event of each historical event; obtaining a text category similarity value of the current event and each historical event according to the current event text category and the historical event text category of each historical event; respectively carrying out weighted fusion on the subject word semantic similarity value, the abstract sentence semantic similarity value, the syntax similarity value, the argument similarity value, the trigger word similarity value, the time window similarity value and the text category similarity value of each historical event to obtain similarity score values of the current event and each historical event; judging whether the similarity score value of each historical event is larger than a preset threshold value or not respectively; if the similarity score value of the historical event is smaller than or equal to a preset threshold value, the similarity score value of the historical event is kept unchanged, and the similarity score value is used as the final similarity score value of the current event and the historical event; if the similarity score value of the historical event is larger than a preset threshold value, carrying out naming entity and region weighting on the similarity score value of the historical event according to the current event naming entity, the current event occurrence place, the historical event naming entity and the historical event occurrence place to obtain a weighted similarity score value, and taking the weighted similarity score value as a final similarity score value of the current event and the historical event; and sequencing the historical events according to the final similarity score value to obtain a similar event sequencing result of the current event.
Optionally, the step of weighting the similarity score value of the historical event according to the current event naming entity, the current event occurrence location, the historical event naming entity and the historical event occurrence location to obtain the weighted similarity score value includes: judging whether a historical event occurrence place in a historical event and a current event occurrence place belong to the same region or not, and judging whether a historical event naming entity in the historical event and a current event naming entity have the same entity or not; if the historical event occurrence place and the current event occurrence place in the historical event do not belong to the same region, and the historical event named entity in the historical event does not have the same entity as the current event named entity, the similarity score value of the historical event is kept unchanged; if the historical event occurrence place in the historical event and the current event occurrence place belong to the same region, and the historical event naming entity in the historical event and the current event naming entity do not have the same entity, carrying out region weighting on the similarity score value of the historical event to obtain a weighted similarity score value; if the historical event occurrence place in the historical event and the current event occurrence place do not belong to the same region, and the historical event naming entity in the historical event and the current event naming entity have the same entity, the similarity score value of the historical event is weighted by the naming entity, and the weighted similarity score value is obtained; if the historical event occurrence place in the historical event and the current event occurrence place belong to the same region, and the historical event naming entity in the historical event and the current event naming entity have the same entity, region weighting and naming entity weighting are carried out on the similarity score value of the historical event, and the weighted similarity score value is obtained.
Optionally, the step of performing regional weighting on the similarity score value of the historical event to obtain a weighted similarity score value includes: calculating the regional distance between the historical event occurrence place and the current event occurrence place according to the historical event occurrence place and the current event occurrence place corresponding to the historical event; and carrying out regional weighting on the similarity score value according to the regional distance to obtain the weighted similarity score value.
Optionally, the step of performing named entity weighting on the similarity score value of the historical event to obtain a weighted similarity score value includes: comparing the historical event named entity in the historical event with the current event named entity to obtain the number of the same entities contained in the historical event named entity and the current event named entity; and carrying out named entity weighting on the similarity score according to the number of the same entities to obtain the weighted similarity score.
Optionally, the step of performing region weighting and named entity weighting on the similarity score value of the historical event to obtain a weighted similarity score value includes: calculating the regional distance between the historical event occurrence place and the current event occurrence place according to the historical event occurrence place and the current event occurrence place corresponding to the historical event; comparing the historical event named entity in the historical event with the current event named entity to obtain the number of the same entities contained in the historical event named entity and the current event named entity; and carrying out region weighting and named entity weighting on the similarity score value according to the region distance and the number of the same entities to obtain the weighted similarity score value.
Optionally, the weighted similarity score value is calculated as follows:
wherein score_new (k) is the similarity score value weighted by the kth historical event; score (k) is the similarity score value before weighting the kth historical event; delta is a region weighting constant; d (k) is the geographical distance between the current event and the kth historical event; η is a named entity weighting constant; n (k) is the same number of named entities for the current event and the kth historical event.
Optionally, the step of obtaining the semantic similarity value of the current event and the subject word of each historical event according to the current event subject word and the historical event subject word of each historical event includes: respectively inputting the current event subject word and the history event subject word of each history event into a pre-training word vector model to obtain a current event subject word vector contained in the current event and a history event subject word vector contained in each history event; adding the current event subject word vectors contained in the current event to obtain a current event subject word semantic vector corresponding to the current event; adding the historical event subject word vectors contained in each historical event respectively to obtain a historical event subject word semantic vector corresponding to each historical event; and (3) respectively carrying out cosine similarity calculation on the current event subject word semantic vector and each historical event subject word semantic vector to obtain a subject word semantic similarity value of the current event and each historical event.
Optionally, the step of obtaining the semantic similarity value of the summary sentence of the current event and each historical event according to the current event summary and the historical event summary of each historical event includes: respectively inputting a current event abstract corresponding to a current event and a historical event abstract corresponding to each historical event into a bert pre-training sentence vector model to obtain a current abstract sentence vector corresponding to the current event abstract and a historical abstract sentence vector corresponding to each historical event abstract; adding the current abstract sentence vector corresponding to the current event abstract to obtain the sentence semantic vector of the current event abstract; respectively adding the sentence vectors of the history abstracts corresponding to each history event abstract to obtain the sentence semantic vector of each history event abstract; and respectively carrying out cosine similarity calculation on the sentence semantic vector of the current event abstract and the sentence semantic vector of each historical event abstract to obtain abstract sentence semantic similarity values of the current event and each historical event.
Optionally, the step of obtaining the syntactic similarity value of the current event and each historical event according to the current event title and the historical event title of each historical event includes: respectively obtaining the title editing distance of the current event title and each historical event title according to the current event title and the historical event title of each historical event; and respectively carrying out normalization processing on the title editing distance of the current event title and each historical event title to obtain a syntax similarity value of the current event and each historical event.
Optionally, the step of obtaining the argument similarity value of the current event and each historical event according to the current event argument and the historical event argument of each historical event includes: respectively inputting the current event argument and the historical event argument of each historical event into a pre-training word vector model to obtain a current event argument word vector corresponding to the current event argument and a historical event argument word vector corresponding to each historical event argument; respectively obtaining the argument edit distance of the current event argument and each historical event argument according to the current event argument and the historical event argument of each historical event; and obtaining the argument similarity value of the current event and each historical event according to the argument word vector of the current event, the argument word vector of the historical event and the argument editing distance.
Optionally, the step of obtaining the trigger word similarity value of the current event and each historical event according to the current event trigger word and the historical event trigger word of each historical event includes: respectively inputting the current event trigger word and the historical event trigger word of each historical event into a pre-training word vector model to obtain a current event trigger word vector corresponding to the current event trigger word and a historical event trigger word vector corresponding to each historical event trigger word; respectively obtaining trigger word editing distances of the current event trigger word and each historical event trigger word according to the current event trigger word and the historical event trigger word of each historical event; and obtaining the trigger word similarity value of the current event and each historical event according to the trigger word vector of the current event, the trigger word vector of the historical event and the trigger word editing distance.
Optionally, the step of obtaining the time window similarity value of the current event and each historical event according to the current event life cycle and the historical event life cycle of each historical event includes: calculating the event co-occurrence distance of the current event life cycle and the historical event life cycle of each historical event; and (3) reducing the co-occurrence distance of the events by a preset multiple to obtain a time window similarity value of the current event and each historical event.
Optionally, the step of obtaining the text category similarity value of the current event and each historical event according to the current event text category and the historical event text category of each historical event includes: judging whether the current event text category is the same as the historical event text category of each historical event or not respectively; if the current event text category is the same as the historical event text category, the text category similarity value is a first preset value; if the current event text category is different from the historical event text category, the text category similarity value is a second preset value, and the second preset value is smaller than the first preset value.
Optionally, the calculation formula of the semantically similar value of the subject term of the current event and each historical event is as follows:
Wherein s1k is the semanteme similarity value of the subject term of the current event and the kth historical event;semantic vectors for the subject words of the current event; p is the number of the current event subject word vectors; />The method comprises the steps of providing a current event subject word vector corresponding to a p-th subject word in a current event; />A semantic vector of a subject term of the historical event corresponding to the kth historical event; q is the number of the topic word vectors of the history event corresponding to the kth history event; />And the term vector is the historical event subject term corresponding to the q-th subject term in the kth historical event.
Optionally, the calculation formula of the semantic similarity value of the abstract sentence of the current event and each historical event is as follows:
s2k is the abstract sentence semantic similarity value of the current event and the kth historical event;sentence semantic vectors that are abstracted for the current event; l is the number of current abstract sentence vectors in the current event abstract; />A sentence vector of the current event abstract corresponding to the first sentence in the current event abstract; />Sentence semantic vectors of the abstract of the history event corresponding to the kth history event; m is the number of the abstract sentence vectors of the historical event corresponding to the kth historical event; />And the abstract sentence vector is the abstract sentence vector of the historical event corresponding to the m abstract sentence in the kth historical event.
Optionally, the syntactic similarity calculation formula for the current event and each historical event is as follows:
wherein s3k is a syntactic similarity value of the current event and the kth historical event; t is t 1 A current event title corresponding to the current event; t is t 2k A historical event title corresponding to a kth historical event; ed (t) 1 ,t 2k ) And editing the distance for the title of the current event title corresponding to the current event and the title of the historical event title corresponding to the kth historical event.
Optionally, the calculation formula of the argument similarity value of the current event and each of the historical events is as follows:
s4k is an argument similarity value of the current event and the kth historical event;a current event argument word vector corresponding to the current event; />A historical event argument word vector corresponding to the kth historical event; w (W) ca The method comprises the steps of selecting a current event argument corresponding to a current event; w (W) hak A historical event argument corresponding to the kth historical event; ed (W) ca ,W hak ) And editing the distances for the argument of the current event argument corresponding to the current event and the argument of the historical event argument corresponding to the kth historical event.
Optionally, the calculation formula of the trigger word similarity value of the current event and each historical event is as follows:
s5k is the trigger word similarity value of the current event and the kth historical event; Triggering word vectors for the current event corresponding to the current event; />Triggering word vectors for the history events corresponding to the kth history event; w (W) ct Triggering words for the current event corresponding to the current event; w (W) htk Triggering words for the historical events corresponding to the kth historical event; ed (W) ct ,W htk ) Corresponding to the current eventThe trigger word edit distance of the current event trigger word element and the trigger word of the history event corresponding to the kth history event.
Optionally, the time window similarity value calculation formula of the current event and each historical event is as follows:
wherein s6k is a time window similarity value of the current event and the kth historical event; t is t c The life cycle of the current event corresponding to the current event; t is t hk The life cycle of the historical event corresponding to the kth historical event; td (t) c ,t hk ) The co-occurrence distance of the life cycle of the current event corresponding to the current event and the life cycle of the historical event corresponding to the kth historical event is set; t is a preset multiple.
Optionally, the text category similarity value calculation formula of the current event and each historical event is as follows:
wherein s7k is a text category similarity value of the current event and the kth historical event; LAble (LAble) ec The text category of the current event corresponding to the current event; LAble (LAble) eh (k) And the text category of the historical event corresponding to the kth historical event.
Optionally, after the step of sorting the historical events according to the final similarity score value to obtain the similar event sorting result of the current event, the method further includes: obtaining the matching number of similar events; and selecting the historical events with the large final similarity score value from the similar event sequencing result according to the number of the similar event matches, and obtaining the similar historical events with the number of the similar event matches.
According to a second aspect, an embodiment of the present invention provides a multi-dimensional feature fusion similarity event computing system, including:
the first acquisition module is used for acquiring a current event subject word, a current event abstract, a current event title, a current event argument, a current event trigger word, a current event life cycle, a current event text category, a current event naming entity and a current event occurrence place of a current event;
the second acquisition module is used for acquiring the historical event subject words, the historical event abstract, the historical event titles, the historical event argument, the historical event trigger words, the historical event life cycle, the historical event text category, the historical event naming entity and the historical event occurrence place of each historical event;
The first processing module is used for obtaining the semantic similarity value of the subject word of the current event and each historical event according to the subject word of the current event and the subject word of each historical event;
the second processing module is used for obtaining the abstract sentence semantic similarity value of the current event and each historical event according to the current event abstract and the historical event abstract of each historical event;
the third processing module is used for obtaining a syntax similarity value of the current event and each historical event according to the current event title and the historical event title of each historical event;
the fourth processing module is used for obtaining the argument similarity value of the current event and each historical event according to the current event argument and the historical event argument of each historical event;
the fifth processing module is used for obtaining the trigger word similarity value of the current event and each historical event according to the current event trigger word and the historical event trigger word of each historical event;
the sixth processing module is used for obtaining a time window similarity value of the current event and each historical event according to the life cycle of the current event and the life cycle of the historical event of each historical event;
the seventh processing module is used for obtaining a text category similarity value of the current event and each historical event according to the text category of the current event and the text category of each historical event;
The eighth processing module is used for respectively carrying out weighted fusion on the subject word semantic similarity value, the abstract sentence semantic similarity value, the syntax similarity value, the argument similarity value, the trigger word similarity value, the time window similarity value and the text category similarity value of each historical event to obtain a similarity score value of the current event and each historical event;
the judging module is used for judging whether the similarity score value of each historical event is larger than a preset threshold value or not respectively;
the ninth processing module is configured to, if the similarity score value of the historical event is less than or equal to a preset threshold, keep the similarity score value of the historical event unchanged, and use the similarity score value as a final similarity score value of the current event and the historical event;
a tenth processing module, configured to, if the similarity score of the historical event is greater than a preset threshold, perform naming entity and region weighting on the similarity score of the historical event according to the current event naming entity, the current event occurrence location, the historical event naming entity and the historical event occurrence location, obtain a weighted similarity score, and use the weighted similarity score as a final similarity score of the current event and the historical event;
and the eleventh processing module is used for sequencing the historical events according to the final similarity score value to obtain a similar event sequencing result of the current event.
Optionally, the tenth processing module includes: the first judging sub-module is used for judging whether the historical event occurrence place and the current event occurrence place in the historical event belong to the same region or not and judging whether the historical event naming entity and the current event naming entity in the historical event have the same entity or not; the first processing sub-module is used for keeping the similarity score value of the historical event unchanged if the historical event occurrence place in the historical event and the current event occurrence place do not belong to the same region and the historical event naming entity in the historical event and the current event naming entity do not have the same entity; the second processing sub-module is used for carrying out region weighting on the similarity score value of the historical event if the historical event occurrence place in the historical event and the current event occurrence place belong to the same region and the historical event naming entity in the historical event does not have the same entity as the current event naming entity, so as to obtain the weighted similarity score value; the third processing sub-module is used for carrying out named entity weighting on the similarity score value of the historical event if the historical event occurrence place in the historical event and the current event occurrence place do not belong to the same region and the historical event named entity in the historical event and the current event named entity are the same, so as to obtain the weighted similarity score value; and the fourth processing sub-module is used for carrying out region weighting and named entity weighting on the similar score value of the historical event if the historical event occurrence place in the historical event and the current event occurrence place belong to the same region and the historical event named entity in the historical event and the current event named entity have the same entity, so as to obtain the weighted similar score value.
Optionally, the second processing sub-module includes: the first processing unit is used for calculating the region distance between the historical event occurrence place and the current event occurrence place according to the historical event occurrence place and the current event occurrence place corresponding to the historical event; and the second processing unit is used for carrying out regional weighting on the similarity score value according to the regional distance to obtain the weighted similarity score value.
Optionally, the third processing sub-module includes: the third processing unit is used for comparing the historical event named entity in the historical event with the current event named entity to obtain the number of the same entities contained in the historical event named entity and the current event named entity; and the fourth processing unit is used for carrying out named entity weighting on the similar score values according to the number of the same entities to obtain weighted similar score values.
Optionally, the fourth processing sub-module includes: the fifth processing unit is used for calculating the region distance between the historical event occurrence place and the current event occurrence place according to the historical event occurrence place and the current event occurrence place corresponding to the historical event; the sixth processing unit is used for comparing the historical event named entity in the historical event with the current event named entity to obtain the number of the same entities contained in the historical event named entity and the current event named entity; and the seventh processing unit is used for carrying out region weighting and named entity weighting on the similarity score value according to the region distance and the number of the same entities to obtain the weighted similarity score value.
Optionally, the weighted similarity score value is calculated as follows:
wherein score_new (k) is the similarity score value weighted by the kth historical event; score (k) is the similarity score value before weighting the kth historical event; delta is a region weighting constant; d (k) is the geographical distance between the current event and the kth historical event; η is a named entity weighting constant; n (k) is the same number of named entities for the current event and the kth historical event.
Optionally, the first processing module includes: a fifth processing sub-module, configured to input a current event subject word and a history event subject word of each history event into the pre-training word vector model respectively, to obtain a current event subject word vector contained in the current event and a history event subject word vector contained in each history event; a sixth processing sub-module, configured to add the current event subject word vectors included in the current event to obtain a current event subject word semantic vector corresponding to the current event; the seventh processing sub-module is used for respectively adding the historical event subject word vectors contained in each historical event to obtain a historical event subject word semantic vector corresponding to each historical event; and the eighth processing sub-module is used for respectively carrying out cosine similarity calculation on the current event subject word semantic vector and each history event subject word semantic vector to obtain a subject word semantic similarity value of the current event and each history event.
Optionally, the second processing module includes: a ninth processing sub-module, configured to input a current event summary corresponding to the current event and a historical event summary corresponding to each historical event into the bert pre-training sentence vector model, respectively, to obtain a current summary sentence vector corresponding to the current event summary and a historical summary sentence vector corresponding to each historical event summary; a tenth processing sub-module, configured to add the current abstract sentence vector corresponding to the current event abstract to obtain a sentence semantic vector of the current event abstract; an eleventh processing sub-module, configured to add the historical summary sentence vectors corresponding to each historical event summary to obtain a sentence semantic vector of each historical event summary; and the twelfth processing sub-module is used for respectively carrying out cosine similarity calculation on the sentence semantic vector of the current event abstract and the sentence semantic vector of each historical event abstract to obtain the abstract sentence semantic similarity value of the current event and each historical event.
Optionally, the third processing module includes: a thirteenth processing sub-module, configured to obtain a current event title and a title editing distance of each historical event title according to the current event title and the historical event title of each historical event, respectively; and the fourteenth processing sub-module is used for respectively carrying out normalization processing on the current event title and the title editing distance of each historical event title to obtain a syntax similarity value of the current event and each historical event.
Optionally, the fourth processing module includes: a fifteenth processing sub-module, configured to input a current event argument and a history event argument of each history event into the pre-training word vector model, to obtain a current event argument word vector corresponding to the current event argument and a history event argument word vector corresponding to each history event argument; a sixteenth processing sub-module, configured to obtain an argument edit distance of the current event argument and each historical event argument according to the current event argument and the historical event argument of each historical event respectively; seventeenth processing sub-module, configured to obtain an argument similarity value between the current event and each historical event according to the argument word vector of the current event, the argument word vector of the historical event, and the argument editing distance.
Optionally, the fifth processing module includes: the eighteenth processing sub-module is used for respectively inputting the current event trigger word and the historical event trigger word of each historical event into the pre-training word vector model to obtain a current event trigger word vector corresponding to the current event trigger word and a historical event trigger word vector corresponding to each historical event trigger word; a nineteenth processing sub-module, configured to obtain trigger word edit distances of the current event trigger word and each historical event trigger word according to the current event trigger word and the historical event trigger word of each historical event respectively; and the twentieth processing sub-module is used for obtaining the trigger word similarity value of the current event and each historical event according to the trigger word vector of the current event, the trigger word vector of the historical event and the trigger word editing distance.
Optionally, the sixth processing module includes: a twenty-first processing sub-module, configured to calculate an event co-occurrence distance between a current event life cycle and a historical event life cycle of each historical event; a twenty-second processing sub-module, configured to reduce the event co-occurrence distance by a preset multiple to obtain a time window similarity value of the current event and each historical event;
optionally, the seventh processing module includes: the second judging sub-module is used for judging whether the current event text category is the same as the historical event text category of each historical event or not respectively; a twenty-third processing sub-module, configured to, if the current event text category is the same as the historical event text category, set the text category similarity value to a first preset value; and a twenty-fourth processing sub-module, configured to, if the current event text category and the historical event text category are different, set the text category similarity value to a second preset value, where the second preset value is smaller than the first preset value.
Optionally, the calculation formula of the semantically similar value of the subject term of the current event and each historical event is as follows:
wherein s1k is the semanteme similarity value of the subject term of the current event and the kth historical event; Semantic vectors for the subject words of the current event; p is the number of the current event subject word vectors; />The method comprises the steps of providing a current event subject word vector corresponding to a p-th subject word in a current event; />A semantic vector of a subject term of the historical event corresponding to the kth historical event; q is the number of the topic word vectors of the history event corresponding to the kth history event; />And the term vector is the historical event subject term corresponding to the q-th subject term in the kth historical event.
Optionally, the calculation formula of the semantic similarity value of the abstract sentence of the current event and each historical event is as follows:
s2k is the abstract sentence semantic similarity value of the current event and the kth historical event;sentence semantic vectors that are abstracted for the current event; l is the number of current abstract sentence vectors in the current event abstract; />A sentence vector of the current event abstract corresponding to the first sentence in the current event abstract; />Sentence semantic vectors of the abstract of the history event corresponding to the kth history event; m is the number of the abstract sentence vectors of the historical event corresponding to the kth historical event; />And the abstract sentence vector is the abstract sentence vector of the historical event corresponding to the m abstract sentence in the kth historical event.
Optionally, the syntactic similarity calculation formula for the current event and each historical event is as follows:
Wherein s3k is a syntactic similarity value of the current event and the kth historical event; t is t 1 A current event title corresponding to the current event; t is t 2k A historical event title corresponding to a kth historical event; ed (t) 1 ,t 2k ) And editing the distance for the title of the current event title corresponding to the current event and the title of the historical event title corresponding to the kth historical event.
Optionally, the calculation formula of the argument similarity value of the current event and each of the historical events is as follows:
s4k is an argument similarity value of the current event and the kth historical event;a current event argument word vector corresponding to the current event; />A historical event argument word vector corresponding to the kth historical event; w (W) ca The method comprises the steps of selecting a current event argument corresponding to a current event; w (W) hak A historical event argument corresponding to the kth historical event; ed (W) ca ,W hak ) And editing the distances for the argument of the current event argument corresponding to the current event and the argument of the historical event argument corresponding to the kth historical event.
Optionally, the calculation formula of the trigger word similarity value of the current event and each historical event is as follows:
s5k is the trigger word similarity value of the current event and the kth historical event;triggering word vectors for the current event corresponding to the current event; / >Triggering word vectors for the history events corresponding to the kth history event; w (W) ct Triggering words for the current event corresponding to the current event; w (W) htk Triggering words for the historical events corresponding to the kth historical event; ed (W) ct ,W htk ) And editing the distance for the trigger word of the current event trigger word corresponding to the current event and the trigger word of the history event trigger word corresponding to the kth history event.
Optionally, the time window similarity value calculation formula of the current event and each historical event is as follows:
wherein s6k is a time window similarity value of the current event and the kth historical event; t is t c The life cycle of the current event corresponding to the current event; t is t hk The life cycle of the historical event corresponding to the kth historical event; td (t) c ,t hk ) The co-occurrence distance of the life cycle of the current event corresponding to the current event and the life cycle of the historical event corresponding to the kth historical event is set; t is a preset multiple.
Optionally, the text category similarity value calculation formula of the current event and each historical event is as follows:
wherein s7k is a text category similarity value of the current event and the kth historical event; LAble (LAble) ec The text category of the current event corresponding to the current event; LAble (LAble) eh (k) And the text category of the historical event corresponding to the kth historical event.
Optionally, the method further comprises: the third acquisition module is used for acquiring the matching number of the similar events; and the twelfth processing module is used for selecting the history events with large final similarity score values from the similar event sequencing results according to the number of the similar event matches to obtain similar history events with the number of the similar event matches.
According to a third aspect, an embodiment of the present invention provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores a computer program executable by the at least one processor to cause the at least one processor to perform the multi-dimensional feature fusion similarity event calculation method described in any one of the first aspects above.
According to a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium storing computer instructions for causing a computer to execute the multi-dimensional feature fusion similarity event calculation method described in any one of the above first aspects.
The technical scheme of the embodiment of the invention has the following advantages:
the embodiment of the invention provides a multi-dimensional feature fusion similar event calculating method, a system, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a current event subject word, a current event abstract, a current event title, a current event argument, a current event trigger word, a current event life cycle, a current event text category, a current event naming entity and a current event occurrence place of a current event; acquiring a historical event subject word, a historical event abstract, a historical event title, a historical event argument, a historical event trigger word, a historical event life cycle, a historical event text category, a historical event naming entity and a historical event occurrence place of each historical event; obtaining a subject word semantic similarity value of the current event and each historical event according to the subject word of the current event and the subject word of each historical event; obtaining the abstract sentence semantic similarity value of the current event and each historical event according to the current event abstract and the historical event abstract of each historical event; obtaining a syntax similarity value of the current event and each historical event according to the current event title and the historical event title of each historical event; obtaining the argument similarity value of the current event and each historical event according to the current event argument and the historical event argument of each historical event; obtaining a trigger word similarity value of the current event and each historical event according to the current event trigger word and the historical event trigger word of each historical event; obtaining a time window similarity value of the current event and each historical event according to the life cycle of the current event and the life cycle of the historical event of each historical event; obtaining a text category similarity value of the current event and each historical event according to the current event text category and the historical event text category of each historical event; respectively carrying out weighted fusion on the subject word semantic similarity value, the abstract sentence semantic similarity value, the syntax similarity value, the argument similarity value, the trigger word similarity value, the time window similarity value and the text category similarity value of each historical event to obtain similarity score values of the current event and each historical event; judging whether the similarity score value of each historical event is larger than a preset threshold value or not respectively; if the similarity score value of the historical event is smaller than or equal to a preset threshold value, the similarity score value of the historical event is kept unchanged, and the similarity score value is used as the final similarity score value of the current event and the historical event; if the similarity score value of the historical event is larger than a preset threshold value, carrying out naming entity and region weighting on the similarity score value of the historical event according to the current event naming entity, the current event occurrence place, the historical event naming entity and the historical event occurrence place to obtain a weighted similarity score value, and taking the weighted similarity score value as a final similarity score value of the current event and the historical event; and sequencing the historical events according to the final similarity score value to obtain a similar event sequencing result of the current event. According to the subject word, abstract, title, argument, trigger word, life cycle and text type of the current event and the historical event, respectively obtaining a subject word semantic similarity value, abstract sentence semantic similarity value, syntax similarity value, argument similarity value, trigger word similarity value, time window similarity value and text category similarity value of the current event and each historical event; secondly, weighting and fusing the plurality of similar values to obtain a similar score value of the current event and each historical event; when the similarity score value is not greater than a preset threshold value, the similarity score value is kept unchanged; when the similarity score value is larger than a preset threshold value, carrying out named entity weighting and region weighting according to the occurrence places and named entities of the current event and the historical event to obtain a weighted similarity score value; and sorting the historical time according to the final similarity score value to obtain a historical similar event sorting result of the current event, and determining the historical event with higher similarity with the current event according to the sorting result. The similarity values of the plurality of features are obtained by comparing the current event with the historical event, the plurality of features enrich the similarity comparison, the event similarity is better judged from a plurality of angles, the accuracy of the event similarity is improved, and the searched historical event similarity is higher.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a specific example of a method for computing a multi-dimensional feature fusion similarity event according to an embodiment of the present invention;
FIG. 2 is a flowchart of another specific example of a multi-dimensional feature fusion similarity event calculation method according to an embodiment of the present invention;
FIG. 3 is a block diagram of one specific example of a multi-dimensional feature fusion similarity event computing system in accordance with an embodiment of the present invention;
fig. 4 is a schematic diagram of an electronic device according to an embodiment of the invention.
Detailed Description
The following description of the embodiments of the present invention will be made apparent and fully in view of the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The embodiment of the invention provides a multi-dimensional feature fusion similar event calculation method, which comprises the steps S1-S14 as shown in FIG. 1.
Step S1: the method comprises the steps of obtaining a current event subject word, a current event abstract, a current event title, a current event argument, a current event trigger word, a current event life cycle, a current event text category, a current event naming entity and a current event occurrence place of a current event.
In this embodiment, the method for extracting the subject word of the current event to obtain the subject word of the current event may be that idf (inverse text word frequency) is trained in a massive data set, then the word weight of the word is calculated by counting the word frequency tf of the current event and multiplying tf by the trained idf of the word, that is tf, idf, and the top N keywords are subject words of the text according to the ranking of the weights.
The abstract extraction is carried out on the current event to obtain the abstract of the current event, and the specific abstract extraction method can be to extract abstract information of texts through the longest three sentences.
The method for extracting the title of the current event to obtain the title of the current event can be used for extracting a repeated sentence, namely a public sentence, in the event text as the title of the event, and if a plurality of repeated sentences exist, the repeated sentences are used as the title with high occurrence times according to the occurrence times of the sentences.
And performing unsupervised training through word2vec, and extracting the context semantic features of the words in the text set. And performing unsupervised pre-training through a bert model, and extracting semantic features of sentences in a text set. And calculating tf-idf of the text by training the idf, and acquiring keywords of the text. And extracting abstract information of the text through the longest three sentences. And aiming at the extracted information and the unsupervised pre-training, carrying out semantic similarity calculation at word level and sentence level, and perfectly solving the similarity problem of semantic level.
The method for extracting the argument of the current event comprises the steps of word segmentation and part-of-speech tagging, counting the first 100 words of the title and the article, and taking the noun with the highest word frequency as the argument. An argument is the main body of an event, specifically, the noun part of the event main body, namely, the object of the event main body, such as "free from renting events", wherein the whole text is the main body with the largest number of free mentions, and can be regarded as the argument of the event.
And extracting trigger words from the current event to obtain the current event trigger words, and specifically, using the verb parts behind the main body words as trigger words. Trigger words are verbs of trigger event occurrence. For example, if a small person falls down, the trousers of the person are dirty, and the falling down is the trigger word.
The event attribute is divided into event trigger words and event arguments, and by adding attribute similarity calculation of the event, whether the event main argument and the event occurring action are consistent or not can be perfectly judged, and the event similarity can be better judged from the event angle.
Determining the life cycle of the current event according to the starting time and the ending time of the current event
The current event text category is the text category of the current text, namely the word text category. Classifying according to the channel category acquired by the data. Education is the education, and medical treatment is the medical treatment, so that category names are unified. Specifically, education, medical treatment, sports, etc. may be included, which is only schematically described in the present embodiment, and is not limited thereto. And carrying out named entity identification on the current event to obtain a named entity of the current event, wherein specific named entity identification can be identified through a Stanford CoreNLP tool.
The place name extraction is carried out on the current event to obtain the place where the current event occurs, and the specific place name extraction method can be that region identification is carried out through a hundred-degree LDA tool.
Step S2: and acquiring a historical event subject word, a historical event abstract, a historical event title, a historical event argument, a historical event trigger word, a historical event life cycle, a historical event text category, a historical event naming entity and a historical event occurrence place of each historical event.
In this embodiment, the method for determining the history event subject word, the history event abstract, the history event title, the history event argument, the history event trigger word, the history event life cycle, the history event text category, the history event naming entity and the history event occurrence place corresponding to the history event is similar to the process described in the step S1, the operation object is changed from the current event to the history event, and the specific process is not repeated here.
Step S3: and obtaining the semantic similarity value of the subject word of the current event and each historical event according to the subject word of the current event and the subject word of each historical event.
In the embodiment, mapping the current event subject word into a current event subject word vector through a pre-training word vector model; mapping the historical event subject words corresponding to each historical event into the historical event subject word vectors corresponding to each historical event through a pre-training word vector model; and then, respectively carrying out cosine similarity calculation on the current event subject word vector and the history event subject word vector corresponding to each history event to obtain a subject word semantic similarity value of the current event subject word and the history event subject word corresponding to each history event.
Specifically, the pre-training word vector model may be a word2vector model; of course, in other embodiments, the pre-training word vector model may be other word vector models in the prior art, which is only schematically illustrated in this embodiment, but not limited thereto.
Step S4: and obtaining the semantic similarity value of the abstract sentences of the current event and each historical event according to the current event abstract and the historical event abstract of each historical event.
In the embodiment, mapping the current event abstract into a current event abstract sentence vector through a pre-training sentence vector model; mapping the historical event abstract corresponding to each historical event into a historical event abstract sentence vector corresponding to each historical event through a pre-training sentence vector model; and then, respectively carrying out cosine similarity calculation on the current event abstract sentence vector and the historical event abstract sentence vector corresponding to each historical event to obtain abstract sentence semantic similarity values of the current event abstract and the historical event abstract corresponding to each historical event.
Specifically, the pre-training sentence vector model may be a bert model; of course, in other embodiments, the pre-training sentence vector model may be other sentence vector models in the prior art, which is only schematically illustrated in this embodiment, and is not limited thereto.
Step S5: and obtaining the syntactic similarity value of the current event and each historical event according to the current event title and the historical event title of each historical event.
In this embodiment, the syntax similarity value of the current event title and each historical event title is obtained by calculating the title editing distance of the current event title and the historical event title of each historical event and normalizing the title editing distance corresponding to each historical event.
Specifically, the Edit Distance (also called Levenshtein Distance) refers to the minimum number of editing operations required to change from one string to another string between two strings. The permitted editing operations include replacing one character with another, inserting one character, and deleting one character. In general, the smaller the edit distance, the greater the similarity of the two character strings.
Step S6: and obtaining the similar value of the current event and the argument of each historical event according to the current event argument and the historical event argument of each historical event.
In the embodiment, mapping the current event argument to a current event argument word vector through a pre-training word vector model; and mapping the historical event argument corresponding to each historical event into a historical event argument word vector corresponding to each historical event through a pre-training word vector model. And obtaining the argument edit distance of the current event argument and the argument of the historical event argument of each historical event through distance calculation.
And carrying out similarity calculation according to the current event argument word vector, the historical event argument word vector corresponding to each historical event and the argument editing distance to obtain an argument similarity value of the current event and each historical event.
Step S7: and obtaining the trigger word similarity value of the current event and each historical event according to the current event trigger word and the historical event trigger word of each historical event.
In the embodiment, mapping the current event trigger word into a current event trigger word vector through a pre-training word vector model; and mapping the historical event trigger words corresponding to each historical event into the historical event trigger word vectors corresponding to each historical event through a pre-training word vector model. And obtaining trigger word editing distances of the current event trigger word and the historical event trigger word of each historical event through distance calculation.
And carrying out similarity calculation according to the word vector of the trigger word of the current event, the word vector of the trigger word of the historical event corresponding to each historical event and the editing distance of the trigger word, and obtaining the similarity value of the trigger word of the current event and each historical event.
Step S8: and obtaining a time window similarity value of the current event and each historical event according to the current event life cycle and the historical event life cycle of each historical event.
In this embodiment, the time window similarity value of the current event and each historical event is finally obtained by calculating the event co-occurrence distance of the life cycle of the current event and the life cycle of the historical event corresponding to each historical event and scaling the event co-occurrence distance.
Step S9: and obtaining a text category similarity value of the current event and each historical event according to the current event text category and the historical event text category of each historical event.
In this embodiment, whether the current event text category of the current event is consistent with the historical event text category corresponding to each historical event is compared; if the text types are consistent, the text types are similar to 1, and the next step of fusion weighting is participated; if the text types are inconsistent, the text category similarity is 0, and the next fusion weighting is not participated.
Step S10: and respectively carrying out weighted fusion on the subject word semantic similarity value, the abstract sentence semantic similarity value, the syntax similarity value, the argument similarity value, the trigger word similarity value, the time window similarity value and the text category similarity value of each historical event to obtain the similarity score value of the current event and each historical event.
In this embodiment, a fusion weighted linear parameter is preset through statistical analysis, and the fusion weighted value of the current event and each historical event is obtained by linearly fusing the subject word semantic similarity value, the abstract sentence semantic similarity value, the syntax similarity value, the argument similarity value, the trigger word similarity value, the time window similarity value and the text category similarity value obtained through similarity calculation, and finally, global index normalization is performed through indexes to obtain a similarity score value of the current event and each historical event, that is, an event similarity score of the current event and each historical event, so that weighted sequencing is performed based on score in the next step.
The calculation formula of the similarity score value of the current event and the history event is as follows.
sim(k)=λ 1 *s1k+λ 2 *s2k+λ 3 *s3k+λ 4 *s4k+λ 5 *s5k+λ 6 *s6k+λ 7 *s7k
Wherein score (k) is a similarity score value of the current event and the kth historical event; q is the total number of historical events; sim (k) is a fusion weighting value of the current event and the kth historical event; lambda (lambda) 1 Linear parameters for the semantic similarity value of the subject term; s1k is the semanteme similarity value of the subject term of the current event and the kth historical event; lambda (lambda) 2 Linear parameters of semantic similarity values of abstract sentences; s2k is the abstract sentence semantic similarity value of the current event and the kth historical event; lambda (lambda) 3 Linear parameters that are syntactic similarity values; s3k is a syntactic similarity value of the current event and the kth historical event; lambda (lambda) 4 Linear parameters that are argument similarity values; s4k is the argument similarity value of the current event and the kth historical event; lambda (lambda) 5 A linear parameter that is a trigger word similarity value; s5k is the trigger word similarity value of the current event and the kth historical event; lambda (lambda) 6 Linear parameters that are time window similarity values; s6k is a time window similarity value of the current event and the kth historical event; lambda (lambda) 7 Linear parameters which are text category similarity values; s7k is the text category similarity value for the current event and the kth historical event.
Step S11: and judging whether the similarity score value of each historical event is larger than a preset threshold value or not respectively. If the similarity score is not greater than the preset threshold, executing step S12; if the similarity score is greater than the preset threshold, step S13 is performed.
In this embodiment, the preset threshold value is 0.7. This is only schematically described in this embodiment, but not limited to, and in other embodiments, the device may be reasonably set according to actual needs.
The weighted order-increasing operation is performed only when the score is larger than a preset threshold, and the score smaller than the preset threshold indicates that the similarity between the current event and the historical event is lower, and the probability that the historical event becomes a similar event of the current event is lower without further similarity judgment. And on the premise of very high semantic similarity, the similarity is regarded as similar events, and similar events which are closer to each other are searched in the similar events, namely the relevance of the similar events is searched. The threshold is low, i.e. dissimilar, and the dissimilar is not necessary to find the region relevance.
Step S12: if the similarity score of the historical event is smaller than or equal to the preset threshold value, the similarity score of the historical event is kept unchanged, and the similarity score is used as the final similarity score of the current event and the historical event.
In this embodiment, when the similarity score value of the historical event is smaller than or equal to the preset threshold, it is indicated that the similarity between the current event and the historical event is lower, no operation is performed, the fused similarity weight of the historical event is kept unchanged, and the similarity score value is used as the final similarity score value of the current event and the historical event.
Step S13: and if the similarity score value of the historical event is larger than a preset threshold value, carrying out naming entity and region weighting on the similarity score value of the historical event according to the current event naming entity, the current event occurrence place, the historical event naming entity and the historical event occurrence place to obtain a weighted similarity score value, and taking the weighted similarity score value as the final similarity score value of the current event and the historical event.
In this embodiment, when the similarity score value of the historical event is greater than a preset threshold, it is indicated that the similarity between the current event and the historical event is higher, and in order to obtain a more similar historical event, the similarity score value of the historical event is weighted according to the current event naming entity, the historical event naming entity, the current event occurrence location and the historical event occurrence location, and the naming entity and the region are further screened according to the similarity of the naming entity and the similarity of the occurrence location, so as to find the historical event with higher similarity. And obtaining the similarity score value of the current event and the historical event after weighting, and taking the weighted similarity score value as the final similarity score value of the current event and the historical event.
Step S14: and sequencing the historical events according to the final similarity score value to obtain a similar event sequencing result of the current event.
In this embodiment, the final similarity score value represents the similarity between the current event and the historical event. The larger the value of the final similarity score value is, the higher the similarity between the current event and the historical event is; otherwise, the smaller the value of the final similarity score value, the smaller the similarity between the current event and the historical event is indicated.
In this embodiment, the final similarity score values are sorted in descending order, that is, the values of the final similarity score values are arranged in order from large to small, so as to obtain a similarity event sorting result of the current event. Of course, in other embodiments, the arrangement may be performed in ascending order, and may be set as needed.
According to the subject word, abstract, title, argument, trigger word, life cycle and text type of the current event and the historical event, respectively obtaining a subject word semantic similarity value, abstract sentence semantic similarity value, syntax similarity value, argument similarity value, trigger word similarity value, time window similarity value and text category similarity value of the current event and each historical event; secondly, weighting and fusing the plurality of similar values to obtain a similar score value of the current event and each historical event; when the similarity score value is not greater than a preset threshold value, the similarity score value is kept unchanged; when the similarity score value is larger than a preset threshold value, carrying out named entity weighting and region weighting according to the occurrence places and named entities of the current event and the historical event to obtain a weighted similarity score value; and sorting the historical time according to the final similarity score value to obtain a historical similar event sorting result of the current event, and determining the historical event with higher similarity with the current event according to the sorting result. The similarity values of the plurality of features are obtained by comparing the current event with the historical event, the plurality of features enrich the similarity comparison, the event similarity is better judged from a plurality of angles, the accuracy of the event similarity is improved, and the searched historical event similarity is higher.
As an exemplary embodiment, step S13 includes steps S131-S135 in the step of weighting the similarity score value of the historical event by the naming entity and the region according to the current event naming entity, the current event occurrence location, the historical event naming entity and the historical event occurrence location, and obtaining the weighted similarity score value.
Step S131: judging whether the historical event occurrence place and the current event occurrence place in the historical event belong to the same region or not, and judging whether the historical event naming entity and the current event naming entity in the historical event have the same entity or not.
In this embodiment, the same region is divided and determined according to the regions, and if different event occurrence locations belong to the same region, the event occurrence locations are considered to be the same region.
Named entities such as regions, names of people, institutions and the like mentioned by the event are extracted, and the named entities related to the same named entity are intersected as related to the main body, so that similar events are likely to be generated. For the extracted regions, if events which are similar in terms of semantics occur in the same region, the events can be regarded as important similar events, and events with the same performance are likely to occur.
Specifically, the same region may be divided according to provincial administrative regions. For example, the current event occurrence location is the city of X province F, and the history event occurrence location is the city of X province J, and since the occurrence locations of both events are within the city of X province, the occurrence locations of both events belong to the same region. For example, the current event occurrence place is the city of X province F, the history event occurrence place is the city of Y province I city O, and the two event occurrence places are not in the same region because the two event occurrence places are in the province X and the province Y, belong to different provinces. Of course, in other embodiments, the same region may be divided according to the administrative domain of the city, for example, the events occurring in the city of X, province and F all belong to the same region, and the events occurring in the city of X, province and Y all belong to different regions. This is only schematically described in the present embodiment, and is not limited thereto.
And matching and searching the historical event named entity and the current event named entity in the historical event to determine whether the same named entity exists or not, wherein the same named entity is the crossed named entity.
For example, the historical event named entity of a certain historical event comprises three named entities A, B and C, the current event named entity of the current event comprises four named entities A, B, D and E, after comparison of the named entities, two identical named entities, namely A and B, are obtained, and then the historical event named entity and the current event named entity have identical entities. For another example, the historical event named entity of a certain historical event comprises A, B and C named entities, the current event named entity of the current event comprises D and E named entities, and after comparison of the named entities, the two events are obtained to have no same named entity, so that the historical event named entity and the current event named entity have no same entity.
Step S132: if the historical event occurrence place and the current event occurrence place in the historical event do not belong to the same region, and the historical event named entity in the historical event does not have the same entity as the current event named entity, the similarity score value of the historical event is kept unchanged.
In this embodiment, when the place where the historical event occurs in the historical event and the place where the current event occurs do not belong to the same region and the named entity of the historical event in the historical event and the named entity of the current event do not have the same entity, it is indicated that the places where the current event and the historical event occur are not in the same region, the similar score value is not weighted geographically, the named entity is not weighted, and the similar score value of the current event and the similar score value of the historical event remain unchanged.
Step S133: if the historical event occurrence place in the historical event and the current event occurrence place belong to the same region, and the historical event naming entity in the historical event and the current event naming entity do not have the same entity, region weighting is carried out on the similarity score value of the historical event, and the weighted similarity score value is obtained.
In this embodiment, when the place where the historical event occurs in the historical event and the place where the current event occurs belong to the same region and the named entity of the historical event in the historical event and the named entity of the current event do not have the same entity, it is indicated that the places where the current event and the historical event occur are in the same region, and the current event and the historical event do not have the same named entity, then the similar score values of the current event and the historical event are subjected to region weighting, the named entity weighting is not performed, and the weighted similar score values are obtained after the region weighting.
Step S134: if the historical event occurrence place and the current event occurrence place in the historical event do not belong to the same region, and the historical event naming entity in the historical event and the current event naming entity have the same entity, the similarity score value of the historical event is weighted by the naming entity, and the weighted similarity score value is obtained.
In this embodiment, when the place where the historical event occurs in the historical event and the place where the current event occurs do not belong to the same region and the named entity of the historical event in the historical event and the named entity of the current event have the same entity, it is indicated that the places where the current event and the historical event occur are not in the same region, the named entity weighting is performed on the similarity score values of the current event and the historical event, the region weighting is not performed, and the weighted similarity score values are obtained after the named entity weighting.
Step S135: if the historical event occurrence place in the historical event and the current event occurrence place belong to the same region, and the historical event naming entity in the historical event and the current event naming entity have the same entity, region weighting and naming entity weighting are carried out on the similarity score value of the historical event, and the weighted similarity score value is obtained.
In this embodiment, when the place of occurrence of the historical event in the historical event belongs to the same region as the place of occurrence of the current event and the named entity of the historical event in the historical event has the same entity as the named entity of the current event, it is indicated that the places of occurrence of the current event and the historical event are in the same region, the current event and the historical event have the same named entity, and then the weighted similar score values of the current event and the historical event are weighted by the regions and the named entities, and the weighted similar score values are obtained after the regions and the named entities are weighted together.
And the step of carrying out regional comparison on the occurrence places of the current event and the historical event, carrying out entity comparison on the named entities of the current event and the historical event, further weighting the similarity score value through the regional comparison result and the entity comparison result, carrying out weighting and sequencing on the historical event which occurs in the same region and the historical event with the crossed main body, increasing the similarity score value and improving the event similarity.
As an exemplary embodiment, step S133 performs regional weighting on the similarity score values of the historical events, and the step of obtaining weighted similarity score values includes steps S1331-S1332.
Step S1331: and calculating the regional distance between the historical event occurrence place and the current event occurrence place according to the historical event occurrence place and the current event occurrence place corresponding to the historical event.
In this embodiment, a distance calculation is performed on a historical event occurrence place and a current event occurrence place corresponding to a historical event, so as to obtain a region distance between the historical event occurrence place and the current event occurrence place. Specifically, the distance calculation may calculate the linear distance between the two event occurrence places on the map, and use the calculated linear distance as the region distance between the two event occurrence places; of course, in other embodiments, the distance calculation may also calculate the actual distance between the two event occurrence places by using the longitude and latitude coordinates of the two event occurrence places, and use the actual distance as the geographical distance between the two event occurrence places. This is only schematically described in the present embodiment, but not limited to, and in other embodiments, the distance calculation method in the prior art may be used to obtain the region distance.
Step S1332: and carrying out regional weighting on the similarity score value according to the regional distance to obtain the weighted similarity score value.
In this embodiment, the region weighting may be to determine a region weighting factor according to the region distance, and perform the region weighting on the similarity score value by using the region weighting factor. The closer the geographical distance, the greater the weighted similarity score value that is ultimately obtained.
Specifically, the regional weighting factor may beWherein, delta is a region weighting constant, the value range of delta is 0-1, d (k) is the region distance between the current event and the kth historical event.
The weighted similarity score values for the regions are shown below.
Weighted similarity score = pre-weighted similarity score (1 + regional weighting factor)
And the step of weighting and sequencing the historical events in the same region at the occurrence point through the calculated region distance, wherein the closer the occurrence point distance is, the larger the weighted similarity score value is, and the higher the similarity with the current event is.
As an exemplary embodiment, step S134 performs named entity weighting on the similarity score values of the historical events, and the step of obtaining weighted similarity score values includes steps S1341-S1342.
Step S1341: and comparing the historical event named entity in the historical event with the current event named entity to obtain the number of the same entities contained in the historical event named entity and the current event named entity.
In this embodiment, the historical event named entity in the historical event is compared with the current event named entity to find the same named entity contained in the two events, and the number of the same named entity is counted to obtain the number of the same named entity contained in the historical event named entity and the current event named entity.
Step S1342: and carrying out named entity weighting on the similarity score according to the number of the same entities to obtain the weighted similarity score.
In this embodiment, the named entity weighting may be that a named entity weighting factor is determined according to the number, and the named entity weighting factor is used to weight the named entity on the similar score value. The greater the number of identical entities, the greater the resulting weighted similarity score value.
Specifically, the named entity weighting factor may be η×n (k). Wherein, eta is a named entity weighting constant, and the value range of eta is 0-1; n (k) is the same number of named entities for the current event and the kth historical event.
The weighted similarity score values for named entities are shown below.
Weighted similarity score = pre-weighted similarity score (1 + named entity weighting factor)
And the step is that the weighted sequence of the historical events with the same entity is increased by the calculated number of the same named entity, and the weighted similarity score value is higher as the number of the same named entity is higher, the similarity with the current event is higher.
As an exemplary embodiment, step S135 includes steps S1351-S1353 in which the weighted similarity score value is obtained by performing regional weighting and named entity weighting on the similarity score value of the history event.
Step S1351: and calculating the regional distance between the historical event occurrence place and the current event occurrence place according to the historical event occurrence place and the current event occurrence place corresponding to the historical event.
In this embodiment, a distance calculation is performed on a historical event occurrence place and a current event occurrence place corresponding to a historical event, so as to obtain a region distance between the historical event occurrence place and the current event occurrence place. Specifically, the distance calculation may calculate the linear distance between the two event occurrence places on the map, and use the calculated linear distance as the region distance between the two event occurrence places; of course, in other embodiments, the distance calculation may also calculate the actual distance between the two event occurrence places by using the longitude and latitude coordinates of the two event occurrence places, and use the actual distance as the geographical distance between the two event occurrence places. This is only schematically described in the present embodiment, but not limited to, and in other embodiments, the distance calculation method in the prior art may be used to obtain the region distance.
Step S1352: and comparing the historical event named entity in the historical event with the current event named entity to obtain the number of the same entities contained in the historical event named entity and the current event named entity.
In this embodiment, the historical event named entity in the historical event is compared with the current event named entity to find the same named entity contained in the two events, and the number of the same named entity is counted to obtain the number of the same named entity contained in the historical event named entity and the current event named entity.
Step S1353: and carrying out region weighting and named entity weighting on the similarity score value according to the region distance and the number of the same entities to obtain the weighted similarity score value.
In this embodiment, the region weighting and the named entity weighting may be to determine a region named entity weighting factor according to the region distance and the number of the same named entities, and perform dual weighting on the similar score value by using the region named entity weighting factor. The closer the regional distance is, the larger the number of the same entities is, and the larger the finally obtained weighted similarity score value is.
Specifically, the named entity weighting factor may be 1+δ+η (k). Wherein, eta is a named entity weighting constant, and the value range of eta is 0-1; n (k) is the same number of named entities for the current event and the kth historical event.
The similarity score values for the regions and named entities weighted together are shown below.
Weighted similarity score = pre-weighted similarity score (1 + regional naming entity weighting factor)
And the step is to weight and increase the sequence of the historical events which occur in the same region and have the same entity through the calculated region distance and the number of the same named entities, wherein the weighted similarity score value is larger and the similarity with the current event is higher when the region distance is closer and the number of the same named entities is larger.
As an exemplary embodiment, the weighted similarity score value is calculated as follows:
wherein score_new (k) is the similarity score value weighted by the kth historical event; score (k) is the similarity score value before weighting the kth historical event; delta is a region weighting constant; d (k) is the geographical distance between the current event and the kth historical event; η is a named entity weighting constant; n (k) is the same number of named entities for the current event and the kth historical event.
As an exemplary embodiment, step S3 includes steps S301-S304 in the step of obtaining a semantic similarity value between the current event and the subject word of each history event according to the current event subject word and the history event subject word of each history event.
Step S301: and respectively inputting the current event subject word and the historical event subject word of each historical event into a pre-training word vector model to obtain a current event subject word vector contained in the current event and a historical event subject word vector contained in each historical event.
In the embodiment, inputting the subject word of the current event into a word2vector pre-training word vector model to obtain a subject word vector corresponding to the current event; and respectively inputting the subject words of the history events corresponding to each history event into a word2vector pre-training word vector model to obtain the subject word vector corresponding to each history event.
Step S302: and adding the current event subject word vectors contained in the current event to obtain the current event subject word semantic vector corresponding to the current event.
In this embodiment, the current event subject term vector included in the current event is added, that is, the elements in the corresponding dimension are added, so as to obtain the subject term semantic vector of the current event.
The calculation formula of the semantic vector of the subject term of the current event corresponding to the current event is shown as follows.
Wherein, the liquid crystal display device comprises a liquid crystal display device,semantic vectors for the subject words of the current event; p is the number of the current event subject word vectors; / >And the current event subject word vector corresponding to the p-th subject word in the current event.
Step S303: and adding the historical event subject word vectors contained in each historical event respectively to obtain a historical event subject word semantic vector corresponding to each historical event.
In this embodiment, the term vectors of the history event subject included in each history event are added to obtain the semantic vector of the history event subject of each history event.
The calculation formula of the semantic vector of the subject term of the history event corresponding to the history event is shown as follows.
Wherein, the liquid crystal display device comprises a liquid crystal display device,a semantic vector of a subject term of the historical event corresponding to the kth historical event; q is the number of the topic word vectors of the history event corresponding to the kth history event; />And the term vector is the historical event subject term corresponding to the q-th subject term in the kth historical event.
Step S304: and (3) respectively carrying out cosine similarity calculation on the current event subject word semantic vector and each historical event subject word semantic vector to obtain a subject word semantic similarity value of the current event and each historical event.
In this embodiment, the calculation formula of the semantic similarity value of the subject term of the current event and the historical event is as follows.
Wherein s1k is the semanteme similarity value of the subject term of the current event and the kth historical event.
The method comprises the steps of adding all the subject word vectors of the event article, namely adding elements of corresponding dimensions to obtain a subject word semantic vector of a current event and a subject word semantic vector of each historical event, and then performing cosine calculation on the subject word semantic vector of the current event and the subject word semantic vector of each historical event to obtain a subject word semantic similarity value of the current event and each historical event.
As an exemplary embodiment, step S4 includes steps S401-S404 in the step of obtaining the semantic similarity value of the summary sentence of the current event and each history event according to the summary of the current event and the summary of the history event of each history event.
Step S401: the current event abstract corresponding to the current event and the historical event abstract corresponding to each historical event are respectively input into a pre-training sentence vector model to obtain a current abstract sentence vector corresponding to the current event abstract and a historical abstract sentence vector corresponding to each historical event abstract.
In this embodiment, the current event summary is mapped to the current event summary sentence vector through the bert pre-training sentence vector model. And mapping the historical event abstract corresponding to each historical event into a historical event abstract sentence vector corresponding to each historical event through a bert pre-training sentence vector model.
Step S402: and adding the current abstract sentence vectors corresponding to the current event abstract to obtain the sentence semantic vector of the current event abstract.
In this embodiment, the calculation formula of the sentence semantic vector of the current event abstract is as follows.
Wherein, the liquid crystal display device comprises a liquid crystal display device,sentence semantic vectors that are abstracted for the current event; l is the number of current abstract sentence vectors in the current event abstract; />And the sentence vector is the current event abstract sentence vector corresponding to the first sentence in the current event abstract.
Step S403: and respectively adding the historical summary sentence vectors corresponding to each historical event summary to obtain the sentence semantic vector of each historical event summary.
In this embodiment, a calculation formula of the sentence semantic vector of the history event abstract corresponding to the history event is as follows.
Wherein, the liquid crystal display device comprises a liquid crystal display device,sentence semantic vectors of the abstract of the history event corresponding to the kth history event; m is the number of the abstract sentence vectors of the historical event corresponding to the kth historical event; />And the abstract sentence vector is the abstract sentence vector of the historical event corresponding to the m abstract sentence in the kth historical event.
Step S404: and respectively carrying out cosine similarity calculation on the sentence semantic vector of the current event abstract and the sentence semantic vector of each historical event abstract to obtain abstract sentence semantic similarity values of the current event and each historical event.
In this embodiment, the calculation formula of the semantic similarity value of the abstract sentences of the current event and the historical event is as follows.
Wherein s2k is the abstract sentence semantic similarity value of the current event and the kth historical event.
The step of adding the sentence vectors of the event summary sentences, namely adding the elements of the corresponding dimension to obtain the sentence semantic vector of the current event and the sentence semantic vector of each historical event, and then carrying out cosine calculation on the sentence semantic vector of the current event and the sentence semantic vector of each historical event to obtain the summary sentence semantic similarity value of the current event and each historical event.
As an exemplary embodiment, step S5 includes steps S501-S502 in the step of obtaining a syntactic similarity value of the current event and each of the historical events from the current event header and the historical event header of each of the historical events.
Step S501: and respectively obtaining the title editing distance of the current event title and each historical event title according to the current event title and the historical event title of each historical event.
In this embodiment, the title editing distance of the current event title and the history event title is obtained by comparing the character string differences of the current event title and the history event title.
Step S502: and respectively carrying out normalization processing on the title editing distance of the current event title and each historical event title to obtain a syntax similarity value of the current event and each historical event.
In this embodiment, the syntactic similarity calculation formula of the current event and the historical event is as follows.
Wherein s3k is the current event sumSyntax similarity values for the kth historical event; t is t 1 A current event title corresponding to the current event; t is t 2k A historical event title corresponding to a kth historical event; ed (t) 1 ,t 2k ) And editing the distance for the title of the current event title corresponding to the current event and the title of the historical event title corresponding to the kth historical event.
And the step is that the syntactic similarity value is obtained by calculating the editing distance between the title of the current event and the title of the historical event and normalizing.
As an exemplary embodiment, step S6 includes steps S601-S603 in the step of obtaining an argument similarity value of the current event and each historical event according to the current event argument and the historical event argument of each historical event.
Step S601: and respectively inputting the current event argument and the historical event argument of each historical event into a pre-training word vector model to obtain a current event argument word vector corresponding to the current event argument and a historical event argument word vector corresponding to each historical event argument.
In this embodiment, the current event argument is mapped to the current event argument word vector by the word2vector pre-training word vector model. And mapping the historical event argument corresponding to each historical event into a historical event argument word vector corresponding to each historical event through a word2vector pre-training word vector model.
Step S602: and respectively obtaining the argument editing distance of the current event argument and each historical event argument according to the current event argument and the historical event argument of each historical event.
In this embodiment, the argument edit distance of the current event argument and the argument of the history event argument of each history event is obtained by distance calculation.
Step S603: and obtaining the argument similarity value of the current event and each historical event according to the argument word vector of the current event, the argument word vector of the historical event and the argument editing distance.
In this embodiment, the calculation formula of the argument similarity values of the current event and the history event is as follows.
S4k is an argument similarity value of the current event and the kth historical event;a current event argument word vector corresponding to the current event; />A historical event argument word vector corresponding to the kth historical event; w (W) ca The method comprises the steps of selecting a current event argument corresponding to a current event; w (W) hak A historical event argument corresponding to the kth historical event; ed (W) ca ,W hak ) And editing the distances for the argument of the current event argument corresponding to the current event and the argument of the historical event argument corresponding to the kth historical event.
And the step of calculating the similarity of the argument by calculating the semantic approximate value and the syntactic distance value of the current event argument and the historical event argument.
As an exemplary embodiment, step S7 includes steps S701-S703 in the step of obtaining a trigger word similarity value of the current event and each of the history events according to the current event trigger word and the history event trigger word of each of the history events.
Step S701: the current event trigger words and the historical event trigger words of each historical event are respectively input into a pre-training word vector model, and the current event trigger word vector corresponding to the current event trigger words and the historical event trigger word vector corresponding to each historical event trigger word are obtained.
In the embodiment, a trigger word of a current event is input into a word2vector pre-training word vector model to obtain a trigger word vector corresponding to the current event; and respectively inputting the trigger words of the historical events corresponding to each historical event into a word2vector pre-training word vector model to obtain the trigger word vector corresponding to each historical event.
Step S702: and respectively obtaining trigger word editing distances of the current event trigger word and each historical event trigger word according to the current event trigger word and the historical event trigger word of each historical event.
In this embodiment, the trigger word editing distance between the current event trigger word and the historical event trigger word of each historical event is obtained through distance calculation.
Step S703: and obtaining the trigger word similarity value of the current event and each historical event according to the trigger word vector of the current event, the trigger word vector of the historical event and the trigger word editing distance.
In this embodiment, the calculation formula of the trigger word similarity value of the current event and the history event is as follows.
S5k is the trigger word similarity value of the current event and the kth historical event;triggering word vectors for the current event corresponding to the current event; />Triggering word vectors for the history events corresponding to the kth history event; w (W) ct Triggering words for the current event corresponding to the current event; w (W) htk Triggering words for the historical events corresponding to the kth historical event; ed (W) ct ,W htk ) And editing the distance for the trigger word of the current event trigger word corresponding to the current event and the trigger word of the history event trigger word corresponding to the kth history event.
And calculating the similarity of the trigger words by calculating the semantic approximate value and the syntactic distance value of the current event trigger word and the historical event trigger word.
As an exemplary embodiment, step S8 includes steps S801 to S802 in the step of obtaining a time window similarity value of the current event and each of the historical events according to the current event life cycle and the historical event life cycle of each of the historical events.
Step S801: and calculating the event co-occurrence distance of the current event life cycle and the historical event life cycle of each historical event.
In this embodiment, the co-occurrence distance refers to that the time windows overlap, and the more the time windows overlap, the greater the distance, and the overlapping portion is the distance. For example, event 1: the life cycle is: 2018.02.03-2018.04.01, event 2: life cycle: 2018.01.02-2018.04.01, wherein the year is removed and the overlapping month is distance.
Step S802: and (3) reducing the co-occurrence distance of the events by a preset multiple to obtain a time window similarity value of the current event and each historical event.
In this embodiment, a calculation formula of the time window similarity value of the current event and the historical event is as follows.
Wherein s6k is a time window similarity value of the current event and the kth historical event; t is t c The life cycle of the current event corresponding to the current event; t is t hk The life cycle of the historical event corresponding to the kth historical event; td (t) c ,t hk ) The co-occurrence distance of the life cycle of the current event corresponding to the current event and the life cycle of the historical event corresponding to the kth historical event is set; t is a preset multiple.
In this embodiment, the value range of the preset multiple is 10-20. Specifically, the preset multiple is 12. This embodiment is described only schematically, but not limited to.
And the step is that the event co-occurrence distance td of the life cycle of the current event and the life cycle of the historical event is calculated, and the td is scaled to obtain a time window similarity value finally.
As an exemplary embodiment, step S9 includes steps S901-S903 in the step of obtaining a text class similarity value of the current event and each of the history events according to the current event text class and the history event text class of each of the history events.
Step S901: and judging whether the current event text category is the same as the historical event text category of each historical event respectively.
In this embodiment, a tag corresponding to a current event text category is compared with a tag corresponding to a historical event text category of each historical event; if the labels corresponding to the text types are consistent, executing step S902; if the labels corresponding to the text types are not identical, step S903 is performed.
Step S902: if the current event text category is the same as the historical event text category, the text category similarity value is a first preset value.
In this embodiment, the current event text category is the same as the historical event text category, which indicates that the current event and the historical event are the same, and the similarity of the two events is higher. Specifically, if the value of the first preset value is 1 and the text category similarity value is 1, the fusion weighting of a plurality of subsequent similarity values is participated.
Step S903: if the current event text category is different from the historical event text category, the text category similarity value is a second preset value, and the second preset value is smaller than the first preset value.
In this embodiment, if the current event text category and the historical event text category are different, the categories of the two events are different, and the similarity of the two events is low. Specifically, if the value of the second preset value is 0 and the text category similarity value is 0, the fusion weighting of the plurality of subsequent similarity values is not participated.
The text class similarity calculation formula of the current event and the historical event is shown as follows.
Wherein s7k is a text category similarity value of the current event and the kth historical event; LAble (LAble) ec The text category of the current event corresponding to the current event; LAble (LAble) eh (k) And the text category of the historical event corresponding to the kth historical event.
And comparing whether the event category of the current event is consistent with the history event category, if so, participating in the next fusion weighting, and if not, not participating in the next fusion weighting.
As an exemplary embodiment, step S14 further includes steps S15-S16 after the step of sorting the historical events according to the final similarity score value to obtain a similarity event sorting result of the current event.
Step S15: and obtaining the matching number of the similar events.
In this embodiment, the number of matching similar events is determined according to the user requirement. Specifically, the number of the similar matches may be 5, 10, etc., which is only schematically described in the present embodiment, but not limited thereto.
Step S16: and selecting the historical events with the large final similarity score value from the similar event sequencing result according to the number of the similar event matches, and obtaining the similar historical events with the number of the similar event matches.
In this embodiment, according to the number of matching similar events, the historical event with a larger similarity score is used as the historical similar event corresponding to the current event.
And selecting the historical events with the matching number of the similar events with larger score value as the historical similar events corresponding to the current event according to the final similar score value.
A specific example will be described in detail below, and a flowchart is shown in fig. 2.
And performing unsupervised training learning on the massive text data set through a word2vector model. The history data is segmented by a jieba segmenter, and the words are regarded as the minimum semantic units. Semantic features of each word are learned by contextual understanding of the massive text data, and the model is saved.
And performing unsupervised training learning on the massive text data sets through the bert model. Word vectors and sentence vectors are obtained through training learning of the context of the text.
Extracting the subject words, mapping the subject words into word vectors, and obtaining the similarity weight by calculating cosine similarity.
And adding all the subject term vectors of the event article, namely adding elements of corresponding dimensions to obtain a subject term semantic vector of the current event and a subject term semantic vector of the historical event, and performing cosine calculation on the two vectors to obtain a sentence semantic similarity value.
And extracting a text abstract, mapping abstract sentences into sentence vectors, and obtaining a similarity weight by calculating cosine similarity. And adding sentence vectors of all sentences of the event article, namely adding elements of corresponding dimensions to obtain sentence semantic vectors of the current event and sentence semantic vectors of the historical event, and performing cosine calculation on the two vectors to obtain sentence semantic similarity values.
And obtaining a syntactic similarity value by calculating the editing distance of the title of the current event and the title of the historical event and normalizing.
And calculating the similarity of the trigger words by calculating the semantic approximate value and the syntactic distance value of the current event argument and the historical event argument.
And calculating the similarity of the trigger words by calculating the semantic approximate value and the syntactic distance value of the current event trigger word and the historical event trigger word.
And calculating an event co-occurrence distance td of the life cycle of the current event and the life cycle of the historical event, and scaling td to obtain a time window similarity value.
And comparing whether the event category of the current event is consistent with the event category of the history event, if so, participating in the next fusion weighting, and if not, not participating in the next fusion weighting.
And (3) carrying out linear fusion on a plurality of similarity values obtained by the similarity calculation through statistical analysis and presetting linear parameters, and finally carrying out global index normalization through indexes to obtain event similarity score so as to pave for weighting sorting based on score in the next step.
The similarity scores are weighted and increased in four cases, wherein delta is weighted only if score is larger than a preset threshold value, and the last operation sc of increasing the final sequence is near the distance adding override: the more the value or (e 1) is currently; the part (2) and the part (2) are weighted according to the number of the co-occurrence main bodies, the more the number is, the larger the final score is; (3) The weighted increasing sequence with the same region and the same crossed main body is that the (1) and the (2) are combined; (4) Neither the same territory nor the crossing subject, the score is unchanged.
The final score is ranked, resulting in a final ranking. The events of the front top Z are the most similar Z events.
According to the method, similarity calculation is carried out through semantic features, syntactic features, main body features and event attribute features of the events, and finally, a plurality of similar events with larger similarity are obtained through linear fusion of similar weights of all the features and sequencing in an increasing order. The fusion comparison of the multidimensional information can accurately acquire the historical similar events, and the searching accuracy of the similar events is improved.
The embodiment also provides a multi-dimensional feature fusion similar event computing system, which is used for realizing the embodiment and the preferred implementation manner, and the description is omitted. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the system described in the following embodiments is preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.
The embodiment also provides a multi-dimensional feature fusion similar event computing system, as shown in fig. 3, including:
the first acquisition module 1 is used for acquiring a current event subject word, a current event abstract, a current event title, a current event argument, a current event trigger word, a current event life cycle, a current event text category, a current event naming entity and a current event occurrence place of a current event;
The second obtaining module 2 is configured to obtain a historical event topic word, a historical event abstract, a historical event title, a historical event argument, a historical event trigger word, a historical event life cycle, a historical event text category, a historical event naming entity and a historical event occurrence place of each historical event;
the first processing module 3 is configured to obtain a topic word semantic similarity value of the current event and each historical event according to the current event topic word and the historical event topic word of each historical event;
the second processing module 4 is configured to obtain a summary sentence semantic similarity value of the current event and each historical event according to the current event summary and the historical event summary of each historical event;
the third processing module 5 is configured to obtain a syntax similarity value of the current event and each historical event according to the current event header and the historical event header of each historical event;
the fourth processing module 6 is configured to obtain an argument similarity value of the current event and each historical event according to the current event argument and the historical event argument of each historical event;
a fifth processing module 7, configured to obtain a trigger word similarity value of the current event and each historical event according to the current event trigger word and the historical event trigger word of each historical event;
A sixth processing module 8, configured to obtain a time window similarity value of the current event and each historical event according to the current event life cycle and the historical event life cycle of each historical event;
a seventh processing module 9, configured to obtain a text category similarity value of the current event and each historical event according to the current event text category and the historical event text category of each historical event;
the eighth processing module 10 is configured to perform weighted fusion on the topic word semantic similarity value, the abstract sentence semantic similarity value, the syntax similarity value, the argument similarity value, the trigger word similarity value, the time window similarity value and the text category similarity value of each historical event, so as to obtain a similarity score value of the current event and each historical event;
a judging module 11, configured to respectively judge whether a similarity score value of each historical event is greater than a preset threshold;
the ninth processing module 12 is configured to, if the similarity score of the historical event is less than or equal to a preset threshold, keep the similarity score of the historical event unchanged, and use the similarity score as a final similarity score of the current event and the historical event;
a tenth processing module 13, configured to, if the similarity score of the historical event is greater than a preset threshold, perform naming entity and region weighting on the similarity score of the historical event according to the current event naming entity, the current event occurrence location, the historical event naming entity and the historical event occurrence location, obtain a weighted similarity score, and use the weighted similarity score as a final similarity score of the current event and the historical event;
The eleventh processing module 14 is configured to sort the historical events according to the final similarity score value, so as to obtain a similar event sorting result of the current event.
Optionally, the tenth processing module includes: the first judging sub-module is used for judging whether the historical event occurrence place and the current event occurrence place in the historical event belong to the same region or not and judging whether the historical event naming entity and the current event naming entity in the historical event have the same entity or not; the first processing sub-module is used for keeping the similarity score value of the historical event unchanged if the historical event occurrence place in the historical event and the current event occurrence place do not belong to the same region and the historical event naming entity in the historical event and the current event naming entity do not have the same entity; the second processing sub-module is used for carrying out region weighting on the similarity score value of the historical event if the historical event occurrence place in the historical event and the current event occurrence place belong to the same region and the historical event naming entity in the historical event does not have the same entity as the current event naming entity, so as to obtain the weighted similarity score value; the third processing sub-module is used for carrying out named entity weighting on the similarity score value of the historical event if the historical event occurrence place in the historical event and the current event occurrence place do not belong to the same region and the historical event named entity in the historical event and the current event named entity are the same, so as to obtain the weighted similarity score value; and the fourth processing sub-module is used for carrying out region weighting and named entity weighting on the similar score value of the historical event if the historical event occurrence place in the historical event and the current event occurrence place belong to the same region and the historical event named entity in the historical event and the current event named entity have the same entity, so as to obtain the weighted similar score value.
Optionally, the second processing sub-module includes: the first processing unit is used for calculating the region distance between the historical event occurrence place and the current event occurrence place according to the historical event occurrence place and the current event occurrence place corresponding to the historical event; and the second processing unit is used for carrying out regional weighting on the similarity score value according to the regional distance to obtain the weighted similarity score value.
Optionally, the third processing sub-module includes: the third processing unit is used for comparing the historical event named entity in the historical event with the current event named entity to obtain the number of the same entities contained in the historical event named entity and the current event named entity; and the fourth processing unit is used for carrying out named entity weighting on the similar score values according to the number of the same entities to obtain weighted similar score values.
Optionally, the fourth processing sub-module includes: the fifth processing unit is used for calculating the region distance between the historical event occurrence place and the current event occurrence place according to the historical event occurrence place and the current event occurrence place corresponding to the historical event; the sixth processing unit is used for comparing the historical event named entity in the historical event with the current event named entity to obtain the number of the same entities contained in the historical event named entity and the current event named entity; and the seventh processing unit is used for carrying out region weighting and named entity weighting on the similarity score value according to the region distance and the number of the same entities to obtain the weighted similarity score value.
Optionally, the weighted similarity score value is calculated as follows:
/>
wherein score_new (k) is the similarity score value weighted by the kth historical event; score (k) is the similarity score value before weighting the kth historical event; delta is a region weighting constant; d (k) is the geographical distance between the current event and the kth historical event; η is a named entity weighting constant; n (k) is the same number of named entities for the current event and the kth historical event.
Optionally, the first processing module includes: a fifth processing sub-module, configured to input a current event subject word and a history event subject word of each history event into the pre-training word vector model respectively, to obtain a current event subject word vector contained in the current event and a history event subject word vector contained in each history event; a sixth processing sub-module, configured to add the current event subject word vectors included in the current event to obtain a current event subject word semantic vector corresponding to the current event; the seventh processing sub-module is used for respectively adding the historical event subject word vectors contained in each historical event to obtain a historical event subject word semantic vector corresponding to each historical event; and the eighth processing sub-module is used for respectively carrying out cosine similarity calculation on the current event subject word semantic vector and each history event subject word semantic vector to obtain a subject word semantic similarity value of the current event and each history event.
Optionally, the second processing module includes: a ninth processing sub-module, configured to input a current event summary corresponding to the current event and a historical event summary corresponding to each historical event into the bert pre-training sentence vector model, respectively, to obtain a current summary sentence vector corresponding to the current event summary and a historical summary sentence vector corresponding to each historical event summary; a tenth processing sub-module, configured to add the current abstract sentence vector corresponding to the current event abstract to obtain a sentence semantic vector of the current event abstract; an eleventh processing sub-module, configured to add the historical summary sentence vectors corresponding to each historical event summary to obtain a sentence semantic vector of each historical event summary; and the twelfth processing sub-module is used for respectively carrying out cosine similarity calculation on the sentence semantic vector of the current event abstract and the sentence semantic vector of each historical event abstract to obtain the abstract sentence semantic similarity value of the current event and each historical event.
Optionally, the third processing module includes: a thirteenth processing sub-module, configured to obtain a current event title and a title editing distance of each historical event title according to the current event title and the historical event title of each historical event, respectively; and the fourteenth processing sub-module is used for respectively carrying out normalization processing on the current event title and the title editing distance of each historical event title to obtain a syntax similarity value of the current event and each historical event.
Optionally, the fourth processing module includes: a fifteenth processing sub-module, configured to input a current event argument and a history event argument of each history event into the pre-training word vector model, to obtain a current event argument word vector corresponding to the current event argument and a history event argument word vector corresponding to each history event argument; a sixteenth processing sub-module, configured to obtain an argument edit distance of the current event argument and each historical event argument according to the current event argument and the historical event argument of each historical event respectively; seventeenth processing sub-module, configured to obtain an argument similarity value between the current event and each historical event according to the argument word vector of the current event, the argument word vector of the historical event, and the argument editing distance.
Optionally, the fifth processing module includes: the eighteenth processing sub-module is used for respectively inputting the current event trigger word and the historical event trigger word of each historical event into the pre-training word vector model to obtain a current event trigger word vector corresponding to the current event trigger word and a historical event trigger word vector corresponding to each historical event trigger word; a nineteenth processing sub-module, configured to obtain trigger word edit distances of the current event trigger word and each historical event trigger word according to the current event trigger word and the historical event trigger word of each historical event respectively; and the twentieth processing sub-module is used for obtaining the trigger word similarity value of the current event and each historical event according to the trigger word vector of the current event, the trigger word vector of the historical event and the trigger word editing distance.
Optionally, the sixth processing module includes: a twenty-first processing sub-module, configured to calculate an event co-occurrence distance between a current event life cycle and a historical event life cycle of each historical event; a twenty-second processing sub-module, configured to reduce the event co-occurrence distance by a preset multiple to obtain a time window similarity value of the current event and each historical event;
optionally, the seventh processing module includes: the second judging sub-module is used for judging whether the current event text category is the same as the historical event text category of each historical event or not respectively; a twenty-third processing sub-module, configured to, if the current event text category is the same as the historical event text category, set the text category similarity value to a first preset value; and a twenty-fourth processing sub-module, configured to, if the current event text category and the historical event text category are different, set the text category similarity value to a second preset value, where the second preset value is smaller than the first preset value.
Optionally, the calculation formula of the semantically similar value of the subject term of the current event and each historical event is as follows:
wherein s1k is the semanteme similarity value of the subject term of the current event and the kth historical event; Semantic vectors for the subject words of the current event; p is the current event topicThe number of word vectors; />The method comprises the steps of providing a current event subject word vector corresponding to a p-th subject word in a current event; />A semantic vector of a subject term of the historical event corresponding to the kth historical event; q is the number of the topic word vectors of the history event corresponding to the kth history event; />And the term vector is the historical event subject term corresponding to the q-th subject term in the kth historical event.
Optionally, the calculation formula of the semantic similarity value of the abstract sentence of the current event and each historical event is as follows:
s2k is the abstract sentence semantic similarity value of the current event and the kth historical event;sentence semantic vectors that are abstracted for the current event; l is the number of current abstract sentence vectors in the current event abstract; />A sentence vector of the current event abstract corresponding to the first sentence in the current event abstract; />Sentence semantic vectors of the abstract of the history event corresponding to the kth history event; m is the number of the abstract sentence vectors of the historical event corresponding to the kth historical event; />And the abstract sentence vector is the abstract sentence vector of the historical event corresponding to the m abstract sentence in the kth historical event.
Optionally, the syntactic similarity calculation formula for the current event and each historical event is as follows:
Wherein s3k is a syntactic similarity value of the current event and the kth historical event; t is t 1 A current event title corresponding to the current event; t is t 2k A historical event title corresponding to a kth historical event; ed (t) 1 ,t 2k ) And editing the distance for the title of the current event title corresponding to the current event and the title of the historical event title corresponding to the kth historical event.
Optionally, the calculation formula of the argument similarity value of the current event and each of the historical events is as follows:
s4k is an argument similarity value of the current event and the kth historical event;a current event argument word vector corresponding to the current event; />A historical event argument word vector corresponding to the kth historical event; w (W) ca The method comprises the steps of selecting a current event argument corresponding to a current event; w (W) hak A historical event argument corresponding to the kth historical event; ed (W) ca ,W hak ) And editing the distances for the argument of the current event argument corresponding to the current event and the argument of the historical event argument corresponding to the kth historical event.
Optionally, the calculation formula of the trigger word similarity value of the current event and each historical event is as follows:
s5k is the trigger word similarity value of the current event and the kth historical event;triggering word vectors for the current event corresponding to the current event; / >Triggering word vectors for the history events corresponding to the kth history event; w (W) ct Triggering words for the current event corresponding to the current event; w (W) htk Triggering words for the historical events corresponding to the kth historical event; ed (W) ct ,W htk ) And editing the distance for the trigger word of the current event trigger word corresponding to the current event and the trigger word of the history event trigger word corresponding to the kth history event.
Optionally, the time window similarity value calculation formula of the current event and each historical event is as follows:
wherein s6k is a time window similarity value of the current event and the kth historical event; t is t c The life cycle of the current event corresponding to the current event; t is t hk The life cycle of the historical event corresponding to the kth historical event; td (t) c ,t hk ) For the life cycle of the current event corresponding to the current event and the life cycle of the historical event corresponding to the kth historical eventCo-occurrence distance; t is a preset multiple.
Optionally, the text category similarity value calculation formula of the current event and each historical event is as follows:
wherein s7k is a text category similarity value of the current event and the kth historical event; LAble (LAble) ec The text category of the current event corresponding to the current event; LAble (LAble) eh (k) And the text category of the historical event corresponding to the kth historical event.
Optionally, the method further comprises: the third acquisition module is used for acquiring the matching number of the similar events; and the twelfth processing module is used for selecting the history events with large final similarity score values from the similar event sequencing results according to the number of the similar event matches to obtain similar history events with the number of the similar event matches.
The multi-dimensional feature fusion similar event computing system of the present embodiment is presented in terms of functional units, where the units refer to ASIC circuits, processors and memory that execute one or more software or firmware programs, and/or other devices that can provide the functionality described above.
Further functional descriptions of the above respective modules are the same as those of the above corresponding embodiments, and are not repeated here.
The embodiment of the invention also provides an electronic device, as shown in fig. 4, which includes one or more processors 71 and a memory 72, and in fig. 4, one processor 71 is taken as an example.
The controller may further include: an input device 73 and an output device 74.
The processor 71, memory 72, input device 73 and output device 74 may be connected by a bus or otherwise, for example in fig. 4.
The processor 71 may be a central processing unit (Central Processing Unit, CPU). The processor 71 may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or combinations of the above. A general purpose processor may be a microprocessor or any conventional processor or the like.
The memory 72 serves as a non-transitory computer readable storage medium, and may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the multi-dimensional feature fusion similarity event calculation method in the embodiments of the present application. The processor 71 executes various functional applications of the server and data processing, i.e., implements the multi-dimensional feature fusion similar event calculation method of the above-described method embodiment, by running non-transitory software programs, instructions, and modules stored in the memory 72.
Memory 72 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store data created according to the use of a processing device operated by the server, or the like. In addition, memory 72 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 72 may optionally include memory located remotely from processor 71, such remote memory being connectable to the network connection device through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 73 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the processing device of the server. The output device 74 may include a display device such as a display screen.
One or more modules are stored in the memory 72 that, when executed by the one or more processors 71, perform the method shown in fig. 1.
It will be appreciated by those skilled in the art that implementing all or part of the above-described embodiment method may be implemented by a computer program instructing related hardware, and the executed program may be stored in a computer readable storage medium, and the program may include the embodiment of the multi-dimensional feature fusion similar event calculation method as described above when executed. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a Flash Memory (Flash Memory), a Hard Disk (HDD), or a Solid State Drive (SSD); the storage medium may also comprise a combination of memories of the kind described above.
Although embodiments of the present invention have been described in connection with the accompanying drawings, various modifications and variations may be made by those skilled in the art without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope of the invention as defined by the appended claims.

Claims (8)

1. The method for calculating the multi-dimensional feature fusion similar event is characterized by comprising the following steps of:
acquiring a current event subject word, a current event abstract, a current event title, a current event argument, a current event trigger word, a current event life cycle, a current event text category, a current event naming entity and a current event occurrence place of a current event;
acquiring a historical event subject word, a historical event abstract, a historical event title, a historical event argument, a historical event trigger word, a historical event life cycle, a historical event text category, a historical event naming entity and a historical event occurrence place of each historical event;
obtaining a subject word semantic similarity value of the current event and each historical event according to the subject word of the current event and the subject word of each historical event;
the step of obtaining the semantic similarity value of the current event and the subject word of each historical event according to the current event subject word and the historical event subject word of each historical event comprises the following steps: respectively inputting the current event subject word and the history event subject word of each history event into a pre-training word vector model to obtain a current event subject word vector contained in the current event and a history event subject word vector contained in each history event; adding the current event subject word vectors contained in the current event to obtain a current event subject word semantic vector corresponding to the current event; adding the historical event subject word vectors contained in each historical event respectively to obtain a historical event subject word semantic vector corresponding to each historical event; cosine similarity calculation is carried out on the current event subject word semantic vector and each historical event subject word semantic vector respectively, so that a subject word semantic similarity value of the current event and each historical event is obtained;
Obtaining the abstract sentence semantic similarity value of the current event and each historical event according to the current event abstract and the historical event abstract of each historical event;
the step of obtaining the semantic similarity value of the summary sentences of the current event and each historical event according to the current event summary and the historical event summary of each historical event comprises the following steps: respectively inputting a current event abstract corresponding to a current event and a historical event abstract corresponding to each historical event into a bert pre-training sentence vector model to obtain a current abstract sentence vector corresponding to the current event abstract and a historical abstract sentence vector corresponding to each historical event abstract; adding the current abstract sentence vector corresponding to the current event abstract to obtain the sentence semantic vector of the current event abstract; respectively adding the sentence vectors of the history abstracts corresponding to each history event abstract to obtain the sentence semantic vector of each history event abstract; carrying out cosine similarity calculation on the sentence semantic vector of the current event abstract and the sentence semantic vector of each historical event abstract respectively to obtain abstract sentence semantic similarity values of the current event and each historical event;
Obtaining a syntax similarity value of the current event and each historical event according to the current event title and the historical event title of each historical event;
the step of obtaining the syntactic similarity value of the current event and each historical event according to the current event title and the historical event title of each historical event comprises the following steps: respectively obtaining the title editing distance of the current event title and each historical event title according to the current event title and the historical event title of each historical event; respectively carrying out normalization processing on the title editing distance of the current event title and each historical event title to obtain a syntax similarity value of the current event and each historical event;
obtaining the argument similarity value of the current event and each historical event according to the current event argument and the historical event argument of each historical event;
the step of obtaining the similar value of the current event and the argument of each historical event according to the current event argument and the historical event argument of each historical event comprises the following steps: respectively inputting the current event argument and the historical event argument of each historical event into a pre-training word vector model to obtain a current event argument word vector corresponding to the current event argument and a historical event argument word vector corresponding to each historical event argument; respectively obtaining the argument edit distance of the current event argument and each historical event argument according to the current event argument and the historical event argument of each historical event; obtaining an argument similarity value of the current event and each historical event according to the argument word vector of the current event, the argument word vector of the historical event and the argument editing distance;
Obtaining a trigger word similarity value of the current event and each historical event according to the current event trigger word and the historical event trigger word of each historical event;
the step of obtaining the trigger word similarity value of the current event and each historical event according to the current event trigger word and the historical event trigger word of each historical event comprises the following steps: respectively inputting the current event trigger word and the historical event trigger word of each historical event into a pre-training word vector model to obtain a current event trigger word vector corresponding to the current event trigger word and a historical event trigger word vector corresponding to each historical event trigger word; respectively obtaining trigger word editing distances of the current event trigger word and each historical event trigger word according to the current event trigger word and the historical event trigger word of each historical event; obtaining trigger word similarity values of the current event and each historical event according to the trigger word vector of the current event, the trigger word vector of the historical event and the trigger word editing distance;
obtaining a time window similarity value of the current event and each historical event according to the life cycle of the current event and the life cycle of the historical event of each historical event;
The step of obtaining the time window similarity value of the current event and each historical event according to the current event life cycle and the historical event life cycle of each historical event comprises the following steps: calculating the event co-occurrence distance of the current event life cycle and the historical event life cycle of each historical event; reducing the co-occurrence distance of the events by a preset multiple to obtain a time window similarity value of the current event and each historical event;
obtaining a text category similarity value of the current event and each historical event according to the current event text category and the historical event text category of each historical event;
the step of obtaining the text category similarity value of the current event and each historical event according to the text category of the current event and the text category of each historical event comprises the following steps: judging whether the current event text category is the same as the historical event text category of each historical event or not respectively; if the current event text category is the same as the historical event text category, the text category similarity value is a first preset value; if the current event text category is different from the historical event text category, the text category similarity value is a second preset value, and the second preset value is smaller than the first preset value;
Respectively carrying out weighted fusion on the subject word semantic similarity value, the abstract sentence semantic similarity value, the syntax similarity value, the argument similarity value, the trigger word similarity value, the time window similarity value and the text category similarity value of each historical event to obtain similarity score values of the current event and each historical event;
judging whether the similarity score value of each historical event is larger than a preset threshold value or not respectively;
if the similarity score value of the historical event is smaller than or equal to a preset threshold value, the similarity score value of the historical event is kept unchanged, and the similarity score value is used as the final similarity score value of the current event and the historical event;
if the similarity score value of the historical event is larger than a preset threshold value, carrying out naming entity and region weighting on the similarity score value of the historical event according to the current event naming entity, the current event occurrence place, the historical event naming entity and the historical event occurrence place to obtain a weighted similarity score value, and taking the weighted similarity score value as a final similarity score value of the current event and the historical event;
the step of weighting the similarity score value of the historical event according to the current event naming entity, the current event occurrence place, the historical event naming entity and the historical event occurrence place to obtain the weighted similarity score value comprises the following steps: judging whether a historical event occurrence place in a historical event and a current event occurrence place belong to the same region or not, and judging whether a historical event naming entity in the historical event and a current event naming entity have the same entity or not; if the historical event occurrence place and the current event occurrence place in the historical event do not belong to the same region, and the historical event named entity in the historical event does not have the same entity as the current event named entity, the similarity score value of the historical event is kept unchanged; if the historical event occurrence place in the historical event and the current event occurrence place belong to the same region, and the historical event naming entity in the historical event and the current event naming entity do not have the same entity, carrying out region weighting on the similarity score value of the historical event to obtain a weighted similarity score value; if the historical event occurrence place in the historical event and the current event occurrence place do not belong to the same region, and the historical event naming entity in the historical event and the current event naming entity have the same entity, the similarity score value of the historical event is weighted by the naming entity, and the weighted similarity score value is obtained; if the historical event occurrence place in the historical event and the current event occurrence place belong to the same region, and the historical event naming entity in the historical event and the current event naming entity have the same entity, carrying out region weighting and naming entity weighting on the similarity score value of the historical event to obtain a weighted similarity score value;
And sequencing the historical events according to the final similarity score value to obtain a similar event sequencing result of the current event.
2. The method for computing multi-dimensional feature fusion similarity events of claim 1,
the step of performing regional weighting on the similarity score value of the historical event to obtain the weighted similarity score value comprises the following steps:
calculating the regional distance between the historical event occurrence place and the current event occurrence place according to the historical event occurrence place and the current event occurrence place corresponding to the historical event;
carrying out regional weighting on the similarity score value according to the regional distance to obtain a weighted similarity score value;
the step of weighting the similarity score value of the historical event by using the named entity to obtain the weighted similarity score value comprises the following steps:
comparing the historical event named entity in the historical event with the current event named entity to obtain the number of the same entities contained in the historical event named entity and the current event named entity;
carrying out named entity weighting on the similarity score according to the number of the same entities to obtain a weighted similarity score;
the step of performing region weighting and named entity weighting on the similarity score value of the historical event to obtain the weighted similarity score value comprises the following steps:
Calculating the regional distance between the historical event occurrence place and the current event occurrence place according to the historical event occurrence place and the current event occurrence place corresponding to the historical event;
comparing the historical event named entity in the historical event with the current event named entity to obtain the number of the same entities contained in the historical event named entity and the current event named entity;
and carrying out region weighting and named entity weighting on the similarity score value according to the region distance and the number of the same entities to obtain the weighted similarity score value.
3. The method for computing multi-dimensional feature fusion similarity events of claim 2,
the weighted similarity score value is calculated as follows:
wherein score_new (k) is the similarity score value weighted by the kth historical event; score (k) is the similarity score value before weighting the kth historical event; delta is a region weighting constant; d (k) is the geographical distance between the current event and the kth historical event; η is a named entity weighting constant; n (k) is the same number of named entities for the current event and the kth historical event.
4. The method for computing multi-dimensional feature fusion similarity events of claim 1,
The calculation formula of the semantic similarity value of the subject terms of the current event and each historical event is as follows:
wherein s1k is the current event and the kth historical eventA subject term semantic similarity value;semantic vectors for the subject words of the current event; p is the number of the current event subject word vectors; />The method comprises the steps of providing a current event subject word vector corresponding to a p-th subject word in a current event; />A semantic vector of a subject term of the historical event corresponding to the kth historical event; q is the number of the topic word vectors of the history event corresponding to the kth history event; />The method comprises the steps of providing a keyword vector of a historical event corresponding to a q-th keyword in a kth historical event;
the calculation formula of the semantic similarity value of the abstract sentences of the current event and each historical event is as follows:
s2k is the abstract sentence semantic similarity value of the current event and the kth historical event;sentence semantics for current event summaryVector; l is the number of current abstract sentence vectors in the current event abstract; />A sentence vector of the current event abstract corresponding to the first sentence in the current event abstract; />Sentence semantic vectors of the abstract of the history event corresponding to the kth history event; m is the number of the abstract sentence vectors of the historical event corresponding to the kth historical event; / >The abstract sentence vector of the historical event corresponding to the m abstract sentence in the kth historical event;
the syntactic similarity calculation formula for the current event and each historical event is as follows:
wherein s3k is a syntactic similarity value of the current event and the kth historical event; t is t 1 A current event title corresponding to the current event; t is t 2k A historical event title corresponding to a kth historical event; ed (t) 1 ,t 2k ) Editing the distance for the title of the current event title corresponding to the current event and the title of the historical event title corresponding to the kth historical event;
the calculation formula of the argument similarity value of the current event and each historical event is as follows:
s4k is an argument similarity value of the current event and the kth historical event;a current event argument word vector corresponding to the current event; />A historical event argument word vector corresponding to the kth historical event; w (W) ca The method comprises the steps of selecting a current event argument corresponding to a current event; w (W) hak A historical event argument corresponding to the kth historical event; ed (W) ca ,W hak ) Editing the distance for the argument of the current event argument corresponding to the current event and the argument of the historical event argument corresponding to the kth historical event;
the calculation formula of the trigger word similarity value of the current event and each historical event is as follows:
S5k is the trigger word similarity value of the current event and the kth historical event;triggering word vectors for the current event corresponding to the current event; />Triggering word vectors for the history events corresponding to the kth history event; w (W) ct Triggering words for the current event corresponding to the current event; w (W) htk Triggering words for the historical events corresponding to the kth historical event; ed (W) ct ,W htk ) Editing distance for trigger words of the current event trigger word corresponding to the current event and trigger words of the history event corresponding to the kth history event;
the time window similarity calculation formula of the current event and each historical event is as follows:
wherein s6k is a time window similarity value of the current event and the kth historical event; t is t c The life cycle of the current event corresponding to the current event; t is t hk The life cycle of the historical event corresponding to the kth historical event; td (t) c ,t hk ) The co-occurrence distance of the life cycle of the current event corresponding to the current event and the life cycle of the historical event corresponding to the kth historical event is set; t is a preset multiple;
the text category similarity calculation formula of the current event and each historical event is as follows:
wherein s7k is a text category similarity value of the current event and the kth historical event; LAble (LAble) ec The text category of the current event corresponding to the current event; LAble (LAble) eh (k) And the text category of the historical event corresponding to the kth historical event.
5. The method for computing the multi-dimensional feature fusion similarity event according to claim 1, wherein after the step of sorting the historical events according to the final similarity score value to obtain the similarity event sorting result of the current event, further comprising:
obtaining the matching number of similar events;
and selecting the historical events with the large final similarity score value from the similar event sequencing result according to the number of the similar event matches, and obtaining the similar historical events with the number of the similar event matches.
6. A multi-dimensional feature fusion similarity event computing system, comprising:
the first acquisition module is used for acquiring a current event subject word, a current event abstract, a current event title, a current event argument, a current event trigger word, a current event life cycle, a current event text category, a current event naming entity and a current event occurrence place of a current event;
the second acquisition module is used for acquiring the historical event subject words, the historical event abstract, the historical event titles, the historical event argument, the historical event trigger words, the historical event life cycle, the historical event text category, the historical event naming entity and the historical event occurrence place of each historical event;
The first processing module is used for obtaining the semantic similarity value of the subject word of the current event and each historical event according to the subject word of the current event and the subject word of each historical event;
the first processing module includes: a fifth processing sub-module, configured to input a current event subject word and a history event subject word of each history event into the pre-training word vector model respectively, to obtain a current event subject word vector contained in the current event and a history event subject word vector contained in each history event; a sixth processing sub-module, configured to add the current event subject word vectors included in the current event to obtain a current event subject word semantic vector corresponding to the current event; the seventh processing sub-module is used for respectively adding the historical event subject word vectors contained in each historical event to obtain a historical event subject word semantic vector corresponding to each historical event; the eighth processing sub-module is used for performing cosine similarity calculation on the current event subject word semantic vector and each history event subject word semantic vector respectively to obtain a subject word semantic similarity value of the current event and each history event;
The second processing module is used for obtaining the abstract sentence semantic similarity value of the current event and each historical event according to the current event abstract and the historical event abstract of each historical event;
the second processing module includes: a ninth processing sub-module, configured to input a current event summary corresponding to the current event and a historical event summary corresponding to each historical event into the bert pre-training sentence vector model, respectively, to obtain a current summary sentence vector corresponding to the current event summary and a historical summary sentence vector corresponding to each historical event summary; a tenth processing sub-module, configured to add the current abstract sentence vector corresponding to the current event abstract to obtain a sentence semantic vector of the current event abstract; an eleventh processing sub-module, configured to add the historical summary sentence vectors corresponding to each historical event summary to obtain a sentence semantic vector of each historical event summary; the twelfth processing sub-module is used for performing cosine similarity calculation on the sentence semantic vector of the current event abstract and the sentence semantic vector of each historical event abstract respectively to obtain abstract sentence semantic similarity values of the current event and each historical event;
The third processing module is used for obtaining a syntax similarity value of the current event and each historical event according to the current event title and the historical event title of each historical event;
the third processing module includes: a thirteenth processing sub-module, configured to obtain a current event title and a title editing distance of each historical event title according to the current event title and the historical event title of each historical event, respectively; a fourteenth processing sub-module, configured to perform normalization processing on the current event header and the header editing distance of each historical event header, to obtain a syntax similarity value of the current event and each historical event;
the fourth processing module is used for obtaining the argument similarity value of the current event and each historical event according to the current event argument and the historical event argument of each historical event;
the fourth processing module includes: a fifteenth processing sub-module, configured to input a current event argument and a history event argument of each history event into the pre-training word vector model, to obtain a current event argument word vector corresponding to the current event argument and a history event argument word vector corresponding to each history event argument; a sixteenth processing sub-module, configured to obtain an argument edit distance of the current event argument and each historical event argument according to the current event argument and the historical event argument of each historical event respectively; seventeenth processing sub-module, configured to obtain an argument similarity value between the current event and each historical event according to the argument word vector of the current event, the argument word vector of the historical event, and the argument editing distance;
The fifth processing module is used for obtaining the trigger word similarity value of the current event and each historical event according to the current event trigger word and the historical event trigger word of each historical event;
the fifth processing module includes: the eighteenth processing sub-module is used for respectively inputting the current event trigger word and the historical event trigger word of each historical event into the pre-training word vector model to obtain a current event trigger word vector corresponding to the current event trigger word and a historical event trigger word vector corresponding to each historical event trigger word; a nineteenth processing sub-module, configured to obtain trigger word edit distances of the current event trigger word and each historical event trigger word according to the current event trigger word and the historical event trigger word of each historical event respectively; the twentieth processing sub-module is used for obtaining the trigger word similarity value of the current event and each historical event according to the trigger word vector of the current event, the trigger word vector of the historical event and the trigger word editing distance;
the sixth processing module is used for obtaining a time window similarity value of the current event and each historical event according to the life cycle of the current event and the life cycle of the historical event of each historical event;
The sixth processing module includes: a twenty-first processing sub-module, configured to calculate an event co-occurrence distance between a current event life cycle and a historical event life cycle of each historical event; a twenty-second processing sub-module, configured to reduce the event co-occurrence distance by a preset multiple to obtain a time window similarity value of the current event and each historical event;
the seventh processing module is used for obtaining a text category similarity value of the current event and each historical event according to the text category of the current event and the text category of each historical event;
the seventh processing module includes: the second judging sub-module is used for judging whether the current event text category is the same as the historical event text category of each historical event or not respectively; a twenty-third processing sub-module, configured to, if the current event text category is the same as the historical event text category, set the text category similarity value to a first preset value; a twenty-fourth processing sub-module, configured to, if the current event text category and the historical event text category are different, set a text category similarity value to a second preset value, where the second preset value is smaller than the first preset value;
the eighth processing module is used for respectively carrying out weighted fusion on the subject word semantic similarity value, the abstract sentence semantic similarity value, the syntax similarity value, the argument similarity value, the trigger word similarity value, the time window similarity value and the text category similarity value of each historical event to obtain a similarity score value of the current event and each historical event;
The judging module is used for judging whether the similarity score value of each historical event is larger than a preset threshold value or not respectively;
the ninth processing module is configured to, if the similarity score value of the historical event is less than or equal to a preset threshold, keep the similarity score value of the historical event unchanged, and use the similarity score value as a final similarity score value of the current event and the historical event;
a tenth processing module, configured to, if the similarity score of the historical event is greater than a preset threshold, perform naming entity and region weighting on the similarity score of the historical event according to the current event naming entity, the current event occurrence location, the historical event naming entity and the historical event occurrence location, obtain a weighted similarity score, and use the weighted similarity score as a final similarity score of the current event and the historical event;
the tenth processing module includes: the first judging sub-module is used for judging whether the historical event occurrence place and the current event occurrence place in the historical event belong to the same region or not and judging whether the historical event naming entity and the current event naming entity in the historical event have the same entity or not; the first processing sub-module is used for keeping the similarity score value of the historical event unchanged if the historical event occurrence place in the historical event and the current event occurrence place do not belong to the same region and the historical event naming entity in the historical event and the current event naming entity do not have the same entity; the second processing sub-module is used for carrying out region weighting on the similarity score value of the historical event if the historical event occurrence place in the historical event and the current event occurrence place belong to the same region and the historical event naming entity in the historical event does not have the same entity as the current event naming entity, so as to obtain the weighted similarity score value; the third processing sub-module is used for carrying out named entity weighting on the similarity score value of the historical event if the historical event occurrence place in the historical event and the current event occurrence place do not belong to the same region and the historical event named entity in the historical event and the current event named entity are the same, so as to obtain the weighted similarity score value; the fourth processing sub-module is configured to perform region weighting and named entity weighting on the similarity score value of the historical event if the historical event occurrence location in the historical event and the current event occurrence location belong to the same region and the historical event named entity in the historical event has the same entity as the current event named entity, so as to obtain a weighted similarity score value;
And the eleventh processing module is used for sequencing the historical events according to the final similarity score value to obtain a similar event sequencing result of the current event.
7. An electronic device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores a computer program executable by the at least one processor to cause the at least one processor to perform the multi-dimensional feature fusion similarity event calculation method of any of claims 1-5.
8. A computer-readable storage medium storing computer instructions for causing the computer to perform the multi-dimensional feature fusion similarity event calculation method according to any one of claims 1 to 5.
CN202110906530.2A 2021-08-09 2021-08-09 Multi-dimensional feature fusion similar event calculation method and system and electronic equipment Active CN113722478B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110906530.2A CN113722478B (en) 2021-08-09 2021-08-09 Multi-dimensional feature fusion similar event calculation method and system and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110906530.2A CN113722478B (en) 2021-08-09 2021-08-09 Multi-dimensional feature fusion similar event calculation method and system and electronic equipment

Publications (2)

Publication Number Publication Date
CN113722478A CN113722478A (en) 2021-11-30
CN113722478B true CN113722478B (en) 2023-09-19

Family

ID=78675183

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110906530.2A Active CN113722478B (en) 2021-08-09 2021-08-09 Multi-dimensional feature fusion similar event calculation method and system and electronic equipment

Country Status (1)

Country Link
CN (1) CN113722478B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114925757B (en) * 2022-05-09 2023-10-03 中国电信股份有限公司 Multisource threat information fusion method, device, equipment and storage medium
CN116167352B (en) * 2023-04-03 2023-07-21 联仁健康医疗大数据科技股份有限公司 Data processing method, device, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110633409A (en) * 2018-06-20 2019-12-31 上海财经大学 Rule and deep learning fused automobile news event extraction method
CN111104794A (en) * 2019-12-25 2020-05-05 同方知网(北京)技术有限公司 Text similarity matching method based on subject words
CN111382575A (en) * 2020-03-19 2020-07-07 电子科技大学 Event extraction method based on joint labeling and entity semantic information

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10360352B2 (en) * 2012-10-02 2019-07-23 Banjo, Inc. System and method for event-based vehicle operation

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110633409A (en) * 2018-06-20 2019-12-31 上海财经大学 Rule and deep learning fused automobile news event extraction method
CN111104794A (en) * 2019-12-25 2020-05-05 同方知网(北京)技术有限公司 Text similarity matching method based on subject words
CN111382575A (en) * 2020-03-19 2020-07-07 电子科技大学 Event extraction method based on joint labeling and entity semantic information

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
improving event co-reference by context extraction and dynamic feature weighting;Katie McConky等;2012 IEEE international multi-disciplinary conference on cognitive methods in situation awareness and decision support;978-983 *
事件抽取技术的回顾与展望;许旭阳;韩永峰;宋文政;;信息工程大学学报;第12卷(第01期);113-118 *
基于篇章级事件表示的文本相关度计算方法;刘铭;郑子豪;秦兵;刘一仝;李阳;;中国科学:信息科学;第50卷(第07期);1033-1054 *

Also Published As

Publication number Publication date
CN113722478A (en) 2021-11-30

Similar Documents

Publication Publication Date Title
CN111708873B (en) Intelligent question-answering method, intelligent question-answering device, computer equipment and storage medium
CN110162593B (en) Search result processing and similarity model training method and device
CN111753060B (en) Information retrieval method, apparatus, device and computer readable storage medium
CN105824959B (en) Public opinion monitoring method and system
JP6309644B2 (en) Method, system, and storage medium for realizing smart question answer
CN103473283B (en) Method for matching textual cases
US20210216576A1 (en) Systems and methods for providing answers to a query
CN102663129A (en) Medical field deep question and answer method and medical retrieval system
CN104102721A (en) Method and device for recommending information
CN109388743B (en) Language model determining method and device
CN106708929B (en) Video program searching method and device
CN113722478B (en) Multi-dimensional feature fusion similar event calculation method and system and electronic equipment
CN110147421B (en) Target entity linking method, device, equipment and storage medium
WO2021112984A1 (en) Feature and context based search result generation
WO2020101477A1 (en) System and method for dynamic entity sentiment analysis
CN112182145A (en) Text similarity determination method, device, equipment and storage medium
CN113011172A (en) Text processing method and device, computer equipment and storage medium
US11379527B2 (en) Sibling search queries
CN106570196B (en) Video program searching method and device
CN109522396B (en) Knowledge processing method and system for national defense science and technology field
Dutta et al. PNRank: Unsupervised ranking of person name entities from noisy OCR text
CN112883182A (en) Question-answer matching method and device based on machine reading
CN112417174A (en) Data processing method and device
CN114281942A (en) Question and answer processing method, related equipment and readable storage medium
CN113761125A (en) Dynamic summary determination method and device, computing equipment and computer storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant