CN113722478A - Multi-dimensional feature fusion similar event calculation method and system and electronic equipment - Google Patents

Multi-dimensional feature fusion similar event calculation method and system and electronic equipment Download PDF

Info

Publication number
CN113722478A
CN113722478A CN202110906530.2A CN202110906530A CN113722478A CN 113722478 A CN113722478 A CN 113722478A CN 202110906530 A CN202110906530 A CN 202110906530A CN 113722478 A CN113722478 A CN 113722478A
Authority
CN
China
Prior art keywords
event
historical
current
historical event
current event
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110906530.2A
Other languages
Chinese (zh)
Other versions
CN113722478B (en
Inventor
韩勇
李青龙
骆飞
赵冲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Smart Starlight Information Technology Co ltd
Original Assignee
Beijing Smart Starlight Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Smart Starlight Information Technology Co ltd filed Critical Beijing Smart Starlight Information Technology Co ltd
Priority to CN202110906530.2A priority Critical patent/CN113722478B/en
Publication of CN113722478A publication Critical patent/CN113722478A/en
Application granted granted Critical
Publication of CN113722478B publication Critical patent/CN113722478B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method, a system and electronic equipment for calculating a multi-dimensional feature fusion similar event, wherein the method comprises the following steps: obtaining a subject word semantic similarity value, a abstract sentence semantic similarity value, a syntax similarity value, a argument similarity value, a trigger word similarity value, a time window similarity value and a text type similarity value of the current event and each historical event respectively according to subject words, abstracts, titles, arguments, trigger words, life cycles and text types of the current event and the historical events; carrying out weighted fusion on the multiple similar values to obtain a similar score value of the current event and each historical event; when the similarity score value is larger than a preset threshold value, weighting according to the occurrence places of the current event and the historical event and the named entities to obtain a weighted similarity score value; and sequencing the historical events according to the final similarity score value to obtain a sequencing result, and determining the historical events with high similarity to the current events according to the sequencing result. The accuracy of the event similarity search is improved through multiple dimensions.

Description

Multi-dimensional feature fusion similar event calculation method and system and electronic equipment
Technical Field
The invention relates to the field of text analysis, in particular to a method and a system for calculating a multi-dimensional feature fusion similar event, electronic equipment and a storage medium.
Background
The existing public opinion analysis platform carries out heat and transmission analysis on the happening events, but lacks the comparative analysis of historical similar events.
At present, the historical events containing the keywords are usually obtained according to keyword comparison or keyword search, and the historical events are searched by the method roughly, so that the historical events with high similarity cannot be obtained.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method, a system, an electronic device, and a storage medium for calculating a multi-dimensional feature fusion similar event, so as to solve the defect of low similarity in historical event search in the prior art.
Therefore, the embodiment of the invention provides the following technical scheme:
according to a first aspect, an embodiment of the present invention provides a method for computing a multi-dimensional feature fusion similar event, including: acquiring a current event subject term, a current event abstract, a current event title, a current event argument, a current event trigger term, a current event life cycle, a current event text category, a current event named entity and a current event occurrence place of a current event; acquiring a history event subject term, a history event abstract, a history event title, a history event argument, a history event trigger term, a history event life cycle, a history event text category, a history event named entity and a history event occurrence place of each history event; obtaining a subject term semantic similarity value of the current event and each historical event according to the subject term of the current event and the subject term of each historical event; obtaining a semantic similarity value of a summary sentence of the current event and each historical event according to the current event summary and the historical event summary of each historical event; obtaining a syntactic similarity value of the current event and each historical event according to the current event title and the historical event title of each historical event; obtaining an argument similarity value of the current event and each historical event according to the current event argument and the historical event argument of each historical event; obtaining a trigger word similarity value of the current event and each historical event according to the current event trigger word and the historical event trigger word of each historical event; obtaining a time window similarity value of the current event and each historical event according to the current event life cycle and the historical event life cycle of each historical event; obtaining a text category similarity value of the current event and each historical event according to the text category of the current event and the text category of the historical events; respectively carrying out weighted fusion on the subject term semantic similarity value, the abstract sentence semantic similarity value, the syntax similarity value, the argument similarity value, the trigger term similarity value, the time window similarity value and the text category similarity value of each historical event to obtain a similarity score value of the current event and each historical event; respectively judging whether the similarity score value of each historical event is greater than a preset threshold value; if the similarity score value of the historical event is smaller than or equal to the preset threshold value, keeping the similarity score value of the historical event unchanged, and taking the similarity score value as the final similarity score value of the current event and the historical event; if the similarity score value of the historical event is larger than the preset threshold value, conducting named entity and region weighting on the similarity score value of the historical event according to the current event named entity, the current event occurrence place, the historical event named entity and the historical event occurrence place to obtain a weighted similarity score value, and taking the weighted similarity score value as a final similarity score value of the current event and the historical event; and sequencing the historical events according to the final similar score value to obtain a similar event sequencing result of the current event.
Optionally, the step of weighting the named entity and the region of the similar score value of the historical event according to the named entity of the current event, the occurring location of the current event, the named entity of the historical event, and the occurring location of the historical event to obtain the weighted similar score value includes: judging whether a historical event occurrence place in a historical event and a current event occurrence place belong to the same region or not, and judging whether a historical event named entity in the historical event and a current event named entity in the historical event have the same entity or not; if the historical event occurrence place in the historical event and the current event occurrence place in the historical event do not belong to the same region, and the historical event named entity in the historical event and the current event named entity in the historical event do not have the same entity, the similarity score value of the historical event is kept unchanged; if the historical event occurrence place and the current event occurrence place in the historical event belong to the same region and the historical event named entity and the current event named entity in the historical event do not have the same entity, performing region weighting on the similar score value of the historical event to obtain a weighted similar score value; if the historical event occurrence place and the current event occurrence place in the historical event do not belong to the same region and the historical event named entity and the current event named entity in the historical event have the same entity, conducting named entity weighting on the similar score value of the historical event to obtain a weighted similar score value; and if the historical event occurrence place in the historical event and the current event occurrence place in the historical event belong to the same region, and the historical event named entity in the historical event and the current event named entity in the historical event have the same entity, performing region weighting and named entity weighting on the similarity score value of the historical event to obtain a weighted similarity score value.
Optionally, the step of performing regional weighting on the similarity score value of the historical event to obtain a weighted similarity score value includes: calculating a region distance between the historical event occurrence point and the current event occurrence point according to the historical event occurrence point and the current event occurrence point corresponding to the historical event; and carrying out region weighting on the similarity score value according to the region distance to obtain the weighted similarity score value.
Optionally, the step of weighting the similar score value of the historical event by the named entity to obtain a weighted similar score value includes: comparing a historical event named entity in a historical event with a current event named entity to obtain the number of the same entities contained in the historical event named entity and the current event named entity; and carrying out named entity weighting on the similarity score values according to the number of the same entities to obtain weighted similarity score values.
Optionally, the step of performing region weighting and named entity weighting on the similarity score value of the historical event to obtain a weighted similarity score value includes: calculating a region distance between the historical event occurrence point and the current event occurrence point according to the historical event occurrence point and the current event occurrence point corresponding to the historical event; comparing a historical event named entity in a historical event with a current event named entity to obtain the number of the same entities contained in the historical event named entity and the current event named entity; and carrying out region weighting and named entity weighting on the similarity score values according to the region distance and the number of the same entities to obtain weighted similarity score values.
Optionally, the weighted similarity score value is calculated as follows:
Figure BDA0003201772910000021
wherein score _ new (k) is a weighted similarity score value for the kth historical event; score (k) is the similarity score value before weighting for the kth historical event; delta is a region weighting constant; d (k) is the regional distance between the current event and the kth historical event; η is the named entity weighting constant; n (k) is the same named entity number of the current event and the kth historical event.
Optionally, the step of obtaining the semantic similarity value between the current event and the subject term of each historical event according to the subject term of the current event and the subject term of each historical event includes: respectively inputting the current event subject term and the historical event subject term of each historical event into a pre-training term vector model to obtain a current event subject term vector contained in the current event and a historical event subject term vector contained in each historical event; adding current event subject word vectors contained in a current event to obtain a current event subject word semantic vector corresponding to the current event; adding the historical event subject word vectors contained in each historical event to obtain a historical event subject word semantic vector corresponding to each historical event; and respectively carrying out cosine similarity calculation on the semantic vector of the subject word of the current event and the semantic vector of the subject word of each historical event to obtain a subject word semantic similarity value of the current event and each historical event.
Optionally, the step of obtaining a semantic similarity value between the current event and the summary sentence of each historical event according to the current event summary and the historical event summary of each historical event includes: respectively inputting a current event abstract corresponding to a current event and a historical event abstract corresponding to each historical event into a bert pre-training sentence vector model to obtain a current abstract sentence vector corresponding to the current event abstract and a historical abstract sentence vector corresponding to each historical event abstract; adding the current abstract sentence vectors corresponding to the current event abstract to obtain a sentence semantic vector of the current event abstract; adding the historical abstract sentence vectors corresponding to each historical event abstract respectively to obtain the sentence semantic vector of each historical event abstract; and respectively carrying out cosine similarity calculation on the sentence semantic vector of the current event abstract and the sentence semantic vector of each historical event abstract to obtain an abstract sentence semantic similarity value of the current event and each historical event.
Optionally, the step of obtaining a syntactic similarity value between the current event and each historical event according to the current event title and the historical event title of each historical event includes: respectively obtaining the title editing distance of the current event title and each historical event title according to the current event title and each historical event title of each historical event; and respectively carrying out normalization processing on the title editing distance of the current event title and each historical event title to obtain a syntactic similarity value of the current event and each historical event.
Optionally, the step of obtaining a similarity value between the current event and the argument of each historical event according to the current event argument and the historical event argument of each historical event includes: respectively inputting the current event argument and the historical event argument of each historical event into a pre-training word vector model to obtain a current event argument word vector corresponding to the current event argument and a historical event argument word vector corresponding to each historical event argument; respectively obtaining argument edit distances of the current event argument and each historical event argument according to the current event argument and the historical event argument of each historical event; and obtaining an argument similarity value of the current event and each historical event according to the current event argument word vector, the historical event argument word vector and the argument edit distance.
Optionally, the step of obtaining a similarity value between the current event and the trigger word of each historical event according to the current event trigger word and the trigger word of each historical event includes: respectively inputting the current event trigger word and the historical event trigger word of each historical event into a pre-training word vector model to obtain a current event trigger word vector corresponding to the current event trigger word and a historical event trigger word vector corresponding to each historical event trigger word; respectively obtaining the editing distance of the trigger words of the current event and the historical event according to the trigger words of the current event and the historical event of each historical event; and obtaining a trigger word similarity value of the current event and each historical event according to the trigger word vector of the current event, the trigger word vector of the historical event and the editing distance of the trigger words.
Optionally, the step of obtaining a similarity value between the current event and each of the historical events according to the current event lifecycle and the historical event lifecycle of each of the historical events includes: calculating the event co-occurrence distance of the current event life cycle and the historical event life cycle of each historical event; and reducing the event co-occurrence distance by preset times to obtain a time window similarity value of the current event and each historical event.
Optionally, the step of obtaining a text category similarity value between the current event and each historical event according to the text category of the current event and the text category of the historical event of each historical event includes: respectively judging whether the current event text type is the same as the historical event text type of each historical event; if the current event text type is the same as the historical event text type, the text type similarity value is a first preset value; and if the current event text type is different from the historical event text type, the text type similarity value is a second preset value, and the second preset value is smaller than the first preset value.
Optionally, the calculation formula of the semantic similarity value of the subject term of the current event and each historical event is as follows:
Figure BDA0003201772910000031
Figure BDA0003201772910000032
Figure BDA0003201772910000033
wherein s1k is a subject term semantic similarity value between the current event and the kth historical event;
Figure BDA0003201772910000034
semantic vector of subject term of current event; p is the number of the word vectors of the subject words of the current event;
Figure BDA0003201772910000035
a current event subject word vector corresponding to the pth subject word in the current event;
Figure BDA0003201772910000041
semantic vectors of historical event subject terms corresponding to the kth historical event; q is the number of the historical event topic word vectors corresponding to the kth historical event;
Figure BDA0003201772910000042
and the historical event topic word vector corresponding to the qth topic word in the kth historical event is obtained.
Optionally, the calculation formula of the semantic similarity value of the summary sentence of the current event and each historical event is as follows:
Figure BDA0003201772910000043
Figure BDA0003201772910000044
Figure BDA0003201772910000045
wherein s2k is the semantic similarity value of the abstract sentence of the current event and the kth historical event;
Figure BDA0003201772910000046
a sentence semantic vector of the current event abstract; l is the number of current abstract sentence vectors in the current event abstract;
Figure BDA0003201772910000047
a current event summary sentence vector corresponding to the ith sentence in the current event summary;
Figure BDA0003201772910000048
sentence semantic vectors of historical event abstracts corresponding to the kth historical event; m is the number of historical event abstract sentence vectors corresponding to the kth historical event;
Figure BDA0003201772910000049
a historical event summary sentence vector corresponding to the mth summary sentence in the kth historical event.
Alternatively, the syntactic similarity value calculation formula for the current event and each historical event is as follows:
Figure BDA00032017729100000410
wherein s3k is the syntactic similarity value of the current event and the kth historical event; t is t1A current event title corresponding to the current event; t is t2kIs as followsHistorical event titles corresponding to the k historical events; ed (t)1,t2k) And editing the distance for the current event title corresponding to the current event and the title of the historical event title corresponding to the kth historical event.
Alternatively, the formula for calculating the argument similarity value of the current event and each historical event is as follows:
Figure BDA00032017729100000411
wherein s4k is an argument similarity value of the current event and the kth historical event;
Figure BDA00032017729100000412
a current event argument word vector corresponding to the current event;
Figure BDA00032017729100000413
historical event argument word vectors corresponding to the kth historical event; wcaA current event argument corresponding to the current event; whakHistorical event arguments corresponding to the kth historical event; ed (W)ca,Whak) And editing the distance for the current event argument corresponding to the current event and the argument of the historical event argument corresponding to the kth historical event.
Optionally, the calculation formula of the trigger word similarity value of the current event and each historical event is as follows:
Figure BDA0003201772910000051
wherein s5k is the trigger similarity value of the current event and the kth historical event;
Figure BDA0003201772910000052
triggering word and word vectors for a current event corresponding to the current event;
Figure BDA0003201772910000053
is the k-th calendarTriggering word and word vectors by historical events corresponding to the historical events; wctTriggering words for the current events corresponding to the current events; whtkTriggering words for the historical events corresponding to the kth historical event; ed (W)ct,Whtk) And triggering word elements for the current event corresponding to the current event and triggering word editing distances of the historical event triggering words corresponding to the kth historical event.
Optionally, the time window similarity value calculation formula of the current event and each historical event is as follows:
Figure BDA0003201772910000054
wherein s6k is the similarity value of the current event and the kth historical event in the time window; t is tcThe current event life cycle corresponding to the current event; t is thkThe historical event life cycle corresponding to the kth historical event; td (t)c,thk) The co-occurrence distance between the current event life cycle corresponding to the current event and the historical event life cycle corresponding to the kth historical event; t is a preset multiple.
Optionally, the text category similarity value calculation formula of the current event and each historical event is as follows:
Figure BDA0003201772910000055
wherein s7k is the text category similarity value of the current event and the kth historical event; tableecThe current event text category corresponding to the current event; tableeh(k) And the current historical event text category is corresponding to the kth historical event.
Optionally, after the step of sorting the historical events according to the final similar score value to obtain a similar event sorting result of the current event, the method further includes: acquiring the matching number of similar events; and selecting the historical events with large final similarity score values from the similar event sequencing results according to the matching number of the similar events to obtain the similar historical events with the matching number of the similar events.
According to a second aspect, an embodiment of the present invention provides a multi-dimensional feature fusion similar event calculation system, including:
the first acquisition module is used for acquiring a current event subject term, a current event abstract, a current event title, a current event argument, a current event trigger term, a current event life cycle, a current event text category, a current event named entity and a current event occurrence place of a current event;
the second acquisition module is used for acquiring a history event subject term, a history event abstract, a history event title, a history event argument, a history event trigger term, a history event life cycle, a history event text category, a history event named entity and a history event occurrence place of each history event;
the first processing module is used for obtaining a subject term semantic similarity value of the current event and each historical event according to the subject term of the current event and the subject term of each historical event;
the second processing module is used for obtaining a abstract sentence semantic similarity value of the current event and each historical event according to the current event abstract and the historical event abstract of each historical event;
the third processing module is used for obtaining a syntactic similarity value of the current event and each historical event according to the current event title and the historical event title of each historical event;
the fourth processing module is used for obtaining an argument similarity value of the current event and each historical event according to the current event argument and the historical event argument of each historical event;
the fifth processing module is used for obtaining a trigger word similarity value of the current event and each historical event according to the current event trigger word and the historical event trigger word of each historical event;
the sixth processing module is used for obtaining a time window similarity value of the current event and each historical event according to the current event life cycle and the historical event life cycle of each historical event;
the seventh processing module is used for obtaining a text type similarity value of the current event and each historical event according to the text type of the current event and the text type of the historical event of each historical event;
the eighth processing module is used for respectively carrying out weighted fusion on the subject word semantic similarity value, the abstract sentence semantic similarity value, the syntax similarity value, the argument similarity value, the trigger word similarity value, the time window similarity value and the text category similarity value of each historical event to obtain a similarity score value of the current event and each historical event;
the judging module is used for respectively judging whether the similarity score value of each historical event is greater than a preset threshold value;
the ninth processing module is used for keeping the similarity score value of the historical event unchanged if the similarity score value of the historical event is smaller than or equal to the preset threshold value, and taking the similarity score value as the final similarity score value of the current event and the historical event;
the tenth processing module is used for weighting the named entities and the regions of the similar scores of the historical events according to the named entities of the current events, the occurrence places of the current events, the named entities of the historical events and the occurrence places of the historical events to obtain weighted similar scores, and using the weighted similar scores as final similar scores of the current events and the historical events;
and the eleventh processing module is used for sequencing the historical events according to the final similar score value to obtain a similar event sequencing result of the current event.
Optionally, the tenth processing module includes: the first judgment submodule is used for judging whether a historical event occurrence place and a current event occurrence place in a historical event belong to the same region or not and judging whether a historical event named entity and a current event named entity in the historical event have the same entity or not; the first processing submodule is used for keeping the similarity score value of the historical event unchanged if the occurrence place of the historical event in the historical event and the occurrence place of the current event do not belong to the same region and the named entity of the historical event in the historical event and the named entity of the current event do not have the same entity; the second processing submodule is used for carrying out region weighting on the similar score value of the historical event to obtain a weighted similar score value if the occurrence place of the historical event in the historical event and the occurrence place of the current event belong to the same region and the named entity of the historical event in the historical event and the named entity of the current event do not have the same entity; the third processing submodule is used for weighting the named entities of the similar scores of the historical events to obtain weighted similar scores if the place of the historical events in the historical events and the place of the current events do not belong to the same region and the named entities of the historical events in the historical events and the named entities of the current events have the same entity; and the fourth processing submodule is used for carrying out region weighting and named entity weighting on the similar score value of the historical event to obtain a weighted similar score value if the historical event occurrence place in the historical event and the current event occurrence place in the historical event belong to the same region and the historical event named entity in the historical event and the current event named entity in the historical event have the same entity.
Optionally, the second processing sub-module includes: the first processing unit is used for calculating the region distance between the historical event occurrence point and the current event occurrence point according to the historical event occurrence point and the current event occurrence point corresponding to the historical event; and the second processing unit is used for carrying out region weighting on the similarity score value according to the region distance to obtain the weighted similarity score value.
Optionally, the third processing sub-module includes: the third processing unit is used for comparing the historical event named entity in the historical event with the current event named entity to obtain the number of the same entities contained in the historical event named entity and the current event named entity; and the fourth processing unit is used for weighting the named entities of the similarity score values according to the number of the same entities to obtain weighted similarity score values.
Optionally, the fourth processing submodule includes: the fifth processing unit is used for calculating the region distance between the historical event occurrence point and the current event occurrence point according to the historical event occurrence point and the current event occurrence point corresponding to the historical event; the sixth processing unit is used for comparing the historical event named entity in the historical event with the current event named entity to obtain the number of the same entities contained in the historical event named entity and the current event named entity; and the seventh processing unit is used for carrying out region weighting and named entity weighting on the similarity score value according to the region distance and the number of the same entities to obtain the weighted similarity score value.
Optionally, the weighted similarity score value is calculated as follows:
Figure BDA0003201772910000071
wherein score _ new (k) is a weighted similarity score value for the kth historical event; score (k) is the similarity score value before weighting for the kth historical event; delta is a region weighting constant; d (k) is the regional distance between the current event and the kth historical event; η is the named entity weighting constant; n (k) is the same named entity number of the current event and the kth historical event.
Optionally, the first processing module includes: the fifth processing submodule is used for respectively inputting the current event subject term and the historical event subject term of each historical event into the pre-training term vector model to obtain a current event subject term vector contained in the current event and a historical event subject term vector contained in each historical event; the sixth processing submodule is used for adding the current event subject word vectors contained in the current event to obtain the current event subject word semantic vector corresponding to the current event; the seventh processing submodule is used for respectively adding the historical event subject word vectors contained in each historical event to obtain a historical event subject word semantic vector corresponding to each historical event; and the eighth processing submodule is used for performing cosine similarity calculation on the semantic vector of the current event subject term and each semantic vector of the historical event subject terms respectively to obtain a subject term semantic similarity value of the current event and each historical event.
Optionally, the second processing module includes: a ninth processing submodule, configured to input a current event summary corresponding to the current event and a historical event summary corresponding to each historical event into the bert pre-training sentence vector model, respectively, to obtain a current summary sentence vector corresponding to the current event summary and a historical summary sentence vector corresponding to each historical event summary; the tenth processing submodule is used for adding the current abstract sentence vectors corresponding to the current event abstract to obtain the sentence semantic vectors of the current event abstract; the eleventh processing submodule is used for respectively adding the historical abstract sentence vectors corresponding to each historical event abstract to obtain the sentence semantic vector of each historical event abstract; and the twelfth processing submodule is used for respectively carrying out cosine similarity calculation on the sentence semantic vector of the current event abstract and the sentence semantic vector of each historical event abstract to obtain an abstract sentence semantic similarity value of the current event and each historical event.
Optionally, the third processing module includes: the thirteenth processing submodule is used for respectively obtaining the title editing distance of the current event title and each historical event title according to the current event title and each historical event title of each historical event; and the fourteenth processing submodule is used for respectively carrying out normalization processing on the title editing distance of the current event title and each historical event title to obtain a syntactic similarity value of the current event and each historical event.
Optionally, the fourth processing module includes: a fifteenth processing sub-module, configured to input the current event argument and the historical event argument of each historical event into a pre-training word vector model, respectively, to obtain a current event argument word vector corresponding to the current event argument and a historical event argument word vector corresponding to each historical event argument; a sixteenth processing sub-module, configured to obtain argument edit distances of the current event argument and each historical event argument according to the current event argument and the historical event argument of each historical event, respectively; and the seventeenth processing submodule is used for obtaining an argument similarity value of the current event and each historical event according to the current event argument word vector, the historical event argument word vector and the argument editing distance.
Optionally, the fifth processing module includes: the eighteenth processing submodule is used for respectively inputting the current event trigger word and the historical event trigger word of each historical event into the pre-training word vector model to obtain a current event trigger word vector corresponding to the current event trigger word and a historical event trigger word vector corresponding to each historical event trigger word; the nineteenth processing sub-module is used for respectively obtaining the trigger word edit distance of the current event trigger word and each historical event trigger word according to the current event trigger word and each historical event trigger word of each historical event; and the twentieth processing submodule is used for obtaining the trigger word similarity value of the current event and each historical event according to the trigger word vector of the current event, the trigger word vector of the historical event and the editing distance of the trigger word.
Optionally, the sixth processing module includes: the twenty-first processing submodule is used for calculating the event co-occurrence distance of the current event life cycle and the historical event life cycle of each historical event; the twenty-second processing submodule is used for reducing the event co-occurrence distance by preset times to obtain a time window similarity value of the current event and each historical event;
optionally, the seventh processing module includes: the second judgment submodule is used for respectively judging whether the current event text type is the same as the historical event text type of each historical event; the twenty-third processing submodule is used for determining that the text type similarity value is a first preset value if the current event text type is the same as the historical event text type; and the twenty-fourth processing submodule is used for setting the text type similarity value as a second preset value if the current event text type is different from the historical event text type, and the second preset value is smaller than the first preset value.
Optionally, the calculation formula of the semantic similarity value of the subject term of the current event and each historical event is as follows:
Figure BDA0003201772910000081
Figure BDA0003201772910000082
Figure BDA0003201772910000083
wherein s1k is a subject term semantic similarity value between the current event and the kth historical event;
Figure BDA0003201772910000084
semantic vector of subject term of current event; p is the number of the word vectors of the subject words of the current event;
Figure BDA0003201772910000085
a current event subject word vector corresponding to the pth subject word in the current event;
Figure BDA0003201772910000086
semantic vectors of historical event subject terms corresponding to the kth historical event; q is the number of the historical event topic word vectors corresponding to the kth historical event;
Figure BDA0003201772910000087
and the historical event topic word vector corresponding to the qth topic word in the kth historical event is obtained.
Optionally, the calculation formula of the semantic similarity value of the summary sentence of the current event and each historical event is as follows:
Figure BDA0003201772910000088
Figure BDA0003201772910000089
Figure BDA00032017729100000810
wherein s2k is the semantic similarity value of the abstract sentence of the current event and the kth historical event;
Figure BDA00032017729100000811
a sentence semantic vector of the current event abstract; l is the number of current abstract sentence vectors in the current event abstract;
Figure BDA00032017729100000812
a current event summary sentence vector corresponding to the ith sentence in the current event summary;
Figure BDA00032017729100000813
sentence semantic vectors of historical event abstracts corresponding to the kth historical event; m is the number of historical event abstract sentence vectors corresponding to the kth historical event;
Figure BDA00032017729100000814
a historical event summary sentence vector corresponding to the mth summary sentence in the kth historical event.
Alternatively, the syntactic similarity value calculation formula for the current event and each historical event is as follows:
Figure BDA0003201772910000091
wherein s3k is the syntactic similarity value of the current event and the kth historical event; t is t1A current event title corresponding to the current event; t is t2kA history event title corresponding to the kth history event; ed (t)1,t2k) And editing the distance for the current event title corresponding to the current event and the title of the historical event title corresponding to the kth historical event.
Alternatively, the formula for calculating the argument similarity value of the current event and each historical event is as follows:
Figure BDA0003201772910000092
wherein s4k is an argument similarity value of the current event and the kth historical event;
Figure BDA0003201772910000093
a current event argument word vector corresponding to the current event;
Figure BDA0003201772910000094
historical event argument word vectors corresponding to the kth historical event; wcaA current event argument corresponding to the current event; whakHistorical event arguments corresponding to the kth historical event; ed (W)ca,Whak) And editing the distance for the current event argument corresponding to the current event and the argument of the historical event argument corresponding to the kth historical event.
Optionally, the calculation formula of the trigger word similarity value of the current event and each historical event is as follows:
Figure BDA0003201772910000095
wherein s5k is the trigger similarity value of the current event and the kth historical event;
Figure BDA0003201772910000096
triggering word and word vectors for a current event corresponding to the current event;
Figure BDA0003201772910000097
triggering word and word vectors for the historical events corresponding to the kth historical event; wctTriggering words for the current events corresponding to the current events; whtkTriggering words for the historical events corresponding to the kth historical event; ed (W)ct,Whtk) And triggering word elements for the current event corresponding to the current event and triggering word editing distances of the historical event triggering words corresponding to the kth historical event.
Optionally, the time window similarity value calculation formula of the current event and each historical event is as follows:
Figure BDA0003201772910000098
wherein s6k is the similarity value of the current event and the kth historical event in the time window; t is tcThe current event life cycle corresponding to the current event; t is thkThe historical event life cycle corresponding to the kth historical event; td (t)c,thk) The co-occurrence distance between the current event life cycle corresponding to the current event and the historical event life cycle corresponding to the kth historical event; t is a preset multiple.
Optionally, the text category similarity value calculation formula of the current event and each historical event is as follows:
Figure BDA0003201772910000099
wherein s7k is the text category similarity value of the current event and the kth historical event; tableecThe current event text category corresponding to the current event; tableeh(k) And the current historical event text category is corresponding to the kth historical event.
Optionally, the method further comprises: the third acquisition module is used for acquiring the matching number of the similar events; and the twelfth processing module is used for selecting the historical event with the large final similarity score value from the similar event sequencing result according to the matching number of the similar events to obtain the similar historical events with the matching number of the similar events.
According to a third aspect, an embodiment of the present invention provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to cause the at least one processor to perform the multi-dimensional feature fusion similar event calculation method as described in any of the above first aspects.
According to a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, where computer instructions are stored, and the computer instructions are configured to cause a computer to execute the multi-dimensional feature fusion similar event calculation method described in any one of the first aspect.
The technical scheme of the embodiment of the invention has the following advantages:
the embodiment of the invention provides a method, a system, electronic equipment and a storage medium for calculating a multi-dimensional feature fusion similar event, wherein the method comprises the following steps: acquiring a current event subject term, a current event abstract, a current event title, a current event argument, a current event trigger term, a current event life cycle, a current event text category, a current event named entity and a current event occurrence place of a current event; acquiring a history event subject term, a history event abstract, a history event title, a history event argument, a history event trigger term, a history event life cycle, a history event text category, a history event named entity and a history event occurrence place of each history event; obtaining a subject term semantic similarity value of the current event and each historical event according to the subject term of the current event and the subject term of each historical event; obtaining a semantic similarity value of a summary sentence of the current event and each historical event according to the current event summary and the historical event summary of each historical event; obtaining a syntactic similarity value of the current event and each historical event according to the current event title and the historical event title of each historical event; obtaining an argument similarity value of the current event and each historical event according to the current event argument and the historical event argument of each historical event; obtaining a trigger word similarity value of the current event and each historical event according to the current event trigger word and the historical event trigger word of each historical event; obtaining a time window similarity value of the current event and each historical event according to the current event life cycle and the historical event life cycle of each historical event; obtaining a text category similarity value of the current event and each historical event according to the text category of the current event and the text category of the historical events; respectively carrying out weighted fusion on the subject term semantic similarity value, the abstract sentence semantic similarity value, the syntax similarity value, the argument similarity value, the trigger term similarity value, the time window similarity value and the text category similarity value of each historical event to obtain a similarity score value of the current event and each historical event; respectively judging whether the similarity score value of each historical event is greater than a preset threshold value; if the similarity score value of the historical event is smaller than or equal to the preset threshold value, keeping the similarity score value of the historical event unchanged, and taking the similarity score value as the final similarity score value of the current event and the historical event; if the similarity score value of the historical event is larger than the preset threshold value, conducting named entity and region weighting on the similarity score value of the historical event according to the current event named entity, the current event occurrence place, the historical event named entity and the historical event occurrence place to obtain a weighted similarity score value, and taking the weighted similarity score value as a final similarity score value of the current event and the historical event; and sequencing the historical events according to the final similar score value to obtain a similar event sequencing result of the current event. Obtaining a subject word semantic similarity value, a summary sentence semantic similarity value, a syntax similarity value, a argument similarity value, a trigger word similarity value, a time window similarity value and a text type similarity value of the current event and each historical event respectively according to subject words, abstracts, titles, arguments, trigger words, life cycles and text types of the current event and the historical events; secondly, performing weighted fusion on the plurality of similar values to obtain a similar score value of the current event and each historical event; when the similarity score value is not larger than the preset threshold value, the similarity score value is kept unchanged; when the similarity score value is larger than a preset threshold value, carrying out named entity weighting and region weighting according to the occurrence places and the named entities of the current event and the historical event to obtain a weighted similarity score value; and sequencing the historical time according to the final similarity score value to obtain a sequencing result of the historical similar events of the current event, and determining the historical event with higher similarity to the current event according to the sequencing result. The similarity values of the multiple features are obtained by comparing the current event with the historical events, the similarity comparison is enriched for the multiple features, the event similarity is judged better from multiple angles, the accuracy of the event similarity is improved, and the similarity of the searched historical events is higher.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flowchart illustrating a specific example of a method for computing a multi-dimensional feature fusion similarity event according to an embodiment of the present invention;
FIG. 2 is a flowchart of another specific example of a multi-dimensional feature fusion similar event calculation method according to an embodiment of the present invention;
FIG. 3 is a block diagram of a specific example of a multi-dimensional feature fusion similar events computing system according to an embodiment of the present invention;
fig. 4 is a schematic diagram of an electronic device according to an embodiment of the invention.
Detailed Description
The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention provides a multi-dimensional feature fusion similar event calculation method, which comprises the steps of S1-S14 as shown in FIG. 1.
Step S1: and acquiring a current event subject term, a current event abstract, a current event title, a current event argument, a current event trigger term, a current event life cycle, a current event text category, a current event named entity and a current event occurrence place of the current event.
In this embodiment, the topic word of the current event is extracted to obtain the topic word of the current event, and a specific topic word extracting method may be to train idf (inverse text word frequency) in a massive data set, then calculate the weight of a word by counting the word frequency tf of the current event and multiplying tf by the trained idf of the word, that is, tf —, idf, and sort the words according to the weight, where the top N keywords are the topic words of the text.
The method for extracting the abstract of the current event comprises the step of extracting the abstract of the current event, wherein the specific abstract extracting method can be used for extracting the abstract information of the text by the longest three sentences.
The title of the current event is extracted to obtain the title of the current event, and the specific title extraction method can be that a sentence which repeatedly appears in an event text, namely a public sentence, is extracted to serve as the title of the event, and if a plurality of repeated sentences exist, the title is obtained according to the occurrence frequency of the sentences and the frequency is high.
And carrying out unsupervised training through word2vec, and extracting the context semantic features of the words in the text set. Unsupervised pre-training is carried out through a bert model, and semantic features of sentences in a text set are extracted. And (5) calculating tf-idf of the text by training idf to obtain keywords of the text. And extracting summary information of the text by the longest three sentences. And aiming at the extracted information and unsupervised pre-training, semantic similarity calculation of a word level and a sentence level is carried out, and the problem of similarity of semantic levels is perfectly solved.
The argument extraction is carried out on the current event to obtain the argument of the current event, and the specific argument extraction method can be that the argument can be obtained by word segmentation and part of speech tagging, then the title and the first 100 characters of the article are counted, and the noun with the highest word frequency appears. An argument is the subject of an event, specifically, the noun part of the subject of the event, i.e., the object of the subject of the event, such as "free-renting event", where the most frequent mention is made "free" in the whole text, i.e., the subject, and can be regarded as the argument of the event.
And extracting a trigger word of the current event to obtain a trigger word of the current event, specifically, taking a verb part behind the subject word as the trigger word. The trigger is a verb that triggers the occurrence of an event. For example, a small and clear fall can cause the own trousers to be dirty, and the fall is a trigger word.
The event attributes are divided into event trigger words and event arguments, and by increasing the attribute similarity calculation of the events, whether the event main arguments are consistent with the actions of the events can be perfectly judged, and the event similarity can be better judged from the event perspective.
Determining the life cycle of the current event according to the start time and the end time of the current event
The current event text category is a text classification of the current text, namely a word text classification category. And classifying according to the channel category acquired by the data acquisition. Education is classified into education and medical treatment is classified into medical treatment, and category names are unified. The specific examples include education, medical treatment, sports, etc., which are only schematically described in this embodiment, but not limited thereto. And carrying out named entity identification on the current event to obtain a named entity of the current event, wherein the specific named entity identification can be identified by a Stanford CoreNLP tool.
And extracting the place name of the current event to obtain the place where the current event occurs, wherein the specific place name extraction method can be that the place identification is carried out by a Baidu LDA tool.
Step S2: acquiring a historical event subject word, a historical event abstract, a historical event title, a historical event argument, a historical event trigger word, a historical event life cycle, a historical event text category, a historical event named entity and a historical event occurrence place of each historical event.
In this embodiment, the method for determining the historical event topic word, the historical event abstract, the historical event title, the historical event argument, the historical event trigger word, the historical event life cycle, the historical event text category, the historical event naming entity, and the historical event occurrence location corresponding to the historical event is similar to the process described in step S1, and the specific process is not described herein again.
Step S3: and obtaining the subject term semantic similarity value of the current event and each historical event according to the subject term of the current event and the subject term of each historical event.
In the embodiment, the subject word of the current event is mapped into a word vector of the subject word of the current event through a pre-training word vector model; respectively mapping the historical event subject term corresponding to each historical event into a historical event subject term vector corresponding to each historical event through a pre-training term vector model; and then, respectively carrying out cosine similarity calculation on the current event subject word vector and the historical event subject word vector corresponding to each historical event to obtain subject word semantic similarity values of the current event subject word and the historical event subject word corresponding to each historical event.
Specifically, the pre-training word vector model may be a word2vector model; of course, in other embodiments, the pre-training word vector model may also be other word vector models in the prior art, which is only illustrated schematically in this embodiment and is not limited thereto.
Step S4: and obtaining a semantic similarity value of the abstract sentences of the current event and each historical event according to the current event abstract and the historical event abstract of each historical event.
In the embodiment, the current event abstract is mapped into a current event abstract sentence vector through a pre-training sentence vector model; respectively mapping the historical event abstract corresponding to each historical event into historical event abstract sentence vectors corresponding to each historical event through a pre-training sentence vector model; and then, respectively carrying out cosine similarity calculation on the current event abstract sentence vector and the historical event abstract sentence vector corresponding to each historical event to obtain abstract sentence semantic similarity values of the current event abstract and the historical event abstract corresponding to each historical event.
Specifically, the pre-training sentence vector model may be a bert model; of course, in other embodiments, the pre-training sentence vector model may also be other sentence vector models in the prior art, which is only illustrated schematically in this embodiment and is not limited thereto.
Step S5: and obtaining a syntactic similarity value of the current event and each historical event according to the current event title and the historical event title of each historical event.
In this embodiment, the syntactic similarity value between the current event title and each historical event title is obtained by calculating the title edit distance between the current event title and each historical event title of each historical event, and normalizing the title edit distance corresponding to each historical event.
Specifically, the Edit Distance (Edit Distance), also called Levenshtein Distance, refers to the minimum number of editing operations required to change one character string into another character string. Permitted editing operations include replacing one character with another, inserting one character, and deleting one character. Generally, the smaller the edit distance, the greater the similarity of two character strings.
Step S6: and obtaining the argument similarity value of the current event and each historical event according to the current event argument and the historical event argument of each historical event.
In the embodiment, the current event argument is mapped into the word vector of the current event argument through a pre-training word vector model; and respectively mapping the historical event argument corresponding to each historical event into a historical event argument word vector corresponding to each historical event through a pre-training word vector model. And obtaining the argument edit distance of the current event argument and the historical event argument of each historical event through distance calculation.
And calculating the similarity according to the word vector of the argument of the current event, the word vector of the argument of the historical event corresponding to each historical event and the edit distance of the argument to obtain the argument similarity value of the current event and each historical event.
Step S7: and obtaining the similarity value between the current event and the trigger word of each historical event according to the current event trigger word and the historical event trigger word of each historical event.
In the embodiment, the current event trigger word is mapped into a current event trigger word vector through a pre-training word vector model; and respectively mapping the historical event trigger words corresponding to each historical event into historical event trigger word vectors corresponding to each historical event through a pre-training word vector model. And obtaining the trigger word edit distance of the current event trigger word and the historical event trigger word of each historical event through distance calculation.
And calculating the similarity according to the current event trigger word-word vector, the historical event trigger word-word vector corresponding to each historical event and the trigger word editing distance to obtain the trigger word similarity value of the current event and each historical event.
Step S8: and obtaining a time window similarity value of the current event and each historical event according to the current event life cycle and the historical event life cycle of each historical event.
In this embodiment, the event co-occurrence distance of the life cycle of the current event and the life cycle of the historical event corresponding to each historical event is calculated, and the event co-occurrence distance is scaled, so that the time window similarity value between the current event and each historical event is finally obtained.
Step S9: and obtaining a text category similarity value of the current event and each historical event according to the text category of the current event and the text category of the historical event of each historical event.
In the embodiment, whether the current event text category of the current event is consistent with the historical event text category corresponding to each historical event is compared; if the text types are consistent, the text types are similar to 1, and the next step of fusion weighting is participated; if the text types are not consistent, the text types are similar to 0, and the next step of fusion weighting is not involved.
Step S10: and respectively carrying out weighted fusion on the subject term semantic similarity value, the abstract sentence semantic similarity value, the syntax similarity value, the argument similarity value, the trigger term similarity value, the time window similarity value and the text category similarity value of each historical event to obtain a similarity score value of the current event and each historical event.
In this embodiment, fusion weighted linear parameters are preset through statistical analysis, subject word semantic similarity values, abstract sentence semantic similarity values, syntax similarity values, argument similarity values, trigger word similarity values, time window similarity values and text category similarity values obtained through similarity calculation are subjected to linear fusion to obtain fusion weighted values of a current event and each historical event, finally global index normalization is performed through indexes to obtain similarity score values of the current event and each historical event, namely event similarity scores score of the current event and each historical event, and a cushion is laid for next weighted sequencing based on score.
The calculation formula of the similarity score value of the current event and the historical event is as follows.
sim(k)=λ1*s1k+λ2*s2k+λ3*s3k+λ4*s4k+λ5*s5k+λ6*s6k+λ7*s7k
Figure BDA0003201772910000131
Wherein score (k) is a similarity score value of the current event and the kth historical event; q is the total number of historical events; sim (k) is the fusion weighted value of the current event and the kth historical event; lambda [ alpha ]1Linear parameters which are subject term semantic similarity values; s1k is the semantic similarity value of the subject term between the current event and the kth historical event; lambda [ alpha ]2Linear parameters which are semantic similarity values of the abstract sentences; s2k is the semantic similarity value of the abstract sentence of the current event and the kth historical event; lambda [ alpha ]3A linear parameter that is a syntactic similarity value; s3k is the syntactic similarity value of the current event and the kth historical event; lambda [ alpha ]4A linear parameter being an argument similarity value; s4k is argument similarity value of current event and kth historical event; lambda [ alpha ]5A linear parameter that is a trigger word similarity value; s5k is the trigger similarity value of the current event and the kth historical event; lambda [ alpha ]6A linear parameter that is a time window similarity value; s6k is the time window similarity value of the current event and the kth historical event; lambda [ alpha ]7Linear parameters which are similar values of text categories; s7k is the text category similarity value for the current event and the kth historical event.
Step S11: and respectively judging whether the similarity score value of each historical event is greater than a preset threshold value. If the similarity score value is not greater than the preset threshold, executing step S12; if the similarity score value is greater than the predetermined threshold, step S13 is executed.
In this embodiment, the value of the preset threshold is 0.7. The embodiments are only schematically described, and are not limited thereto, and in other embodiments, the embodiments may be reasonably arranged according to actual needs.
The weighting and sequencing operation is carried out only when the score is larger than a preset threshold, the score is smaller than the preset threshold, the similarity between the current event and the historical event is low, the probability that the historical event becomes the similar event of the current event is low, and further similarity judgment is not needed. Only on the premise that the semantic similarity is very high, the similar events are regarded as similar events, and similar events with closer distances are searched in the similar events, namely the relevance of the similar events is searched. The threshold is low, i.e. not similar, and it is not necessary to find the regional relevance.
Step S12: and if the similarity score value of the historical event is smaller than or equal to the preset threshold, keeping the similarity score value of the historical event unchanged, and taking the similarity score value as the final similarity score value of the current event and the historical event.
In this embodiment, when the similarity score value of the historical event is less than or equal to the preset threshold, it indicates that the similarity between the current event and the historical event is low, no operation is performed, the fusion similarity weight of the historical events is kept unchanged, and the similarity score value is used as the final similarity score value of the current event and the historical event.
Step S13: if the similarity score value of the historical event is larger than the preset threshold value, conducting named entity and region weighting on the similarity score value of the historical event according to the current event named entity, the current event occurrence place, the historical event named entity and the historical event occurrence place to obtain a weighted similarity score value, and taking the weighted similarity score value as a final similarity score value of the current event and the historical event.
In this embodiment, when the similarity score value of the historical event is greater than the preset threshold, it indicates that the similarity between the current event and the historical event is higher, and in order to obtain a more similar historical event, the named entities and the region weighting are performed on the similarity score value of the historical event according to the named entities of the current event, the named entities of the historical event, the occurrence location of the current event, and the occurrence location of the historical event, and further screening is performed according to the similarity of the named entities and the similarity of the occurrence location, so as to find the historical event with higher similarity. And obtaining a similar score value of the current event and the historical event after weighting, and taking the weighted similar score value as a final similar score value of the current event and the historical event.
Step S14: and sequencing the historical events according to the final similar score value to obtain a similar event sequencing result of the current event.
In this embodiment, the final similarity score value represents the similarity between the current event and the historical event. The larger the value of the final similarity score value is, the higher the similarity between the current event and the historical event is; conversely, the smaller the value of the final similarity score value is, the smaller the similarity between the current event and the historical event is.
In this embodiment, the final similar score values are sorted in a descending order, that is, the numerical values of the final similar score values are sorted from large to small, so as to obtain a similar event sorting result of the current event. Of course, in other embodiments, the arrangement may be performed in an ascending order, and the arrangement may be reasonably set according to needs.
Obtaining a subject word semantic similarity value, a summary sentence semantic similarity value, a syntax similarity value, a argument similarity value, a trigger word similarity value, a time window similarity value and a text type similarity value of the current event and each historical event respectively according to subject words, abstracts, titles, arguments, trigger words, life cycles and text types of the current event and the historical events; secondly, performing weighted fusion on the plurality of similar values to obtain a similar score value of the current event and each historical event; when the similarity score value is not larger than the preset threshold value, the similarity score value is kept unchanged; when the similarity score value is larger than a preset threshold value, carrying out named entity weighting and region weighting according to the occurrence places and the named entities of the current event and the historical event to obtain a weighted similarity score value; and sequencing the historical time according to the final similarity score value to obtain a sequencing result of the historical similar events of the current event, and determining the historical event with higher similarity to the current event according to the sequencing result. The similarity values of the multiple features are obtained by comparing the current event with the historical events, the similarity comparison is enriched for the multiple features, the event similarity is judged better from multiple angles, the accuracy of the event similarity is improved, and the similarity of the searched historical events is higher.
As an exemplary embodiment, the step S13 of weighting the similarity score value of the historical event according to the current event named entity, the current event occurrence location, the historical event named entity and the historical event occurrence location to obtain a weighted similarity score value includes steps S131 to S135.
Step S131: judging whether the historical event occurrence place and the current event occurrence place in the historical event belong to the same region or not, and judging whether the historical event named entity and the current event named entity in the historical event have the same entity or not.
In this embodiment, the same region is divided and determined according to the area, and the event occurrence locations are considered to be the same region if different event occurrence locations belong to the same area.
Named entities such as regions, names and organizations mentioned by the events are extracted, and similar events are possible when the same named entity is considered to be related to a cross of subjects. For the extracted regions, if events which occur in the same region and are similar semantically, the events can be regarded as key similar events, and the events with the same performance are likely to occur here.
Specifically, the same region may be divided according to provincial administrative regions. For example, the current event occurrence location is X province, F city, and the historical event occurrence location is X province, J city, and both the occurrence locations of the two events belong to the same region because the occurrence locations of the two events are within X province. For another example, the current event occurrence location is X province, F city, and the historical event occurrence location is Y province, I city, O district, and the occurrence locations of the two events belong to different provinces, one in X province and one in Y province, so the occurrence locations of the two events do not belong to the same region. Of course, in other embodiments, the same region may be divided according to the administrative areas of the city level, for example, events occurring in X province, F city, belong to the same region, events occurring in X province, Y city, and events occurring in Y province, I city, belong to different regions. This is only schematically described in the present embodiment, and is not limited thereto.
And matching and searching the historical event named entities in the historical events and the current event named entities to determine whether the same named entities exist, namely the same named entities are crossed named entities.
For example, the historical event named entities of a certain historical event include A, B and C, the current event named entity of the current event includes A, B, D and E, after comparison of the named entities, two named entities are obtained, namely a and B, and then the historical event named entity and the current event named entity have the same entity. For another example, the historical event named entities of a certain historical event include A, B and C named entities, the current event named entity of the current event includes D and E named entities, and after comparison of the named entities, it is found that two events do not have the same named entity, and then the historical event named entity and the current event named entity do not have the same entity.
Step S132: if the historical event occurrence place in the historical event and the current event occurrence place in the historical event do not belong to the same region, and the historical event named entity in the historical event and the current event named entity in the historical event do not have the same entity, the similarity score value of the historical event is kept unchanged.
In this embodiment, when the place of occurrence of the historical event in the historical event and the place of occurrence of the current event do not belong to the same region and the named entity of the historical event in the historical event and the named entity of the current event do not have the same entity, it is indicated that the places of occurrence of the current event and the historical event are not in the same region, and the named entity of the current event and the historical event do not have the same named entity, the similar score values are not subjected to region weighting or named entity weighting, and the similar score values of the current event and the historical event remain unchanged.
Step S133: and if the historical event occurrence place in the historical event and the current event occurrence place in the historical event belong to the same region and the historical event named entity in the historical event and the current event named entity in the historical event do not have the same entity, performing region weighting on the similarity score value of the historical event to obtain the weighted similarity score value.
In this embodiment, when a historical event occurrence location and a current event occurrence location in a historical event belong to the same region and a historical event named entity in the historical event and a current event named entity in the historical event do not have the same entity, it is indicated that the occurrence locations of the current event and the historical event are in the same region and the current event and the historical event do not have the same named entity, the region weighting is performed on the similarity score values of the current event and the historical event, the named entity weighting is not performed, and the weighted similarity score value is obtained after the region weighting.
Step S134: and if the historical event occurrence place in the historical event and the current event occurrence place in the historical event do not belong to the same region and the historical event named entity in the historical event and the current event named entity in the historical event have the same entity, weighting the named entity of the historical event to obtain a weighted similar score value.
In this embodiment, when the place of occurrence of the historical event in the historical event and the place of occurrence of the current event do not belong to the same region and the named entities of the historical event and the named entities of the current event in the historical event have the same entity, it is indicated that the places of occurrence of the current event and the historical event are not in the same region and the named entities of the current event and the historical event have the same name entity, the named entities are weighted on the similarity score values of the current event and the historical event, the region weighting is not performed, and the weighted similarity score value is obtained after the named entities are weighted.
Step S135: and if the historical event occurrence place in the historical event and the current event occurrence place in the historical event belong to the same region, and the historical event named entity in the historical event and the current event named entity in the historical event have the same entity, performing region weighting and named entity weighting on the similarity score value of the historical event to obtain a weighted similarity score value.
In this embodiment, when a historical event occurrence location and a current event occurrence location in a historical event belong to the same region and a historical event named entity in the historical event and a current event named entity in the historical event have the same entity, it is indicated that the current event and the historical event occur in the same region and the current event and the historical event have the same named entity, the similarity score values of the current event and the historical event are weighted by the region and the named entity, and the weighted similarity score values are obtained after the region and the named entity are weighted together.
The above steps, the region comparison is carried out on the occurrence places of the current event and the historical event, the entity comparison is carried out on the named entities of the current event and the historical event, the similarity score values are further weighted according to the region comparison result and the entity comparison result, the weighting and the increasing sequence are carried out on the historical event which occurs in the same region and the historical event with a crossed main body, the similarity score values are increased, and the event similarity is improved.
As an exemplary embodiment, the step S133 performs regional weighting on the similarity score value of the historical event to obtain a weighted similarity score value, and the step S1331-S1332 is included.
Step S1331: and calculating the region distance between the historical event occurrence point and the current event occurrence point according to the historical event occurrence point and the current event occurrence point corresponding to the historical event.
In this embodiment, the distance between the historical event occurrence location and the current event occurrence location corresponding to the historical event is calculated to obtain the geographical distance between the historical event occurrence location and the current event occurrence location. Specifically, the distance calculation may be performed by calculating a linear distance between the two event occurrence locations on the map, and using the calculated linear distance as a regional distance between the two event occurrence locations; of course, in other embodiments, the distance calculation may also calculate the actual distance between the two event occurrence locations by using the longitude and latitude coordinates of the two event occurrence locations, and the actual distance is used as the regional distance between the two event occurrence locations. This is only schematically described in this embodiment, which is not limited to this, and in other embodiments, the regional distance may also be obtained by using a distance calculation method in the prior art.
Step S1332: and carrying out region weighting on the similarity score value according to the region distance to obtain the weighted similarity score value.
In this embodiment, the region weighting may be a region weighting factor determined according to a region distance, and the region weighting is performed on the similarity score value by the region weighting factor. The closer the region distance is, the larger the finally obtained weighted similarity score value is.
In particular, the regional weighting factor may be
Figure BDA0003201772910000161
Wherein, δ is a region weighting constant, δ has a value range of 0-1, and d (k) is a region distance between the current event and the kth historical event.
The similarity score values after region weighting are as follows.
The weighted similarity score value (1+ region weighting factor) is the similarity score value before weighting
In the step, the historical events of which the occurrence positions are in the same region are weighted and sequenced according to the calculated region distance, and the closer the occurrence position distance is, the greater the weighted similarity score value is, and the higher the similarity with the current event is.
As an exemplary embodiment, the step S134 of weighting the similarity score values of the historical events by the named entities to obtain weighted similarity score values includes steps S1341-S1342.
Step S1341: and comparing the historical event named entity in the historical event with the current event named entity to obtain the number of the same entities contained in the historical event named entity and the current event named entity.
In this embodiment, entity comparison is performed between the historical event named entity in the historical event and the current event named entity, the same named entity included in the two events is found, and the number of the same named entity is counted to obtain the number of the same entity included in the historical event named entity and the current event named entity.
Step S1342: and carrying out named entity weighting on the similarity score values according to the number of the same entities to obtain weighted similarity score values.
In this embodiment, the named entity weighting may be named entity weighting by determining a named entity weighting factor according to the number and weighting the similar score value by the named entity weighting factor. The larger the number of the same entities is, the larger the finally obtained weighted similarity score value is.
Specifically, the named entity weighting factor may be η x n (k). Wherein eta is a named entity weighting constant, and the numeric area of eta is 0-1; n (k) is the same named entity number of the current event and the kth historical event.
The weighted similarity score values for the named entities are shown below.
Weighted similarity score value before weighting (1+ named entity weighting factor)
In the above step, the historical events with the same entities are weighted and sorted according to the calculated number of the same named entities, and the more the number of the same named entities is, the greater the weighted similarity score value is, and the higher the similarity with the current event is.
As an exemplary embodiment, the step S135 of performing region weighting and named entity weighting on the similarity score value of the historical event to obtain a weighted similarity score value includes steps S1351 to S1353.
Step S1351: and calculating the region distance between the historical event occurrence point and the current event occurrence point according to the historical event occurrence point and the current event occurrence point corresponding to the historical event.
In this embodiment, the distance between the historical event occurrence location and the current event occurrence location corresponding to the historical event is calculated to obtain the geographical distance between the historical event occurrence location and the current event occurrence location. Specifically, the distance calculation may be performed by calculating a linear distance between the two event occurrence locations on the map, and using the calculated linear distance as a regional distance between the two event occurrence locations; of course, in other embodiments, the distance calculation may also calculate the actual distance between the two event occurrence locations by using the longitude and latitude coordinates of the two event occurrence locations, and the actual distance is used as the regional distance between the two event occurrence locations. This is only schematically described in this embodiment, which is not limited to this, and in other embodiments, the regional distance may also be obtained by using a distance calculation method in the prior art.
Step S1352: and comparing the historical event named entity in the historical event with the current event named entity to obtain the number of the same entities contained in the historical event named entity and the current event named entity.
In this embodiment, entity comparison is performed between the historical event named entity in the historical event and the current event named entity, the same named entity included in the two events is found, and the number of the same named entity is counted to obtain the number of the same entity included in the historical event named entity and the current event named entity.
Step S1353: and carrying out region weighting and named entity weighting on the similarity score values according to the region distance and the number of the same entities to obtain weighted similarity score values.
In this embodiment, the region weighting and the named entity weighting may be performed by determining a region named entity weighting factor according to a region distance and the number of the same named entities, and performing dual weighting of the region and the named entities on the similarity score value by using the region named entity weighting factor. The closer the region distance is, the larger the number of the same entities is, and the larger the finally obtained weighted similarity score value is.
Specifically, the named entity weighting factor may be 1+ δ + η n (k). Wherein eta is a named entity weighting constant, and the numeric area of eta is 0-1; n (k) is the same named entity number of the current event and the kth historical event.
The similarity score values of the regions and the named entities after the common weighting are shown as follows.
The weighted similarity score value is the similarity score value before weighting (1+ region named entity weighting factor)
In the step, the historical events which occur in the same region and have the same entity are weighted and sequenced according to the calculated region distance and the number of the same named entity, and the closer the region distance is, the more the number of the same named entity is, the larger the weighted similarity score value is, and the higher the similarity with the current event is.
As an exemplary embodiment, the calculation formula of the weighted similarity score value is as follows:
Figure BDA0003201772910000171
wherein score _ new (k) is a weighted similarity score value for the kth historical event; score (k) is the similarity score value before weighting for the kth historical event; delta is a region weighting constant; d (k) is the regional distance between the current event and the kth historical event; η is the named entity weighting constant; n (k) is the same named entity number of the current event and the kth historical event.
As an exemplary embodiment, the step S3 of obtaining the semantic similarity value between the subject term of the current event and the subject term of each historical event according to the subject term of the current event and the subject term of each historical event includes steps S301 to S304.
Step S301: and respectively inputting the current event subject term and the historical event subject term of each historical event into a pre-training term vector model to obtain a current event subject term vector contained in the current event and a historical event subject term vector contained in each historical event.
In the embodiment, the subject word of the current event is input into a word2vector pre-training word vector model to obtain a subject word vector corresponding to the current event; and respectively inputting the historical event subject word corresponding to each historical event into a word2vector pre-training word vector model to obtain a subject word vector corresponding to each historical event.
Step S302: and adding the current event subject word vectors contained in the current event to obtain the current event subject word semantic vector corresponding to the current event.
In this embodiment, the subject word vectors of the current event included in the current event are added, that is, the elements of the corresponding dimensions are added, so as to obtain the subject word semantic vector of the current event.
The calculation formula of the semantic vector of the subject term of the current event corresponding to the current event is shown as follows.
Figure BDA0003201772910000181
Wherein the content of the first and second substances,
Figure BDA0003201772910000182
semantic vector of subject term of current event; p is the number of the word vectors of the subject words of the current event;
Figure BDA0003201772910000183
and the current event topic word vector corresponding to the pth topic word in the current event.
Step S303: and respectively adding the historical event subject word vectors contained in each historical event to obtain the historical event subject word semantic vector corresponding to each historical event.
In this embodiment, the history event topic word vectors included in each history event are added to obtain the history event topic word semantic vector of each history event.
The calculation formula of the semantic vector of the subject term of the historical event corresponding to the historical event is shown as follows.
Figure BDA0003201772910000184
Wherein the content of the first and second substances,
Figure BDA0003201772910000185
semantic vectors of historical event subject terms corresponding to the kth historical event; q is the number of the historical event topic word vectors corresponding to the kth historical event;
Figure BDA0003201772910000186
and the historical event topic word vector corresponding to the qth topic word in the kth historical event is obtained.
Step S304: and respectively carrying out cosine similarity calculation on the semantic vector of the subject word of the current event and the semantic vector of the subject word of each historical event to obtain a subject word semantic similarity value of the current event and each historical event.
In this embodiment, a formula for calculating semantic similarity between subject terms of a current event and a historical event is shown as follows.
Figure BDA0003201772910000187
Where s1k is the semantic similarity value between the subject term of the current event and the subject term of the kth historical event.
In the above step, the subject word semantic vector of the current event and the subject word semantic vector of each historical event are obtained by adding all the subject word vectors of the event article, that is, adding the corresponding dimension elements, and then the cosine calculation is performed on the subject word semantic vector of the current event and the subject word semantic vector of each historical event respectively to obtain the subject word semantic similarity value of the current event and each historical event.
As an exemplary embodiment, the step S4 of obtaining the semantic similarity value of the summary sentence of the current event and each historical event according to the summary of the current event and the summary of the historical event of each historical event includes steps S401-S404.
Step S401: and respectively inputting the current event abstract corresponding to the current event and the historical event abstract corresponding to each historical event into the pre-training sentence vector model to obtain a current abstract sentence vector corresponding to the current event abstract and a historical abstract sentence vector corresponding to each historical event abstract.
In this embodiment, the current event summary is mapped to the current event summary sentence vector through the bert pre-training sentence vector model. And respectively mapping the historical event abstract corresponding to each historical event into historical event abstract sentence vectors corresponding to each historical event through a bert pre-training sentence vector model.
Step S402: and adding the current abstract sentence vectors corresponding to the current event abstract to obtain the sentence semantic vector of the current event abstract.
In this embodiment, a calculation formula of the sentence semantic vector of the current event summary is as follows.
Figure BDA0003201772910000191
Wherein the content of the first and second substances,
Figure BDA0003201772910000192
a sentence semantic vector of the current event abstract; l is the number of current abstract sentence vectors in the current event abstract;
Figure BDA0003201772910000193
and the current event summary sentence vector corresponding to the ith sentence in the current event summary.
Step S403: and respectively adding the historical abstract sentence vectors corresponding to each historical event abstract to obtain the sentence semantic vector of each historical event abstract.
In this embodiment, a calculation formula of a sentence semantic vector of a historical event summary corresponding to a historical event is as follows.
Figure BDA0003201772910000194
Wherein the content of the first and second substances,
Figure BDA0003201772910000195
sentence semantic vectors of historical event abstracts corresponding to the kth historical event; m is the number of historical event abstract sentence vectors corresponding to the kth historical event;
Figure BDA0003201772910000196
a historical event summary sentence vector corresponding to the mth summary sentence in the kth historical event.
Step S404: and respectively carrying out cosine similarity calculation on the sentence semantic vector of the current event abstract and the sentence semantic vector of each historical event abstract to obtain an abstract sentence semantic similarity value of the current event and each historical event.
In this embodiment, the formula for calculating semantic similarity between the abstract sentences of the current event and the historical event is as follows.
Figure BDA0003201772910000197
Where s2k is the semantic similarity value of the summary sentence between the current event and the kth historical event.
The sentence semantic vector of the current event and the sentence semantic vector of each historical event are obtained by adding the sentence vectors of the event abstract sentences, namely adding the elements of corresponding dimensionality, and then the cosine calculation is respectively carried out on the sentence semantic vector of the current event and the sentence semantic vector of each historical event to obtain the abstract sentence semantic similarity value of the current event and each historical event.
As an exemplary embodiment, the step S5 of obtaining a syntactic similarity value of the current event and each historical event according to the current event title and the historical event title of each historical event includes steps S501-S502.
Step S501: and respectively obtaining the title edit distance of the current event title and each historical event title according to the current event title and each historical event title of each historical event.
In this embodiment, the title edit distance between the current event title and the historical event title is obtained by comparing the character string difference between the current event title and the historical event title.
Step S502: and respectively carrying out normalization processing on the title editing distance of the current event title and each historical event title to obtain a syntactic similarity value of the current event and each historical event.
In this embodiment, a syntactic similarity value calculation formula of the current event and the historical event is as follows.
Figure BDA0003201772910000198
Wherein s3k is the syntactic similarity value of the current event and the kth historical event; t is t1A current event title corresponding to the current event; t is t2kA history event title corresponding to the kth history event; ed (t)1,t2k) And editing the distance for the current event title corresponding to the current event and the title of the historical event title corresponding to the kth historical event.
And in the step, the syntactic similarity value is obtained by calculating the edit distance between the title of the current event and the title of the historical event and normalizing the edit distances.
As an exemplary embodiment, the step S6 of obtaining a similarity value between the current event and the argument of each historical event according to the current event argument and the historical event argument of each historical event includes steps S601-S603.
Step S601: and respectively inputting the current event argument and the historical event argument of each historical event into a pre-training word vector model to obtain a current event argument word vector corresponding to the current event argument and a historical event argument word vector corresponding to each historical event argument.
In this embodiment, the current event argument is mapped to the current event argument word vector through the word2vector pre-training word vector model. And respectively mapping the historical event argument corresponding to each historical event into a historical event argument word vector corresponding to each historical event through a word2vector pre-training word vector model.
Step S602: and respectively obtaining the argument edit distance of the current event argument and each historical event argument according to the current event argument and the historical event argument of each historical event.
In this embodiment, the argument edit distance of the current event argument and the historical event argument of each historical event is obtained by distance calculation.
Step S603: and obtaining an argument similarity value of the current event and each historical event according to the current event argument word vector, the historical event argument word vector and the argument edit distance.
In this embodiment, a calculation formula of argument similarity values of the current event and the historical event is as follows.
Figure BDA0003201772910000201
Wherein s4k is an argument similarity value of the current event and the kth historical event;
Figure BDA0003201772910000202
a current event argument word vector corresponding to the current event;
Figure BDA0003201772910000203
historical event argument word vectors corresponding to the kth historical event; wcaA current event argument corresponding to the current event; whakHistorical event arguments corresponding to the kth historical event; ed (W)ca,Whak) And editing the distance for the current event argument corresponding to the current event and the argument of the historical event argument corresponding to the kth historical event.
And calculating the argument similarity by calculating the semantic approximate value and the syntactic distance value of the current event argument and the historical event argument.
As an exemplary embodiment, the step S7 of obtaining the similarity value between the current event and the trigger word of each historical event according to the current event trigger word and the trigger word of each historical event includes steps S701-S703.
Step S701: and respectively inputting the current event trigger word and the historical event trigger word of each historical event into a pre-training word vector model to obtain a current event trigger word vector corresponding to the current event trigger word and a historical event trigger word vector corresponding to each historical event trigger word.
In the embodiment, a current event trigger word is input into a word2vector pre-training word vector model to obtain a trigger word vector corresponding to the current event; and respectively inputting the historical event trigger word corresponding to each historical event into a word2vector pre-training word vector model to obtain a trigger word vector corresponding to each historical event.
Step S702: and respectively obtaining the editing distance of the trigger words of the current event and each historical event according to the trigger words of the current event and each historical event.
In this embodiment, the edit distance of the trigger between the current event trigger and the history event trigger of each history event is obtained by distance calculation.
Step S703: and obtaining a trigger word similarity value of the current event and each historical event according to the trigger word vector of the current event, the trigger word vector of the historical event and the editing distance of the trigger words.
In this embodiment, a calculation formula of the trigger word similarity value of the current event and the historical event is as follows.
Figure BDA0003201772910000211
Wherein s5k is the trigger similarity value of the current event and the kth historical event;
Figure BDA0003201772910000212
triggering word and word vectors for a current event corresponding to the current event;
Figure BDA0003201772910000213
triggering word and word vectors for the historical events corresponding to the kth historical event; wctTriggering words for the current events corresponding to the current events; whtkTriggering words for the historical events corresponding to the kth historical event; ed (W)ct,Whtk) And editing the distance for the trigger word of the current event trigger word corresponding to the current event and the trigger word of the history event trigger word corresponding to the kth history event.
And calculating the similarity of the trigger words by calculating semantic approximate values and syntactic distance values of the current event trigger words and the historical event trigger words.
As an exemplary embodiment, the step S8 of obtaining the similarity value between the current event and each of the historical events according to the current event lifecycle and the historical event lifecycle of each of the historical events includes steps S801-S802.
Step S801: and calculating the event co-occurrence distance of the current event life cycle and the historical event life cycle of each historical event.
In this embodiment, the co-occurrence distance means that the time windows overlap, the more the time windows overlap, the larger the distance is, and the overlapped part is the distance. For example, event 1: the life cycle is: 2018.02.03-2018.04.01, event 2: the life cycle is as follows: 2018.01.02-2018.04.01, wherein the years are removed and the overlapping parts of the months are distances.
Step S802: and reducing the event co-occurrence distance by preset times to obtain a time window similarity value of the current event and each historical event.
In this embodiment, a calculation formula of the similarity value between the current event and the historical event in the time window is as follows.
Figure BDA0003201772910000214
Wherein s6k is the similarity value of the current event and the kth historical event in the time window; t is tcThe current event life cycle corresponding to the current event; t is thkThe historical event life cycle corresponding to the kth historical event; td (t)c,thk) The co-occurrence distance between the current event life cycle corresponding to the current event and the historical event life cycle corresponding to the kth historical event; t is a preset multiple.
In this embodiment, the value range of the preset multiple is 10 to 20. Specifically, the preset multiple is 12. The present embodiment is described only schematically, and is not limited thereto.
In the steps, the event co-occurrence distance td of the life cycle of the current event and the life cycle of the historical event is calculated, and td is scaled, so that the similarity value of the time window is finally obtained.
As an exemplary embodiment, the step S9 of obtaining the similarity value between the text category of the current event and the text category of each historical event according to the text category of the current event and the text category of each historical event includes steps S901 to S903.
Step S901: and respectively judging whether the current event text type is the same as the historical event text type of each historical event.
In this embodiment, the tag corresponding to the text category of the current event is compared with the tag corresponding to the text category of the historical event of each historical event; if the labels corresponding to the text categories are consistent, executing step S902; if the labels corresponding to the text type are not consistent, step S903 is executed.
Step S902: and if the current event text type is the same as the historical event text type, the text type similarity value is a first preset value.
In this embodiment, the text type of the current event is the same as the text type of the historical event, which indicates that the current event is the same as the historical event, and the similarity between the two events is high. Specifically, if the numerical value of the first preset value is 1 and the text type similarity value is 1, the fusion weighting of a plurality of subsequent similarity values is involved.
Step S903: and if the current event text type is different from the historical event text type, the text type similarity value is a second preset value, and the second preset value is smaller than the first preset value.
In this embodiment, if the current event text category is different from the historical event text category, the categories of the two events are different, and the similarity between the two events is low. Specifically, if the numerical value of the second preset value is 0 and the text type similarity value is 0, the subsequent fusion weighting of multiple similarity values is not involved.
The text category similarity value calculation formula of the current event and the historical event is as follows.
Figure BDA0003201772910000221
Wherein s7k is the text category similarity value of the current event and the kth historical event; lableecThe current event text category corresponding to the current event; tableeh(k) And the current historical event text category is corresponding to the kth historical event.
And in the step, by comparing whether the event type of the current event is consistent with the historical event type, the next step of fusion weighting is participated if the event type of the current event is consistent with the historical event type, and the next step of fusion weighting is not participated if the event type of the current event is inconsistent with the historical event type of the current event.
As an exemplary embodiment, the step S14 is further performed after the step of sorting the historical events according to the final similarity score value to obtain a similar event sorting result of the current event, and includes steps S15-S16.
Step S15: and acquiring the matching number of the similar events.
In this embodiment, the matching number of similar events is determined according to the user requirement. Specifically, the number of the similarity matches may be 5, 10, etc., which is only schematically described in this embodiment and is not limited thereto.
Step S16: and selecting the historical events with large final similarity score values from the similar event sequencing results according to the matching number of the similar events to obtain the similar historical events with the matching number of the similar events.
In this embodiment, according to the matching number of similar events, a historical event with a larger similarity score value is used as a historical similar event corresponding to the current event.
And selecting the historical events with the matched number of the similar events with larger score values as the historical similar events corresponding to the current events according to the final similarity score value.
A detailed description is given below with a specific example, and the flowchart is shown in fig. 2.
And carrying out unsupervised training learning on the mass text data set through a word2vector model. And performing word segmentation on the historical data through a jieba word segmentation device, and taking the word as the minimum semantic unit. Through context understanding of massive text data, semantic features of each word are learned, and a model is saved.
And carrying out unsupervised training learning on a mass text data set through a bert model. And obtaining a word vector and a sentence vector through training and learning of the context of the text.
And extracting the subject term, mapping the subject term into a term vector, and calculating cosine similarity to obtain a similar weight.
The method comprises the steps of adding all subject word vectors of an event article, namely adding elements of corresponding dimensions to obtain a subject word semantic vector of a current event and a subject word semantic vector of a historical event, and performing cosine calculation on the two vectors to obtain a sentence semantic similarity value.
And extracting the text abstract, mapping the abstract sentences into sentence vectors, and calculating cosine similarity to obtain a similarity weight. Sentence vectors of all sentences of the event article are added, namely elements of corresponding dimensionality are added to obtain a sentence semantic vector of the current event and a sentence semantic vector of the historical event, and a sentence semantic similarity value is obtained by performing cosine calculation on the two vectors.
And calculating the editing distance between the title of the current event and the title of the historical event, and normalizing to obtain a syntactic similarity value.
The trigger word similarity is calculated by calculating semantic approximations and syntactic distance values of the current event argument and the historical event argument.
And calculating the similarity of the trigger words by calculating semantic approximate values and syntactic distance values of the current event trigger words and the historical event trigger words.
And finally obtaining a time window similarity value by calculating the event co-occurrence distance td of the life cycle of the current event and the life cycle of the historical event and scaling the td.
And comparing whether the event type of the current event is consistent with the historical event type, if so, participating in the next fusion weighting, and if not, participating in the next fusion weighting.
And presetting linear parameters through statistical analysis, performing linear fusion on a plurality of similarity values obtained by the similarity calculation, finally performing global index normalization through indexes to obtain an event similarity score, and paving for performing weighted sorting based on the score in the next step.
And weighting and increasing the similarity scores in four cases, wherein the score is weighted to be processed in a delta row only if the score is greater than a preset threshold, and the operation sc is finally performed in the increasing and increasing sequence close to the weighting: the more recent the value of or (e 1); the current event (2) and the current event with the history have the same event domain, are crossed according to the cross main domain body, the distance is related to the same person or mechanism, the weighting is carried out according to the number of the commonly occurring main bodies, and the score is larger as the number is larger; (3) weighting and increasing the sequence of the same region and the crossed body, namely combining (1) and (2); (4) there was neither the same zone nor the crossover body, score was unchanged.
And sorting the final score to obtain a final ranking. The event of the top Z is the most similar Z events.
According to the method, similarity calculation is carried out through semantic features, syntactic features, main features and event attribute features of events, and finally a plurality of similar events with larger similarity are obtained through linear fusion of similar weights of all the features and increasing sequence sorting. The history similar events can be accurately acquired through the fusion comparison of the multi-dimensional information, and the searching precision of the similar events is improved.
The embodiment also provides a multi-dimensional feature fusion similar event calculation system, which is used for implementing the above embodiments and preferred embodiments, and the description of the system is omitted for brevity. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the system described in the embodiments below is preferably implemented in software, implementations in hardware, or a combination of software and hardware are also possible and contemplated.
The embodiment also provides a system for computing a multi-dimensional feature fusion similar event, as shown in fig. 3, including:
the first acquisition module 1 is used for acquiring a current event subject term, a current event abstract, a current event title, a current event argument, a current event trigger term, a current event life cycle, a current event text category, a current event named entity and a current event occurrence place of a current event;
the second acquisition module 2 is used for acquiring a history event subject term, a history event abstract, a history event title, a history event argument, a history event trigger term, a history event life cycle, a history event text category, a history event named entity and a history event occurrence place of each history event;
the first processing module 3 is used for obtaining subject term semantic similarity values of the current event and each historical event according to the subject terms of the current event and the subject terms of the historical events;
the second processing module 4 is used for obtaining a abstract sentence semantic similarity value of the current event and each historical event according to the current event abstract and the historical event abstract of each historical event;
the third processing module 5 is configured to obtain a syntactic similarity value between the current event and each historical event according to the current event title and the historical event title of each historical event;
the fourth processing module 6 is used for obtaining an argument similarity value of the current event and each historical event according to the current event argument and the historical event argument of each historical event;
the fifth processing module 7 is configured to obtain a trigger word similarity value between the current event and each historical event according to the current event trigger word and the historical event trigger word of each historical event;
the sixth processing module 8 is configured to obtain a time window similarity value between the current event and each historical event according to the current event life cycle and the historical event life cycle of each historical event;
a seventh processing module 9, configured to obtain a text category similarity value between the current event and each historical event according to the text category of the current event and the text category of the historical event of each historical event;
an eighth processing module 10, configured to perform weighted fusion on the subject term semantic similarity value, the abstract sentence semantic similarity value, the syntax similarity value, the argument similarity value, the trigger term similarity value, the time window similarity value, and the text category similarity value of each historical event, respectively, to obtain a similarity score value between the current event and each historical event;
the judging module 11 is configured to respectively judge whether the similarity score value of each historical event is greater than a preset threshold;
the ninth processing module 12 is configured to, if the similarity score value of the historical event is less than or equal to the preset threshold, keep the similarity score value of the historical event unchanged, and use the similarity score value as a final similarity score value of the current event and the historical event;
a tenth processing module 13, configured to, if the similarity score value of the historical event is greater than the preset threshold, perform named entity and region weighting on the similarity score value of the historical event according to the current event named entity, the current event occurrence location, the historical event named entity, and the historical event occurrence location to obtain a weighted similarity score value, and use the weighted similarity score value as a final similarity score value of the current event and the historical event;
and the eleventh processing module 14 is configured to sort the historical events according to the final similar score value, and obtain a similar event sorting result of the current event.
Optionally, the tenth processing module includes: the first judgment submodule is used for judging whether a historical event occurrence place and a current event occurrence place in a historical event belong to the same region or not and judging whether a historical event named entity and a current event named entity in the historical event have the same entity or not; the first processing submodule is used for keeping the similarity score value of the historical event unchanged if the occurrence place of the historical event in the historical event and the occurrence place of the current event do not belong to the same region and the named entity of the historical event in the historical event and the named entity of the current event do not have the same entity; the second processing submodule is used for carrying out region weighting on the similar score value of the historical event to obtain a weighted similar score value if the occurrence place of the historical event in the historical event and the occurrence place of the current event belong to the same region and the named entity of the historical event in the historical event and the named entity of the current event do not have the same entity; the third processing submodule is used for weighting the named entities of the similar scores of the historical events to obtain weighted similar scores if the place of the historical events in the historical events and the place of the current events do not belong to the same region and the named entities of the historical events in the historical events and the named entities of the current events have the same entity; and the fourth processing submodule is used for carrying out region weighting and named entity weighting on the similar score value of the historical event to obtain a weighted similar score value if the historical event occurrence place in the historical event and the current event occurrence place in the historical event belong to the same region and the historical event named entity in the historical event and the current event named entity in the historical event have the same entity.
Optionally, the second processing sub-module includes: the first processing unit is used for calculating the region distance between the historical event occurrence point and the current event occurrence point according to the historical event occurrence point and the current event occurrence point corresponding to the historical event; and the second processing unit is used for carrying out region weighting on the similarity score value according to the region distance to obtain the weighted similarity score value.
Optionally, the third processing sub-module includes: the third processing unit is used for comparing the historical event named entity in the historical event with the current event named entity to obtain the number of the same entities contained in the historical event named entity and the current event named entity; and the fourth processing unit is used for weighting the named entities of the similarity score values according to the number of the same entities to obtain weighted similarity score values.
Optionally, the fourth processing submodule includes: the fifth processing unit is used for calculating the region distance between the historical event occurrence point and the current event occurrence point according to the historical event occurrence point and the current event occurrence point corresponding to the historical event; the sixth processing unit is used for comparing the historical event named entity in the historical event with the current event named entity to obtain the number of the same entities contained in the historical event named entity and the current event named entity; and the seventh processing unit is used for carrying out region weighting and named entity weighting on the similarity score value according to the region distance and the number of the same entities to obtain the weighted similarity score value.
Optionally, the weighted similarity score value is calculated as follows:
Figure BDA0003201772910000241
wherein score _ new (k) is a weighted similarity score value for the kth historical event; score (k) is the similarity score value before weighting for the kth historical event; delta is a region weighting constant; d (k) is the regional distance between the current event and the kth historical event; η is the named entity weighting constant; n (k) is the same named entity number of the current event and the kth historical event.
Optionally, the first processing module includes: the fifth processing submodule is used for respectively inputting the current event subject term and the historical event subject term of each historical event into the pre-training term vector model to obtain a current event subject term vector contained in the current event and a historical event subject term vector contained in each historical event; the sixth processing submodule is used for adding the current event subject word vectors contained in the current event to obtain the current event subject word semantic vector corresponding to the current event; the seventh processing submodule is used for respectively adding the historical event subject word vectors contained in each historical event to obtain a historical event subject word semantic vector corresponding to each historical event; and the eighth processing submodule is used for performing cosine similarity calculation on the semantic vector of the current event subject term and each semantic vector of the historical event subject terms respectively to obtain a subject term semantic similarity value of the current event and each historical event.
Optionally, the second processing module includes: a ninth processing submodule, configured to input a current event summary corresponding to the current event and a historical event summary corresponding to each historical event into the bert pre-training sentence vector model, respectively, to obtain a current summary sentence vector corresponding to the current event summary and a historical summary sentence vector corresponding to each historical event summary; the tenth processing submodule is used for adding the current abstract sentence vectors corresponding to the current event abstract to obtain the sentence semantic vectors of the current event abstract; the eleventh processing submodule is used for respectively adding the historical abstract sentence vectors corresponding to each historical event abstract to obtain the sentence semantic vector of each historical event abstract; and the twelfth processing submodule is used for respectively carrying out cosine similarity calculation on the sentence semantic vector of the current event abstract and the sentence semantic vector of each historical event abstract to obtain an abstract sentence semantic similarity value of the current event and each historical event.
Optionally, the third processing module includes: the thirteenth processing submodule is used for respectively obtaining the title editing distance of the current event title and each historical event title according to the current event title and each historical event title of each historical event; and the fourteenth processing submodule is used for respectively carrying out normalization processing on the title editing distance of the current event title and each historical event title to obtain a syntactic similarity value of the current event and each historical event.
Optionally, the fourth processing module includes: a fifteenth processing sub-module, configured to input the current event argument and the historical event argument of each historical event into a pre-training word vector model, respectively, to obtain a current event argument word vector corresponding to the current event argument and a historical event argument word vector corresponding to each historical event argument; a sixteenth processing sub-module, configured to obtain argument edit distances of the current event argument and each historical event argument according to the current event argument and the historical event argument of each historical event, respectively; and the seventeenth processing submodule is used for obtaining an argument similarity value of the current event and each historical event according to the current event argument word vector, the historical event argument word vector and the argument editing distance.
Optionally, the fifth processing module includes: the eighteenth processing submodule is used for respectively inputting the current event trigger word and the historical event trigger word of each historical event into the pre-training word vector model to obtain a current event trigger word vector corresponding to the current event trigger word and a historical event trigger word vector corresponding to each historical event trigger word; the nineteenth processing sub-module is used for respectively obtaining the trigger word edit distance of the current event trigger word and each historical event trigger word according to the current event trigger word and each historical event trigger word of each historical event; and the twentieth processing submodule is used for obtaining the trigger word similarity value of the current event and each historical event according to the trigger word vector of the current event, the trigger word vector of the historical event and the editing distance of the trigger word.
Optionally, the sixth processing module includes: the twenty-first processing submodule is used for calculating the event co-occurrence distance of the current event life cycle and the historical event life cycle of each historical event; the twenty-second processing submodule is used for reducing the event co-occurrence distance by preset times to obtain a time window similarity value of the current event and each historical event;
optionally, the seventh processing module includes: the second judgment submodule is used for respectively judging whether the current event text type is the same as the historical event text type of each historical event; the twenty-third processing submodule is used for determining that the text type similarity value is a first preset value if the current event text type is the same as the historical event text type; and the twenty-fourth processing submodule is used for setting the text type similarity value as a second preset value if the current event text type is different from the historical event text type, and the second preset value is smaller than the first preset value.
Optionally, the calculation formula of the semantic similarity value of the subject term of the current event and each historical event is as follows:
Figure BDA0003201772910000251
Figure BDA0003201772910000252
Figure BDA0003201772910000253
wherein s1k is a subject term semantic similarity value between the current event and the kth historical event;
Figure BDA0003201772910000254
semantic vector of subject term of current event; p is the number of the word vectors of the subject words of the current event;
Figure BDA0003201772910000255
a current event subject word vector corresponding to the pth subject word in the current event;
Figure BDA0003201772910000256
semantic vectors of historical event subject terms corresponding to the kth historical event; q is the number of the historical event topic word vectors corresponding to the kth historical event;
Figure BDA0003201772910000257
and the historical event topic word vector corresponding to the qth topic word in the kth historical event is obtained.
Optionally, the calculation formula of the semantic similarity value of the summary sentence of the current event and each historical event is as follows:
Figure BDA0003201772910000258
Figure BDA0003201772910000261
Figure BDA0003201772910000262
wherein s2k is the semantic similarity value of the abstract sentence of the current event and the kth historical event;
Figure BDA0003201772910000263
a sentence semantic vector of the current event abstract; l is the number of current abstract sentence vectors in the current event abstract;
Figure BDA0003201772910000264
a current event summary sentence vector corresponding to the ith sentence in the current event summary;
Figure BDA0003201772910000265
sentence semantic vectors of historical event abstracts corresponding to the kth historical event; m is the number of historical event abstract sentence vectors corresponding to the kth historical event;
Figure BDA0003201772910000266
a historical event summary sentence vector corresponding to the mth summary sentence in the kth historical event.
Alternatively, the syntactic similarity value calculation formula for the current event and each historical event is as follows:
Figure BDA0003201772910000267
wherein s3k is the syntactic similarity value of the current event and the kth historical event; t is t1A current event title corresponding to the current event; t is t2kA history event title corresponding to the kth history event; ed (t)1,t2k) And editing the distance for the current event title corresponding to the current event and the title of the historical event title corresponding to the kth historical event.
Alternatively, the formula for calculating the argument similarity value of the current event and each historical event is as follows:
Figure BDA0003201772910000268
wherein s4k is an argument similarity value of the current event and the kth historical event;
Figure BDA0003201772910000269
a current event argument word vector corresponding to the current event;
Figure BDA00032017729100002610
historical event argument word vectors corresponding to the kth historical event; wcaA current event argument corresponding to the current event; whakHistorical event arguments corresponding to the kth historical event; ed (W)ca,Whak) And editing the distance for the current event argument corresponding to the current event and the argument of the historical event argument corresponding to the kth historical event.
Optionally, the calculation formula of the trigger word similarity value of the current event and each historical event is as follows:
Figure BDA00032017729100002611
wherein s5k is the trigger similarity value of the current event and the kth historical event;
Figure BDA00032017729100002612
triggering word and word vectors for a current event corresponding to the current event;
Figure BDA00032017729100002613
triggering word and word vectors for the historical events corresponding to the kth historical event; wctTriggering words for the current events corresponding to the current events; whtkTriggering words for the historical events corresponding to the kth historical event; ed (W)ct,Whtk) And triggering word elements for the current event corresponding to the current event and triggering word editing distances of the historical event triggering words corresponding to the kth historical event.
Optionally, the time window similarity value calculation formula of the current event and each historical event is as follows:
Figure BDA00032017729100002614
wherein s6k is the similarity value of the current event and the kth historical event in the time window; t is tcThe current event life cycle corresponding to the current event; t is thkThe historical event life cycle corresponding to the kth historical event; td (t)c,thk) The co-occurrence distance between the current event life cycle corresponding to the current event and the historical event life cycle corresponding to the kth historical event; t is a preset multiple.
Optionally, the text category similarity value calculation formula of the current event and each historical event is as follows:
Figure BDA0003201772910000271
wherein s7k is the text category similarity value of the current event and the kth historical event; tableecThe current event text category corresponding to the current event; tableeh(k) And the current historical event text category is corresponding to the kth historical event.
Optionally, the method further comprises: the third acquisition module is used for acquiring the matching number of the similar events; and the twelfth processing module is used for selecting the historical event with the large final similarity score value from the similar event sequencing result according to the matching number of the similar events to obtain the similar historical events with the matching number of the similar events.
The multidimensional feature fusion similar event computing system in this embodiment is presented in the form of functional units, where a unit refers to an ASIC circuit, a processor and memory executing one or more software or fixed programs, and/or other devices that may provide the above-described functionality.
Further functional descriptions of the modules are the same as those of the corresponding embodiments, and are not repeated herein.
An embodiment of the present invention further provides an electronic device, as shown in fig. 4, the electronic device includes one or more processors 71 and a memory 72, where one processor 71 is taken as an example in fig. 4.
The controller may further include: an input device 73 and an output device 74.
The processor 71, the memory 72, the input device 73 and the output device 74 may be connected by a bus or other means, as exemplified by the bus connection in fig. 4.
The processor 71 may be a Central Processing Unit (CPU). The Processor 71 may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, or combinations thereof. A general purpose processor may be a microprocessor or any conventional processor or the like.
The memory 72, which is a non-transitory computer readable storage medium, can be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the multidimensional feature fusion similar event calculation method in the embodiments of the present application. The processor 71 executes various functional applications of the server and data processing by running non-transitory software programs, instructions and modules stored in the memory 72, namely, implementing the multi-dimensional feature fusion similar event calculation method of the above-described method embodiment.
The memory 72 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of a processing device operated by the server, and the like. Further, the memory 72 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 72 may optionally include memory located remotely from the processor 71, which may be connected to a network connection device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 73 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the processing device of the server. The output device 74 may include a display device such as a display screen.
One or more modules are stored in the memory 72, which when executed by the one or more processors 71 perform the method shown in FIG. 1.
It will be understood by those skilled in the art that all or part of the processes of the method for implementing the above embodiments may be implemented by instructing relevant hardware through a computer program, and the executed program may be stored in a computer-readable storage medium, and when executed, may include the processes of the above embodiments of the multidimensional feature fusion similar event calculation method. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD) or a Solid State Drive (SSD), etc.; the storage medium may also comprise a combination of memories of the kind described above.
Although the embodiments of the present invention have been described in conjunction with the accompanying drawings, those skilled in the art may make various modifications and variations without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope defined by the appended claims.

Claims (10)

1. A multi-dimensional feature fusion similar event calculation method is characterized by comprising the following steps:
acquiring a current event subject term, a current event abstract, a current event title, a current event argument, a current event trigger term, a current event life cycle, a current event text category, a current event named entity and a current event occurrence place of a current event;
acquiring a history event subject term, a history event abstract, a history event title, a history event argument, a history event trigger term, a history event life cycle, a history event text category, a history event named entity and a history event occurrence place of each history event;
obtaining a subject term semantic similarity value of the current event and each historical event according to the subject term of the current event and the subject term of each historical event;
obtaining a semantic similarity value of a summary sentence of the current event and each historical event according to the current event summary and the historical event summary of each historical event;
obtaining a syntactic similarity value of the current event and each historical event according to the current event title and the historical event title of each historical event;
obtaining an argument similarity value of the current event and each historical event according to the current event argument and the historical event argument of each historical event;
obtaining a trigger word similarity value of the current event and each historical event according to the current event trigger word and the historical event trigger word of each historical event;
obtaining a time window similarity value of the current event and each historical event according to the current event life cycle and the historical event life cycle of each historical event;
obtaining a text category similarity value of the current event and each historical event according to the text category of the current event and the text category of the historical events;
respectively carrying out weighted fusion on the subject term semantic similarity value, the abstract sentence semantic similarity value, the syntax similarity value, the argument similarity value, the trigger term similarity value, the time window similarity value and the text category similarity value of each historical event to obtain a similarity score value of the current event and each historical event;
respectively judging whether the similarity score value of each historical event is greater than a preset threshold value;
if the similarity score value of the historical event is smaller than or equal to the preset threshold value, keeping the similarity score value of the historical event unchanged, and taking the similarity score value as the final similarity score value of the current event and the historical event;
if the similarity score value of the historical event is larger than the preset threshold value, conducting named entity and region weighting on the similarity score value of the historical event according to the current event named entity, the current event occurrence place, the historical event named entity and the historical event occurrence place to obtain a weighted similarity score value, and taking the weighted similarity score value as a final similarity score value of the current event and the historical event;
and sequencing the historical events according to the final similar score value to obtain a similar event sequencing result of the current event.
2. The method for calculating the multi-dimensional feature fusion similar event according to claim 1, wherein the step of weighting the similarity score values of the historical events by the named entities of the current event, the occurrence location of the current event, the named entities of the historical events and the occurrence location of the historical events to obtain the weighted similarity score values comprises the steps of:
judging whether a historical event occurrence place in a historical event and a current event occurrence place belong to the same region or not, and judging whether a historical event named entity in the historical event and a current event named entity in the historical event have the same entity or not;
if the historical event occurrence place in the historical event and the current event occurrence place in the historical event do not belong to the same region, and the historical event named entity in the historical event and the current event named entity in the historical event do not have the same entity, the similarity score value of the historical event is kept unchanged;
if the historical event occurrence place and the current event occurrence place in the historical event belong to the same region and the historical event named entity and the current event named entity in the historical event do not have the same entity, performing region weighting on the similar score value of the historical event to obtain a weighted similar score value;
if the historical event occurrence place and the current event occurrence place in the historical event do not belong to the same region and the historical event named entity and the current event named entity in the historical event have the same entity, conducting named entity weighting on the similar score value of the historical event to obtain a weighted similar score value;
and if the historical event occurrence place in the historical event and the current event occurrence place in the historical event belong to the same region, and the historical event named entity in the historical event and the current event named entity in the historical event have the same entity, performing region weighting and named entity weighting on the similarity score value of the historical event to obtain a weighted similarity score value.
3. The multi-dimensional feature fusion similar event calculation method according to claim 2,
the step of performing regional weighting on the similarity score value of the historical event to obtain the weighted similarity score value comprises the following steps:
calculating a region distance between the historical event occurrence point and the current event occurrence point according to the historical event occurrence point and the current event occurrence point corresponding to the historical event;
performing regional weighting on the similarity score value according to the regional distance to obtain a weighted similarity score value;
optionally, the step of weighting the similar score value of the historical event by the named entity to obtain a weighted similar score value includes:
comparing a historical event named entity in a historical event with a current event named entity to obtain the number of the same entities contained in the historical event named entity and the current event named entity;
naming entity weighting is carried out on the similarity score values according to the number of the same entities to obtain weighted similarity score values;
optionally, the step of performing region weighting and named entity weighting on the similarity score value of the historical event to obtain a weighted similarity score value includes:
calculating a region distance between the historical event occurrence point and the current event occurrence point according to the historical event occurrence point and the current event occurrence point corresponding to the historical event;
comparing a historical event named entity in a historical event with a current event named entity to obtain the number of the same entities contained in the historical event named entity and the current event named entity;
and carrying out region weighting and named entity weighting on the similarity score values according to the region distance and the number of the same entities to obtain weighted similarity score values.
4. The multi-dimensional feature fusion similar event calculation method according to claim 3,
the weighted similarity score value is calculated as follows:
Figure FDA0003201772900000031
wherein score _ new (k) is a weighted similarity score value for the kth historical event; score (k) is the similarity score value before weighting for the kth historical event; delta is a region weighting constant; d (k) is the regional distance between the current event and the kth historical event; η is the named entity weighting constant; n (k) is the same named entity number of the current event and the kth historical event.
5. The method of multi-dimensional feature fusion similar event computation of claim 1,
the step of obtaining the subject term semantic similarity value of the current event and each historical event according to the subject term of the current event and the subject term of each historical event comprises the following steps:
respectively inputting the current event subject term and the historical event subject term of each historical event into a pre-training term vector model to obtain a current event subject term vector contained in the current event and a historical event subject term vector contained in each historical event;
adding current event subject word vectors contained in a current event to obtain a current event subject word semantic vector corresponding to the current event;
adding the historical event subject word vectors contained in each historical event to obtain a historical event subject word semantic vector corresponding to each historical event;
respectively carrying out cosine similarity calculation on the semantic vector of the subject term of the current event and the semantic vector of the subject term of each historical event to obtain a subject term semantic similarity value of the current event and each historical event;
optionally, the step of obtaining a semantic similarity value between the current event and the summary sentence of each historical event according to the current event summary and the historical event summary of each historical event includes:
respectively inputting a current event abstract corresponding to a current event and a historical event abstract corresponding to each historical event into a bert pre-training sentence vector model to obtain a current abstract sentence vector corresponding to the current event abstract and a historical abstract sentence vector corresponding to each historical event abstract;
adding the current abstract sentence vectors corresponding to the current event abstract to obtain a sentence semantic vector of the current event abstract;
adding the historical abstract sentence vectors corresponding to each historical event abstract respectively to obtain the sentence semantic vector of each historical event abstract;
respectively carrying out cosine similarity calculation on the sentence semantic vector of the current event abstract and the sentence semantic vector of each historical event abstract to obtain abstract sentence semantic similarity values of the current event and each historical event;
optionally, the step of obtaining a syntactic similarity value between the current event and each historical event according to the current event title and the historical event title of each historical event includes:
respectively obtaining the title editing distance of the current event title and each historical event title according to the current event title and each historical event title of each historical event;
respectively carrying out normalization processing on the title editing distance of the current event title and each historical event title to obtain a syntactic similarity value of the current event and each historical event;
optionally, the step of obtaining a similarity value between the current event and the argument of each historical event according to the current event argument and the historical event argument of each historical event includes:
respectively inputting the current event argument and the historical event argument of each historical event into a pre-training word vector model to obtain a current event argument word vector corresponding to the current event argument and a historical event argument word vector corresponding to each historical event argument;
respectively obtaining argument edit distances of the current event argument and each historical event argument according to the current event argument and the historical event argument of each historical event;
obtaining an argument similarity value of the current event and each historical event according to the current event argument word vector, the historical event argument word vector and the argument edit distance;
optionally, the step of obtaining a similarity value between the current event and the trigger word of each historical event according to the current event trigger word and the trigger word of each historical event includes:
respectively inputting the current event trigger word and the historical event trigger word of each historical event into a pre-training word vector model to obtain a current event trigger word vector corresponding to the current event trigger word and a historical event trigger word vector corresponding to each historical event trigger word;
respectively obtaining the editing distance of the trigger words of the current event and the historical event according to the trigger words of the current event and the historical event of each historical event;
obtaining a trigger word similarity value of the current event and each historical event according to the trigger word vector of the current event, the trigger word vector of the historical event and the editing distance of the trigger word;
optionally, the step of obtaining a similarity value between the current event and each of the historical events according to the current event lifecycle and the historical event lifecycle of each of the historical events includes:
calculating the event co-occurrence distance of the current event life cycle and the historical event life cycle of each historical event;
reducing the event co-occurrence distance by preset times to obtain a time window similarity value of the current event and each historical event;
optionally, the step of obtaining a text category similarity value between the current event and each historical event according to the text category of the current event and the text category of the historical event of each historical event includes:
respectively judging whether the current event text type is the same as the historical event text type of each historical event;
if the current event text type is the same as the historical event text type, the text type similarity value is a first preset value;
and if the current event text type is different from the historical event text type, the text type similarity value is a second preset value, and the second preset value is smaller than the first preset value.
6. The multi-dimensional feature fusion similar event calculation method according to claim 5,
the calculation formula of the semantic similarity value of the subject term of the current event and each historical event is as follows:
Figure FDA0003201772900000051
Figure FDA0003201772900000052
Figure FDA0003201772900000053
wherein s1k is a subject term semantic similarity value between the current event and the kth historical event; wc d1Semantic vector of subject term of current event; p is the number of the word vectors of the subject words of the current event;
Figure FDA0003201772900000054
a current event subject word vector corresponding to the pth subject word in the current event;
Figure FDA0003201772900000055
semantic vectors of historical event subject terms corresponding to the kth historical event; q is the history corresponding to the kth history eventThe number of word vectors of the part subject words;
Figure FDA0003201772900000056
a history event topic word vector corresponding to the qth topic word in the kth history event;
optionally, the calculation formula of the semantic similarity value of the summary sentence of the current event and each historical event is as follows:
Figure FDA0003201772900000057
Figure FDA0003201772900000058
Figure FDA0003201772900000061
wherein s2k is the semantic similarity value of the abstract sentence of the current event and the kth historical event;
Figure FDA0003201772900000062
a sentence semantic vector of the current event abstract; l is the number of current abstract sentence vectors in the current event abstract;
Figure FDA0003201772900000063
a current event summary sentence vector corresponding to the ith sentence in the current event summary;
Figure FDA0003201772900000064
sentence semantic vectors of historical event abstracts corresponding to the kth historical event; m is the number of historical event abstract sentence vectors corresponding to the kth historical event;
Figure FDA0003201772900000065
a historical event abstract sentence vector corresponding to the mth abstract sentence in the kth historical event;
alternatively, the syntactic similarity value calculation formula for the current event and each historical event is as follows:
Figure FDA0003201772900000066
wherein s3k is the syntactic similarity value of the current event and the kth historical event; t is t1A current event title corresponding to the current event; t is t2kA history event title corresponding to the kth history event; ed (t)1,t2k) Editing the distance for the current event title corresponding to the current event and the title of the historical event title corresponding to the kth historical event;
alternatively, the formula for calculating the argument similarity value of the current event and each historical event is as follows:
Figure FDA0003201772900000067
wherein s4k is an argument similarity value of the current event and the kth historical event;
Figure FDA0003201772900000068
a current event argument word vector corresponding to the current event;
Figure FDA0003201772900000069
historical event argument word vectors corresponding to the kth historical event; wcaA current event argument corresponding to the current event; whakHistorical event arguments corresponding to the kth historical event; ed (W)ca,Whak) Editing distances for the current event argument corresponding to the current event and the argument of the historical event argument corresponding to the kth historical event;
optionally, the calculation formula of the trigger word similarity value of the current event and each historical event is as follows:
Figure FDA00032017729000000610
wherein s5k is the trigger similarity value of the current event and the kth historical event;
Figure FDA00032017729000000611
triggering word and word vectors for a current event corresponding to the current event;
Figure FDA00032017729000000612
triggering word and word vectors for the historical events corresponding to the kth historical event; wctTriggering words for the current events corresponding to the current events; whtkTriggering words for the historical events corresponding to the kth historical event; ed (W)ct,Whtk) Triggering a word element for a current event corresponding to the current event and a triggering word editing distance of a history event triggering word corresponding to the kth history event;
optionally, the time window similarity value calculation formula of the current event and each historical event is as follows:
Figure FDA0003201772900000071
wherein s6k is the similarity value of the current event and the kth historical event in the time window; t is tcThe current event life cycle corresponding to the current event; t is thkThe historical event life cycle corresponding to the kth historical event; td (t)c,thk) The co-occurrence distance between the current event life cycle corresponding to the current event and the historical event life cycle corresponding to the kth historical event; t is a preset multiple;
optionally, the text category similarity value calculation formula of the current event and each historical event is as follows:
Figure FDA0003201772900000072
wherein s7k is the text category similarity value of the current event and the kth historical event; tableecThe current event text category corresponding to the current event; tableeh(k) And the current historical event text category is corresponding to the kth historical event.
7. The method for calculating the multi-dimensional feature fusion similar events according to claim 1, wherein after the step of sorting the historical events according to the final similar score value to obtain a similar event sorting result of the current event, the method further comprises:
acquiring the matching number of similar events;
and selecting the historical events with large final similarity score values from the similar event sequencing results according to the matching number of the similar events to obtain the similar historical events with the matching number of the similar events.
8. A multi-dimensional feature fusion similar event computing system, comprising:
the first acquisition module is used for acquiring a current event subject term, a current event abstract, a current event title, a current event argument, a current event trigger term, a current event life cycle, a current event text category, a current event named entity and a current event occurrence place of a current event;
the second acquisition module is used for acquiring a history event subject term, a history event abstract, a history event title, a history event argument, a history event trigger term, a history event life cycle, a history event text category, a history event named entity and a history event occurrence place of each history event;
the first processing module is used for obtaining a subject term semantic similarity value of the current event and each historical event according to the subject term of the current event and the subject term of each historical event;
the second processing module is used for obtaining a abstract sentence semantic similarity value of the current event and each historical event according to the current event abstract and the historical event abstract of each historical event;
the third processing module is used for obtaining a syntactic similarity value of the current event and each historical event according to the current event title and the historical event title of each historical event;
the fourth processing module is used for obtaining an argument similarity value of the current event and each historical event according to the current event argument and the historical event argument of each historical event;
the fifth processing module is used for obtaining a trigger word similarity value of the current event and each historical event according to the current event trigger word and the historical event trigger word of each historical event;
the sixth processing module is used for obtaining a time window similarity value of the current event and each historical event according to the current event life cycle and the historical event life cycle of each historical event;
the seventh processing module is used for obtaining a text type similarity value of the current event and each historical event according to the text type of the current event and the text type of the historical event of each historical event;
the eighth processing module is used for respectively carrying out weighted fusion on the subject word semantic similarity value, the abstract sentence semantic similarity value, the syntax similarity value, the argument similarity value, the trigger word similarity value, the time window similarity value and the text category similarity value of each historical event to obtain a similarity score value of the current event and each historical event;
the judging module is used for respectively judging whether the similarity score value of each historical event is greater than a preset threshold value;
the ninth processing module is used for keeping the similarity score value of the historical event unchanged if the similarity score value of the historical event is smaller than or equal to the preset threshold value, and taking the similarity score value as the final similarity score value of the current event and the historical event;
the tenth processing module is used for weighting the named entities and the regions of the similar scores of the historical events according to the named entities of the current events, the occurrence places of the current events, the named entities of the historical events and the occurrence places of the historical events to obtain weighted similar scores, and using the weighted similar scores as final similar scores of the current events and the historical events;
and the eleventh processing module is used for sequencing the historical events according to the final similar score value to obtain a similar event sequencing result of the current event.
9. An electronic device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to cause the at least one processor to perform the multi-dimensional feature fusion similar events computation method of any one of claims 1-7.
10. A computer-readable storage medium storing computer instructions for causing a computer to perform the multi-dimensional feature fusion similar event calculation method according to any one of claims 1 to 7.
CN202110906530.2A 2021-08-09 2021-08-09 Multi-dimensional feature fusion similar event calculation method and system and electronic equipment Active CN113722478B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110906530.2A CN113722478B (en) 2021-08-09 2021-08-09 Multi-dimensional feature fusion similar event calculation method and system and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110906530.2A CN113722478B (en) 2021-08-09 2021-08-09 Multi-dimensional feature fusion similar event calculation method and system and electronic equipment

Publications (2)

Publication Number Publication Date
CN113722478A true CN113722478A (en) 2021-11-30
CN113722478B CN113722478B (en) 2023-09-19

Family

ID=78675183

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110906530.2A Active CN113722478B (en) 2021-08-09 2021-08-09 Multi-dimensional feature fusion similar event calculation method and system and electronic equipment

Country Status (1)

Country Link
CN (1) CN113722478B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114676346A (en) * 2022-03-17 2022-06-28 平安科技(深圳)有限公司 News event processing method and device, computer equipment and storage medium
CN114925757A (en) * 2022-05-09 2022-08-19 中国电信股份有限公司 Multi-source threat intelligence fusion method, device, equipment and storage medium
CN116167352A (en) * 2023-04-03 2023-05-26 联仁健康医疗大数据科技股份有限公司 Data processing method, device, electronic equipment and storage medium
CN117520484A (en) * 2024-01-04 2024-02-06 中国电子科技集团公司第十五研究所 Similar event retrieval method, system, equipment and medium based on big data semantics

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180276351A1 (en) * 2012-10-02 2018-09-27 Banjo, Inc. System and method for event-based vehicle operation
CN110633409A (en) * 2018-06-20 2019-12-31 上海财经大学 Rule and deep learning fused automobile news event extraction method
CN111104794A (en) * 2019-12-25 2020-05-05 同方知网(北京)技术有限公司 Text similarity matching method based on subject words
CN111382575A (en) * 2020-03-19 2020-07-07 电子科技大学 Event extraction method based on joint labeling and entity semantic information

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180276351A1 (en) * 2012-10-02 2018-09-27 Banjo, Inc. System and method for event-based vehicle operation
CN110633409A (en) * 2018-06-20 2019-12-31 上海财经大学 Rule and deep learning fused automobile news event extraction method
CN111104794A (en) * 2019-12-25 2020-05-05 同方知网(北京)技术有限公司 Text similarity matching method based on subject words
CN111382575A (en) * 2020-03-19 2020-07-07 电子科技大学 Event extraction method based on joint labeling and entity semantic information

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
KATIE MCCONKY等: "improving event co-reference by context extraction and dynamic feature weighting", 2012 IEEE INTERNATIONAL MULTI-DISCIPLINARY CONFERENCE ON COGNITIVE METHODS IN SITUATION AWARENESS AND DECISION SUPPORT, pages 978 - 983 *
刘铭;郑子豪;秦兵;刘一仝;李阳;: "基于篇章级事件表示的文本相关度计算方法", 中国科学:信息科学, vol. 50, no. 07, pages 1033 - 1054 *
许旭阳;韩永峰;宋文政;: "事件抽取技术的回顾与展望", 信息工程大学学报, vol. 12, no. 01, pages 113 - 118 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114676346A (en) * 2022-03-17 2022-06-28 平安科技(深圳)有限公司 News event processing method and device, computer equipment and storage medium
CN114925757A (en) * 2022-05-09 2022-08-19 中国电信股份有限公司 Multi-source threat intelligence fusion method, device, equipment and storage medium
CN114925757B (en) * 2022-05-09 2023-10-03 中国电信股份有限公司 Multisource threat information fusion method, device, equipment and storage medium
CN116167352A (en) * 2023-04-03 2023-05-26 联仁健康医疗大数据科技股份有限公司 Data processing method, device, electronic equipment and storage medium
CN116167352B (en) * 2023-04-03 2023-07-21 联仁健康医疗大数据科技股份有限公司 Data processing method, device, electronic equipment and storage medium
CN117520484A (en) * 2024-01-04 2024-02-06 中国电子科技集团公司第十五研究所 Similar event retrieval method, system, equipment and medium based on big data semantics
CN117520484B (en) * 2024-01-04 2024-04-16 中国电子科技集团公司第十五研究所 Similar event retrieval method, system, equipment and medium based on big data semantics

Also Published As

Publication number Publication date
CN113722478B (en) 2023-09-19

Similar Documents

Publication Publication Date Title
CN113722478B (en) Multi-dimensional feature fusion similar event calculation method and system and electronic equipment
CN110968699B (en) Logic map construction and early warning method and device based on fact recommendation
US10482136B2 (en) Method and apparatus for extracting topic sentences of webpages
CN106156204B (en) Text label extraction method and device
CN105824959B (en) Public opinion monitoring method and system
US20210216576A1 (en) Systems and methods for providing answers to a query
CN111708873A (en) Intelligent question answering method and device, computer equipment and storage medium
US8150822B2 (en) On-line iterative multistage search engine with text categorization and supervised learning
CN105045875B (en) Personalized search and device
CN112395395B (en) Text keyword extraction method, device, equipment and storage medium
CN106202294B (en) Related news computing method and device based on keyword and topic model fusion
CN109388743B (en) Language model determining method and device
CN111090771B (en) Song searching method, device and computer storage medium
CN109522396B (en) Knowledge processing method and system for national defense science and technology field
WO2021112984A1 (en) Feature and context based search result generation
CN110569419A (en) question-answering system optimization method and device, computer equipment and storage medium
CN109766447A (en) A kind of method and apparatus of determining sensitive information
Dutta et al. PNRank: Unsupervised ranking of person name entities from noisy OCR text
CN106570196B (en) Video program searching method and device
CN112269852B (en) Method, system and storage medium for generating public opinion themes
CN114281942A (en) Question and answer processing method, related equipment and readable storage medium
CN113688633A (en) Outline determination method and device
CN110222156B (en) Method and device for discovering entity, electronic equipment and computer readable medium
CN113761125A (en) Dynamic summary determination method and device, computing equipment and computer storage medium
CN113656575A (en) Training data generation method and device, electronic equipment and readable medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant