CN111949838A - Data propagation path generation method, device, equipment and storage medium - Google Patents

Data propagation path generation method, device, equipment and storage medium Download PDF

Info

Publication number
CN111949838A
CN111949838A CN202010835470.5A CN202010835470A CN111949838A CN 111949838 A CN111949838 A CN 111949838A CN 202010835470 A CN202010835470 A CN 202010835470A CN 111949838 A CN111949838 A CN 111949838A
Authority
CN
China
Prior art keywords
event
data
type
event data
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010835470.5A
Other languages
Chinese (zh)
Inventor
张发恩
姜勇越
王建华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Innovation Wisdom Shanghai Technology Co ltd
AInnovation Shanghai Technology Co Ltd
Original Assignee
Innovation Wisdom Shanghai Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Innovation Wisdom Shanghai Technology Co ltd filed Critical Innovation Wisdom Shanghai Technology Co ltd
Priority to CN202010835470.5A priority Critical patent/CN111949838A/en
Publication of CN111949838A publication Critical patent/CN111949838A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Abstract

The application provides a method, a device, equipment and a storage medium for generating a data propagation path, which relate to the technical field of computers, and particularly relate to the method, the device, the equipment and the storage medium for generating the data propagation path, wherein the method for generating the data propagation path comprises the following steps: acquiring target event data, clustering the target event data to determine the event type of the target event data in an event library, and determining the source and the report time of the target event data. According to the method and the device, when a user needs to inquire an event of a certain type, the user can clearly and completely check the propagation paths of the event data of the event type, then the user can conduct operations such as propagation tracing, propagation time counting and the like on the event type according to the propagation paths of the event data of the event type, and then the user can judge whether the event data is plagiarism or not and judge whether the event is a hotspot or not according to the propagation tracing result and the propagation time counting result.

Description

Data propagation path generation method, device, equipment and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for generating a propagation path of data.
Background
Information data originates from a number of sources, some from traditional media such as newspapers, magazines, television and radio, some from networks such as news portals, social platforms and forums, and even from intelligence agencies. In the current society, the network is developed rapidly, and the network information data becomes the most important source of the information data due to the characteristics of timely release, quick update, wide audience range and the like.
At present, because the network information data has the characteristics of timeliness, quick update, wide audience scope and the like, the monitoring of the network information data has problems, for example, the source of the network information data is difficult to determine, so that the network information data is plagiarized, the data spreading trend is unknown, the relationship among spreading main bodies of a plurality of network information data is unknown, and the monitoring problem is caused.
Disclosure of Invention
An object of the embodiments of the present invention is to provide a method, an apparatus, a device and a storage medium for generating a data propagation path, which are used to extract source information in data, and further generate a data propagation path according to the source information of the data, so as to identify plagiarism and plagiarism phenomena of the data according to the data propagation path.
To this end, a first aspect of the present application discloses a method for generating a propagation path of data, the method comprising the steps of:
acquiring target event data;
clustering the target event data to determine the event type of the target event data in an event library;
determining a source and a reporting time of the target event data;
determining a propagation path of the target event data according to the source and the report time of the target event data;
and updating the propagation path of the event type according to the propagation path of the target event data, wherein the propagation path of the event type at least comprises the propagation path of the target event data.
In the first aspect of the application, by acquiring the target event data, the target event data can be clustered and the event type can be determined, meanwhile, the source and the report time of the target event data can be determined, and the propagation path of the target event data can be determined, and then the propagation path of the event type can be updated according to the propagation path of the target event data, and finally when a user needs to inquire an event of a certain type, the propagation paths of a plurality of pieces of event data under the event type can be clearly and completely checked, further, the user can perform operations such as propagation tracing, propagation times statistics and the like on the event type according to the propagation paths of the plurality of pieces of event data under the event type, and then the user can judge whether the data of a certain event is plagiarism or not and judge whether the certain event is a hotspot or not according to the propagation traceability result and the propagation times statistical result.
In the first aspect of the present application, as an optional implementation manner, the clustering the target event data to determine an event type of the target event data in an event library includes:
calculating to obtain a central vector of each event type in the event library;
calculating to obtain a vector of the target event data;
comparing the vector of the target event data with the central vector of each event type in the event library, and determining the similarity between the target event data and each event type in the event library;
matching event types with the similarity meeting a first preset threshold from the event library according to the similarity between the target event data and each event type in the event library;
and taking the event type with the similarity meeting a first preset threshold as the event type of the target event data in the event library.
In this optional embodiment, the vector of the target event data can be obtained by vectorizing the target event data, so that the vector of the target event data can be compared with the central vector of each event type in the event library, the similarity between the target event data and each event type in the event library can be further determined according to the comparison result, the similarity between the target event data and each event type in the event library can be further compared with a first preset threshold, and finally, the event type with the similarity meeting the first preset threshold is matched from the event library.
In the first aspect of the present application, as an optional implementation manner, the calculating to obtain the vector of the target event data includes:
calculating to obtain a vector of the target event data according to a paragraph vector algorithm;
and the step of calculating the central vector of each event type in the event library comprises the following steps:
calculating vectors of all data of each event type in the event library;
and carrying out weighted average calculation on vectors of all data of each event type in the event library to obtain a central vector of each event type in the event library.
In this alternative embodiment, the vector of the target event data can be calculated by a paragraph vector algorithm, and the central vector of an event type can be obtained by weighting and averaging the vectors of all event data in the event type.
In the first aspect of the present application, as an optional implementation manner, before the performing a weighted average calculation on vectors of all data of each event type in the event library to obtain a central vector of each event type in the event library, the method further includes:
sequencing all data of each event type in the event library according to a TextRank algorithm, and obtaining the weight of each event data of each event type in the event library;
and performing weighted average calculation on vectors of all data of each event type in the event library to obtain a central vector of each event type in the event library, wherein the weighted average calculation comprises the following steps:
and according to the weight of each event data of each event type in the event library, carrying out weighted average calculation on vectors of all data of each event type in the event library to obtain a central vector of each event type in the event library.
In the optional embodiment, all data of each event type in the event library can be sorted according to the TextRank algorithm, the weight of each event data of each event type in the event library is obtained, and the weight required in the weighted average calculation process is further determined.
In the first aspect of the present application, as an optional implementation manner, after the taking the event type of which the similarity satisfies the first preset threshold as the event type of the target event data in the event library, the method further includes:
adjusting the first preset threshold value to a second preset threshold value;
when the target event data is used as one item of event data in the event types and the event types are updated, calculating the updated central vector of the event types;
calculating a vector of each item of event data of the updated event type;
comparing the updated vector of each item of event data of the event type with the updated central vector of the event type, and determining the similarity of each item of event data of the event type and the updated event type;
according to the similarity between each item of event data of the updated event type and the updated event type;
and removing the event data with the similarity smaller than the preset second threshold value in the updated event types.
In this optional embodiment, after the target time data is associated with the event type, all event data in the event type may be filtered, so that the similarity between the event data in the event type may be improved, and the accuracy of the propagation path of the event type may be further improved.
In the first aspect of the present application, as an optional implementation manner, after removing event data in the updated event type, where the similarity is smaller than the preset second threshold, the method further includes:
comparing the similarity of the event data with the similarity of other event types in the event library, wherein the similarity of the event data is smaller than the preset second threshold value in the updated event types, and obtaining a comparison result;
and re-determining the event type of the data which does not meet the preset second threshold value according to the comparison result.
In this alternative embodiment, when one item of event data in an event type is disassociated from the event type, the event data may be compared with other event types in the event repository to re-determine the event type of the event data.
In the first aspect of the present application, as an optional implementation manner, the determining the source and the reporting time of the target event data includes the sub-steps of:
extracting the reprint information of the target event data according to a preset entity library, wherein the entity library comprises information of a portal website, information of a social platform, media person information, drafter information and information of users of the social platform;
determining the source of the target event data as one of a portal website type, a social platform type, a contributor type and a media person type according to the reprinting information of the target event data;
the reporting time of the target event data is determined.
In this optional embodiment, the reprint information of the target event data can be extracted through a preset entity library, where the entity library includes information of a portal website, information of a social platform, media person information, contributor information, and information of a user of the social platform, and meanwhile, report time of the target event data can also be determined.
In the first aspect of the present application, as an optional implementation manner, determining a source of the target event data as one of a portal type, a social platform type, a contributor type, and a media person type according to the reprinting information of the target event data includes:
when the reprinting information of the target event data comprises more than two different pieces of media person information or more than two different contributor information, performing meta search on the more than two different pieces of media person information or the more than two different contributor information to obtain a search result;
and determining the source of the target event data as a contributor type or a media person type from more than two different pieces of media person information or more than two different contributor information according to the search result.
When the target event data is transferred by the media person or the contributor, at this time, the media person or the contributor may use a plurality of similar user names in the same event data because the media person or the contributor has randomness in using the user names, and then the media person or the contributor may use a plurality of similar user names in the same event data.
In a second aspect of the present application, there is provided a data propagation path processing apparatus, including:
the acquisition module is used for acquiring target event data;
the clustering module is used for clustering the target event data to determine the event type of the target event data in the event library;
the first determining module is used for determining the source and the report time of the target event data;
the second determining module is used for determining a propagation path of the target event data according to the source and the report time of the target event data;
and the updating module is used for updating the propagation path of the event type according to the propagation path of the target event data, wherein the propagation path of the event type at least comprises the propagation path of the target event data.
The device of the second aspect of the present application, by executing the data propagation path processing method, can perform clustering on target event data and determine an event type by obtaining the target event data, and at the same time, can determine a source and a reporting time of the target event data and determine a propagation path of the target event data, and further can update the propagation path of the event type according to the propagation path of the target event data, and finally, when a user needs to query a certain type of event, can clearly and completely check the propagation paths of a plurality of event data under the event type, and further, the user can perform operations such as propagation tracing, propagation count, and the like on the event type according to the propagation paths of the plurality of event data under the event type, and further, the user can determine whether a certain event data is plagiarism, or not according to the propagation tracing result and the propagation count result, And judging whether a certain event is a hot spot.
A third aspect of the present application discloses a propagation path processing device of data, the device including:
a memory storing executable program code;
a processor coupled to the memory;
the processor calls the executable program code stored in the memory to execute the data propagation path generation method of the first aspect of the present application.
The device of the third aspect of the present application, by using the method for processing the propagation path of the execution data, can obtain the target event data, further cluster the target event data and determine the event type, and at the same time, can determine the source and the reporting time of the target event data and determine the propagation path of the target event data, further update the propagation path of the event type according to the propagation path of the target event data, and finally, when a user needs to query a certain type of event, clearly and completely check the propagation paths of a plurality of event data under the event type, further, the user can perform operations such as propagation tracing, propagation count, and the like on the event type according to the propagation paths of a plurality of event data under the event type, and further, the user can determine whether a certain event data is plagiarism, or not according to the propagation tracing result and the propagation count result, And judging whether a certain event is a hot spot.
A fourth aspect of the present application discloses a computer storage medium, where a computer instruction is stored, and when the computer instruction is called, the computer storage medium is used to execute the method for generating a propagation path of data according to the first aspect of the present application.
According to the method for processing the propagation path of the execution data by the computer storage medium of the fourth aspect of the application, the target event data can be obtained, then the target event data can be clustered, the event type can be determined, meanwhile, the source and the report time of the target event data can be determined, the propagation path of the target event data can be determined, then the propagation path of the event type can be updated according to the propagation path of the target event data, finally, when a user needs to query a certain type of event, the propagation paths of a plurality of event data under the event type can be clearly and completely checked, then the user can conduct operations of propagation tracing, propagation frequency statistics and the like on the event type according to the propagation paths of the plurality of event data under the event type, and then the user can judge whether the event data is plagiarism, plagiarism or not according to the propagation source result and the propagation frequency statistics result, And judging whether a certain event is a hot spot.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.
Fig. 1 is a schematic flow chart of a method for generating a propagation path of data according to an embodiment of the present disclosure;
fig. 2 is a schematic structural diagram of a data propagation path generation apparatus disclosed in an embodiment of the present application;
fig. 3 is a schematic structural diagram of a data propagation path generating device disclosed in an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.
Example one
Referring to fig. 1, fig. 1 is a schematic flow chart illustrating a method for generating a data propagation path according to an embodiment of the present disclosure. As shown in fig. 1, a method for generating a data propagation path according to an embodiment of the present application includes the steps of:
101. acquiring target event data;
102. clustering the target event data to determine the event type of the target event data in an event library;
103. determining the source and the report time of the target event data;
104. determining a propagation path of the target event data according to the source and the report time of the target event data;
105. and updating the propagation path of the event type according to the propagation path of the target event data, wherein the propagation path of the event type at least comprises the propagation path of the target event data.
In the embodiment of the application, by acquiring the target event data, the target event data can be clustered and the event type can be determined, meanwhile, the source and the report time of the target event data can be determined and the propagation path of the target event data can be determined, and then the propagation path of the event type can be updated according to the propagation path of the target event data.
In this embodiment, the target event data may be electronic news, such as news on a microblog, news on a today's headline, and the like.
In the embodiment of the application, the event library is composed of all event data of a plurality of event types. For example, the event library has event data of an event type of "earthquake occurs in place a" and event data of an event type of "city in front of line B", wherein the event data of the event type of "earthquake occurs in place a" may include data on Tencent web, today's head bar, news on a microblog or other platforms.
In the embodiment of the present application, the target event data may be crawled from platforms such as Tencent network, today's first line, microblog and the like through a crawler technology, and for the crawler technology, reference is made to the prior art, which is not described in detail in the embodiment of the present application.
In this embodiment of the present application, a propagation path of an event type may be formed by splicing propagation paths of a plurality of pieces of event data, for example, it is assumed that the propagation paths of two pieces of event data are "flight news sports forwarded by new wave sports" and "this head is transferred to new wave sports", respectively, and the propagation path of that event type is "flight news sports forwarded by new wave sports" and "this head is transferred to new wave sports".
In the embodiment of the present application, as an optional implementation manner, step 102: clustering the target event data to determine an event type of the target event data in an event repository, comprising the sub-steps of:
calculating to obtain a central vector of each event type in the event library;
calculating to obtain a vector of the target event data;
comparing the vector of the target event data with the central vector of each event type in the event library, and determining the similarity between the target event data and each event type in the event library;
matching event types with the similarity meeting a first preset threshold from the event library according to the similarity between the target event data and each event type in the event library;
and taking the event type with the similarity meeting a first preset threshold as the event type of the target event data in the event library.
In this optional embodiment, the vector of the target event data is obtained by vectorizing the target event data, so that the vector of the target event data can be compared with the central vector of each event type in the event library, the similarity between the target event data and each event type in the event library can be determined according to the comparison result, the similarity between the target event data and each event type in the event library can be compared with a first preset threshold, and finally, the event type with the similarity meeting the first preset threshold is matched from the event library.
In the embodiment of the present application, as an optional implementation manner, the steps of: calculating a vector of target event data, comprising the substeps of:
and calculating to obtain a vector of the target event data according to a paragraph vector algorithm.
In the embodiment of the present application, as an optional implementation manner, the steps of: calculating a central vector of each event type in the event library, and comprising the following sub-steps:
calculating vectors of all data of each event type in the event library;
and carrying out weighted average calculation on vectors of all data of each event type in the event library to obtain a central vector of each event type in the event library.
The vector of the target event data can be calculated by a paragraph vector algorithm, and the central vector of the event type can be obtained by weighting and averaging the vectors of all the event data under the same event type.
In the embodiment of the present application, the paragraph vector algorithm is a "doc 2 vec" algorithm.
In the embodiment of the present application, as an optional implementation manner, in the step: before performing weighted average calculation on vectors of all data of each event type in the event library to obtain a central vector of each event type in the event library, the method of the embodiment of the application further includes the steps of:
and sequencing all data of each event type in the event library according to a TextRank algorithm, and obtaining the weight of each event data of each event type in the event library.
In the embodiment of the application, correspondingly, the steps are as follows: carrying out weighted average calculation on vectors of all data of each event type in the event library to obtain a central vector of each event type in the event library, and comprising the following substeps:
and carrying out weighted average calculation on vectors of all data of each event type in the event library according to the weight of each event data of each event type in the event library to obtain a central vector of each event type in the event library.
Therefore, all data of each event type in the event library can be sequenced according to the TextRank algorithm, the weight of each event data of each event type in the event library is obtained, and the weight required in the weighted average calculation process is further determined.
In the embodiment of the present application, as an optional implementation manner, in the step: after the event type with the similarity meeting the first preset threshold is taken as the event type of the target event data in the event library, the method of the embodiment of the application further comprises the following steps:
adjusting the first preset threshold value to a second preset threshold value;
when the target event data is used as one item of event data in the event types and the event types are updated, calculating a central vector of the updated event types;
calculating a vector of each item of event data of the updated event type;
comparing the vector of each item of event data of the updated event type with the central vector of the updated event type, and determining the similarity of each item of event data of the updated event type and the updated event type;
according to the similarity between each item of event data of the updated event type and the updated event type;
and removing the event data with the similarity smaller than a preset second threshold value in the updated event types.
In the embodiment of the application, after the target time data is associated with the event type, all event data in the event type can be screened, so that the similarity between the event data in the event type can be improved, and the accuracy of the propagation path of the event type can be further improved.
In the embodiment of the present application, the first preset threshold may be 0.85, and the second preset threshold may be 0.89, wherein the maximum value of the second preset threshold does not exceed 0.95.
In the embodiment of the present application, as an optional implementation manner, in the step: after removing event data with the similarity smaller than a preset second threshold value from the updated event types, the method of the embodiment of the application further includes the steps of:
comparing the similarity of the event data with the similarity of other event types in the event library, wherein the similarity of the event data in the updated event types is smaller than a preset second threshold value, and obtaining a comparison result;
and re-determining the event type of the data which does not meet the preset second threshold value according to the comparison result.
When one piece of event data in an event type is disassociated with the event type, the event data can be compared with other event types in the event library to re-determine the event type of the event data.
In the embodiment of the present application, as an optional implementation manner, step 103: determining the source and reporting time of the target event data, comprising the substeps of:
extracting the reprint information of the target event data according to a preset entity library, wherein the entity library comprises information of a portal website, information of a social platform, media person information, drafter information and information of users of the social platform;
determining the source of the target event data as one of a portal website type, a social platform type, a contributor type and a media person type according to the reprinting information of the target event data;
the reporting time of the target event data is determined.
In this optional embodiment, the reprint information of the target event data can be extracted through a preset entity library, where the entity library includes information of a portal website, information of a social platform, media person information, contributor information, and information of a user of the social platform, and meanwhile, report time of the target event data can also be determined.
In the embodiment of the present application, as an optional implementation manner, the steps of: determining the source of the target event data as one of a portal type, a social platform type, a contributor type and a media person type according to the reprint information of the target event data, comprising the substeps of:
when the reprinting information of the target event data comprises more than two different pieces of media person information or more than two different contributor information, performing meta search on the more than two different pieces of media person information or the more than two different contributor information to obtain a search result;
and determining the source of the target event data as a contributor type or a media person type from more than two different pieces of media person information or more than two different contributor information according to the search result.
When the target event data is transferred by the media person or the contributor, at this time, the media person or the contributor may use a plurality of similar user names in the same event data because the media person or the contributor has randomness in using the user names, and then the media person or the contributor may use a plurality of similar user names in the same event data.
For example, assuming that a field "text is converted from a post published on a microblog in the first year" and a field "text author is the first year" and "text is the first year", then the entity library can match the field "the first year and the second year" and the field "the first year and the second year" simultaneously when matching, and further to determine the final source of the target event data, it is necessary to perform a meta search on the field "the first year and the second year" and the field "the first year and the second year" to count the occurrence times of the field "the first year and the second year" and select one of the fields "the first year and the second year" according to the statistical result.
Example two
Referring to fig. 2, fig. 2 is a schematic structural diagram of a data propagation path generating device according to an embodiment of the present disclosure. As shown in fig. 2, the data propagation path generating apparatus according to the embodiment of the present application includes:
an obtaining module 201, configured to obtain target event data;
the clustering module 202 is configured to cluster the target event data to determine an event type of the target event data in an event library;
a first determining module 203, configured to determine a source and a reporting time of the target event data;
a second determining module 204, configured to determine a propagation path of the target event data according to a source of the target event data and the reporting time;
the updating module 205 is configured to update a propagation path of an event type according to a propagation path of the target event data, where the propagation path of the event type at least includes the propagation path of the target event data.
According to the device provided by the embodiment of the application, by means of the method for processing the propagation path of the execution data, the target event data can be obtained to be clustered and the event type can be determined, meanwhile, the source and the report time of the target event data can be determined, the propagation path of the event type can be updated according to the propagation path of the target event data, finally, when a user needs to inquire an event of a certain type, the propagation paths of a plurality of event data under the event type can be clearly and completely checked, the user can conduct operations such as propagation tracing, propagation time counting and the like on the event type according to the propagation paths of the plurality of event data under the event type, and the user can judge whether the event data is plagiarism, plagiarism and the like according to the propagation tracing result and the propagation time counting result, And judging whether a certain event is a hot spot.
In this embodiment, the target event data may be electronic news, such as news on a microblog, news on a today's headline, and the like.
In the embodiment of the application, the event library is composed of all event data of a plurality of event types. For example, the event library has event data of an event type of "earthquake occurs in place a" and event data of an event type of "city in front of line B", wherein the event data of the event type of "earthquake occurs in place a" may include data on Tencent web, today's head bar, news on a microblog or other platforms.
In the embodiment of the present application, the target event data may be crawled from platforms such as Tencent network, today's first line, microblog and the like through a crawler technology, and for the crawler technology, reference is made to the prior art, which is not described in detail in the embodiment of the present application.
In this embodiment of the present application, a propagation path of an event type may be formed by splicing propagation paths of a plurality of pieces of event data, for example, it is assumed that the propagation paths of two pieces of event data are "flight news sports forwarded by new wave sports" and "this head is transferred to new wave sports", respectively, and the propagation path of that event type is "flight news sports forwarded by new wave sports" and "this head is transferred to new wave sports".
In this embodiment of the present application, as an optional implementation manner, the clustering module 202 performs clustering on the target event data to determine an event type of the target event data in the event library in a specific manner:
calculating to obtain a central vector of each event type in the event library;
calculating to obtain a vector of the target event data;
comparing the vector of the target event data with the central vector of each event type in the event library, and determining the similarity between the target event data and each event type in the event library;
matching event types with the similarity meeting a first preset threshold from the event library according to the similarity between the target event data and each event type in the event library;
and taking the event type with the similarity meeting a first preset threshold as the event type of the target event data in the event library.
In this optional embodiment, the vector of the target event data is obtained by vectorizing the target event data, so that the vector of the target event data can be compared with the central vector of each event type in the event library, the similarity between the target event data and each event type in the event library can be determined according to the comparison result, the similarity between the target event data and each event type in the event library can be compared with a first preset threshold, and finally, the event type with the similarity meeting the first preset threshold is matched from the event library.
In this embodiment of the present application, as an optional implementation manner, a specific manner in which the clustering module 202 performs calculation to obtain a vector of the target event data is as follows:
and calculating to obtain a vector of the target event data according to a paragraph vector algorithm.
In this embodiment of the present application, as an optional implementation manner, a specific manner for the clustering module 202 to calculate the center vector of each event type in the event library is as follows:
calculating vectors of all data of each event type in the event library;
and carrying out weighted average calculation on vectors of all data of each event type in the event library to obtain a central vector of each event type in the event library.
The vector of the target event data can be calculated by a paragraph vector algorithm, and the central vector of the event type can be obtained by weighting and averaging the vectors of all the event data under the same event type.
In the embodiment of the present application, the paragraph vector algorithm is a "doc 2 vec" algorithm.
In this application embodiment, as an optional implementation manner, the apparatus in this application embodiment further includes a sorting module, where:
and the sorting module is used for sorting all the data of each event type in the event library according to the TextRank algorithm and obtaining the weight of each event data of each event type in the event library.
In this embodiment of the present application, correspondingly, the specific way for the clustering module 202 to calculate the center vector of each event type in the event library is as follows:
and carrying out weighted average calculation on vectors of all data of each event type in the event library according to the weight of each event data of each event type in the event library to obtain a central vector of each event type in the event library.
Therefore, all data of each event type in the event library can be sequenced according to the TextRank algorithm, the weight of each event data of each event type in the event library is obtained, and the weight required in the weighted average calculation process is further determined.
In this application embodiment, as an optional implementation manner, the apparatus in this application embodiment further includes a modification module, where the modification module is configured to:
adjusting the first preset threshold value to a second preset threshold value;
when the target event data is used as one item of event data in the event types and the event types are updated, calculating a central vector of the updated event types;
calculating a vector of each item of event data of the updated event type;
comparing the vector of each item of event data of the updated event type with the central vector of the updated event type, and determining the similarity of each item of event data of the updated event type and the updated event type;
according to the similarity between each item of event data of the updated event type and the updated event type;
and removing the event data with the similarity smaller than a preset second threshold value in the updated event types.
In the embodiment of the application, after the target time data is associated with the event type, all event data in the event type can be screened, so that the similarity between the event data in the event type can be improved, and the accuracy of the propagation path of the event type can be further improved.
In the embodiment of the present application, the first preset threshold may be 0.85, and the second preset threshold may be 0.89, wherein the maximum value of the second preset threshold does not exceed 0.95.
In this embodiment of the present application, as an optional implementation manner, the clustering module 202 is further configured to:
comparing the similarity of the event data with the similarity of other event types in the event library, wherein the similarity of the event data in the updated event types is smaller than a preset second threshold value, and obtaining a comparison result;
and re-determining the event type of the data which does not meet the preset second threshold value according to the comparison result.
When one piece of event data in an event type is disassociated with the event type, the event data can be compared with other event types in the event library to re-determine the event type of the event data.
In the embodiment of the present application, as an optional implementation manner, the specific manner for the first determining module 203 to determine the source and the reporting time of the target event data is as follows:
extracting the reprint information of the target event data according to a preset entity library, wherein the entity library comprises information of a portal website, information of a social platform, media person information, drafter information and information of users of the social platform;
determining the source of the target event data as one of a portal website type, a social platform type, a contributor type and a media person type according to the reprinting information of the target event data;
the reporting time of the target event data is determined.
In this optional embodiment, the reprint information of the target event data can be extracted through a preset entity library, where the entity library includes information of a portal website, information of a social platform, media person information, contributor information, and information of a user of the social platform, and meanwhile, report time of the target event data can also be determined.
In this embodiment of the present application, as an optional implementation manner, the specific manner in which the first determining module 203 determines the source of the target event data as one of a portal type, a social platform type, a contributor type, and a media person type according to the reprinting information of the target event data is as follows:
when the reprinting information of the target event data comprises more than two different pieces of media person information or more than two different contributor information, performing meta search on the more than two different pieces of media person information or the more than two different contributor information to obtain a search result;
and determining the source of the target event data as a contributor type or a media person type from more than two different pieces of media person information or more than two different contributor information according to the search result.
When the target event data is transferred by the media person or the contributor, at this time, the media person or the contributor may use a plurality of similar user names in the same event data because the media person or the contributor has randomness in using the user names, and then the media person or the contributor may use a plurality of similar user names in the same event data.
For example, assuming that a field "text is converted from a post published on a microblog in the first year" and a field "text author is the first year" and "text is the first year", then the entity library can match the field "the first year and the second year" and the field "the first year and the second year" simultaneously when matching, and further to determine the final source of the target event data, it is necessary to perform a meta search on the field "the first year and the second year" and the field "the first year and the second year" to count the occurrence times of the field "the first year and the second year" and select one of the fields "the first year and the second year" according to the statistical result.
EXAMPLE III
Referring to fig. 3, fig. 3 is a schematic structural diagram of a data propagation path generating device according to an embodiment of the present application. As shown in fig. 3, the data propagation path generating device according to the embodiment of the present application includes:
a memory 301 storing executable program code;
a processor 302 coupled to the memory 301;
the processor 302 calls the executable program code stored in the memory 301 to execute the data propagation path generation method according to the first embodiment of the present application.
According to the device of the embodiment of the application, by means of the method for processing the propagation path of the execution data, the target event data can be obtained, clustering can be performed on the target event data, the event type can be determined, the source and the report time of the target event data can be determined, the propagation path of the event type can be updated according to the propagation path of the target event data, finally, when a user needs to inquire an event of a certain type, the propagation paths of a plurality of event data under the event type can be clearly and completely checked, the user can perform operations such as propagation tracing, propagation time statistics and the like on the event type according to the propagation paths of the plurality of event data under the event type, and the user can judge whether the event data is plagiarism, plagiarism and the like according to the propagation tracing result and the propagation time, And judging whether a certain event is a hot spot.
Example four
The embodiment of the application discloses a computer storage medium, wherein a computer instruction is stored in the computer storage medium, and when the computer instruction is called, the computer instruction is used for executing the data propagation path generation method in the first embodiment of the application.
By means of the method for processing the propagation path of the execution data, the storage medium can cluster the target event data and determine the event type by acquiring the target event data, and can determine the source and the report time of the target event data and determine the propagation path of the target event data at the same time, and further can update the propagation path of the event type according to the propagation path of the target event data, and finally when a user needs to query a certain type of event, the propagation path of a plurality of pieces of event data under the event type can be clearly and completely checked, and further the user can perform operations such as propagation tracing, propagation number statistics and the like on the event type according to the propagation paths of the plurality of pieces of event data under the event type, and further the user can judge whether the certain piece of event data is plagiarism, or not according to the propagation tracing result and the propagation number statistics result, And judging whether a certain event is a hot spot.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, a division of a unit is merely a division of one logic function, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
In addition, units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
Furthermore, the functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
It should be noted that the functions, if implemented in the form of software functional modules and sold or used as independent products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.
The above embodiments are merely examples of the present application and are not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (10)

1. A method for generating a propagation path of data, the method comprising:
acquiring target event data;
clustering the target event data to determine the event type of the target event data in an event library;
determining a source and a reporting time of the target event data;
determining a propagation path of the target event data according to the source and the report time of the target event data;
and updating the propagation path of the event type according to the propagation path of the target event data, wherein the propagation path of the event type at least comprises the propagation path of the target event data.
2. The method of claim 1, wherein said clustering the target event data to determine an event type of the target event data in an event repository comprises:
calculating to obtain a central vector of each event type in the event library;
calculating to obtain a vector of the target event data;
comparing the vector of the target event data with the central vector of each event type in the event library, and determining the similarity between the target event data and each event type in the event library;
matching event types with the similarity meeting a first preset threshold from the event library according to the similarity between the target event data and each event type in the event library;
and taking the event type with the similarity meeting a first preset threshold as the event type of the target event data in the event library.
3. The method of claim 2, wherein said computing a vector of said target event data comprises:
calculating to obtain a vector of the target event data according to a paragraph vector algorithm;
and the step of calculating the central vector of each event type in the event library comprises the following steps:
calculating vectors of all data of each event type in the event library;
and carrying out weighted average calculation on vectors of all data of each event type in the event library to obtain a central vector of each event type in the event library.
4. The method of claim 3, wherein prior to said computing a weighted average of vectors for all data for each event type in the event repository to obtain a center vector for each event type in the event repository, the method further comprises:
sequencing all data of each event type in the event library according to a TextRank algorithm, and obtaining the weight of each event data of each event type in the event library;
and performing weighted average calculation on vectors of all data of each event type in the event library to obtain a central vector of each event type in the event library, wherein the weighted average calculation comprises the following steps:
and according to the weight of each event data of each event type in the event library, carrying out weighted average calculation on vectors of all data of each event type in the event library to obtain a central vector of each event type in the event library.
5. The method of claim 2, wherein after the event type for which the similarity satisfies a first preset threshold is taken as the event type of the target event data in the event library, the method further comprises:
adjusting the first preset threshold value to a second preset threshold value;
when the target event data is used as one item of event data in the event types and the event types are updated, calculating the updated central vector of the event types;
calculating a vector of each item of event data of the updated event type;
comparing the updated vector of each item of event data of the event type with the updated central vector of the event type, and determining the similarity of each item of event data of the event type and the updated event type;
according to the similarity between each item of event data of the updated event type and the updated event type;
and removing the event data with the similarity smaller than the preset second threshold value in the updated event types.
6. The method according to claim 5, wherein after the removing event data of the updated event type whose similarity is smaller than the preset second threshold, the method further comprises:
comparing the similarity of the event data with the similarity of other event types in the event library, wherein the similarity of the event data is smaller than the preset second threshold value in the updated event types, and obtaining a comparison result;
and re-determining the event type of the data which does not meet the preset second threshold value according to the comparison result.
7. The method of claim 1, wherein said determining a source and a reporting time for said target event data comprises:
extracting the reprinting information of the target event data according to a preset entity library, wherein the entity library comprises information of a portal website, information of a social platform, media person information, drafter information and information of a user of the social platform;
determining the source of the target event data as one of a portal website type, a social platform type, a contributor type and a media person type according to the reprinting information of the target event data;
determining a reporting time for the target event data;
and determining the source of the target event data as one of a portal website type, a social platform type, a contributor type and a media person type according to the reprint information of the target event data, wherein the determining comprises the following steps:
when the reprinting information of the target event data comprises more than two different pieces of media person information or more than two different contributors information, performing meta search on the more than two different pieces of media person information or the more than two different contributors information to obtain a search result;
determining the source of the target event data as a contributor type or a media person type from the two or more different media person information or the two or more different contributor information according to the search result.
8. A propagation path processing apparatus for data, characterized in that the apparatus comprises:
the acquisition module is used for acquiring target event data;
the clustering module is used for clustering the target event data to determine the event type of the target event data in an event library;
the first determining module is used for determining the source and the report time of the target event data;
the second determination module is used for determining a propagation path of the target event data according to the source and the report time of the target event data;
and the updating module is used for updating the propagation path of the event type according to the propagation path of the target event data, wherein the propagation path of the event type at least comprises the propagation path of the target event data.
9. A propagation path processing apparatus of data, characterized in that the apparatus comprises:
a memory storing executable program code;
a processor coupled with the memory;
the processor calls the executable program code stored in the memory to execute the propagation path generation method of data according to any one of claims 1 to 7.
10. A storage medium storing computer instructions for performing a propagation path generation method of data according to any one of claims 1 to 7 when the computer instructions are called.
CN202010835470.5A 2020-08-19 2020-08-19 Data propagation path generation method, device, equipment and storage medium Pending CN111949838A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010835470.5A CN111949838A (en) 2020-08-19 2020-08-19 Data propagation path generation method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010835470.5A CN111949838A (en) 2020-08-19 2020-08-19 Data propagation path generation method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN111949838A true CN111949838A (en) 2020-11-17

Family

ID=73342830

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010835470.5A Pending CN111949838A (en) 2020-08-19 2020-08-19 Data propagation path generation method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111949838A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113221010A (en) * 2021-05-26 2021-08-06 支付宝(杭州)信息技术有限公司 Event propagation state display method and device and electronic equipment
CN113612749A (en) * 2021-07-27 2021-11-05 华中科技大学 Intrusion behavior-oriented tracing data clustering method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106156232A (en) * 2015-04-24 2016-11-23 阿里巴巴集团控股有限公司 A kind of monitoring method and apparatus of spreading network information
CN106557551A (en) * 2016-10-27 2017-04-05 西南石油大学 Scale forecast method and system is propagated based on the microblogging that microblogging affair clustering is modeled
CN111143655A (en) * 2019-12-30 2020-05-12 创新奇智(青岛)科技有限公司 Method for calculating news popularity
CN111324789A (en) * 2020-02-13 2020-06-23 创新奇智(上海)科技有限公司 Method for calculating network information data heat

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106156232A (en) * 2015-04-24 2016-11-23 阿里巴巴集团控股有限公司 A kind of monitoring method and apparatus of spreading network information
CN106557551A (en) * 2016-10-27 2017-04-05 西南石油大学 Scale forecast method and system is propagated based on the microblogging that microblogging affair clustering is modeled
CN111143655A (en) * 2019-12-30 2020-05-12 创新奇智(青岛)科技有限公司 Method for calculating news popularity
CN111324789A (en) * 2020-02-13 2020-06-23 创新奇智(上海)科技有限公司 Method for calculating network information data heat

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113221010A (en) * 2021-05-26 2021-08-06 支付宝(杭州)信息技术有限公司 Event propagation state display method and device and electronic equipment
CN113612749A (en) * 2021-07-27 2021-11-05 华中科技大学 Intrusion behavior-oriented tracing data clustering method and device

Similar Documents

Publication Publication Date Title
CN107436875B (en) Text classification method and device
US7269544B2 (en) System and method for identifying special word usage in a document
JP2017123168A (en) Method for making entity mention in short text associated with entity in semantic knowledge base, and device
Zhang et al. Efficient partial-duplicate detection based on sequence matching
CN112035599B (en) Query method and device based on vertical search, computer equipment and storage medium
Molino et al. Cota: Improving the speed and accuracy of customer support through ranking and deep networks
CN103313248A (en) Method and device for identifying junk information
Cordobés et al. Graph-based techniques for topic classification of tweets in Spanish
CN110929145A (en) Public opinion analysis method, public opinion analysis device, computer device and storage medium
Alassi et al. Effectiveness of template detection on noise reduction and websites summarization
CN110069769A (en) Using label generating method, device and storage equipment
CN111949838A (en) Data propagation path generation method, device, equipment and storage medium
CN108509545B (en) Method and system for processing comments of article
US9558462B2 (en) Identifying and amalgamating conditional actions in business processes
CN111586695A (en) Short message identification method and related equipment
JP5079642B2 (en) History processing apparatus, history processing method, and history processing program
CN112287111B (en) Text processing method and related device
CN113934848A (en) Data classification method and device and electronic equipment
JP2016212879A (en) Information processing method and information processing apparatus
Wei et al. Online education recommendation model based on user behavior data analysis
CN111046169B (en) Method, device, equipment and storage medium for extracting subject term
US9824140B2 (en) Method of creating classification pattern, apparatus, and recording medium
CN112487181A (en) Keyword determination method and related equipment
CN115495587A (en) Alarm analysis method and device based on knowledge graph
CN112926297B (en) Method, apparatus, device and storage medium for processing information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination