CN116610758A - Information tracing method, system and storage medium - Google Patents

Information tracing method, system and storage medium Download PDF

Info

Publication number
CN116610758A
CN116610758A CN202310478299.0A CN202310478299A CN116610758A CN 116610758 A CN116610758 A CN 116610758A CN 202310478299 A CN202310478299 A CN 202310478299A CN 116610758 A CN116610758 A CN 116610758A
Authority
CN
China
Prior art keywords
target
publisher
information
entity
tracing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310478299.0A
Other languages
Chinese (zh)
Inventor
吕东
李艺涛
王媛媛
段东圣
段运强
井雅琪
王子涵
任博雅
佟玲玲
李鹏霄
王立强
艾政阳
侯炜
王红兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National Computer Network and Information Security Management Center
Original Assignee
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National Computer Network and Information Security Management Center filed Critical National Computer Network and Information Security Management Center
Priority to CN202310478299.0A priority Critical patent/CN116610758A/en
Publication of CN116610758A publication Critical patent/CN116610758A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention relates to an information tracing method, an information tracing system and a storage medium, wherein the method comprises the following steps: obtaining the same kind of target subject information from a plurality of platforms, and preprocessing the target subject information to obtain a target text; creating a target map of the target text; performing entity link processing on the target map to obtain a publisher set of the target text, wherein the publisher set carries the propagation path information of the target text; and performing traceability analysis on the publisher set to determine the target publisher of the target subject information. The source, evolution and propagation path of the topic information are determined by carrying out cross-platform and cross-space tracking and analysis on the topic information on a plurality of platforms, and comprehensive information reference and decision support are provided for users, so that the technical effect of multi-platform information tracing can be realized.

Description

Information tracing method, system and storage medium
Technical Field
The embodiment of the invention relates to the technical field of information identification, in particular to an information tracing method, an information tracing system and a storage medium.
Background
With the continuous development of internet technology, more and more people begin to publish content and share information, such as social media, blogs, forums and the like, using different platforms. Subject information (e.g., discussion of a topic, event, or product) on these platforms is often widely discussed and forwarded, but due to differences in platforms and propagation of information, it is difficult to trace the source, evolution, and propagation path of such subject information.
Currently, a certain progress has been made in the multi-platform topic information tracing technology, but some challenges and disadvantages still exist. Platform difference problem: the information formats, data structures, user behaviors and the like of different platforms are greatly different, so that a certain difficulty is brought to cross-platform tracing. Data quality problem: the information on the Internet has a large amount of noise, false information and misinterpretation, and the accuracy and the credibility of the multi-platform theme information tracing are affected. Data size problem: the information volume on the internet is very huge, and how to effectively process and analyze the data is also a difficulty in tracing the multi-platform theme information.
Disclosure of Invention
In view of this, in order to solve the technical problem of tracing the multi-platform theme information, the embodiments of the present invention provide an information tracing method, system and storage medium.
In a first aspect, an embodiment of the present invention provides an information tracing method, including:
obtaining the same kind of target subject information from a plurality of platforms, and preprocessing the target subject information to obtain a target text;
creating a target map of the target text;
performing entity link processing on the target map to obtain a publisher set of the target text, wherein the publisher set carries the propagation path information of the target text;
And performing traceability analysis on the publisher set to determine the target publisher of the target subject information.
In one possible implementation manner, the preprocessing the target subject information to obtain target text includes:
performing cosine similarity data extraction on the target subject information to obtain associated data without information loss;
and carrying out data sorting processing on the associated data according to the time parameters to obtain a target text.
In one possible implementation manner, the creating the target map of the target text includes:
performing entity extraction on the target text to obtain an entity extraction result of the target text;
extracting the attribute of the target text to obtain an attribute extraction result of the target text;
based on the entity extraction result, extracting entity relation from the target text to obtain a relation extraction result of the target text;
and creating a target map of the target text based on the entity extraction result, the attribute extraction result and the relation extraction result.
In one possible implementation manner, the entity linking processing is performed on the target atlas to obtain a publisher set of the target text, including:
Matching the entity extraction result in the target atlas with a standard knowledge base to obtain a quasi entity set in the standard knowledge base;
respectively carrying out score evaluation on all entities in the quasi entity set to obtain a quasi entity score set;
taking an entity corresponding to the highest score in the quasi entity score set as a target entity;
and taking all target entities corresponding to the target text as a publisher set.
In one possible implementation manner, the performing a traceability analysis on the publisher set to determine a target publisher of the target topic information includes:
the comment quantity and the forwarding quantity corresponding to each publisher in the publisher set are obtained, and the propagation path quantity in the target text is obtained;
performing weighted average processing on the evaluation quantity, the forwarding quantity and the propagation path quantity to obtain a tracing score corresponding to each publisher;
and comparing the traceability scores corresponding to each publisher in the publisher set to determine the target publisher of the target subject information.
In one possible implementation manner, the comparing the traceability scores corresponding to each publisher in the publisher set to determine the target publisher of the target topic information includes:
Acquiring a first traceability score corresponding to a first publisher in the publisher set and acquiring a second traceability score corresponding to a second publisher;
obtaining a difference value between the first tracing score and the second tracing score to obtain a tracing difference;
judging whether the tracing difference is larger than a preset first threshold value or not, and determining a target publisher, wherein the first threshold value represents the tracing similarity degree of the two publishers;
and when the tracing difference is larger than the first threshold value, taking the publisher with the large tracing score as a target publisher.
In one possible implementation manner, the determining whether the tracing difference is greater than a preset first threshold value, and determining the target publisher further includes:
when the tracing difference is smaller than or equal to the first threshold value, acquiring the first propagation path number of the first publisher and the second propagation path number of the second publisher;
comparing the first propagation path number with the second propagation path number to determine a target publisher;
when the first propagation path number is greater than or equal to the second propagation path number, determining that the target publisher is the first publisher;
And when the first propagation path number is smaller than the second propagation path number, determining that the target publisher is the second publisher.
In one possible embodiment, the method further comprises:
and displaying entity information of the target map corresponding to the target publisher and the propagation path information through a visualization device.
In a second aspect, an embodiment of the present invention provides an information tracing system applied to the information tracing method described in the first aspect, including:
the preprocessing module is used for acquiring the same type of target subject information from a plurality of platforms, and preprocessing the target subject information to obtain a target text;
the creating map module is used for creating a target map of the target text;
the entity link module is used for carrying out entity link processing on the target map to obtain a publisher set of the target text, wherein the publisher set carries the propagation path information of the target text;
and the tracing module is used for carrying out tracing analysis on the publisher set and determining the target publisher of the target subject information.
In a third aspect, an embodiment of the present invention provides a storage medium storing one or more programs, where the one or more programs are executable by one or more processors to implement the information tracing method in any one of the first aspects.
According to the information tracing scheme provided by the embodiment of the invention, the same type of target subject information is obtained from a plurality of platforms, and the target subject information is preprocessed to obtain a target text; creating a target map of the target text; performing entity link processing on the target map to obtain a publisher set of the target text, wherein the publisher set carries the propagation path information of the target text; and performing traceability analysis on the publisher set to determine the target publisher of the target subject information. The method and the system have the advantages that the source, the evolution and the propagation paths of the theme information are determined by carrying out cross-platform and cross-space tracking and analysis on the theme information on a plurality of platforms, and comprehensive information reference and decision support are provided for users.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
Fig. 1 is a schematic flow chart of an information tracing method according to an embodiment of the present invention;
fig. 2 is a flow chart of another information tracing method according to an embodiment of the present invention;
FIG. 3 is a schematic flow chart of determining a target publisher according to an embodiment of the present invention;
FIG. 4 is a flowchart illustrating another method for determining a target publisher according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of an information tracing system according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The terms "comprising" and "having" in embodiments of the present invention are used to mean including open and mean that there may be additional elements/components/etc. in addition to the listed elements/components/etc.; the terms "first" and "second" and the like are used merely as labels, and are not intended to limit the number of their objects. Furthermore, the various elements and regions in the figures are only schematically illustrated and thus the present invention is not limited to the dimensions or distances illustrated in the figures.
For the purpose of facilitating an understanding of the embodiments of the present invention, reference will now be made to the following description of specific embodiments, taken in conjunction with the accompanying drawings, which are not intended to limit the embodiments of the invention.
Fig. 1 is a schematic flow chart of an information tracing method according to an embodiment of the present invention. According to the diagram provided in fig. 1, the information tracing method specifically includes:
s101, obtaining the same kind of target subject information from a plurality of platforms, and preprocessing the target subject information to obtain a target text.
The invention is applied to the technical field of information processing, and particularly relates to an information tracing technology. Acquiring the same type of subject information in a plurality of platforms to obtain a target text, establishing a target map of the target text, establishing a relation map of an entity and an attribute in the target text, obtaining a publisher set according to entity link processing, comparing and tracing the publisher set to judge, and finally obtaining a target publisher to realize the technical effect of tracing the subject information under the plurality of platforms.
Multiple platforms as referred to herein may be understood as different publishers of information, such as social media, blogs, forums, etc. The target theme information is understood herein to be text data carrying theme colors. The preprocessing is understood to be the processing of de-duplication, information extraction and the like of the target subject information.
Further, target topic information of the same emotion category or the same topic stored by a plurality of platforms is obtained through searching software or a database, and then the obtained target topic information is subjected to de-duplication, picture removal, de-noise and information extraction processing to obtain a target text related to the topic information.
S102, creating a target map of the target text.
The target graph is understood as a knowledge graph representing the relationship between entities in the data information, between events or between events and entities.
Further, after the target text is obtained, relation extraction is carried out according to entity data and event attributes contained in the target text after arrangement, so that the association relation between the entity and the event in the target text is realized, the association relation is expressed in a map form, a target map corresponding to the target text is obtained, and preparation is carried out for analyzing a target publisher in the target map in the next step.
S103, performing entity link processing on the target map to obtain a publisher set of the target text, wherein the publisher set carries the propagation path information of the target text.
The entity linking process is understood to be the operation of linking the entity object extracted from the target text to the corresponding correct entity object in the knowledge base. A set of candidate entity objects meeting the requirements is selected from a knowledge base through a given entity term, and then the term is linked to the correct entity object through similarity calculation. The publisher set is understood to be a set obtained by linking entities in the target graph, and represents the correct entity obtained by linking each entity in the target graph as publisher information.
Further, entity links are respectively carried out on all entities in the target map, the entity with high similarity value corresponding to each entity is found out from a knowledge base and used as the publisher information, so that the publisher information of all the entities is obtained, the obtained all the publisher information is used as a publisher set, and the set of text propagation path information corresponding to each entity is contained, so that preparation is made for determining the target publisher in the next step.
S104, performing traceability analysis on the publisher set to determine target publishers of the target topic information.
The traceability analysis can be understood as a process of judging the release source of the information through the information such as the release time and the forwarding amount of the text information corresponding to each publisher. The target publisher is understood herein to be the publisher source of the text forwarding information in the entire collection of publishers contained in the target graph.
Further, according to the obtained publisher sets corresponding to all the entities in the target map, whether each publisher is a source publisher is respectively judged, the text publication source is analyzed according to the publication time and the forwarding amount information of the text information corresponding to each publisher, and the original publisher is found to serve as the target publisher, so that the aim of tracing is achieved.
According to the information tracing scheme provided by the embodiment of the invention, the same type of target subject information is obtained from a plurality of platforms, and the target subject information is preprocessed to obtain a target text; creating a target map of the target text; carrying out entity link processing on the target atlas to obtain a publisher set of the target text, wherein the publisher set carries the propagation path information of the target text; and performing traceability analysis on the publisher set to determine the target publisher of the target subject information. The method and the system have the advantages that the source, the evolution and the propagation paths of the theme information are determined by carrying out cross-platform and cross-space tracking and analysis on the theme information on a plurality of platforms, and comprehensive information reference and decision support are provided for users.
Fig. 2 is a flow chart of another information tracing method according to an embodiment of the present invention. Fig. 2 is presented on the basis of the above embodiment. Referring to the diagram provided in fig. 2, the information tracing method specifically further includes:
s201, obtaining the same kind of target subject information from a plurality of platforms.
S202, performing cosine similarity data extraction on the target subject information to obtain associated data without information loss.
S203, carrying out data sorting processing on the associated data according to the time parameters to obtain a target text.
The invention is applied to the technical field of information processing, and particularly relates to an information tracing technology. Acquiring the same type of subject information in a plurality of platforms to obtain a target text, establishing a target map of the target text, establishing a relation map of an entity and an attribute in the target text, obtaining a publisher set according to entity link processing, comparing and tracing the publisher set to judge, and finally obtaining a target publisher to realize the technical effect of tracing the subject information under the plurality of platforms.
Multiple platforms as referred to herein may be understood as different publishers of information, such as social media, blogs, forums, etc. The target theme information is understood herein to be text data carrying theme colors. The preprocessing is understood to be the processing of de-duplication, information extraction and the like of the target subject information. The data extraction is herein understood to be a process of feature extraction, which is performed by analyzing the content of the target subject information. The related data can be understood as data information obtained after data preprocessing and missing value processing, and represents data such as publisher information, release time information, release address information, release subject content and the like of the target subject information.
Further, target topic information of the same emotion category or the same topic stored by a plurality of platforms is obtained through searching software or a database, cosine similarity calculation is carried out on the obtained target topic information, and preliminary data extraction is carried out on the target topic information, so that the purpose of removing information loss is achieved. And sequencing the extracted data according to the platform release time and extracting the subject to obtain the information of the target text.
In one possible example scenario, target topic information for multiple platforms is collected; and the crawler is utilized to collect the information content of a certain type of subject in a centralized manner under the release condition of each large Internet platform.
The collected content is shown in table 1, and meanwhile, part of labels are adopted for the content of the target subject information by manpower, and the label is usually marked as paper_num, so that the content of the target subject information is clear, and support is provided for subsequent filtering of the collected content.
TABLE 1
After the acquisition format of the target subject information is set, text similarity calculation is carried out on the target subject information, and cosine similarity calculation is carried out on the content of the acquired target subject information and the content of the target subject information marked in advance. Meanwhile, a threshold value parameter A is set, the value range of the parameter A is between 0 and 1, and the default value is 0.8. And filtering the text information of the collected target subject information, and filtering the text content of the target subject information with low relativity.
The acquisition target topic information content is marked as get_num, and i represents one item of target topic information content in the target topic information content. j represents the subject information content of an item in paper_num, and cos (i, j) is obtained by cosine similarity calculation according to formula 1:
calculating the similarity of the cosine of each of i and the marked paper_num, and then summing the calculated similarity, and calculatingThe following formula 2:
if it isAnd if the target theme information is larger than the threshold parameter A, the target theme information is reserved, so that the purpose of information filtering is achieved.
And counting the target subject information of each category, wherein if the information Missing Value in the category is marked as missing_value, the Value interval is 0 and 1, and the Value range of the threshold Value parameter B is 0 and 1. When missing_value is greater than 0.8, the class is deleted. The missing values are filled in by calculating the column value average/mode/[ minimum, maximum ] interval and the like.
And sequencing the target subject information according to the release time and the acquisition time, sequencing the content of the target subject information according to the acquisition time when the release time is the same, and realizing the preprocessing process of the target subject information to obtain the target text.
S204, entity extraction is carried out on the target text, and an entity extraction result of the target text is obtained.
S205, extracting the attributes of the target text to obtain an attribute extraction result of the target text.
S206, based on the entity extraction result, extracting entity relation from the target text to obtain a relation extraction result of the target text.
S207, creating a target map of the target text based on the entity extraction result, the attribute extraction result and the relation extraction result.
Entity extraction is understood herein to mean the process of extracting information from the content of the target text according to a set entity pattern. The attribute extraction is understood as a process of extracting information from the content of the target text according to a set attribute paradigm. The relation extraction is understood as a process of extracting information from the content of the target text according to a set association relation paradigm.
Further, entity extraction, attribute extraction and relationship extraction are carried out on the obtained target text to obtain a corresponding entity extraction result, an attribute extraction result and a relationship extraction result, and the three are integrated to obtain a target map, so that an upstream and downstream relationship map between the entity relationship and the attribute relationship in the target text is represented.
Optionally, there are various ways of entity extraction, and entity, attribute and relation extraction is performed on the sorted target text through the triples, and a normal form is generally set and can be expressed as a triplet of the following formula:
(subject,predicate,object)
Wherein, the subject is a subject (also called subject), and the value is usually an entity or an event; predictes are predicates (also known as words) whose values are typically relationships or attributes; an object is an object (also called an object) and its value may be an entity, an event, a concept, or a common value (such as a number, a character string, etc.). And extracting data from the target text according to the set paradigm, and further obtaining the triplet represented by the target text.
The attribute extraction aims to collect attribute information of a specific entity from different information sources, such as information of nicknames, birthdays, nationalities, educational backgrounds and the like of public characters aiming at a certain public character can be obtained from network public information. There are various methods for calculating attribute similarity, and common methods include edit distance, aggregate similarity calculation, vector-based similarity calculation, and the like.
Edit distance: levenstein, wagner and Fisher, edit Distance with Afine Gaps;
aggregate similarity calculation: jaccard coefficients, dice;
vector-based similarity calculation: cosine similarity, TFIDF similarity.
The pattern matching-based extraction method is also called a rule-based extraction method, and is a method for extracting entity-attribute in text based on a series of rules constructed in advance. The method comprises the steps of defining relevant extraction rules, such as tag labels of relevant specifications or writing regular expressions, matching the rules with target texts, and obtaining extraction results of extracted entities and attributes thereof through matching results.
The entity-attribute extraction method based on pattern matching can be classified into three types according to the method of defining patterns thereof: manually defined based extraction, supervised learning based extraction, and iterative based extraction. The manual definition is based on a series of modes defined manually by a person skilled in the relevant art. The method comprises the steps of firstly collecting relevant corpus to form a large-scale corpus based on a learning mode, then training an automatic acquisition mode through a standard unstructured example, and constructing a knowledge base with a large number of entities and attributes. The iteration-based method is to define template tuples first, iterate the template tuples afterwards, and automatically generate patterns so as to extract entity-attribute.
The entity-attribute extraction method based on the relationship classification is to convert the attribute extraction problem into a relationship classification problem. Firstly, regarding two extracted entities as a sample, regarding the direct relationship of the entities as a label, then classifying the sample according to the characteristics by constructing the characteristics of the sample, and taking the classification result as the relationship attribute between the entities. The relation-based extraction method is generally performed by means of a machine learning method, such as a Support Vector Machine (SVM), a neural network, etc., and the entity-attribute is extracted by learning a classification model through training of a large corpus. The method based on the relation classification can be divided into a remote supervision method and a full supervision method according to the construction mode of the corpus. The method based on remote supervision basically builds a corpus by a machine, and the method based on full supervision builds the corpus by a human.
The entity-attribute extraction method based on clustering converts an attribute extraction problem into a clustering problem. Firstly, constructing entity characteristic vectors, then, clustering the characteristic vectors based on a correlation method, wherein the finally obtained clusters are the attributes of the entities. For example, a weakly supervised clustering method may be used for the category attribute, and an unsupervised clustering method may be used for the corresponding product attribute, and the specific implementation manner will not be described herein.
After the target text is extracted by entity-attribute, a series of discrete named entities are obtained, and in order to obtain semantic topic information, the association relation between the entities is extracted from the related text information, and the entities are connected through the relation to form a net-shaped knowledge structure. The concrete relation extraction can be realized by means of supervised entity relation extraction, semi-supervised entity relation extraction, unsupervised entity relation extraction and open entity relation extraction.
The supervised learning method is to train a machine learning model on the basis of the labeled training data and then identify the relationship type of the test data. Supervised learning approaches include rule-based approaches, feature-based approaches, and kernel-function based approaches.
According to the method based on the rules, corresponding rules or templates are summarized and summarized through a manual or machine learning method according to the difference of the related fields of the corpus to be processed, and entity relation extraction is performed through a template matching method. Feature vector-based methods extract useful information (including lexical information, grammatical information) from the context of the relational sentence instance as features, construct feature vectors, and train entity-relationship extraction models by computing similarity of the feature vectors. The entity relation extraction method based on the kernel function comprises a word sequence kernel function method, a dependency tree kernel function method, a shortest path dependency tree kernel function method, a convolution tree kernel function method and a combination kernel function method thereof, and the entity relation extraction method based on the characteristics can be mutually supplemented.
The semi-supervised entity relation extraction method based on Bootstrapping is used for summarizing an entity relation sequence mode from the context containing relation seeds, and more relation seed examples are found by using the relation sequence mode to form a new relation seed set. Collaborative learning (co-learning) based methods utilize two conditionally independent feature sets to provide different and complementary information, thereby reducing annotation errors.
The supervising entity relation extraction method does not need to rely on entity relation annotation corpus, and comprises relation instance clustering and relation type word selection. Entity pairs with high similarity are grouped into one category according to the context in which the entity pairs appear, and representative words are selected to mark the relationship.
The open entity relation extraction method automatically completes relation type discovery and relation extraction tasks. High quality instances of entity relationships are mapped into large-scale text by means of an external domain-independent entity knowledge base (e.g., DBPedia, YAGO, openCyc, freeBase or other domain knowledge base), training data is obtained from the text alignment method, and then the relationship extraction problem is solved using a supervised learning method.
Obtaining extraction results of the three through a plurality of entity extraction methods, an attribute extraction method and a relation extraction method, constructing a relation graph according to the entity extraction results, the attribute extraction results and the relation extraction results, obtaining a target graph of a target text, and preparing for tracing of entities in the next analysis graph.
S208, matching the entity extraction result in the target atlas with the standard knowledge base to obtain a quasi entity set in the standard knowledge base.
S209, performing score evaluation on all entities in the alignment entity set respectively to obtain a quasi entity score set.
A standard knowledge base is understood herein to mean a library of correct entity relationships stored by the system. Quasi-entity sets are understood herein to be sets that are initially identified as entities by matching. The term "score evaluation" as used herein is understood to mean the process of calculating the score of each quasi entity by a given operation and evaluating the correct entity by the score.
Further, after the target atlas is obtained, matching the entity extraction result in the target atlas with the correct entity stored in the standard knowledge base, and judging that the entity is a quasi entity within the threshold range, thereby obtaining a quasi entity set corresponding to each entity. And then, respectively carrying out score evaluation calculation on all the entities in the target text to obtain a quasi entity score set, and preparing for evaluating the target entity in the next step.
S210, taking an entity corresponding to the highest score in the quasi entity score set as a target entity.
S211, taking all target entities corresponding to the target text as a publisher set.
Wherein the publisher set carries propagation path information of the target text.
Further, according to the quasi entity score set, determining the entity with the highest score, obtaining the final correct entity for one entity in the target map, and taking the correct entity as the target entity. All the entities in the target map are judged through the method, a set corresponding to the target entity is obtained and used as a target entity set, the target entity set is used as a publisher set, and publisher set information in the target map is further obtained, so that preparation is made for determining a target publisher in the next step.
In one possible example scenario, for an entity in a target graph: thirdly, stretching; the forwarding amount of the text information is 1000, and the release time of the text information is 2022-09-10. By comparing with entities stored in a standard knowledge base: the entities Zhang San, zhang Sanjiang, zhang Sanguo, zhang Sanli and Li Sandeng are compared, the threshold is set to be 70%, the similarity between Zhang San and Li San is 50%, and if the similarity is lower than the threshold, li San is filtered out. And calculating the similarity between Zhang Sanjiang and Zhang San, setting Zhang Sanjiang as quasi-entities if the similarity reaches 80% and 70% and obtaining all quasi-entities in a standard knowledge base by the same method, further matching the release time with the text information makeup amount by comparing, determining that the correct entity is Zhang San, and obtaining a target entity by determining that the text information forwarding amount of the release time is 2022-9-10 is 1000, and matching the whole target entity to obtain a publisher set corresponding to all the correct entities.
Optionally, the target atlas is analyzed and mined after it is obtained. Through entity linking (entity linking) processing, a group of candidate entity objects are selected from a knowledge base according to a given entity term, then the term is linked to a correct entity object through similarity calculation, and the entity with the highest term is taken as a target entity through a scoring method. Entity linking is the operation of extracting an entity object from unstructured data (e.g., text) or semi-structured data (e.g., tables) and linking it to the corresponding correct entity object in the knowledge base.
S212, obtaining the number of comments and forwarding numbers corresponding to each publisher in the publisher set and obtaining the number of propagation paths in the target text.
S213, carrying out weighted average processing on the comment quantity, the forwarding quantity and the propagation path quantity to obtain the corresponding traceability score of each publisher.
The weighted average process is understood herein to be a weight calculation process. The tracing score can be understood as the probability score of each publisher judged as the target publisher, and the higher the tracing score is, the larger the text information forwarding diffusion area is represented, and the tracing publisher can be represented.
Further, according to the constructed target map, the number of the comments of the publishers, the forwarding number and other attributes are combined, meanwhile, the number of propagation paths of the published contents in the publisher set is analyzed, a traceability probability calculation algorithm is constructed, and the source of the publishing of the target topic information is comprehensively calculated, so that support is provided for traceability of topic information.
The Number of comments is denoted as number_of_comments, the Number of forwarding is denoted as number_of_forwarding, and the Number of propagation paths is denoted as number_of_propagation_paths.
The traceability score is obtained as a comprehensive score as shown in formula 3:
Composite score (composition_score) =
αNumber_of_comments+βNumber_of_forwarding+
Gamma Number of production path 3
And alpha, beta and gamma are proportionality coefficients of target subject information normalization processing. The conditions are satisfied: α+β+γ=1.
S214, comparing the traceability scores corresponding to each publisher in the publisher set to determine the target publisher of the target topic information.
Further, by comparing the traceability scores corresponding to each publisher in the publisher set, the highest score is found, and the publisher with the highest traceability score is used as the target publisher, so that the traceability processing of the target topic information is realized, and the traceability publisher information is obtained.
S215, entity information and propagation path information of the target map corresponding to the target publisher are displayed through the visualization equipment.
The visualization device can be understood as a display interface or a message reminding interface of a designated area.
Further, after the traceability target publisher of the target subject information is analyzed, the entity information, the attribute information and the entity-attribute relation information of the target map are obtained according to the entity-attribute relation corresponding to the target publisher, and are displayed through the visualization equipment, so that clear traceability processing results are provided for users, and reference basis is provided for follow-up real-time tracking and judging of forwarding trend of the target subject information.
The step of determining the target publisher in step S214 specifically includes:
s301, obtaining a first traceability score corresponding to a first publisher in a publisher set and obtaining a second traceability score corresponding to a second publisher.
A first publisher as referred to herein may be understood as one publisher number in the collection of publishers. The first traceability score is understood as a traceability score obtained by the first publisher through calculation.
S302, obtaining a difference value between the first tracing score and the second tracing score to obtain a tracing difference.
S303, judging whether the tracing difference is larger than a preset first threshold value, and determining a target publisher, wherein the first threshold value represents the tracing similarity degree of the two publishers.
The trace source difference is understood as interpolation between trace source scores corresponding to any two publishers in the publisher set. The first threshold as referred to herein may be understood as characterizing the degree of traceability similarity of two publishers.
Further, two publishers are randomly found from the publisher set, the tracing scores of the two publishers are calculated respectively, and interpolation according to the tracing scores of the two publishers is used as a basis for judging whether the two publishers are target publishers.
And S304, when the tracing difference is larger than a first threshold value, using the publisher with the large tracing score as a target publisher.
Further, when interpolation of the traceability scores corresponding to any two publishers in the publisher set is larger than a first threshold, the two publishers are characterized as having great difference, the larger the traceability score is, the earlier the publication time of the corresponding publisher is characterized as being, the more the text information forwarding quantity is, and the publisher with the larger traceability score is taken as a target publisher.
In a possible example scenario, the publishers 1 and 2 are obtained from the publisher set, the comprehensive score corresponding to the traceability score of the publisher 1 is 1000 points through traceability calculation, the comprehensive score corresponding to the traceability score of the publisher 2 is 3000 points, the first threshold is set to be 500, because 3000-1000=2000, the traceability difference 2000 is far greater than the 500 threshold, the forwarding number, forwarding time or forwarding path number corresponding to the publisher 2 can be clearly judged, the publisher 2 is more likely to be the source of the target subject information, all publishers in the publisher set are temporarily regarded as target publishers, and the publisher with the highest traceability score is finally regarded as the target publisher through the same judging method.
The step of obtaining the target publisher in step S303 specifically further includes:
s401, when the tracing difference is smaller than or equal to a first threshold value, acquiring the first propagation path number of the first publisher and acquiring the second propagation path number of the second publisher.
S402, comparing the first propagation path number with the second propagation path number to determine a target publisher.
S403, when the number of the first propagation paths is larger than or equal to the number of the second propagation paths, determining that the target publisher is the first publisher.
S404, when the number of the first propagation paths is smaller than that of the second propagation paths, determining that the target publisher is the second publisher.
The number of propagation paths referred to herein may be understood as the total number of paths that the target subject information is forwarded by multiple publishers of multiple platforms.
Further, when the difference of the two eggs between the tracing scores corresponding to the two publishers in the publisher set is smaller than a first threshold, the difference of the publication time and the path forwarding quantity of the text information representing the two publishers is small, and the target publisher cannot be judged according to the tracing score. By setting the priority of the number of the propagation paths, the method is used as a reference basis for further judging the target publishers, and the publishers with the large number of the propagation paths are used as the target publishers according to the two-egg size comparison of the number of the two propagation paths corresponding to the two publishers, so that the tracing analysis of the target topic information is realized.
In one possible example scenario, the number of propagation paths of the publisher a is 100, the number of propagation paths of the publisher B is 40, the tracing score a corresponding to the publisher a is calculated to obtain 1000, the tracing score B of the publisher B is calculated to obtain 1050, the first threshold is set to be 200, because 1050-1000=50, the tracing difference 50 is smaller than the first threshold 200, and due to the balance of the proportionality coefficients, the true tracing result of the publisher a and the publisher B cannot be judged through the comprehensive score of the two, and the priority of the number of propagation paths is used as the basis for judging the target publisher. And judging that the publisher A is closer to the target publisher than the publisher B because the number of the propagation paths of the publisher A is larger than that of the propagation paths of the publisher B, taking the publisher A as the target publisher, and obtaining the source publisher of the target subject information as the publisher A to realize the technical effect of information tracing under multiple platforms.
According to the other information tracing method provided by the embodiment of the invention, the target topic information on multiple platforms is collected, and then the target text is obtained through data extraction and time sequencing; extracting an entity-attribute of the target text to obtain a target map corresponding to the target text, and obtaining a target entity in the target map according to entity link processing to obtain a publisher set; and calculating the tracing score of each publisher, comparing, setting the number of the propagation paths as priority, obtaining the target publisher of the target subject information, and determining the origin, evolution and propagation paths of the subject information by carrying out cross-platform and cross-space tracking and analysis on the subject information on a plurality of platforms, so as to provide comprehensive information reference and decision support for users, thereby realizing the technical effect of multi-platform information tracing.
Fig. 5 is a schematic structural diagram of an information tracing system according to an embodiment of the present invention. According to the diagram provided in fig. 5, the information tracing system specifically includes:
the preprocessing module 51 is configured to acquire target subject information of the same type from multiple platforms, and perform preprocessing on the target subject information to obtain a target text;
a create map module 52 for creating a target map of the target text;
the entity link module 53 is configured to perform entity link processing on the target graph to obtain a publisher set of the target text, where the publisher set carries propagation path information of the target text;
and the tracing module 54 is configured to perform tracing analysis on the publisher set, and determine a target publisher of the target topic information.
The information tracing system provided in this embodiment may be an information tracing system as shown in fig. 5, and may perform all steps of the information tracing method as shown in fig. 1-2, so as to achieve the technical effects of the information tracing method as shown in fig. 1-2, and the detailed description will be omitted herein for brevity.
The embodiment of the invention also provides a storage medium (computer readable storage medium). The storage medium here stores one or more programs. Wherein the storage medium may comprise volatile memory, such as random access memory; the memory may also include non-volatile memory, such as read-only memory, flash memory, hard disk, or solid state disk; the memory may also comprise a combination of the above types of memories.
When one or more programs in the storage medium may be executed by one or more processors, to implement the information tracing method performed on the information tracing apparatus side.
The processor is used for executing the information tracing program stored in the memory so as to realize the following steps of the information tracing method executed on the information tracing equipment side:
obtaining the same kind of target subject information from a plurality of platforms, and preprocessing the target subject information to obtain a target text; creating a target map of the target text; carrying out entity link processing on the target atlas to obtain a publisher set of the target text, wherein the publisher set carries the propagation path information of the target text; and performing traceability analysis on the publisher set to determine the target publisher of the target subject information.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of function in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims (10)

1. The information tracing method is characterized by comprising the following steps:
obtaining the same kind of target subject information from a plurality of platforms, and preprocessing the target subject information to obtain a target text;
creating a target map of the target text;
performing entity link processing on the target map to obtain a publisher set of the target text, wherein the publisher set carries the propagation path information of the target text;
And performing traceability analysis on the publisher set to determine the target publisher of the target subject information.
2. The method according to claim 1, wherein preprocessing the target subject information to obtain target text includes:
performing cosine similarity data extraction on the target subject information to obtain associated data without information loss;
and carrying out data sorting processing on the associated data according to the time parameters to obtain a target text.
3. The method of claim 1, wherein the creating the target atlas of the target text comprises:
performing entity extraction on the target text to obtain an entity extraction result of the target text;
extracting the attribute of the target text to obtain an attribute extraction result of the target text;
based on the entity extraction result, extracting entity relation from the target text to obtain a relation extraction result of the target text;
and creating a target map of the target text based on the entity extraction result, the attribute extraction result and the relation extraction result.
4. The method of claim 3, wherein the performing entity linking processing on the target graph to obtain the publisher set of the target text includes:
Matching the entity extraction result in the target atlas with a standard knowledge base to obtain a quasi entity set in the standard knowledge base;
respectively carrying out score evaluation on all entities in the quasi entity set to obtain a quasi entity score set;
taking an entity corresponding to the highest score in the quasi entity score set as a target entity;
and taking all target entities corresponding to the target text as a publisher set.
5. The method of claim 4, wherein the performing a traceability analysis on the set of publishers to determine a target publisher of the target topic information comprises:
the comment quantity and the forwarding quantity corresponding to each publisher in the publisher set are obtained, and the propagation path quantity in the target text is obtained;
performing weighted average processing on the evaluation quantity, the forwarding quantity and the propagation path quantity to obtain a tracing score corresponding to each publisher;
and comparing the traceability scores corresponding to each publisher in the publisher set to determine the target publisher of the target subject information.
6. The method of claim 5, wherein comparing the traceability scores corresponding to each of the publishers in the set of publishers to determine the target publisher of the target topic information comprises:
Acquiring a first traceability score corresponding to a first publisher in the publisher set and acquiring a second traceability score corresponding to a second publisher;
obtaining a difference value between the first tracing score and the second tracing score to obtain a tracing difference;
judging whether the tracing difference is larger than a preset first threshold value or not, and determining a target publisher, wherein the first threshold value represents the tracing similarity degree of the two publishers;
and when the tracing difference is larger than the first threshold value, taking the publisher with the large tracing score as a target publisher.
7. The method of claim 6, wherein determining whether the trace-source difference is greater than a first predetermined threshold, determining a target publisher, further comprises:
when the tracing difference is smaller than or equal to the first threshold value, acquiring the first propagation path number of the first publisher and the second propagation path number of the second publisher;
comparing the first propagation path number with the second propagation path number to determine a target publisher;
when the first propagation path number is greater than or equal to the second propagation path number, determining that the target publisher is the first publisher;
And when the first propagation path number is smaller than the second propagation path number, determining that the target publisher is the second publisher.
8. The method according to claim 1, characterized in that the method further comprises:
and displaying entity information of the target map corresponding to the target publisher and the propagation path information through a visualization device.
9. An information tracing system applied to the information tracing method of claim 1, comprising:
the preprocessing module is used for acquiring the same type of target subject information from a plurality of platforms, and preprocessing the target subject information to obtain a target text;
the creating map module is used for creating a target map of the target text;
the entity link module is used for carrying out entity link processing on the target map to obtain a publisher set of the target text, wherein the publisher set carries the propagation path information of the target text;
and the tracing module is used for carrying out tracing analysis on the publisher set and determining the target publisher of the target subject information.
10. A storage medium storing one or more programs executable by one or more processors to implement the information tracing method of any one of claims 1-8.
CN202310478299.0A 2023-04-28 2023-04-28 Information tracing method, system and storage medium Pending CN116610758A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310478299.0A CN116610758A (en) 2023-04-28 2023-04-28 Information tracing method, system and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310478299.0A CN116610758A (en) 2023-04-28 2023-04-28 Information tracing method, system and storage medium

Publications (1)

Publication Number Publication Date
CN116610758A true CN116610758A (en) 2023-08-18

Family

ID=87675631

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310478299.0A Pending CN116610758A (en) 2023-04-28 2023-04-28 Information tracing method, system and storage medium

Country Status (1)

Country Link
CN (1) CN116610758A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118170994A (en) * 2024-05-15 2024-06-11 北京搜狐互联网信息服务有限公司 Resource data processing method and device, storage medium and electronic equipment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118170994A (en) * 2024-05-15 2024-06-11 北京搜狐互联网信息服务有限公司 Resource data processing method and device, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
CN110516067B (en) Public opinion monitoring method, system and storage medium based on topic detection
CN106503055B (en) A kind of generation method from structured text to iamge description
CN109829166B (en) People and host customer opinion mining method based on character-level convolutional neural network
CN109614550A (en) Public sentiment monitoring method, device, computer equipment and storage medium
CN109543034B (en) Text clustering method and device based on knowledge graph and readable storage medium
CN112070138B (en) Construction method of multi-label mixed classification model, news classification method and system
CN108959418A (en) Character relation extraction method and device, computer device and computer readable storage medium
Chang et al. Research on detection methods based on Doc2vec abnormal comments
CN111581376B (en) Automatic knowledge graph construction system and method
CN106547875B (en) Microblog online emergency detection method based on emotion analysis and label
CN111767725A (en) Data processing method and device based on emotion polarity analysis model
CN110795568A (en) Risk assessment method and device based on user information knowledge graph and electronic equipment
Wang et al. A machine learning analysis of Twitter sentiment to the Sandy Hook shootings
Edwards et al. Identifying wildlife observations on twitter
CN109918648B (en) Rumor depth detection method based on dynamic sliding window feature score
CN110147552B (en) Education resource quality evaluation mining method and system based on natural language processing
CN111460145A (en) Learning resource recommendation method, device and storage medium
CN111898038B (en) Social media false news detection method based on man-machine cooperation
CN111428503A (en) Method and device for identifying and processing same-name person
CN112907358A (en) Loan user credit scoring method, loan user credit scoring device, computer equipment and storage medium
CN116610758A (en) Information tracing method, system and storage medium
JP2022541199A (en) A system and method for inserting data into a structured database based on image representations of data tables.
CN115391570A (en) Method and device for constructing emotion knowledge graph based on aspects
CN111597330A (en) Intelligent expert recommendation-oriented user image drawing method based on support vector machine
CN112989167A (en) Method, device and equipment for identifying transport account and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination