The content of the invention
The present invention provides a kind of propagation path of microblog data and determines method and apparatus, existing for solving
It is fast to spread speed in technology, the problem of microblog data contained much information is difficult to control its propagation path,
The present invention can extract its turn reprinted by analyzing microblog data from microblog data
Relation chain is sent out, the microblog data of transmission on Internet is traced to its source so as to realize, microblog data is obtained
Propagation path, to ensure the information security interests of country and the public.
The present invention provides a kind of propagation path of microblog data and determines method, including:
Gather microblog data;The microblog data includes:It is the content information of the microblog data, described micro-
The attribute information of rich data;Wherein, the attribute information of the microblog data includes:The microblog data
Publisher's mark, the content information uniquely corresponding content identification with the microblog data;
The each microblog data collected is parsed, in each microblog data
Determine whether to include in the microblog data microblog data of forwarding in content information;
The author mark of the microblog data of the forwarding is obtained, is obtained and the microblog data of the forwarding
The unique corresponding original content mark of content information;Determine whether in the content information of the microblog data
In the presence of the forwarding user mark being identified to from the publisher between the author mark, form forwarding and close
Tethers;
According to the original content identify, in all microblog datas, it is determined that with the original content
Identify corresponding all forwarding relation chains;
Corresponding all forwarding relation chains are identified to each original content and carry out deduplication operations, are obtained
The propagation path of each self-corresponding microblog data is identified to each original content.
Optionally, it is described to be determined whether there is in the content information of the microblog data from the publisher
The forwarding user mark between the author mark is identified to, forwarding relation chain is formed, including:
Determined whether there is in the content information of the microblog data from the publisher and be identified to the original
Forwarding user mark between wound person's mark;
If in the presence of, sequence is forwarded according to the sequencing formation of the forwarding user mark arrangement, will be described
Author mark is arranged on the original position of the forwarding sequence, publisher mark is arranged on described
The final position of sequence is forwarded, the forwarding relation chain is formed;
If being not present, the forwarding for only including and publisher's mark being identified to from the author is formed
Relation chain.
Optionally, determined whether there is in the content information of the microblog data from publisher mark
Identified to the forwarding user between the author is identified, including:
The localization of text edit field in the content information of the microblog data;
In the text editing field, it is determined whether there is forwarding mark;
If in the presence of extraction is described to forward the forwarding user mark identified.
Optionally, the attribute information of the microblog data also includes:
The issuing time of the microblog data, the source web of the microblog data, the microblog data
URL;
Accordingly, before the described pair of each microblog data collected is parsed, in addition to:
According to the issuing time of the microblog data, the source web of the microblog data, the microblogging number
According to URL at least one of, the microblog data collected is classified and sorted;
The described pair of each microblog data collected is parsed, including:
According to the sequencing after the classification and sequence, carried out one by one to collecting the microblog data
Parsing.
Optionally, it is described that corresponding all forwarding relation chains progress are identified to each original content
Deduplication operation, obtains the propagation path that each original content identifies each self-corresponding microblog data, bag
Include:
The corresponding all forwarding relation chains of each original content mark are compared two-by-two, remove from
First place in the forwarding relation chain starts, the row of each forwarding user mark and each forwarding user mark
The forwarding relation chain that row sequencing is included by other forwarding relation chains completely.
The present invention also provides a kind of propagation path determining device of microblog data, including:Acquisition module, is used
In collection microblog data;The microblog data includes:The content information of the microblog data, the microblogging
The attribute information of data;Wherein, the attribute information of the microblog data includes:The hair of the microblog data
Cloth person mark, the content information uniquely corresponding content identification with the microblog data;
Parsing module, for being parsed to each microblog data collected;
Determining module, for determining the microblog data in the content information of each microblog data
In whether include the microblog data of forwarding;
Acquisition module, the author mark of the microblog data for obtaining the forwarding, is obtained and described turn
The unique corresponding original content mark of the content information of the microblog data of hair;
The determining module, is additionally operable to determine whether there is in the content information of the microblog data from institute
The forwarding user mark that publisher is identified between the author mark is stated, forwarding relation chain is formed;Root
According to the original content identify, in all microblog datas, it is determined that with the original content mark pair
All forwarding relation chains answered;
Deduplication module, enters for identifying corresponding all forwarding relation chains to each original content
Row deduplication operation, obtains the propagation path that each original content identifies each self-corresponding microblog data.
Optionally, the determining module includes:
Determination sub-module is identified, for being determined whether there is in the content information of the microblog data from institute
State the forwarding user mark that publisher is identified between the author mark;
Sequence determination sub-module, for determining exist from publisher mark in the mark determination sub-module
Know after the forwarding user mark between the author mark, according to the forwarding user mark arrangement
Sequencing formation forwarding sequence, the author is identified the original position for being arranged on the forwarding sequence,
The publisher is identified to the final position for being arranged on the forwarding sequence, the forwarding relation chain is formed;
The sequence determination sub-module, is additionally operable to determine to be not present from described in the mark determination sub-module
Publisher is identified to after the forwarding user mark between the author mark, is formed and only included from the original
Wound person is identified to the forwarding relation chain of publisher's mark.
Optionally, the determining module includes:
Submodule is positioned, for the localization of text edit field in the content information of the microblog data;
Indicate determination sub-module, in the text editing field, it is determined whether there is forwarding mark;
Extracting sub-module, for after the mark determination sub-module determines to have the forwarding mark, carrying
Take the forwarding user mark for forwarding and being identified.
Optionally, the attribute information of the microblog data also includes:
The issuing time of the microblog data, the source web of the microblog data, the microblog data
URL;
Accordingly, described device also includes:
Classification and ordination module, for the issuing time according to the microblog data, the microblog data come
At least one of in source website, the URL of the microblog data, the microblog data collected is entered
Row classification and sequence;
The parsing module, specifically for according to the sequencing after the classification and sequence, one by one to adopting
Collection obtains the microblog data and parsed.
Optionally, the deduplication module, specifically for each original content mark is corresponding all
The forwarding relation chain is compared two-by-two, is removed since the first place in the forwarding relation chain, each forwarding
User identifies and the arrangement sequencing of each forwarding user mark forwards what relation chains were included by other completely
Forward relation chain.
A kind of propagation path for microblog data that the present invention is provided determines method and apparatus, by gathering microblogging
Data, and each microblog data collected is parsed, with each microblog data
Hold and the microblog data of included forwarding determined in information, and determine from the microblog data of forwarding by
Author mark, the original content mark of the microblog data of forwarding;Believed again by the content in microblog data
Determine whether there is the forwarding user being identified between author mark from publisher in breath to identify, to be formed
One forwarding relation chain of this microblog data;Further according to original content mark, in all microblog datas,
It is determined that all forwarding relation chains corresponding with original content mark;Corresponding institute is identified to each original content
There is forwarding relation chain to carry out deduplication operation, so that obtaining each original content identifies each self-corresponding microblogging number
According to propagation path.Realize and the microblog data of transmission on Internet is traced to its source, grasp the biography of microblog data
Broadcast path, it is ensured that the information security interests of country and the public.
Embodiment
To make the purpose, technical scheme and advantage of the embodiment of the present invention clearer, below in conjunction with this hair
Bright embodiment, the technical scheme in the embodiment of the present invention is clearly and completely described.Need explanation
, in accompanying drawing or specification, similar or identical element all uses identical reference.
Figure 1A determines the flow chart of the embodiment one of method for the propagation path of microblog data of the present invention, such as
Shown in Figure 1A, the propagation path of microblog data determines method, including:
Step 101, collection microblog data.
In this step, microblog data includes:The content information of microblog data, the attribute letter of microblog data
Breath;Wherein, the attribute information of microblog data includes:The publisher's mark and microblog data of microblog data
The unique corresponding content identification of content information.The microblog data can be appointing on any internet platform
The electronic data of meaning form, for example, picture, text, video etc..Publisher's mark of microblog data can
Think ID or with ID corresponding use of the user on the internet platform for issuing the microblog data
Name in an account book claims;For example, user's name can be microblog users " Zhang San ";The ID of Zhang San can be
“80651236”;Content identification, is that the content of every microblog data for being sent to user is identified
Identification information, the generation of the content identification can be by uniquely corresponding with it to every microblog data generation
Serial data obtain, for example Message Digest Algorithm 5 MD5 codes (Message Digest Algorithm,
Referred to as " MD5 "), the content identification and the content of the microblog data corresponding to it have unique corresponding relation,
The content of corresponding microblog data can just be known according to content identification.
Step 102, each microblog data collected is parsed, in each microblog data
Content information in determine whether include the microblog data of forwarding in microblog data.
In this step, all microblog datas collected are carried out with analysis one by one, to set up every
The attribute information table of microblog data, specifies the personal feature of every microblog data.Can in the attribute information table
It is (unique with microblog data content information equivalent to above-mentioned with the microblogging ID for including this microblogging
Corresponding content identification), content of microblog (equivalent to the content information of microblog data), microblog users ID
(publisher equivalent to microblog data identifies), issuing time, source web (put down by the issue of the microblogging
Platform, such as Sina, Tengxun), forwarding microblogging ID (mark for turning originator for forwarding the content of microblog),
URL (Uniform Resource Locator, referred to as:" URL ") etc. information.If to every
During bar microblog data is parsed, find to include the microblog data for reprinting forwarding in microblog data
Content, then the microblog data is marked, in case subsequently extracting what is be forwarded in the microblog data
Microblog data, its propagation path information.
Step 103, the author for the microblog data for obtaining forwarding are identified, and obtain the microblog data with forwarding
The unique corresponding original content mark of content information;Determine whether to deposit in the content information of microblog data
In the forwarding user mark being identified to from publisher between author mark, forwarding relation chain is formed.
In this step, the author mark of the microblog data of forwarding is extracted from the microblog data, is led to
For often, during being forwarded to a certain microblog data, the author information of the microblog data be with
What the content information of microblog data was bound, therefore can be got from the microblog data of forwarding original
Person identifies, for example, in the repeating process of microblogging, having the mark of "@Zhang San " in the first place of forwarding manuscript
Know, then Zhang San identifies for the author of the forwarding manuscript.Meanwhile, in the content information of the microblog data
Include two parts content, one be the microblog data publisher oneself viewpoint description, another is
Other people original forwarding manuscript contents that the publisher of the microblog data reprints;Then original content be designated with
The unique corresponding mark of the forwarding manuscript content.In addition, many platforms are provided with the propagation of the forwarding manuscript
Routing information, can be according to the default forwarding user mark of different platform, in the content information of microblog data
In determine from publisher and be identified to forwarding user mark between author mark, to be formed comprising original
Person identifies → forwarded user mark 1 → forwarding user mark 2 → forwarding user and identifies 3 → publisher mark
Forward relation chain.
Step 104, identified according to original content, in all microblog datas, it is determined that with original content mark
Know corresponding all forwarding relation chains.
In this step, identified according to the original content determined in step 103, in other microblog datas
In find it is same corresponding forwarding microblogging identified to the original content carried out other of forwarding operation forward
Relation chain, so as to find whole forwarding relations of the original microblogging sent by " Homeway.com " as shown in Figure 1B
Chain.
Step 105, all forwarding relation chains corresponding to each original content mark carry out deduplication operation,
Obtain the propagation path that each original content identifies each self-corresponding microblog data.
In this step, to the forwarding relation chain of the different length acquired, if wherein exist repeat and
Forwarding relation chain with inclusion relation, then can by by inclusion relation forwarding relation chain removal,
Retain longer forwarding relation chain.Due to the invention aims to determine the propagation path of microblog data,
Then for the path repeated, only retain a most complete paths from the beginning to the end, remove what is repeated
Path, to mitigate the statistics amount to microblog data propagation path.For example, if a forwarding is micro-
The forwarding relation chain won is A → B → C → D;And the forwarding relation chain that another obtains is
A → B → C → D → E, then retain A → B → C → D → E forwarding relation chain, removes A → B → C → D
Forwarding relation chain.It can be seen that having contained turning for A → B → C → D in A → B → C → D → E
Path relation is sent out, therefore, it can remove a plurality of forwardings such as A → B → C → D, A → B → C, A → B
Relation chain.
The propagation path for the microblog data that the present embodiment is provided determines method, by gathering microblog data, and
The each microblog data collected is parsed, with the content information of each microblog data
The microblog data of included forwarding is determined, and be forwarded micro- is determined from the microblog data of forwarding
Author mark, the original content mark of rich data;Again by being determined in the content information of microblog data
With the presence or absence of the forwarding user mark being identified to from publisher between author mark, to form this microblogging
One forwarding relation chain of data;Further according to original content mark, in all microblog datas, it is determined that with
Original content identifies corresponding all forwarding relation chains;Corresponding all forwardings are identified to each original content
Relation chain carries out deduplication operation, so as to obtain the biography that each original content identifies each self-corresponding microblog data
Broadcast path.Realize and the microblog data of transmission on Internet traced to its source, grasp the propagation path of microblog data,
Ensure country and the information security interests of the public.
Fig. 2 determines the flow chart of the embodiment two of method for the propagation path of microblog data of the present invention, such as schemes
Shown in 2, on the basis of above-described embodiment one, the method for the present embodiment includes:
Step 201, collection microblog data.
In this step, microblog data includes:The content information of microblog data, the attribute letter of microblog data
Breath;Wherein, the attribute information of microblog data includes:The publisher's mark and microblog data of microblog data
The unique corresponding content identification of content information;In addition, in the attribute information of the microblog data collected
It can also include:The issuing time of microblog data, the source web of microblog data, the URL of microblog data
Deng.
Step 202, the issuing time according to microblog data, the source web of microblog data, microblog data
URL at least one of, the microblog data collected is classified and sorted.
In this step, the method microblog data collected classified and sorted can be by this area
Technical staff is set according to the analysis target of microblog data, for example, it is desired to send out a certain network platform
The propagation path of the microblog data of cloth is analyzed, then can be according to the source web of microblog data to microblogging
Data are classified;Microblog data can also be ranked up according to time order and function or according to the period to micro-
Rich data carry out segment processing etc..
Step 203, according to the sequencing after classification and sequence, enter one by one to collecting microblog data
Row parsing, to determine whether include forwarding in microblog data in the content information of each microblog data
Microblog data.
In this step, following three kinds of contents, Yi Zhongshi are generally comprised for the microblog data collected
Only include by the original content A of publisher;The content can be any type of electronic data, picture,
Video, text etc.;Other people the original content B forwarded by publisher can also only be included;Can be with
Both other people the original content B forwarded by publisher were included, also comprising publisher to the content of the forwarding
Comment on content;The comment content can be considered the original content A of publisher.Then can be clear from three kinds
Appearance form is:1) A contents are only included;2) B contents are only included;3) A contents had both been included or had been included
B contents.
Step 204, the author for the microblog data for obtaining forwarding are identified, and obtain the microblog data with forwarding
The unique corresponding original content mark of content information.
In this step, generally each network platform uses specific tag mark to the microblog data of forwarding
It is identified, such as includes "@XX " marks in Sina weibo forwarding content;Tengxun's microblogging forwarding content
In also include "@XX " indicate;Wherein " XX " represents the author mark for the content being forwarded.And the mark
Will symbol is located at the beginning location for being forwarded content, by knowing to the specific tag mark of the network platform
Do not position and to the position that the tag mark occurs, it may be determined that go out the author mark of the forwarding content
Know.The determination process identified to original content ibid, according to setting a property for each network platform, is found
Identify location with the unique corresponding original content of the content information of the microblog data of forwarding and obtain and be somebody's turn to do
Mark, for example, original content mark is arranged in the URL of the original content by many network platforms,
It can then be got and the unique corresponding mark of its content by parsing the corresponding URL of original content.Need
Illustrate, each network platform there can be the mark of its self-defined author mark and original content mark
Standard, the application is not construed as limiting to this.
Step 205, determine whether there is in the content information of microblog data be identified to from publisher it is original
Forwarding user mark between person's mark.If in the presence of execution step 206;If being not present, step is performed
207。
In this step, in the content information of microblog data, especially in the original content A portions of publisher
Point there is the original content that is forwarded since author, the path relation traveled between the publisher,
For example, Sina weibo platform is designated " //@AXX//@BXX//@CXX " to forward-path;Tengxun is micro-
Rich platform to being designated of forward-path " | |@AXX | |@BXX | |@CXX ".Each " //@" or " | | after@"
" AXX ", " BXX ", " CXX " for forward the original content forwarding character relation chain.Said process
Implement and can pass through, the localization of text edit field in the content information of microblog data;In text
In edit field, it is determined whether there is forwarding mark;If there is forwarding mark, extract forwarding and marked
The forwarding user mark of knowledge.Because the information of above-mentioned instruction original content propagation path is generally comprised within issue
The original content part A of person, that is, publisher can be commented on or text editing part;Therefore
Publisher oneself can choose whether to disclose above-mentioned forwarding character relation chain, meanwhile, publisher can also be right
The character relation chain is modified or deletion action.Therefore, when being positioned to the character relation chain,
Can be by finding the text editing field in the content information of microblog data, such as " text " field, then at this
Forwarding mark is obtained in field, such as " //@" or " | |@", so as to extract turning of being indicated after the forwarding mark
Hair family is identified, and obtains forwarding character relation chain.
Step 206, the sequencing formation forwarding sequence according to forwarding user's mark arrangement, by author
Mark is arranged on the original position of forwarding sequence, and publisher is identified to the final position for being arranged on forwarding sequence,
Form forwarding relation chain.
In this step, what is indicated in the forwarding character relation chain generally acquired in previous step is
Forwarding personage between author and publisher, if making the character relation chain complete, author is identified
The original position of forwarding sequence is arranged on, publisher is identified to the final position for being arranged on forwarding sequence, shape
Into complete forwarding relation chain.
Step 207, formation only include the forwarding relation chain that publisher's mark is identified to from author.
In this step, refer in step 205 because forwarding relation chain is commonly included in publisher's original
The content part A of wound, that is, publisher can be commented on or text editing part;Therefore issue
Person oneself can choose whether to disclose above-mentioned forwarding character relation chain, meanwhile, publisher can also be to the people
Thing relation chain is modified or deletion action.Therefore, it is more likely that it is original interior to get this in part A
The propagation path information of appearance, then propagation path now is most short propagation path, that is, directly from original
Person then forms this and only includes the forwarding relation chain that publisher's mark is identified to from author to publisher.
Step 208, identified according to original content, in all microblog datas, it is determined that with original content mark
Know corresponding all forwarding relation chains.
In this step, because original content mark is and content uniquely corresponding mark, therefore pass through this
Mark can find all microblog datas for including original content mark, so as to include original at these
The forwarding relation chain that the relevant original content of institute is identified is extracted in the microblog data for creating content identification, can
To form forwarding relationship topology figure corresponding with original content mark according to all forwarding relation chains, such as
Form shown in Figure 1B.
Step 209, by each original content mark it is corresponding it is all forwarding relation chains compare two-by-two, remove
Since the first place in forwarding relation chain, the arrangement of each forwarding user mark and each forwarding user mark
The forwarding relation chain that sequencing is included by other forwarding relation chains completely.
In this step, deduplication operation is carried out in all forwarding relation chains, is opened up with simplifying forwarding relation
The complexity of figure is flutterred, the principle of the duplicate removal can be needed voluntarily by those skilled in the art according to actual count
Setting, or remove since the first place in forwarding relation chain, it is each to forward user's mark and each
The forwarding relation chain that the arrangement sequencing of forwarding user's mark is included by other forwarding relation chains completely, example
Such as, the forwarding relation chain that a forwarding microblogging is obtained is A → B → C → D;And another obtained forwarding
Relation chain is A → B → C → D → E, then retains A → B → C → D → E forwarding relation chain, is removed
A → B → C → D forwarding relation chain.It can be seen that having been contained in A → B → C → D → E
A → B → C → D forward-path relation, therefore, it can remove A → B → C → D, A → B → C,
The a plurality of forwarding relation chain such as A → B.
Fig. 3 is the structural representation of the embodiment one of the propagation path determining device of microblog data of the present invention,
As shown in figure 3, the device of the present embodiment includes:Acquisition module 31, for gathering microblog data;Microblogging
Data include:The content information of microblog data, the attribute information of microblog data;Wherein, microblog data
Attribute information includes:Microblog data publisher mark, it is uniquely corresponding with the content information of microblog data
Content identification;Parsing module 32, for being parsed to each microblog data collected;It is determined that
Whether module 33, turn for determining to include in microblog data in the content information of each microblog data
The microblog data of hair;Acquisition module 34, the author mark of the microblog data for obtaining forwarding, is obtained
Identified with the unique corresponding original content of the content information of the microblog data of forwarding;Determining module 33, is also used
Determine whether there is and be identified to from publisher between author mark in the content information in microblog data
User's mark is forwarded, forwarding relation chain is formed;Identified according to original content, in all microblog datas,
It is determined that all forwarding relation chains corresponding with original content mark;Deduplication module 35, for each original
The corresponding all forwarding relation chains of content identification carry out deduplication operation, obtain each original content mark respective
The propagation path of corresponding microblog data.
The device of the present embodiment, can be used for the technical scheme for performing embodiment of the method one shown in Figure 1A,
Its implementing principle and technical effect is similar, and here is omitted.
The propagation path determining device for the microblog data that the present embodiment is provided, by gathering microblog data, and
The each microblog data collected is parsed, with the content information of each microblog data
The microblog data of included forwarding is determined, and be forwarded micro- is determined from the microblog data of forwarding
Author mark, the original content mark of rich data;Again by being determined in the content information of microblog data
With the presence or absence of the forwarding user mark being identified to from publisher between author mark, to form this microblogging
One forwarding relation chain of data;Further according to original content mark, in all microblog datas, it is determined that with
Original content identifies corresponding all forwarding relation chains;Corresponding all forwardings are identified to each original content
Relation chain carries out deduplication operation, so as to obtain the biography that each original content identifies each self-corresponding microblog data
Broadcast path.Realize and the microblog data of transmission on Internet traced to its source, grasp the propagation path of microblog data,
Ensure country and the information security interests of the public.
Fig. 4 is the structural representation of the embodiment two of the propagation path determining device of microblog data of the present invention,
As shown in figure 4, the device of the present embodiment is on the basis of Fig. 3 shown devices, further, it is determined that mould
Block 33 includes:Determination sub-module 331 is identified, for determining whether to deposit in the content information of microblog data
In the forwarding user mark being identified to from publisher between author mark;Sequence determination sub-module 332,
For determining there is the forwarding being identified to from publisher between author mark in mark determination sub-module 331
After user's mark, according to the sequencing formation forwarding sequence of forwarding user's mark arrangement, by author mark
Know the original position for being arranged on forwarding sequence, publisher identified to the final position for being arranged on forwarding sequence,
Form forwarding relation chain;Sequence determination sub-module 332 is additionally operable to determine not in mark determination sub-module 331
After the forwarding user mark being identified to from publisher between author mark, formed and only included from original
Person is identified to the forwarding relation chain of publisher's mark.
Optionally, determining module 33 includes:Submodule 333 is positioned, for the content letter in microblog data
Localization of text edit field in breath;Indicate determination sub-module 334, in text editing field, really
It is fixed to indicate with the presence or absence of forwarding;Extracting sub-module 335, for determining to deposit in mark determination sub-module 334
After forwarding mark, the identified forwarding user mark of forwarding mark is extracted.
Optionally, the attribute information of microblog data also includes:Issuing time, the microblog data of microblog data
Source web, the URL of microblog data;Accordingly, device also includes:Classification and ordination module 36, is used
In the issuing time according to microblog data, the source web of microblog data, the URL of microblog data extremely
One item missing, is classified and is sorted to the microblog data collected;Parsing module 32, specifically for root
According to the sequencing after classification and sequence, parsed one by one to collecting microblog data.
Optionally, deduplication module 35, specifically for the corresponding all forwardings of each original content mark are closed
Tethers is compared two-by-two, is removed since the first place in forwarding relation chain, and each forwarding user identifies and each
The forwarding relation chain that the arrangement sequencing of forwarding user's mark is included by other forwarding relation chains completely.
The device of the present embodiment, can be used for the technical scheme for performing embodiment of the method two shown in Fig. 2, its
Implementing principle and technical effect are similar, and here is omitted.
Finally it should be noted that:The above embodiments are merely illustrative of the technical solutions of the present invention, rather than to it
Limitation;Although the present invention is described in detail with reference to the foregoing embodiments, the ordinary skill of this area
Personnel should be understood:It can still modify to the technical scheme described in previous embodiment, or
Equivalent substitution is carried out to which part technical characteristic;And these modifications or replacement, do not make relevant art
The essence of scheme departs from the spirit and scope of various embodiments of the present invention technical scheme.