CN110162580A

CN110162580A - Data mining and depth analysis method and application based on distributed early warning platform

Info

Publication number: CN110162580A
Application number: CN201910440837.0A
Authority: CN
Inventors: 张玉兰; 朱世伟; 于俊凤; 魏墨济; 李晨; 李宪毅; 杨爱芹
Original assignee: Hefei Pengyu Data Technology Service Co Ltd
Current assignee: Hefei Pengyu Data Technology Service Co Ltd
Priority date: 2019-05-24
Filing date: 2019-05-24
Publication date: 2019-08-23

Abstract

Present disclose provides a kind of data minings based on distributed early warning platform and depth analysis method and application.Wherein, data mining and depth analysis method, comprising: obtain and cluster Social Media data, obtain cluster topic, extract the theme feature of cluster topic, the theme feature for clustering topic is mapped as first theme to describe user interest；The member theme is classification benchmark with known disparate networks crime；On the basis of issuing the identification of viewpoint account, the User Perspective feature of cluster topic is extracted, states rule using Apriori algorithm excavation user to characterize User Perspective feature；User social contact feature is described using known users social network；For user characteristics of the account building comprising user interest profile, User Perspective feature and user social contact feature for issuing law bans content, the identity and membership credentials of user account to be detected are gone out by the Similarity measures of user characteristics.

Description

Data mining and depth analysis method and application based on distributed early warning platform

Technical field

The disclosure belongs to data mining and analysis field more particularly to a kind of data mining based on distributed early warning platform With depth analysis method and application.

Background technique

Only there is provided background technical informations relevant to the disclosure for the statement of this part, it is not necessary to so constitute first skill Art.

Massive information is all generated daily on internet, there may be " content threats ", such as violence information for these information And pornography.Inventors have found that due to internet mass information contain much information and type is varied, cause to monitoring The problem that the information recognition efficiency of the relational network of suspicious account and the account is low and accuracy rate is poor, in this way cannot be in time to can It doubts account and carries out " task tracking " monitoring.

Summary of the invention

To solve the above-mentioned problems, the first aspect of the disclosure provides a kind of data digging based on distributed early warning platform Pick and depth analysis method, the efficient and accurate effect with information processing.

A kind of data mining based on distributed early warning platform of the first aspect of the disclosure and depth analysis method Technical solution are as follows:

A kind of data mining based on distributed early warning platform and depth analysis method, comprising:

Social Media data are obtained and clustered, cluster topic is obtained；

The theme feature for extracting cluster topic, it is emerging to describe user to be mapped as first theme for the theme feature for clustering topic Interest；The member theme is classification benchmark with known disparate networks crime；

On the basis of issuing the identification of viewpoint account, the User Perspective feature of cluster topic is extracted, is calculated using Apriori Method excavates user and states rule to characterize User Perspective feature；

User social contact feature is described using known users social network；

Include user interest profile, User Perspective feature and user social contact spy to issue the account building of law bans content The user characteristics of sign go out the identity and membership credentials of user account to be detected by the Similarity measures of user characteristics.

To solve the above-mentioned problems, the second aspect of the disclosure provides a kind of distributed early warning platform, with information Handle efficient and accurate effect.

A kind of technical solution of distributed early warning platform of the second aspect of the disclosure are as follows:

A kind of distribution early warning platform, comprising:

Host node and coupled Map node and Reduce node, the host node, Map node and Reduce section Include in point memory, processor and storage on a memory and the computer program that can run on a processor, the place Reason device realizes the step in the data mining described above based on distributed early warning platform and depth analysis method when executing.

To solve the above-mentioned problems, a kind of computer readable storage medium is provided in terms of the third of the disclosure, had Information processing is efficient and accurate effect.

A kind of technical solution of computer readable storage medium in terms of the third of the disclosure are as follows:

A kind of computer readable storage medium, is stored thereon with computer program, realization when which is executed by processor Step in data mining and depth analysis method based on distributed early warning platform described above.

The beneficial effect of the disclosure is:

(1) disclosure be issue law bans content account building comprising user interest profile, User Perspective feature and The user characteristics of user social contact feature can accurately and efficiently calculate user's account to be detected by the similitude of user characteristics Number identity and membership credentials, the relational network of the suspicious account of accurate measurements and the account.

(2) for interest, viewpoint and these social features, the disclosure assigns different weights to each feature, by more Dimension Similarity measures, the weighted of each dimension obtain the similitude between multi-user's account, further according to the threshold of setting Value can accurately determine a possibility that two different Virtual User accounts are same user account among all user accounts.

(3) disclosure realizes the identification and monitoring of online offer sensitive event, account and account membership credentials information, real The self-evolution function of having showed system completes the timely and effectively monitoring to network sensitive information.

Detailed description of the invention

The Figure of description for constituting a part of this disclosure is used to provide further understanding of the disclosure, and the disclosure is shown Meaning property embodiment and its explanation do not constitute the improper restriction to the disclosure for explaining the disclosure.

Fig. 1 is a kind of data mining based on distributed early warning platform and depth analysis method that the embodiment of the present disclosure provides Flow chart.

Specific embodiment

It is noted that following detailed description is all illustrative, it is intended to provide further instruction to the disclosure.Unless another It indicates, all technical and scientific terms used herein has usual with disclosure person of an ordinary skill in the technical field The identical meanings of understanding.

It should be noted that term used herein above is merely to describe specific embodiment, and be not intended to restricted root According to the illustrative embodiments of the disclosure.As used herein, unless the context clearly indicates otherwise, otherwise singular Also it is intended to include plural form, additionally, it should be understood that, when in the present specification using term "comprising" and/or " packet Include " when, indicate existing characteristics, step, operation, device, component and/or their combination.

The present embodiment is using forum, microblogging, news, blog, social network sites on internet etc. as data source.

Related pages data are excavated and are analyzed using distributed early warning platform, developing network sensitive information excavate and Self-evolution early warning platform completes collection, judgement and the early warning work of network sensitivity.

In the present embodiment, distributed early warning platform includes distributed server cluster, and the configuration of server cluster is as follows It is shown:

(1) hardware configuration

CPU:Intel (R) Xeon (R) CPU E5-2609v2@2.50GHz；

Memory: committed memory when operation: > 2GB.

(2) operating system

More than 5.5 version of CentOS, more than Ubuntu10.04 version, more than RHEL4 version.

(3) running environment configures

Java context: 64 jdk1.7；

Web server: Apache tomcat7, Jboss6.0 version or more, WebLogic Server 11g version Or more；

Hadoop the integration environment: using Hadoop2.7.3 as cluster running environment；

Memcached environment: storage management is carried out to bottom data using Memcached 1.4.5；

Spark the integration environment: using Spark2.1.0 as cluster running environment；

Storm: using 0.9.7 as cluster running environment；

Zookeeper environment: to realize cluster distributed co-ordination, system using Zookeeper3.4.6 to cluster into Row coordinated management.

As shown in Figure 1, a kind of data mining based on distributed early warning platform of the present embodiment and depth analysis method, until Include: less

S101: obtaining and clusters Social Media data, obtains cluster topic.

In the present embodiment, Social Media data are the numbers such as hot news website, microblogging, community, the forum in internet According to.

K-means is one kind typically based on the method for division, its purpose is respectively to be gathered into data grouping several A class cluster (Cluster).So that the similarity with higher between the object in same class, between inhomogeneity to aberration It is not as big as possible.Algorithm selects K random central points first, and the center for representing a class is averaged by each point after being initialised Value arrives the distance at class center according to it to remaining each document, the text similarity detection during distance calculating method is as follows It is described, it is divided into one by one in an iterative manner apart from nearest class, then recalculates the average value of each class, adjusted in class The heart.This process is constantly repeated, until all objects have all been divided all some classes.

The algorithm complexity of K-means is O (nkt), and wherein t is the number of iterations, and n is document number, and k is classification number. Usual k, t < < n, so K-means algorithm has very high efficiency.The advantages of K-means clustering algorithm, mainly has: the think of of algorithm Road is clear, realizes that simple, efficiency of algorithm is high, can obtain good cluster result for the data to be divided of convex.The disadvantage is that: It is locally optimal solution that selected and initial center the selection of cluster result and K value, which has very big relationship, arithmetic result,.

Therefore, the distributed early warning platform of the present embodiment, including host node and coupled Map node and Reduce Node clusters Social Media data topic using K-modes clustering algorithm.

K-modes clustering algorithm is in the distributed detailed process of MapReduce such as table 1.

K-modes clustering algorithm of the table 1 based on MapReduce

K-modes clustering algorithm is an iterative type of task, when task has convergence property, can obtain preferable effect Fruit.

S102: the theme feature of cluster topic is extracted, the theme feature for clustering topic is mapped as first theme to describe to use Family interest；The member theme is classification benchmark with known disparate networks crime.

Wherein, the theme feature as cluster topic such as extracting keywords, heading, deictic words.

The theme feature for clustering topic is mapped as first theme to describe user interest using LSA method.

Latent semantic analysis (LSA) is one of the basic technology of theme modeling.Its core concept is the document-possessed Document term matrice resolves into mutually independent document-theme matrix and theme-document term matrice.

The first step is to generate document-document term matrice.If providing m document and n word in vocabulary, we can be with The matrix A of a m × n is constructed, wherein each row represents a document, and each column represents a word.In the most simple version of LSA In, each entry can be simply the original count of j-th of word frequency of occurrence in i-th of document.However, in reality In operation, the effect of original count is not very well, because they can not consider the weight of each word in document.For example, compared with " For test ", perhaps " nuclear " this word more can designate that the theme of given article.

Therefore, LSA model usually replaces the original count in document-document term matrice with tf-idf score.Tf-idf, i.e. word Frequently-inverse document frequency is assigned with corresponding weight for the term j in document i.

That is, the frequency that term occurs in a document is higher, then its weight is bigger；Meanwhile term is in corpus The frequency of appearance is lower, and weight is bigger.

Once possessing document-document term matrice A, potential theme can be thought deeply.Word and document relationships can be captured in order to find out The potential themes of minority, it would be desirable to reduce the dimension of matrix A.

Truncation SVD can be used to execute in this dimensionality reduction.SVD, i.e. singular value decomposition are one of linear algebra skills Art.Arbitrary Matrix M is decomposed into the product of three independent matrix: M=U*S*V by the technology, and wherein S is pair of matrix M singular value Angular moment battle array.Largely, the dimensionality reduction mode that SVD is truncated is: selection singular value in maximum t number, and only reservation matrix U with The preceding t of V is arranged.In this case, t is a hyper parameter, can be selected and be adjusted according to the theme quantity wanted to look up.

For intuitive, truncation SVD is considered as only retaining most important t dimension in our transformation spaces.

In this caseIt is document-theme matrix, andThen become term- Theme matrix.In matrix U and V, each column correspond to one in t theme.In U, row is indicated by theme expression Document vector；In V, row represents the term vector by theme expression.

By these document vector sum term vectors, following index is assessed using cosine similarity isometry now: 1) The similarity of different document；2) similarity of various words；3) similarity of term (or " queries ") and document is (when desired When retrieval is with the maximally related paragraph of inquiry, i.e. progress information retrieval, this point will be highly useful).The advantages of LSA method is quick And efficiently.

S103: on the basis of issuing the identification of viewpoint account, the User Perspective feature of cluster topic is extracted, is utilized Apriori algorithm excavates user and states rule to characterize User Perspective feature.

Wherein, subjective word, tendency word etc. are extracted and is used as User Perspective feature.

The main thought of Apriori algorithm is successively traversed by Level Search, is first found frequent 1 item collection, is then passed through Frequent 1 item collection finds frequent 2 item collection.The rest may be inferred is eventually found frequent N item collection.

During finding candidate frequent item set, there are two extremely important steps.

Step 1: be exactly that all possible N item collection is found according to the N-1 rank frequent item set having been found that, the step for Title is called connection step.Exactly merge qualified two lower term collection to obtain high-order item collection.Condition therein is exactly this Two selected N-1 rank frequent item sets must have N-2 be it is the same, the rally of the N item that generates after merging contains two N-1 ranks All items of frequent item set.Then all this combinations are found to merge, complete N item collection can be found.

Step 2: the title of this step is called beta pruning step.There is a fact first.One item collection, if it has any one A subset infrequently, then itself is also infrequently.For example { n1, n2 } is not frequent item set, then { n1, n2, n3 } is inevitable It is not frequent item set.Because the minimum support of user setting is not achieved in the frequency that { n1, n2 } occurs, then { n1, n2, n3 } The frequency of occurrences is necessarily also to be not achieved.The fact that foundation, can be by which by not being the item collection from frequent item set connection Directly cast out, improves efficiency of algorithm.

The detailed process of Apriori algorithm:

1. scan data set obtains a candidate item collection.

2. acquiring N item collection by N-1 item collection on the basis of the first step.This step needs to do multiple circulation, until not new Result generate.Main two operations of this step are exactly connection step and beta pruning step.

3. circulation executes always, until not new result generates.

In specific implementation, user is excavated using Apriori algorithm state the regular process to characterize User Perspective feature Are as follows:

S1031: the User Perspective characteristic storage of topic is clustered to former sequence database, former sequence database is averagely divided For n disjoint subsequence databases；Wherein, n is positive integer；

In order to not have to scan former sequence database when every time to candidate sequence mode counting, I/O expense is reduced, should be made every A sub- sequence database can be put into memory RDD.

S1032: n sub- sequence databases are dispatched to different Map work sections by the host node of distributed early warning platform Point, each Map working node execute Sequential Pattern Mining Algorithm, and according to the minimum support of setting, Map work is stored in scanning Subsequence database in node memory, calculates local sequence pattern；

S1033: obtained local sequence pattern is passed into Reduce working node, merger handles to obtain global candidate sequence Column mode；

S1034: former sequence database is scanned again, finds out the sequence for meeting the minimum support not less than default Mode, and then obtain the characterization of User Perspective feature.

In step S1032, each Map working node executes the process of Sequential Pattern Mining Algorithm are as follows:

Given minimum support ξ, if support of the sequence S in sequence database is not less than ξ, sequence S is referred to as sequence Column mode；

Wherein, sequence S is the percentage shared in sequence database of the sequence comprising S in the support of sequence database Than；Sequence S is in the sequence number that the support counting of sequence database is in sequence database comprising S.

In Sequential Pattern Mining Algorithm, make as given a definition:

Define 1: nonempty set I={ i_k, k=1,2 ..., n } and it is known as item collection, wherein i_kReferred to as item.

Define 2: sequence is the ordered arrangement of item collection, and sequence S can be expressed as S=< I₁, I₂..., In >,Sequence packet Number containing item is known as the length of sequence.Length is that the sequence of L is denoted as L- sequence.

Define 3: sequence database is made of < Sid, S >, and wherein Sid indicates that sequence number, S indicate sequence.

Constitute a sequence database as shown in Table 2.

2 sequence database of table

It defines 4: setting sequence α=< a₁, a₂..., a_n>, sequence β=< b₁, b₂..., b_m>, If there is integer 1≤j₁< j₂< ... < j_n≤ m, so thatThen sequence α is referred to as the subsequence of sequence β, again Claiming sequence β includes sequence α.

S104: user social contact feature is described using known users social network.

In a particular embodiment, user social contact feature is the attribute of social network where user account, such as the name of social network Title and number.

Wherein, user social contact net can be excavated based on membership credentials and be constructed to obtain.

S105: including user interest profile, User Perspective feature and user to issue the account building of law bans content The user characteristics of social characteristics go out the identity of user account to be detected by the Similarity measures of user characteristics and tissue close System.

In specific implementation, the process of the membership credentials of user account to be detected is gone out by the Similarity measures of user characteristics Are as follows:

S1051: right respectively according to the various dimensions characteristic of user interest profile, User Perspective feature and user social contact feature User interest preference matrix, User Perspective matrix and social networks matrix should be constructed, and user interest is calculated by cosine-algorithm Preference similar matrix, User Perspective similarity matrix and social networks matrix；

Wherein since interest characteristics include time, place and time dimension, viewpoint feature includes event and attitude dimension, society Handing over feature includes that key words co-occurrence and viewpoint are drawn dimension altogether,

S1052: user interest preference similar matrix, User Perspective similarity matrix and social networks matrix are assigned respectively Corresponding weight carries out linear weighted function to above three similar matrix, obtains weighted results；

S1053: according to the threshold value comparison of weighted results and setting, obtain each user account is arranged in front k closely Relevant people obtains the membership credentials of the user account；Wherein, k is positive integer.

It should be noted that the sequence between these three steps of step S102- step S104 can be adjusted arbitrarily, have no effect on The final result that data mining and depth analysis method entirely based on distributed early warning platform obtain.

In another embodiment, before obtaining Social Media data, further includes:

Initialization data source, and new network address part is extracted from initial data, as the candidate for judging source of new data Network address；

The character string phase knowledge and magnanimity of candidate network address and initialization data source are calculated using editing distance matching algorithm.

Wherein, editing distance just refers between two character strings, and a character string is converted into another character string Used insertion is deleted, the minimum number of replacement operation, and editing distance is also a kind of measurement of similarity degree between character string Standard.Editing distance is indicated with ed.

In fact, character string S is exactly changed to minimum edit operation times needed for character T by editing distance.Two character strings Between editing distance to define be minimum value in all sequence of operation costs.Substantially, the editor between two character strings is asked Distance is exactly to seek a process of optimum solution.

The formula for calculating 2 data source string similarities based on editing distance has:

Wherein, ld indicates the editing distance between 2 character strings；M and n is respectively the length of 2 character strings；Sim value is got over Greatly, indicate that 2 similarity of character string are higher.

Data source evolution module based on storm includes three parts, data access part, data processing, data landing portion Point.

(1) data access part: flow data has uncertain and high concurrent possibility, in order to ensure that can carry high speed Flow data is received the data distribution formula based on Hadoop distributed platform and acquires mould using Kafka as message-oriented middleware herein The each website data extracted in block is subsequently transmitted to Storm as data flow.

(2) data processing section: spout is responsible for receiving data from Kafka, and generates first tuple, at this moment Tuple includes two fields (field), is worth for former network address and the network address extracted；First Bolt receives tuple conduct Input, obtains the phase knowledge and magnanimity of two character strings using the similarity of character string derivation algorithm based on editing distance, according to setting Whether threshold determination is new data source, if it is, second tuple of input, the value of field is new data source, is passed to the Two Bolt；

(3) data are landed: second tuple of generation being carried out write operation by second Bolt, is changed into after serializing Memcached is written in byte stream, reduces storage pressure.

If the character string phase knowledge and magnanimity in candidate network address and initialization data source are not less than default similarity threshold, by candidate net Data initialization in location, as Social Media data topic data；Wherein, editing distance is smaller, and similarity is bigger.

In another embodiment, a kind of distributed early warning platform is provided, including host node and coupled Map Node and Reduce node include memory, processor in the host node, Map node and Reduce node and are stored in On memory and the computer program that can run on a processor, realized when the processor executes it is as shown in Figure 1 based on point The data mining of cloth early warning platform and the step in depth analysis method.

In another embodiment, a kind of computer readable storage medium is provided, computer program is stored thereon with, the journey It is realized when sequence is executed by processor in the data mining and depth analysis method based on distributed early warning platform as shown in Figure 1 Step.

It should be understood by those skilled in the art that, embodiment of the disclosure can provide as method, system or computer program Product.Therefore, the shape of hardware embodiment, software implementation or embodiment combining software and hardware aspects can be used in the disclosure Formula.Moreover, the disclosure, which can be used, can use storage in the computer that one or more wherein includes computer usable program code The form for the computer program product implemented on medium (including but not limited to magnetic disk storage and optical memory etc.).

The disclosure be referring to according to the method for the embodiment of the present invention, the process of equipment (system) and computer program product Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.

These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.

These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.

Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the program can be stored in a computer-readable storage medium In, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, the storage medium can be magnetic Dish, CD, read-only memory (Read-Only Memory, ROM) or random access memory (Random AccessMemory, RAM) etc..

Although above-mentioned be described in conjunction with specific embodiment of the attached drawing to the disclosure, model not is protected to the disclosure The limitation enclosed, those skilled in the art should understand that, on the basis of the technical solution of the disclosure, those skilled in the art are not Need to make the creative labor the various modifications or changes that can be made still within the protection scope of the disclosure.

Claims

1. a kind of data mining based on distributed early warning platform and depth analysis method characterized by comprising

Social Media data are obtained and clustered, cluster topic is obtained；

The theme feature for extracting cluster topic, is mapped as first theme for the theme feature for clustering topic to describe user interest；Institute First theme is stated with known disparate networks crime as classification benchmark；

On the basis of issuing the identification of viewpoint account, the User Perspective feature of cluster topic is extracted, is excavated using Apriori algorithm User states rule to characterize User Perspective feature；

User social contact feature is described using known users social network；

Include user interest profile, User Perspective feature and user social contact feature to issue the account building of law bans content User characteristics go out the identity and membership credentials of user account to be detected by the Similarity measures of user characteristics.

2. a kind of data mining based on distributed early warning platform as described in claim 1 and depth analysis method, feature It is, the process of the membership credentials of user account to be detected is gone out by the Similarity measures of user characteristics are as follows:

According to the various dimensions characteristic of user interest profile, User Perspective feature and user social contact feature, building user is respectively corresponded Interest preference matrix, User Perspective matrix and social networks matrix, and the similar square of user interest preference is calculated by cosine-algorithm Battle array, User Perspective similarity matrix and social networks matrix；

Corresponding power is assigned respectively to user interest preference similar matrix, User Perspective similarity matrix and social networks matrix Weight carries out linear weighted function to above three similar matrix, obtains weighted results；

According to the threshold value comparison of weighted results and setting, obtain each user account is arranged in front k closely related people, obtains To the membership credentials of the user account；Wherein, k is positive integer.

3. a kind of data mining based on distributed early warning platform as described in claim 1 and depth analysis method, feature It is, Social Media data topic is clustered using K-modes clustering algorithm.

4. a kind of data mining based on distributed early warning platform as described in claim 1 and depth analysis method, feature It is, the theme feature for clustering topic is mapped as first theme to describe user interest using LSA method.

5. a kind of data mining based on distributed early warning platform as described in claim 1 and depth analysis method, feature It is, excavates user using Apriori algorithm and state the regular process to characterize User Perspective feature are as follows:

The User Perspective characteristic storage of topic is clustered to former sequence database, former sequence database is averagely divided into n not phases The subsequence database of friendship；Wherein, n is positive integer；

N sub- sequence databases are dispatched to different Map working nodes, each Map by the host node of distributed early warning platform Working node executes Sequential Pattern Mining Algorithm, and according to the minimum support of setting, scanning is stored in Map working node memory Subsequence database, calculate local sequence pattern；

Obtained local sequence pattern is passed into Reduce working node, merger handles to obtain global candidate sequence mode；

Former sequence database is scanned again, finds out the sequence pattern for meeting the minimum support not less than default, in turn Obtain the characterization of User Perspective feature.

6. a kind of data mining based on distributed early warning platform as claimed in claim 5 and depth analysis method, feature It is, each Map working node executes the process of Sequential Pattern Mining Algorithm are as follows:

Given minimum support ξ, if support of the sequence S in sequence database is not less than ξ, sequence S is referred to as sequence mould Formula；

Wherein, sequence S is the percentage shared in sequence database of the sequence comprising S in the support of sequence database；Sequence It is the sequence number in sequence database comprising S that S, which is arranged, in the support counting of sequence database.

7. a kind of data mining based on distributed early warning platform as described in claim 1 and depth analysis method, feature It is, before obtaining Social Media data, further includes:

Initialization data source, and new network address part is extracted from initial data, as the candidate network address for judging source of new data；

8. a kind of data mining based on distributed early warning platform as claimed in claim 7 and depth analysis method, feature It is, if the character string phase knowledge and magnanimity in candidate network address and initialization data source are not less than default similarity threshold, by candidate network address In data initialization, as Social Media data topic data；Wherein, editing distance is smaller, and similarity is bigger.

9. a kind of distribution early warning platform, including host node and coupled Map node and Reduce node, the main section In point, Map node and Reduce node include memory, processor and storage on a memory and can run on a processor Computer program, which is characterized in that the processor execute when realize as it is of any of claims 1-8 based on point The data mining of cloth early warning platform and the step in depth analysis method.

10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is by processor It realizes when execution such as the data mining of any of claims 1-8 based on distributed early warning platform and depth analysis side Step in method.