CN109284436A

CN109284436A - Paths planning method and network piracy when searching for unknown message network find system

Info

Publication number: CN109284436A
Application number: CN201811285660.3A
Authority: CN
Inventors: 金哲凡
Original assignee: Zhejiang University of Media and Communications
Current assignee: Zhejiang University of Media and Communications
Priority date: 2018-10-31
Filing date: 2018-10-31
Publication date: 2019-01-29
Anticipated expiration: 2038-10-31
Also published as: CN109284436B

Abstract

The present invention provides paths planning methods when a kind of search unknown message network, applied to it is initial when all nodes attribute be unknown information network, if including the following steps: S1, finding that a certain node has particular community, its relating value is then determined as positive value, and the relating value of node around it is also assigned to positive value, and the size of the positive value is successively decreased with a distance from origin node；S2, the big node of the positive value is preferentially accessed, if access node has the particular community, repeatedly step S1.Method of the invention is applied to following occasion: the node containing particular community is searched in unknown information network with the system with intelligence；The purpose of the present invention is making rational planning for searching route to improve search efficiency, while realizing the search discovery of network piracy.

Description

Paths planning method and network piracy when searching for unknown message network find system

Technical field

The present invention relates to information technology fields, and in particular to it is a kind of search unknown message network when paths planning method and System is found using the network piracy of this method.

Background technique

Information network is usually by connecting and composing between node and node.Each node includes following information: one, content is believed Breath, two, link information.It can be text, image, sound, video etc. on content information format, depending on meaning is by specific application. Link information is directed toward other nodes, and system can find other nodes using this information.Link information sometimes referred to as link, Location etc..

In general, " attribute of node " refers to certain characteristic of the content information of node, for example, text be advertisement or Non- advertisement；Sound is voice, music or the bustling noise of a market；Whether video includes illegal contents, etc..Judge whether node has certain attribute (manpower or machine) resource it is generally necessary to pay.

In general, information network is unknown for search system.System gradually, understand information network step by step, this mistake Cheng Zhong, system can have following several states to the understanding of a node:

One, it hides, system does not know the presence of this node completely；

Two, it finding and does not access, system is aware of the presence of this node by neighborhood of nodes, but its data also has not been obtained, It is naturally also far from being and any analysis is made to its information；

Three, understand connection and do not know about content, system is aware of the link information of this node, but has no knowledge about its content (whether meeting certain attribute)；

Four, understand content and do not know about connection, system is aware of the content of this node, but has no knowledge about its connection letter Breath；

Five, understand completely.

Search system finds network by gradually accessing node, and the information for hiding node in the process is gradually revealed Out.Internal system saves the information of a collection of node, these node informations can be in above-mentioned two to five some state.System Where is the fixed trend in next step of knack, i.e., one is selected from numerous nodes of state two, three, four, after obtaining its information or carrying out Continuous analysis so recycles.The target of system is the node for correctly finding particular community as quickly as possible in unknown network, The fine or not efficiency by decision systems of above-mentioned decision.

In the prior art, the relevant technologies of above-mentioned network discovery have following several:

One, depth-first search and breadth first search；Two, based on the method for content clustering；Three, with PageRanking For the link analysis method of representative.

Basic searching route has breadth First and two kinds of depth-first.Network structure is a non-directed graph in graph theory. Any vertex v 0 of the breadth-first strategy inside Connected undigraph, successively search accesses v0 point again after accessing v0 point Abutment points w1, w2, the w3 of other also not visited mistakes ..., successively search access each of w1 is accessed not yet again later Abutment points, each abutment points being accessed not yet of w2, and so on, i.e., from closely to remote since v0 point, by level by A access communicates and path length is successively incremental vertex since 1 ing with v0 point path, final up to all vertex are all in figure It is accessed once.Depth-first Dissatisfied Suo Ze accesses any vertex v in figure first, then access is adjacent with v since vertex v point But the another summit w1 being accessed not yet, any vertex w2 adjacent with w1 and accessed not yet is then accessed, W3 ..., repeats the above process when cannot continue to access down, successively gradually return back to the vertex that recent visit is crossed, at this moment It is accessed adjacent vertex if had or not, just executes the process of above-mentioned search again since the vertex, until institute in figure Until thering is vertex to be all accessed to.Both methods is all in accordance with predefined sequential search network, to discovery particular community This target of node does not have any optimization.

Method based on content clustering needs to define a kind of calculation method of distance between content information.Apart from close node It is considered as around same " theme ", the connection issued from these nodes has higher priority, is accessed, is used for earlier Fish-Search the and Shark-Search method of web crawlers is all such.Debrra et al. is put forward for the first time Fish-Search Method, one lists of links sorted by priority of system maintenance, and next search target is selected according to it.In information search In the process, link belonging to the higher node of the degree of correlation is endowed higher priority.Hersovic et al. is based on Fish-Search Method proposes Shark-Search method, and the similarity of node, the method are creatively calculated using vector space model The distance between vector is compared to judge similitude, really a kind of text cluster.

Link analysis method was proposed with Google founder Larry Page and Sergey Brind in 1997 PageRanking is representative.It is initially use in Google search engine, and effect is to calculate webpage according to discrepancy chain relation Importance, webpage is ranked up accordingly.Link analysis method is introduced in web search, is constructed with the importance of webpage Access privileges, important webpage first access.

The node containing particular community is searched in unknown information network, above-mentioned three kinds of methods have its weakness.Range is excellent First and depth-first search is basic search mode, does not have any optimization to discovery target.Method requirement based on content clustering Inter-node has mensurable similitude, such as " node about Chinese medicine ", and for the attribute of not mensurable similitude, For example " node containing pirate text " is then helpless, because the content that " piracy " this attribute is related to is dispersion, between each other It is not required for similar.Link analysis method initially calculates Web page importance in Google engine, condition be node information all It has been obtained that, i.e., be all the node of aforesaid state five, system can go to calculate the sequence between them to the greatest extent.It is searched in unknown message network In Suo Yingyong, system gradually, disclose node information step by step, a large amount of nodes in the process be hide or information it is incomplete, use chain Connect the importance that analysis is difficult to accurate reconstruction node.

Summary of the invention

Present invention solves the technical problem that being to provide a kind of method applied to following occasion: being using what there is intelligence System searches for the node containing particular community in unknown information network.The purpose of the present invention is make rational planning for searching route to improve Search efficiency, while realizing the search discovery of network piracy.

To achieve the goals above, present invention employs following technical solutions:

A kind of paths planning method when search unknown message network, applied to it is initial when all nodes attribute be unknown Information network, include the following steps:

If S1, finding that a certain node has particular community, its relating value is determined as positive value, and by node around it Relating value is also assigned to positive value, and the size of the positive value is successively decreased with a distance from origin node；

S2, the big node of the positive value is preferentially accessed, if access node has the particular community, repeatedly step S1。

Further, S1 is specifically included:

S11, a P value is associated with to each node in the information network, P is positive value and is initially 0, P (V) expression knot The P value of point V；

S12, setting constant M and L, wherein M is the positive number greater than zero, indicates to find P when the particular community on node Increment, L are capability of influence coefficient, 0 < L < M, as P (V) < L, no longer to the relating value assignment of its surroundings nodes；

S13, when determine certain node V have the particular community when, increase its P value: P (V) '=P (V)+M, and accordingly increase Add the P value of node around it, and the P value increase of surroundings nodes is successively decreased with a distance from node V；

And S2 is specifically included:

S14, it adds up to the P value of each node, arranges the access order for determining node from high to low with P value.

Further, S13 is specifically included: using node V as root, with the n-layer node around breadth-first fashion traversal V, being increased Add the P value of each node being accessed；

Specifically, enabling Vij is i-th layer of j-th of node, Δ Pij is the increment of P (Vij), and each layer of Δ Pij value is in the past One layer is decayed by factor alpha, then Δ P1j=α M, Δ P2j=α²M, Δ P3j=α³M ...；P (Vij) '=P (Vij)+Δ pij, In 0 < α < 1.

Preferably, the value range that the value range of M is 50~500, L is 0~0.1M.

Further, the particular community includes that node is related to the hot spot of public opinions of pirate content or illegal contents or diverging.

The present invention also provides a kind of network piracies to find system, including database server interconnected, business clothes Be engaged in device and evidence obtaining server, the database server for record original work works relevant information, web crawlers job information and System operation information, the service server are used to carry out data by web crawlers to crawl, execute search strategy and detection Whether encroach right, the evidence obtaining server is for executing evidence obtaining movement；

Wherein, the web crawlers includes basic crawler unit, Features Management unit and strategy execution unit, the basis Crawler unit is crawled for carrying out data；The content and the original that the Features Management unit is used to crawl basic crawler unit Works product carry out characteristic matching, judge node with the presence or absence of pirate content；The strategy execution unit is used to be based on the feature Matching and judging result execute search strategy using paths planning method as described above.

Further, the basic crawler progress data crawl including downloading web page contents and are filtered into text, and under Image in support grid page；It includes doing filtered text and text original work works that the Features Management unit, which carries out characteristic matching, Matching matches the image of downloading with image original work works.

Further, the web crawlers job information of the database server record and system operation information include: URL, linking relationship and infringement discovery result.

Further, system includes that the master-slave mode formed by a database server and several service servers calculates Machine cluster, the evidence obtaining server and service server are deployed on same hardware or distributed deployment is in the difference of internet Position, the service server and evidence obtaining server are connected into internet by the outlet of local area network.

Beneficial effects of the present invention: " node for having found certain attribute " this event includes the information about network, The present invention takes full advantage of this information, serves subsequent search.Method of the invention be suitable for node attribute it is non-cluster but Have the case where certain association, typically such as pirate content, certain illegal contents, diverging hot spot of public opinions.In adjusting method Parameter such as M, L, α value can make the present invention adapt to a variety of different occasions.Method of the invention can look in search all at sea The efficiency of unknown message web search is improved in the big path of the probability of success out.

Network piracy of the invention finds system, and pirate content can be effectively found on network and is collected evidence and is recorded, The huge and tortious dispersion for solving network makes obligee be difficult to find abuse and law to electronic evidence Regulation relatively lags behind, and is difficult to the problem collected evidence finding infringement.

Detailed description of the invention

Fig. 1 is influence schematic diagram of the destination node to surroundings nodes in paths planning method of the invention.

Fig. 2 is the composition schematic diagram that network piracy of the invention finds system embodiment.

Fig. 3 is the function gradation structure that network piracy of the invention finds service server in system embodiment.

Specific embodiment

For a further understanding of the present invention, the preferred embodiment of the invention is described below with reference to embodiment, still It should be appreciated that these descriptions are only further explanation the features and advantages of the present invention, rather than to the claims in the present invention Limitation.

Embodiment 1

Paths planning method when a kind of search unknown message network is present embodiments provided, all knots when being applied to initial The attribute of point is unknown information network, is included the following steps:

Planning path is during the above method is used to search for node containing particular community in unknown information network to mention The effect of height search.Its basic principle is probability: the attribute of all nodes is unknown when initial, when the attribute of a certain node It is determined as timing, the probability that the attribute of surrounding node is also positive increases.It is considered as, there is particular community (such as content is piracy) Node have certain influence power to node around, this influence power is successively decreased with distance.Successively decrease by this influence power and with distance Situation digitlization, impacted big node first accesses, and just forms a kind of search strategy of optimization.

Embodiment as a further preference, S1 are specifically included:

And S2 is specifically included:

In the present invention, set that the node that certain attribute is positive is powerful to node around, and influence degree declines with range attenuation It reduces to centainly low and just loses capability of influence.As shown in Figure 1, value is influence of the point of P1 to the point on link path according to factor alpha Exponential function variation, and intermediate point is influenced by two points that value is P1 and P2 simultaneously, and influence value is accumulated.It is empty The point of line connection is hiding node, if P1 > L (herein yes) when being found, its influence value will be α P1.

Based on this, embodiment, S13 are specifically included as a further preference: using node V as root, with breadth-first side Formula traverses n-layer (such as n=5) node around V, increases the P value for the node being each accessed；

Specifically, enabling Vij is i-th layer of j-th of node, Δ Pij is the increment of P (Vij), and each layer of Δ Pij value is in the past One layer is decayed by factor alpha, then Δ P1j=α M, Δ P2j=α²M, Δ P3j=α³M ...；Preferably, the value range of M be 50~ 500, L value range is 0~0.1M.For example take α=0.6, M=100, then (Δ P1j, Δ P2j, Δ P3j, Δ P4j, Δ P5j)=(60,36,21.6,12.96,7.776)；

P (Vij) '=P (Vij)+Δ pij.

Preferably, above-mentioned particular community can be that node be related to pirate content or illegal contents or diverging Hot spot of public opinions.Adjusting M, L, α value can make the present invention adapt to a variety of different occasions.

Embodiment 2

A kind of network piracy discovery system is present embodiments provided, as shown in Figure 2 comprising database clothes interconnected Business device, service server and evidence obtaining server.Database server is for recording original work works relevant information, web crawlers work Information and system operation information, service server are used to carry out data by web crawlers to crawl, execute search strategy and inspection It surveys and whether encroaches right, evidence obtaining server is for executing evidence obtaining movement；

Be illustrated in figure 3 the function gradation structure of service server, the web crawlers used include basic crawler unit, Features Management unit and strategy execution unit, basic crawler unit are crawled for carrying out data, can be chosen in the present embodiment WebMagic；The content and the original work works that Features Management unit is used to crawl basic crawler unit carry out characteristic matching, Judge node with the presence or absence of pirate content；Strategy execution unit is used to be based on characteristic matching and judging result, using in embodiment 1 Paths planning method execute search strategy.

Embodiment as a further preference, basic crawler carry out data and crawl including downloading web page contents and be filtered into Image in text, and downloading webpage；It includes by filtered text and text that the Features Management unit, which carries out characteristic matching, Basis works product do matching or match the image of downloading with image original work works, to realize content of text and picture material Pirate discovery.

Optionally, the web crawlers job information of database server record and system operation information include: URL, link Relationship and infringement discovery are as a result, can be used for crawling the calculating of strategy.

Embodiment as a further preference, the system in the present embodiment include by a database server and several Platform service server formed master-slave mode computer cluster, evidence obtaining server and service server be deployed on same hardware or Distributed deployment is connected into Yin Te by the outlet of local area network in the different location of internet, service server and evidence obtaining server Net.

In the present invention, pirate website is regarded as a kind of " pollution sources ", thus the path planning side in its embodiment 1 for using Method is alternatively referred to as contamination method.In practical execution, the paths planning method in embodiment 1 is further described below:

One, damped manner has selected exponential function in embodiment 1, convenient for calculating, is also conceivable to it in other embodiments His function.

Two, a node can repeatedly be polluted, and the pollution of node can be accumulated, thus the P value of node can be more than 100 (if setting M=100).

Three, on program is realized, due to crawling and encroaching right, detection is all batch processing, therefore the node encountered in diffusion path has Several situations:

A) not yet by crawling, i.e., the node (url) of web data is not yet obtained；

B) node (url) of webpage has been obtained, it is divided into two classes again:

I. do not make infringement detection, do not know whether it contains piracy；

Ii. detection was done, it is known that whether it contains piracy, and testing result can be divided into:

1. containing piracy 2. without piracy

The node of situation b) can not become candidate url.B) indicate whether known current node is to steal the case where-ii Version, (probability, possibility) seems unimportant whether " pollution ".A kind of way of embodiment is not distinguish these situations, one Rule indistinguishably executes aforementioned strategy, i.e., not because the mode of operation of node (url) blocks the diffusion of pollution.In the time for choosing crawler Selecting naturally can filter out inappropriate node when url.

Embodiment 3

Actual motion test is carried out using network piracy discovery system of the invention, measured result is as shown in the table:

Seed sum	3358304
		The seed number of pollution	20802
The pirate point found in the seed of pollution	7
		The pirate point found in untainted kind	164
Pollute seed piracy probability	3.37*10^-4
		Uncontaminated seed piracy probability	4.91*10^-5

During whole system operation, url seed team shows the element of normal state (not comtaminated), and discovery is pirate therebetween Point 164；There is the element of contaminated (P ≠ 0), the pirate point 7 of discovery therebetween.Find that pirate ratio is usual in points of contamination 6.85 times of state illustrate that method used by system is effective.

The above description of the embodiment is only used to help understand the method for the present invention and its core ideas.It should be pointed out that pair For those skilled in the art, without departing from the principle of the present invention, the present invention can also be carried out Some improvements and modifications, these improvements and modifications also fall within the scope of protection of the claims of the present invention.

Claims

1. paths planning method when a kind of search unknown message network, applied to it is initial when all nodes attribute be unknown Information network, which comprises the steps of:

If S1, finding that a certain node has particular community, its relating value is determined as positive value, and by the association of node around it Value is also assigned to positive value, and the size of the positive value is successively decreased with a distance from origin node；

S2, the big node of the positive value is preferentially accessed, if access node has the particular community, repeatedly step S1.

2. paths planning method when search unknown message network as described in claim 1, which is characterized in that S1 is specifically wrapped It includes:

S11, a P value is associated with to each node in the information network, P is positive value and is initially 0, P (V) expression node V P value；

S12, setting constant M and L, wherein M is the positive number greater than zero, indicates the increment that P when the particular community is found on node, L is capability of influence coefficient, 0 < L < M, as P (V) < L, no longer to the relating value assignment of its surroundings nodes；

S13, when determine certain node V have the particular community when, increase its P value: P (V) '=P (V)+M, and increase accordingly it The P value of surrounding node, and the P value increase of surroundings nodes is successively decreased with a distance from node V；

And S2 is specifically included:

3. paths planning method when search unknown message network as claimed in claim 2, which is characterized in that S13 is specifically wrapped It includes: using node V as root, with the n-layer node around breadth-first fashion traversal V, increasing the P value for the node being each accessed；

Specifically, enabling Vij is i-th layer of j-th of node, Δ Pij is the increment of P (Vij), and each layer of Δ Pij value is from preceding layer Decay by factor alpha, then Δ P1j=α M, Δ P2j=α²M, Δ P3j=α³M ...；P (Vij) '=P (Vij)+Δ pij, wherein 0 < α < 1.

4. paths planning method when search unknown message network as claimed in claim 3, which is characterized in that the value model of M Enclosing for the value range of 50~500, L is 0~0.1M.

5. paths planning method when search unknown message network according to any one of claims 1-4, which is characterized in that institute Stating particular community includes that node is related to the hot spot of public opinions of pirate content or illegal contents or diverging.

6. a kind of network piracy finds system, which is characterized in that including database server interconnected, service server and Evidence obtaining server, the database server is for recording original work works relevant information, web crawlers job information and system fortune Row information, the service server are used to carry out data by web crawlers and crawl, execute search strategy and detect whether to invade Power, the evidence obtaining server is for executing evidence obtaining movement；

Wherein, the web crawlers includes basic crawler unit, Features Management unit and strategy execution unit, the basis crawler Unit is crawled for carrying out data；The content and the original work that the Features Management unit is used to crawl basic crawler unit are made Product carry out characteristic matching, judge node with the presence or absence of pirate content；The strategy execution unit is used to be based on the characteristic matching And judging result, search strategy is executed using paths planning method as described in any one in claim 1-5.

7. network piracy as claimed in claim 6 finds system, which is characterized in that the basis crawler carries out data and crawls packet It includes downloading web page contents and is filtered into text, and the image in downloading webpage；The Features Management unit carries out characteristic matching Including filtered text is matched with text original work works or matches the image of downloading with image original work works.

8. network piracy as claimed in claim 6 finds system, which is characterized in that the network of the database server record Crawler job information and system operation information include: URL, linking relationship and infringement discovery result.

9. network piracy as claimed in claim 7 or 8 finds system, which is characterized in that system includes being taken by a database The master-slave mode computer cluster that business device and several service servers are formed, the evidence obtaining server are deployed in service server On same hardware or distributed deployment is in the different location of internet, and the service server passes through local with evidence obtaining server The outlet of net is connected into internet.