CN102663077B - Web search results security sorting method based on Hits algorithm - Google Patents

Web search results security sorting method based on Hits algorithm Download PDF

Info

Publication number
CN102663077B
CN102663077B CN201210095140.2A CN201210095140A CN102663077B CN 102663077 B CN102663077 B CN 102663077B CN 201210095140 A CN201210095140 A CN 201210095140A CN 102663077 B CN102663077 B CN 102663077B
Authority
CN
China
Prior art keywords
webpage
page
collection
carry out
expressed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201210095140.2A
Other languages
Chinese (zh)
Other versions
CN102663077A (en
Inventor
陈志德
郭扬富
许力
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujian Normal University
Original Assignee
Fujian Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujian Normal University filed Critical Fujian Normal University
Priority to CN201210095140.2A priority Critical patent/CN102663077B/en
Publication of CN102663077A publication Critical patent/CN102663077A/en
Application granted granted Critical
Publication of CN102663077B publication Critical patent/CN102663077B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of network security, specifically to a Web search results security sorting method based on a Hits algorithm. The method comprises the following steps: establishing a malicious feature library F (f1, f2, f3, ..., fn), wherein the malicious feature library comprises n types feature codes that network virus, trojan and bugs are appeared in webpages; expressing each feature code of the malicious feature library as a vector form composed of m components, namely fx= (fx1 fx2, fx3, ..., fxm), wherein x belongs to a set of (1, 2,..., n), and fx belongs to F; the weight of each component is expressed by f'x; then, combining a vector space model with a malicious feature library so as to sorting webpage search results security. According to the web search results security sorting method provided by the invention, sorting of malicious webpages in the search results is reduced, therefore, probability of accessing insecure webpages is reduced.

Description

Web Search Results security sort method based on Hits algorithm
Technical field
The present invention relates to network security technology field, particularly a kind of Web Search Results security sort method based on Hits algorithm.
Background technology
Along with developing rapidly of Internet, the growth of Web resource exponentially level makes the management of Web resource seem more difficult.Nowadays, the malicious web pages of a large amount of under cover wooden horses, virus and illegal advertisement supervisor spreads unchecked in Web network.These webpages are taked fraudulent means, utilize the limitation of search engine, make some malice page be hidden in the more forward position of search result rank, the very big like this information security that may jeopardize subscriber computer and other-end.As can be seen here, solve and the problem of improving Web safety has been extremely urgent.
Summary of the invention
The object of the present invention is to provide a kind of Web Search Results security sort method based on Hits algorithm, the method is conducive to reduce the sequence of malicious web pages in Search Results, thereby reduces the probability that has access to dangerous webpage.
The technical solution used in the present invention is: a kind of Web Search Results security sort method based on Hits algorithm, set up a malice feature database f( f 1, f 2, f 3..., f n ), described malice feature database comprises nthe condition code that kind internet worm, wooden horse, leak occur in webpage, by each condition code of described malice feature database f i be expressed as by mthe vector form that individual component forms, f i =( f i1 , f i2 , f i3 ..., f im ), wherein i∈ 1,2 ..., n, f i f; Then, based on Hits algorithm, carry out as follows webpage security sequence:
Step 1: search for is submitted to text based search engine, before getting from return results the set of webpage tthe set of individual webpage, is designated as root collection r; To described collection rin add by root collection rthe webpage of quoting and quote root collection rwebpage, after inherence link and uncorrelated link are processed, by root collection rbe extended to set g; With set gin Hub webpage be vertex set v 1, take Authority webpage as vertex set v 2, v 1in webpage arrive v 2in the hyperlink of webpage be limit collection e, form two minutes digraph s=( v 1, v 2, e), right v 1in arbitrary summit v, use h( v) expression webpage vhub value, right v 2in arbitrary summit u, use a( u) expression webpage uauthority value, when initial h( v)= a( u)=1;
Step 2: right ucarry out I operation, revise its a( u), right vcarry out O operation, revise its h( v), I operation, O operation are respectively:
I operation:
Figure DEST_PATH_IMAGE002
O operation:
Figure DEST_PATH_IMAGE004
In above-mentioned formula,
Figure DEST_PATH_IMAGE006
represent to go through time v 1the middle page summation,
Figure DEST_PATH_IMAGE008
represent to go through time v 2the middle page summation, risk( f, u), risk( f, v) be calculated as follows:
Figure DEST_PATH_IMAGE010
Figure DEST_PATH_IMAGE012
In above-mentioned formula, μ i represent in described malice feature database iplant the Hazard factor of condition code, μ i ∈ (0,1); The described page utext collection, the page ubeing expressed as vector is u( u 1, u 2, u 3..., u p ), by the page ueach component u k be expressed as by mthe vector form that individual component forms, u k= ( u k1 , u k2 , u k3 ..., u km ), wherein k∈ 1,2 ..., p, u k u; The described page vtext collection, the page vbeing expressed as vector is v( v 1, v 2, v 3..., v p ), by the page veach component v k be expressed as by mthe vector form that individual component forms, v k= ( v k1 , v k2 , v k3 ..., v km ), wherein k∈ 1,2 ..., p, v k v;
Step 3: by step 2 pair vertex set v 2in all pages carry out I operation, to vertex set v 1in all pages carry out O operation; After completing, by following formula pair a( u), h( v) carry out standardization processing:
Figure DEST_PATH_IMAGE014
Figure DEST_PATH_IMAGE016
In above-mentioned formula, qthe quantity that represents chain ingress;
Step 4: repeating step 2,3 carries out iterative computation, until a( u), h( v) convergence;
Step 5: last according to each page a( u) value just sorts by security to each page.
The invention has the beneficial effects as follows on the basis of Hits algorithm, in conjunction with vector space model and malice feature database, the degree of risk of webpage is evaluated.By the Authority value of restriction malicious web pages, reduce the sequence of malicious web pages in Search Results, thereby reduced the probability that has access to dangerous webpage, strengthened Web safety.
Accompanying drawing explanation
Fig. 1 is the fundamental diagram of the embodiment of the present invention.
Embodiment
The present invention is based on the Web Search Results security sort method of Hits algorithm, set up a malice feature database f( f 1, f 2, f 3..., f n ), described malice feature database comprises nthe condition code that kind internet worm, wooden horse, leak occur in webpage, by each condition code of described malice feature database f i be expressed as by mthe vector form that individual component forms, f i =( f i1 , f i2 , f i3 ..., f im ), wherein i∈ 1,2 ..., n, f i f; Then, based on Hits algorithm, carry out as follows webpage security sequence:
Step 1: search for is submitted to text based search engine, before getting from return results the set of webpage tthe set of individual webpage, is designated as root collection r; To described collection rin add by root collection rthe webpage of quoting and quote root collection rwebpage, after inherence link and uncorrelated link are processed, by root collection rbe extended to set g; With set gin Hub webpage be vertex set v 1, take Authority webpage as vertex set v 2, v 1in webpage arrive v 2in the hyperlink of webpage be limit collection e, form two minutes digraph s=( v 1, v 2, e), right v 1in arbitrary summit v, use h( v) expression webpage vhub value, right v 2in arbitrary summit u, use a( u) expression webpage uauthority value, when initial h( v)= a( u)=1;
Step 2: right ucarry out I operation, revise its a( u), right vcarry out O operation, revise its h( v), I operation, O operation are respectively:
I operation:
Figure 444027DEST_PATH_IMAGE002
O operation:
Figure 361168DEST_PATH_IMAGE004
In above-mentioned formula,
Figure 140905DEST_PATH_IMAGE006
represent to go through time v 1the middle page summation,
Figure 598431DEST_PATH_IMAGE008
represent to go through time v 2the middle page summation, risk( f, u), risk( f, v) be calculated as follows:
Figure 412803DEST_PATH_IMAGE010
Figure 235266DEST_PATH_IMAGE012
In above-mentioned formula, μ i represent in described malice feature database iplant the Hazard factor of condition code, μ i ∈ (0,1); The described page utext collection, the page ubeing expressed as vector is u( u 1, u 2, u 3..., u p ), by the page ueach component u k be expressed as by mthe vector form that individual component forms, u k= ( u k1 , u k2 , u k3 ..., u km ), wherein k∈ 1,2 ..., p, u k u; The described page vtext collection, the page vbeing expressed as vector is v( v 1, v 2, v 3..., v p ), by the page veach component v k be expressed as by mthe vector form that individual component forms, v k= ( v k1 , v k2 , v k3 ..., v km ), wherein k∈ 1,2 ..., p, v k v;
Step 3: by step 2 pair vertex set v 2in all pages carry out I operation, to vertex set v 1in all pages carry out O operation; After completing, by following formula pair a( u), h( v) carry out standardization processing:
Figure 502299DEST_PATH_IMAGE014
Figure 763516DEST_PATH_IMAGE016
In above-mentioned formula, qthe quantity that represents chain ingress;
Step 4: repeating step 2,3 carries out iterative computation, until a( u), h( v) convergence;
Step 5: last according to each page a( u) value just sorts by security to each page.
Below the related content the present invention relates to is further described.
1, Hits algorithm
Hits algorithm is a kind of web page interlinkage analytical algorithm by the Kleinberg proposition of IBM.Its principle is according to a given search for
Figure DEST_PATH_IMAGE018
, by link analysis, search the authoritative page with Topic relative.The basic thought of algorithm is by web page interlinkage analysis, to draw the weights of each webpage, thereby draws the authority of webpage.Hits algorithm is divided into two types by webpage: a kind of for expressing the authoritative page of a certain theme, be called the authority page; The another kind of page for these authority pages are linked together, is called the hubs page.Two important weights concepts of Hits algorithm design:
Authority: represent the weighting quantity that an Authoritative Web pages is quoted by other webpage, i.e. the weighting in-degree value of this Authoritative Web pages.If the number of times that certain webpage is cited is more, the weighting in-degree value of this webpage is larger, and Authority is larger, and webpage is also just more important.
Hub: represent that a Web page points to the weighting quantity of other webpage, i.e. the weighting out-degree value of this Web page, it provides the link set of pointing to Authoritative Web pages.If the weighting out-degree value of certain webpage is larger, the Hub value of this this webpage is larger.Hub plays the effect of the implicit declaration theme authority page.
Ideally, pass through search for
Figure 432395DEST_PATH_IMAGE018
the result set obtaining
Figure DEST_PATH_IMAGE020
there is following characteristics:
(1)
Figure 488075DEST_PATH_IMAGE020
relatively little;
(2)
Figure 242405DEST_PATH_IMAGE020
middle related web page is abundant;
(3)
Figure 979417DEST_PATH_IMAGE020
the authority page that comprises most most worthies.
For concrete retrieval, put question to
Figure 502802DEST_PATH_IMAGE018
, build as follows about the gathering subgraph process of this enquirement:
Use text based search engine (as Hotbot, AltaVista) retrieval to obtain result set, get rank the most front the result set of position
Figure DEST_PATH_IMAGE024
, be called root collection (Root Set).
Figure 738053DEST_PATH_IMAGE024
meet feature (1), (2), but far can not meet feature (3), therefore need to expand
Figure 13177DEST_PATH_IMAGE024
.
Expand
Figure 718965DEST_PATH_IMAGE024
, be mainly divided into two aspects.The one, by all
Figure 54131DEST_PATH_IMAGE024
the middle page expands into, in graph model, with
Figure 783053DEST_PATH_IMAGE024
for the directed edge of starting point expands into, the quantity of expansion is restriction not; The 2nd, by sensing
Figure 861867DEST_PATH_IMAGE024
in the link page of each page get wherein arbitrarily
Figure DEST_PATH_IMAGE026
,
Figure 422162DEST_PATH_IMAGE026
value is set as 50 conventionally, if
Figure 193808DEST_PATH_IMAGE026
be not more than 50, get its all pages.These pages are extended to original
Figure 144447DEST_PATH_IMAGE024
middle formation
Figure 89269DEST_PATH_IMAGE020
, be called baseset (Base Set).Such set
Figure 441753DEST_PATH_IMAGE020
can meet preferably above-mentioned three features, quantity generally 1000 in 5000.
In order to improve calculating effect, will
Figure 822236DEST_PATH_IMAGE020
do further processing, link is divided into two kinds of situations: the first refers to that two pages of linking relationship are between different domain names, and such link is called horizontal link; The second refers to that two link pages are under same domain name, and such link is called inherent link.
Inherent link only has the function of inner navigation, almost can not transmit the authority value between webpage, therefore, by the linking relationship of this class from
Figure 242853DEST_PATH_IMAGE020
middle deletion.Go out some incoherent links as advertisement etc. again, form
Figure DEST_PATH_IMAGE028
.
Figure 276275DEST_PATH_IMAGE028
can think to meet the gathering subgraph of above-mentioned 3 features.By calculating hubs and authorities, then the authorities value of last convergence is carried out to a sequence, obtain the result needing.
Authorities and hubs are the relations mutually strengthening, and a good hub page points to a lot of good authorities, and meanwhile, a good authority page also has much good hubs to point to it.
For
Figure 389724DEST_PATH_IMAGE028
, be expressed as two minutes digraphs
Figure DEST_PATH_IMAGE030
.
Figure DEST_PATH_IMAGE032
in any vertex v, use
Figure DEST_PATH_IMAGE034
represent webpage
Figure DEST_PATH_IMAGE036
hub value, right
Figure DEST_PATH_IMAGE038
in summit u, use
Figure DEST_PATH_IMAGE040
the Authority value that represents webpage.At first
Figure DEST_PATH_IMAGE042
, right
Figure DEST_PATH_IMAGE044
carry out I operation, right
Figure 439589DEST_PATH_IMAGE036
carry out O operation, revise respectively
Figure 663897DEST_PATH_IMAGE040
,
Figure 787710DEST_PATH_IMAGE034
, then standardization.Operation I, O below constantly double counting like this, until
Figure 72061DEST_PATH_IMAGE040
,
Figure 484588DEST_PATH_IMAGE034
convergence.
I operation:
Figure DEST_PATH_IMAGE046
O operation:
In above-mentioned formula,
Figure 76369DEST_PATH_IMAGE006
represent to go through time v 1the middle page summation,
Figure 257951DEST_PATH_IMAGE008
represent to go through time v 2the middle page summation.
It is right after each iteration, to need
Figure 447624DEST_PATH_IMAGE040
, carry out standardization processing:
Figure 975874DEST_PATH_IMAGE014
Figure 277543DEST_PATH_IMAGE016
2, the Web safety based on Hits algorithm
Security model has mainly been mated with page source code by malice feature database.Adopt the similarity between similar vector space model (VSM, Vector Space Model) retrieval character code and the page, i.e. risk.In this model, document represents with vector, and condition code in document represents with the component of vector, and its component value is weight.
Wherein, can be respectively the proper vector of malice feature database and document,
Figure DEST_PATH_IMAGE054
for the dimension of proper vector,
Figure DEST_PATH_IMAGE056
for of proper vector
Figure DEST_PATH_IMAGE058
dimension.
In like manner, risk storehouse fwith document dsimilarity, can be used for evaluating the risk of a page
Figure DEST_PATH_IMAGE060
.
Be more than preferred embodiment of the present invention, all changes of doing according to technical solution of the present invention, when the function producing does not exceed the scope of technical solution of the present invention, all belong to protection scope of the present invention.

Claims (1)

1. the Web Search Results security sort method based on Hits algorithm, is characterized in that: set up a malice feature database f( f 1, f 2, f 3..., f n ), described malice feature database comprises nthe condition code that kind internet worm, wooden horse, leak occur in webpage, by each condition code of described malice feature database f i be expressed as by mthe vector form that individual component forms, f i =( f i1 , f i2 , f i3 ..., f im ), wherein i∈ 1,2 ..., n, f i f; Then, based on Hits algorithm, carry out as follows webpage security sequence:
Step 1: search for is submitted to text based search engine, before getting from return results the set of webpage tthe set of individual webpage, is designated as root collection r; To described collection rin add by root collection rthe webpage of quoting and quote root collection rwebpage, after inherence link and uncorrelated link are processed, by root collection rbe extended to set g; With set gin Hub webpage be vertex set v 1, take Authority webpage as vertex set v 2, v 1in webpage arrive v 2in the hyperlink of webpage be limit collection e, form two minutes digraph s=( v 1, v 2, e), right v 1in arbitrary summit v, use h( v) expression summit vcorresponding webpage vhub value, right v 2in arbitrary summit u, use a( u) expression summit ucorresponding webpage uauthority value, when initial h( v)= a( u)=1;
Step 2: right ucarry out I operation, revise its a( u), right vcarry out O operation, revise its h( v), I operation, O operation are respectively:
I operation:
Figure 201687DEST_PATH_IMAGE001
O operation:
Figure 56511DEST_PATH_IMAGE002
In above-mentioned formula,
Figure 836248DEST_PATH_IMAGE003
represent to go through time v 1the middle page summation,
Figure 231457DEST_PATH_IMAGE004
represent to go through time v 2the middle page summation, risk( f, u), risk( f, v) be calculated as follows:
Figure 556707DEST_PATH_IMAGE006
In above-mentioned formula, μ i represent in described malice feature database iplant the Hazard factor of condition code, μ i ∈ (0,1); Described webpage utext collection, webpage ubeing expressed as vector is u( u 1, u 2, u 3..., u p ), by webpage ueach component u k be expressed as by mthe vector form that individual component forms, u k= ( u k1 , u k2 , u k3 ..., u km ), wherein k∈ 1,2 ..., p, u k u; Described webpage vtext collection, webpage vbeing expressed as vector is v( v 1, v 2, v 3..., v p ), by webpage veach component v k be expressed as by mthe vector form that individual component forms, v k= ( v k1 , v k2 , v k3 ..., v km ), wherein k∈ 1,2 ..., p, v k v;
Step 3: by step 2 pair vertex set v 2in all pages carry out I operation, to vertex set v 1in all pages carry out O operation; After completing, by following formula pair a( u), h( v) carry out standardization processing:
Figure 22641DEST_PATH_IMAGE008
In above-mentioned formula, qthe quantity that represents chain ingress;
Step 4: repeating step 2,3 carries out iterative computation, until a( u), h( v) convergence;
Step 5: last according to each page a( u) value just sorts by security to each page.
CN201210095140.2A 2012-03-31 2012-03-31 Web search results security sorting method based on Hits algorithm Expired - Fee Related CN102663077B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210095140.2A CN102663077B (en) 2012-03-31 2012-03-31 Web search results security sorting method based on Hits algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210095140.2A CN102663077B (en) 2012-03-31 2012-03-31 Web search results security sorting method based on Hits algorithm

Publications (2)

Publication Number Publication Date
CN102663077A CN102663077A (en) 2012-09-12
CN102663077B true CN102663077B (en) 2014-03-12

Family

ID=46772568

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210095140.2A Expired - Fee Related CN102663077B (en) 2012-03-31 2012-03-31 Web search results security sorting method based on Hits algorithm

Country Status (1)

Country Link
CN (1) CN102663077B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102937974A (en) * 2012-10-17 2013-02-20 北京奇虎科技有限公司 Search server and search method
CN103761476B (en) * 2013-12-30 2016-11-09 北京奇虎科技有限公司 The method and device of feature extraction
CN108182186B (en) * 2016-12-08 2020-10-02 广东精点数据科技股份有限公司 Webpage sorting method based on random forest algorithm
CN107622048B (en) * 2017-09-06 2021-06-22 南京硅基智能科技有限公司 Text mode recognition method and system

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101409634A (en) * 2007-10-10 2009-04-15 中国科学院自动化研究所 Quantitative analysis tools and method for internet news influence based on information retrieval

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101409634A (en) * 2007-10-10 2009-04-15 中国科学院自动化研究所 Quantitative analysis tools and method for internet news influence based on information retrieval

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
《Optimizing for Web Security Based on Search Engine》;Yangfu Guo et al;《2010 International Conference On Computer Design And Appliations (ICCDA 2010)》;20101231;第5卷;545-549 *
Analysis of Web search algorithm HITS;DAWEI HONG et al;《International Journal of Foundations of Computer Science》;20041231;第15卷(第4期);591-599 *
Associated word extraction system for search query expansion based on hits;Jung-Hun Lee et al;《CCIS》;20111231;649-662 *
DAWEI HONG et al.Analysis of Web search algorithm HITS.《International Journal of Foundations of Computer Science》.2004,第15卷(第4期),
Jung-Hun Lee et al.Associated word extraction system for search query expansion based on hits.《CCIS》.2011,
Yangfu Guo et al.《Optimizing for Web Security Based on Search Engine》.《2010 International Conference On Computer Design And Appliations (ICCDA 2010)》.2010,第5卷

Also Published As

Publication number Publication date
CN102663077A (en) 2012-09-12

Similar Documents

Publication Publication Date Title
Yang et al. Consisrec: Enhancing gnn for social recommendation via consistent neighbor aggregation
Fleischhacker et al. Detecting errors in numerical linked data using cross-checked outlier detection
Fan et al. Querying big graphs within bounded resources
US20180144132A1 (en) Kind of android malicious code detection method on the base of community structure analysis
CN102436563B (en) Method and device for detecting page tampering
CN101853277A (en) Vulnerability data mining method based on classification and association analysis
CN102591965B (en) Method and device for detecting black chain
CN103150663A (en) Method and device for placing network placement data
US9495453B2 (en) Resource download policies based on user browsing statistics
CN103617213B (en) Method and system for identifying newspage attributive characters
CN107437026B (en) Malicious webpage advertisement detection method based on advertisement network topology
CN104516910A (en) Method and system for recommending content in client-side server environment
CN104615779A (en) Method for personalized recommendation of Web text
US20130339369A1 (en) Search Method and Apparatus
CN102663077B (en) Web search results security sorting method based on Hits algorithm
CN104834736A (en) Method and device for establishing index database and retrieval method, device and system
CN111754338B (en) Method and system for identifying partner of trepanning loan website
CN108572971A (en) It is a kind of to be used to excavate and the method and apparatus of the relevant keyword of term
CN103745380A (en) Advertisement delivery method and apparatus
CN105389330A (en) Cross-community matched correlation method for open source resources
CN101268465B (en) Method for sorting a set of electronic documents
Choudhary et al. Role of ranking algorithms for information retrieval
CN102915369A (en) Method for ranking web pages on basis of hyperlink source analysis
CN104462241A (en) Population property classification method and device based on anchor texts and peripheral texts in URLs
CN104391958B (en) The correlation detection methods and device of Webpage search keyword

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20140312

Termination date: 20170331

CF01 Termination of patent right due to non-payment of annual fee