CN102663077B - Web search results security sorting method based on Hits algorithm - Google Patents
Web search results security sorting method based on Hits algorithm Download PDFInfo
- Publication number
- CN102663077B CN102663077B CN201210095140.2A CN201210095140A CN102663077B CN 102663077 B CN102663077 B CN 102663077B CN 201210095140 A CN201210095140 A CN 201210095140A CN 102663077 B CN102663077 B CN 102663077B
- Authority
- CN
- China
- Prior art keywords
- webpage
- page
- collection
- carry out
- expressed
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to the technical field of network security, specifically to a Web search results security sorting method based on a Hits algorithm. The method comprises the following steps: establishing a malicious feature library F (f1, f2, f3, ..., fn), wherein the malicious feature library comprises n types feature codes that network virus, trojan and bugs are appeared in webpages; expressing each feature code of the malicious feature library as a vector form composed of m components, namely fx= (fx1 fx2, fx3, ..., fxm), wherein x belongs to a set of (1, 2,..., n), and fx belongs to F; the weight of each component is expressed by f'x; then, combining a vector space model with a malicious feature library so as to sorting webpage search results security. According to the web search results security sorting method provided by the invention, sorting of malicious webpages in the search results is reduced, therefore, probability of accessing insecure webpages is reduced.
Description
Technical field
The present invention relates to network security technology field, particularly a kind of Web Search Results security sort method based on Hits algorithm.
Background technology
Along with developing rapidly of Internet, the growth of Web resource exponentially level makes the management of Web resource seem more difficult.Nowadays, the malicious web pages of a large amount of under cover wooden horses, virus and illegal advertisement supervisor spreads unchecked in Web network.These webpages are taked fraudulent means, utilize the limitation of search engine, make some malice page be hidden in the more forward position of search result rank, the very big like this information security that may jeopardize subscriber computer and other-end.As can be seen here, solve and the problem of improving Web safety has been extremely urgent.
Summary of the invention
The object of the present invention is to provide a kind of Web Search Results security sort method based on Hits algorithm, the method is conducive to reduce the sequence of malicious web pages in Search Results, thereby reduces the probability that has access to dangerous webpage.
The technical solution used in the present invention is: a kind of Web Search Results security sort method based on Hits algorithm, set up a malice feature database
f(
f 1,
f 2,
f 3...,
f n ), described malice feature database comprises
nthe condition code that kind internet worm, wooden horse, leak occur in webpage, by each condition code of described malice feature database
f i be expressed as by
mthe vector form that individual component forms,
f i =(
f i1
,
f i2
,
f i3
...,
f im ), wherein
i∈ 1,2 ...,
n,
f i ∈
f; Then, based on Hits algorithm, carry out as follows webpage security sequence:
Step 1: search for is submitted to text based search engine, before getting from return results the set of webpage
tthe set of individual webpage, is designated as root collection
r; To described collection
rin add by root collection
rthe webpage of quoting and quote root collection
rwebpage, after inherence link and uncorrelated link are processed, by root collection
rbe extended to set
g; With set
gin Hub webpage be vertex set
v 1, take Authority webpage as vertex set
v 2,
v 1in webpage arrive
v 2in the hyperlink of webpage be limit collection
e, form two minutes digraph
s=(
v 1,
v 2,
e), right
v 1in arbitrary summit
v, use
h(
v) expression webpage
vhub value, right
v 2in arbitrary summit
u, use
a(
u) expression webpage
uauthority value, when initial
h(
v)=
a(
u)=1;
Step 2: right
ucarry out I operation, revise its
a(
u), right
vcarry out O operation, revise its
h(
v), I operation, O operation are respectively:
In above-mentioned formula,
represent to go through time
v 1the middle page summation,
represent to go through time
v 2the middle page summation,
risk(
f,
u),
risk(
f,
v) be calculated as follows:
In above-mentioned formula,
μ i represent in described malice feature database
iplant the Hazard factor of condition code,
μ i ∈ (0,1); The described page
utext collection, the page
ubeing expressed as vector is
u(
u 1,
u 2,
u 3...,
u p ), by the page
ueach component
u k be expressed as by
mthe vector form that individual component forms,
u k= (
u k1
,
u k2
,
u k3
...,
u km ), wherein
k∈ 1,2 ...,
p,
u k ∈
u; The described page
vtext collection, the page
vbeing expressed as vector is
v(
v 1,
v 2,
v 3...,
v p ), by the page
veach component
v k be expressed as by
mthe vector form that individual component forms,
v k= (
v k1
,
v k2
,
v k3
...,
v km ), wherein
k∈ 1,2 ...,
p,
v k ∈
v;
Step 3: by step 2 pair vertex set
v 2in all pages carry out I operation, to vertex set
v 1in all pages carry out O operation; After completing, by following formula pair
a(
u),
h(
v) carry out standardization processing:
In above-mentioned formula,
qthe quantity that represents chain ingress;
Step 4: repeating step 2,3 carries out iterative computation, until
a(
u),
h(
v) convergence;
Step 5: last according to each page
a(
u) value just sorts by security to each page.
The invention has the beneficial effects as follows on the basis of Hits algorithm, in conjunction with vector space model and malice feature database, the degree of risk of webpage is evaluated.By the Authority value of restriction malicious web pages, reduce the sequence of malicious web pages in Search Results, thereby reduced the probability that has access to dangerous webpage, strengthened Web safety.
Accompanying drawing explanation
Fig. 1 is the fundamental diagram of the embodiment of the present invention.
Embodiment
The present invention is based on the Web Search Results security sort method of Hits algorithm, set up a malice feature database
f(
f 1,
f 2,
f 3...,
f n ), described malice feature database comprises
nthe condition code that kind internet worm, wooden horse, leak occur in webpage, by each condition code of described malice feature database
f i be expressed as by
mthe vector form that individual component forms,
f i =(
f i1
,
f i2
,
f i3
...,
f im ), wherein
i∈ 1,2 ...,
n,
f i ∈
f; Then, based on Hits algorithm, carry out as follows webpage security sequence:
Step 1: search for is submitted to text based search engine, before getting from return results the set of webpage
tthe set of individual webpage, is designated as root collection
r; To described collection
rin add by root collection
rthe webpage of quoting and quote root collection
rwebpage, after inherence link and uncorrelated link are processed, by root collection
rbe extended to set
g; With set
gin Hub webpage be vertex set
v 1, take Authority webpage as vertex set
v 2,
v 1in webpage arrive
v 2in the hyperlink of webpage be limit collection
e, form two minutes digraph
s=(
v 1,
v 2,
e), right
v 1in arbitrary summit
v, use
h(
v) expression webpage
vhub value, right
v 2in arbitrary summit
u, use
a(
u) expression webpage
uauthority value, when initial
h(
v)=
a(
u)=1;
Step 2: right
ucarry out I operation, revise its
a(
u), right
vcarry out O operation, revise its
h(
v), I operation, O operation are respectively:
In above-mentioned formula,
represent to go through time
v 1the middle page summation,
represent to go through time
v 2the middle page summation,
risk(
f,
u),
risk(
f,
v) be calculated as follows:
In above-mentioned formula,
μ i represent in described malice feature database
iplant the Hazard factor of condition code,
μ i ∈ (0,1); The described page
utext collection, the page
ubeing expressed as vector is
u(
u 1,
u 2,
u 3...,
u p ), by the page
ueach component
u k be expressed as by
mthe vector form that individual component forms,
u k= (
u k1
,
u k2
,
u k3
...,
u km ), wherein
k∈ 1,2 ...,
p,
u k ∈
u; The described page
vtext collection, the page
vbeing expressed as vector is
v(
v 1,
v 2,
v 3...,
v p ), by the page
veach component
v k be expressed as by
mthe vector form that individual component forms,
v k= (
v k1
,
v k2
,
v k3
...,
v km ), wherein
k∈ 1,2 ...,
p,
v k ∈
v;
Step 3: by step 2 pair vertex set
v 2in all pages carry out I operation, to vertex set
v 1in all pages carry out O operation; After completing, by following formula pair
a(
u),
h(
v) carry out standardization processing:
In above-mentioned formula,
qthe quantity that represents chain ingress;
Step 4: repeating step 2,3 carries out iterative computation, until
a(
u),
h(
v) convergence;
Step 5: last according to each page
a(
u) value just sorts by security to each page.
Below the related content the present invention relates to is further described.
1, Hits algorithm
Hits algorithm is a kind of web page interlinkage analytical algorithm by the Kleinberg proposition of IBM.Its principle is according to a given search for
, by link analysis, search the authoritative page with Topic relative.The basic thought of algorithm is by web page interlinkage analysis, to draw the weights of each webpage, thereby draws the authority of webpage.Hits algorithm is divided into two types by webpage: a kind of for expressing the authoritative page of a certain theme, be called the authority page; The another kind of page for these authority pages are linked together, is called the hubs page.Two important weights concepts of Hits algorithm design:
Authority: represent the weighting quantity that an Authoritative Web pages is quoted by other webpage, i.e. the weighting in-degree value of this Authoritative Web pages.If the number of times that certain webpage is cited is more, the weighting in-degree value of this webpage is larger, and Authority is larger, and webpage is also just more important.
Hub: represent that a Web page points to the weighting quantity of other webpage, i.e. the weighting out-degree value of this Web page, it provides the link set of pointing to Authoritative Web pages.If the weighting out-degree value of certain webpage is larger, the Hub value of this this webpage is larger.Hub plays the effect of the implicit declaration theme authority page.
For concrete retrieval, put question to
, build as follows about the gathering subgraph process of this enquirement:
Use text based search engine (as Hotbot, AltaVista) retrieval to obtain
result set, get rank the most front
the result set of position
, be called root collection (Root Set).
meet feature (1), (2), but far can not meet feature (3), therefore need to expand
.
Expand
, be mainly divided into two aspects.The one, by all
the middle page expands into, in graph model, with
for the directed edge of starting point expands into, the quantity of expansion is restriction not; The 2nd, by sensing
in the link page of each page get wherein arbitrarily
,
value is set as 50 conventionally, if
be not more than 50, get its all pages.These pages are extended to original
middle formation
, be called baseset (Base Set).Such set
can meet preferably above-mentioned three features,
quantity generally 1000 in 5000.
In order to improve calculating effect, will
do further processing, link is divided into two kinds of situations: the first refers to that two pages of linking relationship are between different domain names, and such link is called horizontal link; The second refers to that two link pages are under same domain name, and such link is called inherent link.
Inherent link only has the function of inner navigation, almost can not transmit the authority value between webpage, therefore, by the linking relationship of this class from
middle deletion.Go out some incoherent links as advertisement etc. again, form
.
can think to meet the gathering subgraph of above-mentioned 3 features.By calculating hubs and authorities, then the authorities value of last convergence is carried out to a sequence, obtain the result needing.
Authorities and hubs are the relations mutually strengthening, and a good hub page points to a lot of good authorities, and meanwhile, a good authority page also has much good hubs to point to it.
For
, be expressed as two minutes digraphs
.
in any vertex v, use
represent webpage
hub value, right
in summit u, use
the Authority value that represents webpage.At first
, right
carry out I operation, right
carry out O operation, revise respectively
,
, then standardization.Operation I, O below constantly double counting like this, until
,
convergence.
O operation:
In above-mentioned formula,
represent to go through time
v 1the middle page summation,
represent to go through time
v 2the middle page summation.
2, the Web safety based on Hits algorithm
Security model has mainly been mated with page source code by malice feature database.Adopt the similarity between similar vector space model (VSM, Vector Space Model) retrieval character code and the page, i.e. risk.In this model, document represents with vector, and condition code in document represents with the component of vector, and its component value is weight.
Wherein,
can be respectively the proper vector of malice feature database and document,
for the dimension of proper vector,
for of proper vector
dimension.
In like manner, risk storehouse
fwith document
dsimilarity, can be used for evaluating the risk of a page
.
Be more than preferred embodiment of the present invention, all changes of doing according to technical solution of the present invention, when the function producing does not exceed the scope of technical solution of the present invention, all belong to protection scope of the present invention.
Claims (1)
1. the Web Search Results security sort method based on Hits algorithm, is characterized in that: set up a malice feature database
f(
f 1,
f 2,
f 3...,
f n ), described malice feature database comprises
nthe condition code that kind internet worm, wooden horse, leak occur in webpage, by each condition code of described malice feature database
f i be expressed as by
mthe vector form that individual component forms,
f i =(
f i1
,
f i2
,
f i3
...,
f im ), wherein
i∈ 1,2 ...,
n,
f i ∈
f; Then, based on Hits algorithm, carry out as follows webpage security sequence:
Step 1: search for is submitted to text based search engine, before getting from return results the set of webpage
tthe set of individual webpage, is designated as root collection
r; To described collection
rin add by root collection
rthe webpage of quoting and quote root collection
rwebpage, after inherence link and uncorrelated link are processed, by root collection
rbe extended to set
g; With set
gin Hub webpage be vertex set
v 1, take Authority webpage as vertex set
v 2,
v 1in webpage arrive
v 2in the hyperlink of webpage be limit collection
e, form two minutes digraph
s=(
v 1,
v 2,
e), right
v 1in arbitrary summit
v, use
h(
v) expression summit
vcorresponding webpage
vhub value, right
v 2in arbitrary summit
u, use
a(
u) expression summit
ucorresponding webpage
uauthority value, when initial
h(
v)=
a(
u)=1;
Step 2: right
ucarry out I operation, revise its
a(
u), right
vcarry out O operation, revise its
h(
v), I operation, O operation are respectively:
In above-mentioned formula,
represent to go through time
v 1the middle page summation,
represent to go through time
v 2the middle page summation,
risk(
f,
u),
risk(
f,
v) be calculated as follows:
In above-mentioned formula,
μ i represent in described malice feature database
iplant the Hazard factor of condition code,
μ i ∈ (0,1); Described webpage
utext collection, webpage
ubeing expressed as vector is
u(
u 1,
u 2,
u 3...,
u p ), by webpage
ueach component
u k be expressed as by
mthe vector form that individual component forms,
u k= (
u k1
,
u k2
,
u k3
...,
u km ), wherein
k∈ 1,2 ...,
p,
u k ∈
u; Described webpage
vtext collection, webpage
vbeing expressed as vector is
v(
v 1,
v 2,
v 3...,
v p ), by webpage
veach component
v k be expressed as by
mthe vector form that individual component forms,
v k= (
v k1
,
v k2
,
v k3
...,
v km ), wherein
k∈ 1,2 ...,
p,
v k ∈
v;
Step 3: by step 2 pair vertex set
v 2in all pages carry out I operation, to vertex set
v 1in all pages carry out O operation; After completing, by following formula pair
a(
u),
h(
v) carry out standardization processing:
In above-mentioned formula,
qthe quantity that represents chain ingress;
Step 4: repeating step 2,3 carries out iterative computation, until
a(
u),
h(
v) convergence;
Step 5: last according to each page
a(
u) value just sorts by security to each page.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210095140.2A CN102663077B (en) | 2012-03-31 | 2012-03-31 | Web search results security sorting method based on Hits algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210095140.2A CN102663077B (en) | 2012-03-31 | 2012-03-31 | Web search results security sorting method based on Hits algorithm |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102663077A CN102663077A (en) | 2012-09-12 |
CN102663077B true CN102663077B (en) | 2014-03-12 |
Family
ID=46772568
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201210095140.2A Expired - Fee Related CN102663077B (en) | 2012-03-31 | 2012-03-31 | Web search results security sorting method based on Hits algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102663077B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102937974A (en) * | 2012-10-17 | 2013-02-20 | 北京奇虎科技有限公司 | Search server and search method |
CN103761476B (en) * | 2013-12-30 | 2016-11-09 | 北京奇虎科技有限公司 | The method and device of feature extraction |
CN108182186B (en) * | 2016-12-08 | 2020-10-02 | 广东精点数据科技股份有限公司 | Webpage sorting method based on random forest algorithm |
CN107622048B (en) * | 2017-09-06 | 2021-06-22 | 南京硅基智能科技有限公司 | Text mode recognition method and system |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101409634A (en) * | 2007-10-10 | 2009-04-15 | 中国科学院自动化研究所 | Quantitative analysis tools and method for internet news influence based on information retrieval |
-
2012
- 2012-03-31 CN CN201210095140.2A patent/CN102663077B/en not_active Expired - Fee Related
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101409634A (en) * | 2007-10-10 | 2009-04-15 | 中国科学院自动化研究所 | Quantitative analysis tools and method for internet news influence based on information retrieval |
Non-Patent Citations (6)
Title |
---|
《Optimizing for Web Security Based on Search Engine》;Yangfu Guo et al;《2010 International Conference On Computer Design And Appliations (ICCDA 2010)》;20101231;第5卷;545-549 * |
Analysis of Web search algorithm HITS;DAWEI HONG et al;《International Journal of Foundations of Computer Science》;20041231;第15卷(第4期);591-599 * |
Associated word extraction system for search query expansion based on hits;Jung-Hun Lee et al;《CCIS》;20111231;649-662 * |
DAWEI HONG et al.Analysis of Web search algorithm HITS.《International Journal of Foundations of Computer Science》.2004,第15卷(第4期), |
Jung-Hun Lee et al.Associated word extraction system for search query expansion based on hits.《CCIS》.2011, |
Yangfu Guo et al.《Optimizing for Web Security Based on Search Engine》.《2010 International Conference On Computer Design And Appliations (ICCDA 2010)》.2010,第5卷 |
Also Published As
Publication number | Publication date |
---|---|
CN102663077A (en) | 2012-09-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Yang et al. | Consisrec: Enhancing gnn for social recommendation via consistent neighbor aggregation | |
Fleischhacker et al. | Detecting errors in numerical linked data using cross-checked outlier detection | |
Fan et al. | Querying big graphs within bounded resources | |
US20180144132A1 (en) | Kind of android malicious code detection method on the base of community structure analysis | |
CN102436563B (en) | Method and device for detecting page tampering | |
CN101853277A (en) | Vulnerability data mining method based on classification and association analysis | |
CN102591965B (en) | Method and device for detecting black chain | |
CN103150663A (en) | Method and device for placing network placement data | |
US9495453B2 (en) | Resource download policies based on user browsing statistics | |
CN103617213B (en) | Method and system for identifying newspage attributive characters | |
CN107437026B (en) | Malicious webpage advertisement detection method based on advertisement network topology | |
CN104516910A (en) | Method and system for recommending content in client-side server environment | |
CN104615779A (en) | Method for personalized recommendation of Web text | |
US20130339369A1 (en) | Search Method and Apparatus | |
CN102663077B (en) | Web search results security sorting method based on Hits algorithm | |
CN104834736A (en) | Method and device for establishing index database and retrieval method, device and system | |
CN111754338B (en) | Method and system for identifying partner of trepanning loan website | |
CN108572971A (en) | It is a kind of to be used to excavate and the method and apparatus of the relevant keyword of term | |
CN103745380A (en) | Advertisement delivery method and apparatus | |
CN105389330A (en) | Cross-community matched correlation method for open source resources | |
CN101268465B (en) | Method for sorting a set of electronic documents | |
Choudhary et al. | Role of ranking algorithms for information retrieval | |
CN102915369A (en) | Method for ranking web pages on basis of hyperlink source analysis | |
CN104462241A (en) | Population property classification method and device based on anchor texts and peripheral texts in URLs | |
CN104391958B (en) | The correlation detection methods and device of Webpage search keyword |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20140312 Termination date: 20170331 |
|
CF01 | Termination of patent right due to non-payment of annual fee |