CN102663077A - Web search results security sorting method based on Hits algorithm - Google Patents
Web search results security sorting method based on Hits algorithm Download PDFInfo
- Publication number
- CN102663077A CN102663077A CN2012100951402A CN201210095140A CN102663077A CN 102663077 A CN102663077 A CN 102663077A CN 2012100951402 A CN2012100951402 A CN 2012100951402A CN 201210095140 A CN201210095140 A CN 201210095140A CN 102663077 A CN102663077 A CN 102663077A
- Authority
- CN
- China
- Prior art keywords
- webpage
- page
- collection
- carry out
- expressed
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Abstract
The invention relates to the technical field of network security, specifically to a Web search results security sorting method based on a Hits algorithm. The method comprises the following steps: establishing a malicious feature library F (f1, f2, f3, ..., fn), wherein the malicious feature library comprises n types feature codes that network virus, trojan and bugs are appeared in webpages; expressing each feature code of the malicious feature library as a vector form composed of m components, namely fx= (fx1 fx2, fx3, ..., fxm), wherein x belongs to a set of (1, 2,..., n), and fx belongs to F; the weight of each component is expressed by f'x; then, combining a vector space model with a malicious feature library so as to sorting webpage search results security. According to the web search results security sorting method provided by the invention, sorting of malicious webpages in the search results is reduced, therefore, probability of accessing insecure webpages is reduced.
Description
Technical field
The present invention relates to the network security technology field, particularly a kind of Web Search Results security sort method based on the Hits algorithm.
Background technology
Along with developing rapidly of Internet, Web resource exponentially level increases the management that makes to the Web resource and seems difficult more.Nowadays, the malicious web pages of a large amount of wooden horses under cover, virus and illegal advertisement supervisor spreads unchecked in the Web network.These webpages are taked fraudulent means, utilize the limitation of search engine, make some malice page be hidden in the more forward position of search result rank, the very big like this information security that possibly jeopardize subscriber computer and other-end.This shows, solve and the problem of improving Web safety has been extremely urgent.
Summary of the invention
The object of the present invention is to provide a kind of Web Search Results security sort method based on the Hits algorithm, this method helps reducing the ordering of malicious web pages in Search Results, thereby reduces the probability that has access to dangerous webpage.
The technical scheme that the present invention adopts is: a kind of Web Search Results security sort method based on the Hits algorithm, set up a malice feature database
F(
f 1,
f 2,
f 3...,
f n ), said malice feature database comprises
nThe condition code that kind internet worm, wooden horse, leak occur in webpage is with each condition code of said malice feature database
f i Be expressed as by
mThe vector form that individual component is formed, promptly
f i =(
f i1
,
f i2
,
f i3
...,
f Im ), wherein
i∈ 1,2 ...,
n,
f i ∈
FThen, based on the Hits algorithm, carry out the webpage security ordering as follows:
Step 1: search for is submitted to the text based search engine, before from the set of return results webpage, getting
tThe set of individual webpage is designated as the root collection
RTo said collection
RThe middle adding by the root collection
RThe webpage of quoting with quote the root collection
RWebpage, after inherence link and uncorrelated link handled, with the root collection
RBe extended to set
GWith set
GIn the Hub webpage be vertex set
V 1, be vertex set with the Authority webpage
V 2,
V 1In webpage arrive
V 2In the hyperlink of webpage be the limit collection
E, form two fens digraph
S=(
V 1,
V 2,
E), right
V 1In arbitrary summit
v, use
h(
v) the expression webpage
vThe Hub value, right
V 2In arbitrary summit
u, use
a(
u) the expression webpage
uThe Authority value, when initial
h(
v)=
a(
u)=1;
Step 2: right
uCarry out the I operation, revise its
a(
u), right
vCarry out the O operation, revise its
h(
v), I operation, O operation are respectively:
In the above-mentioned formula,
Expression is gone through time
V 1The middle page and summation,
Expression is gone through time
V 2The middle page and summation,
Risk(
F,
u),
Risk(
F,
v) calculate by following formula:
In the above-mentioned formula,
μ i Represent in the said malice feature database
iPlant the harm factor of condition code,
μ i ∈ (0,1); The said page
uBe text collection, the page
uBeing expressed as vector does
u(
u 1,
u 2,
u 3...,
u p ), with the page
uEach component
u k Be expressed as by
mThe vector form that individual component is formed, promptly
u K= (
u k1
,
u k2
,
u k3
...,
u Km ), wherein
k∈ 1,2 ...,
p,
u k ∈
uThe said page
vBe text collection, the page
vBeing expressed as vector does
v(
v 1,
v 2,
v 3...,
v p ), with the page
vEach component
v k Be expressed as by
mThe vector form that individual component is formed, promptly
v K= (
v k1
,
v k2
,
v k3
...,
v Km ), wherein
k∈ 1,2 ...,
p,
v k ∈
v
Step 3: 2 pairs of vertex sets set by step
V 2In all pages carry out I operation, to vertex set
V 1In all pages carry out O operation; After the completion, right by following formula
a(
u),
h(
v) carry out standardization processing:
In the above-mentioned formula,
qThe quantity of expression chain ingress;
Step 4: repeating step 2,3 carries out iterative computation, up to
a(
u),
h(
v) convergence;
Step 5: last according to each page
a(
u) value just sorts by security to each page.
The invention has the beneficial effects as follows on the basis of Hits algorithm,, the degree of risk of webpage is estimated in conjunction with vector space model and malice feature database.Through the Authority value of restriction malicious web pages, reduce the ordering of malicious web pages in Search Results, thereby reduced the probability that has access to dangerous webpage, strengthened Web safety.
Description of drawings
Fig. 1 is the fundamental diagram of the embodiment of the invention.
Embodiment
The present invention is based on the Web Search Results security sort method of Hits algorithm, set up a malice feature database
F(
f 1,
f 2,
f 3...,
f n ), said malice feature database comprises
nThe condition code that kind internet worm, wooden horse, leak occur in webpage is with each condition code of said malice feature database
f i Be expressed as by
mThe vector form that individual component is formed, promptly
f i =(
f i1
,
f i2
,
f i3
...,
f Im ), wherein
i∈ 1,2 ...,
n,
f i ∈
FThen, based on the Hits algorithm, carry out the webpage security ordering as follows:
Step 1: search for is submitted to the text based search engine, before from the set of return results webpage, getting
tThe set of individual webpage is designated as the root collection
RTo said collection
RThe middle adding by the root collection
RThe webpage of quoting with quote the root collection
RWebpage, after inherence link and uncorrelated link handled, with the root collection
RBe extended to set
GWith set
GIn the Hub webpage be vertex set
V 1, be vertex set with the Authority webpage
V 2,
V 1In webpage arrive
V 2In the hyperlink of webpage be the limit collection
E, form two fens digraph
S=(
V 1,
V 2,
E), right
V 1In arbitrary summit
v, use
h(
v) the expression webpage
vThe Hub value, right
V 2In arbitrary summit
u, use
a(
u) the expression webpage
uThe Authority value, when initial
h(
v)=
a(
u)=1;
Step 2: right
uCarry out the I operation, revise its
a(
u), right
vCarry out the O operation, revise its
h(
v), I operation, O operation are respectively:
In the above-mentioned formula,
Expression is gone through time
V 1The middle page and summation,
Expression is gone through time
V 2The middle page and summation,
Risk(
F,
u),
Risk(
F,
v) calculate by following formula:
In the above-mentioned formula,
μ i Represent in the said malice feature database
iPlant the harm factor of condition code,
μ i ∈ (0,1); The said page
uBe text collection, the page
uBeing expressed as vector does
u(
u 1,
u 2,
u 3...,
u p ), with the page
uEach component
u k Be expressed as by
mThe vector form that individual component is formed, promptly
u K= (
u k1
,
u k2
,
u k3
...,
u Km ), wherein
k∈ 1,2 ...,
p,
u k ∈
uThe said page
vBe text collection, the page
vBeing expressed as vector does
v(
v 1,
v 2,
v 3...,
v p ), with the page
vEach component
v k Be expressed as by
mThe vector form that individual component is formed, promptly
v K= (
v k1
,
v k2
,
v k3
...,
v Km ), wherein
k∈ 1,2 ...,
p,
v k ∈
v
Step 3: 2 pairs of vertex sets set by step
V 2In all pages carry out I operation, to vertex set
V 1In all pages carry out O operation; After the completion, right by following formula
a(
u),
h(
v) carry out standardization processing:
In the above-mentioned formula,
qThe quantity of expression chain ingress;
Step 4: repeating step 2,3 carries out iterative computation, up to
a(
u),
h(
v) convergence;
Step 5: last according to each page
a(
u) value just sorts by security to each page.
Be further described in the face of the related content that the present invention relates to down.
1, Hits algorithm
The Hits algorithm is a kind of web page interlinkage analytical algorithm by the Kleinberg proposition of IBM.Its principle is according to a given search for
, searches the authoritative page relevant with theme through link analysis.Basic idea is to draw the weights of each webpage through the web page interlinkage analysis, thereby draws the authority of webpage.The Hits algorithm is divided into two types with webpage: a kind of for expressing the authoritative page of a certain theme, be called the authority page; The another kind of page for linking together these authority pages is called the hubs page.Two important weights notions of Hits algorithm design:
Authority: represent the weighting quantity that authoritative webpage is quoted by other webpage, weighting in-degree value that promptly should authority's webpage.If the number of times that certain webpage is cited is many more, then the weighting in-degree value of this webpage is big more, and Authority is big more, and webpage is also just important more.
Hub: represent that Web page points to the weighting quantity of other webpage, i.e. the weighting out-degree value of this Web page, it provides the link of pointing to authoritative webpage set.If the weighting out-degree value of certain webpage is big more, the Hub value of this this webpage is big more.Hub plays the effect of the implicit declaration theme authority page.
To
putd question in concrete retrieval, make up following about the gathering subgraph process of this enquirement:
The result set that uses text based search engine (like Hotbot, AltaVista) retrieval to obtain
; Get the result set
of the most preceding
position of rank, be called root collection (Root Set).
satisfies characteristics (1), (2); But far can not satisfy characteristics (3), therefore need to expand
.
Expand
, mainly be divided into two aspects.The one, the page in all
is expanded into; Be in the graph model; The directed edge that with
is starting point expands into, and the quantity of expansion is restriction not; The 2nd, the link page of each page in pointing to
is got wherein any
;
value is set at 50 usually; If
is not more than 50, then get its all pages.These pages are extended to formation
in original
, be called baseset (Base Set).Such set
can be satisfied above-mentioned three characteristics preferably, the quantity of
generally 1000 in 5000.
In order to improve the calculating effect;
done further processing; Be divided into two kinds of situation to link: first kind of two page that are meant linking relationship is between the different domain names, and such link is called horizontal link; Second kind is meant two link pages under the same domain name, and such link is called inherent link.
Inherent link has only inner navigation function; Almost can not transmit the authority value between webpage; Therefore, the linking relationship of this type is deleted from
.Some incoherent links go out again like advertisement etc., form
.
can think to satisfy the gathering subgraph of above-mentioned 3 characteristics.Through calculating hubs and authorities, more last convergent authorities value is carried out an ordering, obtain the result who needs.
Authorities and hubs are the relations that strengthens each other, and a good hub page or leaf points to a lot of good authorities, and simultaneously, a good authority page or leaf also has much good hubs to point to it.
For
, be expressed as two fens digraphs
.Any vertex v in
; The Hub value of expression webpage
with
; To the summit u in
, the Authority value of expression webpage with
.
at first;
carried out the I operation;
carried out the O operation; Revise
,
respectively, then standardization.Like this operation I, O below the double counting constantly is up to
,
convergence.
In the above-mentioned formula,
Expression is gone through time
V 1The middle page and summation,
Expression is gone through time
V 2The middle page and summation.
2, based on the Web safety of Hits algorithm
Security model is mainly mated by malice feature database and page source code to be accomplished.Adopt the similarity between similar vector space model (VSM, Vector Space Model) the retrieval character sign indicating number and the page, i.e. risk.In this model, document is represented with vector, and the condition code in the document is represented with the component of vector, and its component value is a weight.
Wherein,
can be respectively the proper vector of malice feature database and document;
is the dimension of proper vector,
be
dimension of proper vector.
In like manner, risk storehouse
FWith document
DSimilarity, can be used for estimating the risk of a page
More than be preferred embodiment of the present invention, all changes of doing according to technical scheme of the present invention when the function that is produced does not exceed the scope of technical scheme of the present invention, all belong to protection scope of the present invention.
Claims (1)
1. the Web Search Results security sort method based on the Hits algorithm is characterized in that: set up a malice feature database
F(
f 1,
f 2,
f 3...,
f n ), said malice feature database comprises
nThe condition code that kind internet worm, wooden horse, leak occur in webpage is with each condition code of said malice feature database
f i Be expressed as by
mThe vector form that individual component is formed, promptly
f i =(
f i1
,
f i2
,
f i3
...,
f Im ), wherein
i∈ 1,2 ...,
n,
f i ∈
FThen, based on the Hits algorithm, carry out the webpage security ordering as follows:
Step 1: search for is submitted to the text based search engine, before from the set of return results webpage, getting
tThe set of individual webpage is designated as the root collection
RTo said collection
RThe middle adding by the root collection
RThe webpage of quoting with quote the root collection
RWebpage, after inherence link and uncorrelated link handled, with the root collection
RBe extended to set
GWith set
GIn the Hub webpage be vertex set
V 1, be vertex set with the Authority webpage
V 2,
V 1In webpage arrive
V 2In the hyperlink of webpage be the limit collection
E, form two fens digraph
S=(
V 1,
V 2,
E), right
V 1In arbitrary summit
v, use
h(
v) the expression webpage
vThe Hub value, right
V 2In arbitrary summit
u, use
a(
u) the expression webpage
uThe Authority value, when initial
h(
v)=
a(
u)=1;
Step 2: right
uCarry out the I operation, revise its
a(
u), right
vCarry out the O operation, revise its
h(
v), I operation, O operation are respectively:
In the above-mentioned formula,
Expression is gone through time
V 1The middle page and summation,
Expression is gone through time
V 2The middle page and summation,
Risk(
F,
u),
Risk(
F,
v) calculate by following formula:
In the above-mentioned formula,
μ i Represent in the said malice feature database
iPlant the harm factor of condition code,
μ i ∈ (0,1); The said page
uBe text collection, the page
uBeing expressed as vector does
u(
u 1,
u 2,
u 3...,
u p ), with the page
uEach component
u k Be expressed as by
mThe vector form that individual component is formed, promptly
u K= (
u k1
,
u k2
,
u k3
...,
u Km ), wherein
k∈ 1,2 ...,
p,
u k ∈
uThe said page
vBe text collection, the page
vBeing expressed as vector does
v(
v 1,
v 2,
v 3...,
v p ), with the page
vEach component
v k Be expressed as by
mThe vector form that individual component is formed, promptly
v K= (
v k1
,
v k2
,
v k3
...,
v Km ), wherein
k∈ 1,2 ...,
p,
v k ∈
v
Step 3: 2 pairs of vertex sets set by step
V 2In all pages carry out I operation, to vertex set
V 1In all pages carry out O operation; After the completion, right by following formula
a(
u),
h(
v) carry out standardization processing:
In the above-mentioned formula,
qThe quantity of expression chain ingress;
Step 4: repeating step 2,3 carries out iterative computation, up to
a(
u),
h(
v) convergence;
Step 5: last according to each page
a(
u) value just sorts by security to each page.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210095140.2A CN102663077B (en) | 2012-03-31 | 2012-03-31 | Web search results security sorting method based on Hits algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210095140.2A CN102663077B (en) | 2012-03-31 | 2012-03-31 | Web search results security sorting method based on Hits algorithm |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102663077A true CN102663077A (en) | 2012-09-12 |
CN102663077B CN102663077B (en) | 2014-03-12 |
Family
ID=46772568
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201210095140.2A Expired - Fee Related CN102663077B (en) | 2012-03-31 | 2012-03-31 | Web search results security sorting method based on Hits algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102663077B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2014059852A1 (en) * | 2012-10-17 | 2014-04-24 | 北京奇虎科技有限公司 | Search server and search method |
CN103761476A (en) * | 2013-12-30 | 2014-04-30 | 北京奇虎科技有限公司 | Characteristic extraction method and device |
CN107622048A (en) * | 2017-09-06 | 2018-01-23 | 上海斐讯数据通信技术有限公司 | A kind of text mode recognition method and system |
CN108182186A (en) * | 2016-12-08 | 2018-06-19 | 广东精点数据科技股份有限公司 | A kind of Web page sequencing method based on random forests algorithm |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101409634A (en) * | 2007-10-10 | 2009-04-15 | 中国科学院自动化研究所 | Quantitative analysis tools and method for internet news influence based on information retrieval |
-
2012
- 2012-03-31 CN CN201210095140.2A patent/CN102663077B/en not_active Expired - Fee Related
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101409634A (en) * | 2007-10-10 | 2009-04-15 | 中国科学院自动化研究所 | Quantitative analysis tools and method for internet news influence based on information retrieval |
Non-Patent Citations (3)
Title |
---|
DAWEI HONG ET AL: "Analysis of Web search algorithm HITS", 《INTERNATIONAL JOURNAL OF FOUNDATIONS OF COMPUTER SCIENCE》 * |
JUNG-HUN LEE ET AL: "Associated word extraction system for search query expansion based on hits", 《CCIS》 * |
YANGFU GUO ET AL: "《Optimizing for Web Security Based on Search Engine》", 《2010 INTERNATIONAL CONFERENCE ON COMPUTER DESIGN AND APPLIATIONS (ICCDA 2010)》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2014059852A1 (en) * | 2012-10-17 | 2014-04-24 | 北京奇虎科技有限公司 | Search server and search method |
CN103761476A (en) * | 2013-12-30 | 2014-04-30 | 北京奇虎科技有限公司 | Characteristic extraction method and device |
CN103761476B (en) * | 2013-12-30 | 2016-11-09 | 北京奇虎科技有限公司 | The method and device of feature extraction |
CN108182186A (en) * | 2016-12-08 | 2018-06-19 | 广东精点数据科技股份有限公司 | A kind of Web page sequencing method based on random forests algorithm |
CN107622048A (en) * | 2017-09-06 | 2018-01-23 | 上海斐讯数据通信技术有限公司 | A kind of text mode recognition method and system |
CN107622048B (en) * | 2017-09-06 | 2021-06-22 | 南京硅基智能科技有限公司 | Text mode recognition method and system |
Also Published As
Publication number | Publication date |
---|---|
CN102663077B (en) | 2014-03-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20180144132A1 (en) | Kind of android malicious code detection method on the base of community structure analysis | |
Hu et al. | Large-scale malware indexing using function-call graphs | |
CN102436563B (en) | Method and device for detecting page tampering | |
CN106021256A (en) | De-duplicating distributed file system using cloud-based object store | |
CN102591965B (en) | Method and device for detecting black chain | |
CN104516910A (en) | Method and system for recommending content in client-side server environment | |
CN102790762A (en) | Phishing website detection method based on uniform resource locator (URL) classification | |
CN101853277A (en) | Vulnerability data mining method based on classification and association analysis | |
US20130339369A1 (en) | Search Method and Apparatus | |
CN107437026B (en) | Malicious webpage advertisement detection method based on advertisement network topology | |
Satpal et al. | Web information extraction using markov logic networks | |
CN103984883A (en) | Class dependency graph based Android application similarity detection method | |
CN103150663A (en) | Method and device for placing network placement data | |
CN104268142A (en) | Meta search result ranking algorithm based on rejection strategy | |
CN104834736A (en) | Method and device for establishing index database and retrieval method, device and system | |
CN102663077B (en) | Web search results security sorting method based on Hits algorithm | |
CN111181922A (en) | Fishing link detection method and system | |
CN111754338B (en) | Method and system for identifying partner of trepanning loan website | |
Fdez-Glez et al. | A dynamic model for integrating simple web spam classification techniques | |
WO2017086992A1 (en) | Malicious web content discovery through graphical model inference | |
CN104881446A (en) | Searching method and searching device | |
CN101268465B (en) | Method for sorting a set of electronic documents | |
Choudhary et al. | Role of ranking algorithms for information retrieval | |
CN110781497B (en) | Method for detecting web page link and storage medium | |
CN104462241A (en) | Population property classification method and device based on anchor texts and peripheral texts in URLs |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20140312 Termination date: 20170331 |
|
CF01 | Termination of patent right due to non-payment of annual fee |