CN110147472A - Detection method, device and the detection device for website of practising fraud of cheating website - Google Patents
Detection method, device and the detection device for website of practising fraud of cheating website Download PDFInfo
- Publication number
- CN110147472A CN110147472A CN201710576240.XA CN201710576240A CN110147472A CN 110147472 A CN110147472 A CN 110147472A CN 201710576240 A CN201710576240 A CN 201710576240A CN 110147472 A CN110147472 A CN 110147472A
- Authority
- CN
- China
- Prior art keywords
- website
- cheating
- page
- feature
- detected
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9566—URL specific, e.g. using aliases, detecting broken or misspelled links
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Information Transfer Between Computers (AREA)
Abstract
This application provides the cheating detection method of website, device and detection devices for website of practising fraud, wherein, the detection method of cheating website includes: to extract the page feature of the known cheating website lower page from the retrieval log of known cheating website and/or access log;According to the cheating rule building cheating detection model that the page feature indicates, the cheating detection model is for detecting whether website practises fraud;Treat whether detection website is practised fraud and detected according to the cheating detection model.Using the embodiment of the present application, the accuracy of the cheating testing result to website can be improved.
Description
Technical field
This application involves website detection technique field, in particular to a kind of detection method for website of practising fraud, device, Yi Zhongyong
In the detection device of cheating website, and, a kind of computer-readable medium.
Background technique
Currently, the case where raw website cheating, also gets over therewith as user is more and more frequent using the number of internet
Come more.Website cheating is that part website in order to make the webpage for being not belonging to user query result originally also can appear in user's
In query result, such case is properly termed as website cheating.Under normal circumstances, user query are main to the affiliated website cheating of webpage
It is divided into based on content cheating, link cheating and deception crawler cheating etc..
In the prior art, generally each webpage under website is analyzed, and according to analysis result to determine whether in the presence of
The case where website is practised fraud.
Summary of the invention
Inventor has found that when analyzing webpage, need to rely on the prior art work identified in the course of the research
Fraudulent means used by disadvantage website, and if webpage under a website using a kind of fraudulent means that do not analyzed,
Then the prior art is when judging whether the website practises fraud with regard to not accurate enough;Also, to webpage analyze it is general using with
The method of machine sampling will may not largely have representative webpage as analysis object yet, and lead to prior art training
The precision of cheating webpages model and recall deficiency.
Inventor also found in the course of the research, for known cheating webpages, if it is possible to utilize the history of search engine
Search record includes, the retrieval log that webpage under the website is retrieved and the access log to access, to utilize
The information such as the search result webpage being retrieved under known cheating website and the access frequency, the corresponding term that access webpage,
Cheating detection model is constructed, the cheating rule that the cheating detection model is able to reflect out cheating website is allowed for, thus to it
His website carries out more accurate cheating detection;Also, because based on user in search engine when establishing cheating detection model
Log and access log are retrieved, so just having more uniformity and representativeness based on from user perspective to establish model.
Based on this, this application provides a kind of detection methods of website of practising fraud, and may include:
From the retrieval log of known cheating website and/or access log, the known cheating website lower page is extracted
Page feature;
According to the cheating rule building cheating detection model that the page feature indicates, the cheating detection model is for examining
Whether survey station point practises fraud;
Treat whether detection website is practised fraud and detected according to the cheating detection model.
Wherein, described from the retrieval log of known cheating website and/or access log, extract the known website of practising fraud
The page feature of lower page may include:
Obtain retrieval log and/or the access log of the known cheating website, the retrieval log include: term with
The search result page corresponding with the term, the access log include: the accession page and each accession page of user
Access times;
The text feature and/or structure feature for extracting the search result page and/or accession page, as the page
Feature.
Wherein, the text feature and/or structure feature for extracting the search result page and/or accession page, makees
For the page feature, may include:
The body text information and/or heading-text of each page are extracted from the search result page and/or accession page
This information, as the text feature;And
Body structure feature and the header syntax that each page is extracted from the search result page and/or accession page are special
Sign, as the structure feature.
Wherein, the cheating rule indicated according to the page feature constructs cheating detection model, may include:
By the search result page and/or the page feature of accession page, be separately converted to searching characteristic vector and/or
Access feature vector;
According to searching characteristic vector and/or access feature vector, building cheating detection model.
Wherein, described to treat whether detection website is practised fraud and detected according to the cheating detection model, may include:
Obtain the page to be detected of website to be detected;
Extract the page feature to be detected of the page to be detected, and by the page feature to be detected be converted to it is described to
Detect the feature vector to be detected of website;
Whether meet page cheating rule according to the feature vector to be detected, detects whether the website to be detected is work
Disadvantage website.
Wherein, the known cheating website can determine in the following manner:
Obtain the Website Hosting to be determined whether practised fraud;
Each website in the Website Hosting is clustered, all kinds of websites after being clustered;And
The website that annotation results artificial in all kinds of websites are cheating is determined as the known website of practising fraud, the people
Work annotation results are for indicating whether all kinds of websites are cheating website.
Wherein, the method can also include:
Drop power or delete processing are carried out to the website to be detected that testing result is cheating.Present invention also provides a kind of dresses
It sets, to guarantee the realization and application of the above method in practice.
A kind of detection device of website of practising fraud provided by the embodiments of the present application, comprising:
Extraction unit, for extracting the known cheating from the retrieval log of known cheating website and/or access log
The page feature of website lower page;
Model construction unit, the cheating rule building cheating detection model for indicating according to the page feature are described
Cheating detection model is for detecting whether website practises fraud;
Detection unit, for treating whether detection website is practised fraud and detected according to the cheating detection model.
Wherein, the extraction unit may include:
Subelement is obtained, for obtaining retrieval log and/or the access log of the known cheating website, the retrieval day
Will includes: term and the search result page corresponding with the term, and the access log includes: the accession page of user
And the access times of each accession page;And
Subelement is extracted, the text feature and/or structure for extracting the search result page and/or accession page are special
Sign, as the page feature.
Wherein, the extraction subelement may include:
Information extraction subelement, for extracting the text of each page from the search result page and/or accession page
Text information and/or title text information, as the text feature;And
Structure extraction subelement, for extracting the text of each page from the search result page and/or accession page
Structure feature and header syntax feature, as the structure feature.
Wherein, the model construction unit may include:
Transforming subunit, for being separately converted to examine by the page feature of the search result page and/or accession page
Rope feature vector and/or access feature vector;And
Subelement is constructed, for according to searching characteristic vector and/or access feature vector, building cheating detection model.
Wherein, the detection unit may include:
Subelement is obtained, for obtaining the page to be detected of website to be detected;
Subelement is extracted, for extracting the page feature to be detected of the page to be detected, and by the page to be detected
Feature Conversion is the feature vector to be detected of the website to be detected;
Detection sub-unit, for whether meeting page cheating rule according to the feature vector to be detected, detection it is described to
Whether detection website is cheating website.
Wherein, the known cheating website can determine in the following manner:
Obtain the Website Hosting to be determined whether practised fraud;
Each website in the Website Hosting is clustered, all kinds of websites after being clustered;
The website that annotation results artificial in all kinds of websites are cheating is determined as the known website of practising fraud, the people
Work annotation results are for indicating whether all kinds of websites are cheating website.
Wherein, described device can also include:
Cheating processing unit, for carrying out drop power or delete processing to the website to be detected that testing result is cheating.
Present invention also provides a kind of detection device of website of practising fraud, include memory and one or one with
On program, one of them perhaps more than one program be stored in memory and be configured to by one or more than one
It includes the instruction for performing the following operation that processor, which executes the one or more programs:
From the retrieval log of known cheating website and/or access log, the known cheating website lower page is extracted
Page feature;
According to the cheating rule building cheating detection model that the page feature indicates, the cheating detection model is for examining
Whether survey station point practises fraud;
Treat whether detection website is practised fraud and detected according to the cheating detection model.
Present invention also provides a kind of computer-readable mediums, instruction are stored thereon with, when by one or more processors
When execution, so that device executes the detection method of cheating website described in one or more as the aforementioned.
Wherein, described from the retrieval log of known cheating website and/or access log, extract the known website of practising fraud
The page feature of lower page, can specifically include:
Obtain retrieval log and/or the access log of the known cheating website, the retrieval log include: term with
The search result page corresponding with the term, the access log include: the accession page and each accession page of user
Access times;And
The text feature and/or structure feature for extracting the search result page and/or accession page, as the page
Feature.
Wherein, the text feature and/or structure feature for extracting the search result page and/or accession page, makees
For the page feature, can specifically include:
The body text information and/or heading-text of each page are extracted from the search result page and/or accession page
This information, as the text feature;And
Body structure feature and the header syntax that each page is extracted from the search result page and/or accession page are special
Sign, as the structure feature.
Wherein, the cheating rule building cheating detection model indicated according to the page feature, can specifically include:
By the search result page and/or the page feature of accession page, be separately converted to searching characteristic vector and/or
Access feature vector;
According to searching characteristic vector and/or access feature vector, building cheating detection model.
Wherein, described to treat whether detection website is practised fraud and detected according to the cheating detection model, specifically it can wrap
It includes:
Obtain the page to be detected of website to be detected;
Extract the page feature to be detected of the page to be detected, and by the page feature to be detected be converted to it is described to
Detect the feature vector to be detected of website;And
Whether meet page cheating rule according to the feature vector to be detected, detects whether the website to be detected is work
Disadvantage website.
Wherein, the known cheating website can determine in the following manner:
Obtain the Website Hosting to be determined whether practised fraud;
Each website in the Website Hosting is clustered, all kinds of websites after being clustered;And
The website that annotation results artificial in all kinds of websites are cheating is determined as the known website of practising fraud, the people
Work annotation results are for indicating whether all kinds of websites are cheating website.
Wherein, described device can also be configured to execute one or one by one or more than one processor
A procedure above includes the instruction for performing the following operation:
Drop power or delete processing are carried out to the website to be detected that testing result is cheating.
In the embodiment of the present application, for known cheating website, the retrieval log saved from search engine and/or access day
In will, each page or partial page feature under the known cheating website are extracted, thus according to obtained page spy is extracted
Therefore the cheating feature building cheating detection model that sign indicates is treating whether detection website is made based on the cheating detection model
When disadvantage is detected, because the cheating detection model is able to reflect out cheating feature of the cheating website on the page, to it
His website just can be carried out more accurate cheating detection;Also, drawn based on user in search when because establishing cheating detection model
The retrieval log held up and access log, thus established from user perspective model just have more cheating website uniformity and
It is representative.
Detailed description of the invention
In order to more clearly explain the technical solutions in the embodiments of the present application, make required in being described below to embodiment
Attached drawing is briefly described, it should be apparent that, the drawings in the following description are only some examples of the present application, for
For those of ordinary skill in the art, without any creative labor, it can also be obtained according to these attached drawings
His attached drawing.
Fig. 1 is the exemplary process diagram of the detection method embodiment of the cheating website of the application;
Fig. 2 is the exemplary block diagram of the detection device embodiment of the cheating website of the application;
Fig. 3 is a kind of detection device 800 for website of practising fraud shown according to an exemplary embodiment in the application
Block diagram;
Fig. 4 is the structural schematic diagram of server in the embodiment of the present application.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of embodiments of the present application, instead of all the embodiments.It is based on
Embodiment in the application, it is obtained by those of ordinary skill in the art without making creative efforts every other
Embodiment shall fall in the protection scope of this application.
The application can be used in numerous general or special purpose computing device environment or configurations.Such as: personal computer, service
Device computer, handheld device or portable device, laptop device, multi-processor device including any of the above devices or devices
Distributed computing environment etc..
The application can describe in the general context of computer-executable instructions executed by a computer, such as program
Module.Generally, program module includes routines performing specific tasks or implementing specific abstract data types, programs, objects, group
Part, data structure etc..The application can also be practiced in a distributed computing environment, in these distributed computing environments, by
Task is executed by the connected remote processing devices of communication network.In a distributed computing environment, program module can be with
In the local and remote computer storage media including storage equipment.
With reference to Fig. 1, a kind of flow chart of the detection method embodiment for website of practising fraud of the application, the embodiment of the present application are shown
The testing process of middle cheating website may include the building process and website testing process of cheating model, the structure of cheating model
Building process includes step 101~step 102, and website testing process includes step 103, and the overall flow of the present embodiment includes following
Step 101~step 104:
Step 101: from the retrieval log of known cheating website and/or access log, extracting the known website of practising fraud
The page feature of lower page.
In the present embodiment, in the case where known cheating website, it can use the cheating stored in database
The retrieval log of website, access log etc., to extract the page feature of each page or partial page under the cheating website, with
Just the page feature under later use cheating website constructs cheating detection model.
Specifically, storing whole retrieval days of website " www.ABCD.com " in the database of one search engine of hypothesis
Will and access log.
Wherein, retrieval log is for indicating each page under the website by the retrieval information of user search;Retrieve log
It may include: the term inputted when user retrieves every time and its corresponding search result page etc..
Information of the access log for indicating to click accession page after user retrieves every time, alternatively, user is pushed away by the page
The information of modes accession page such as recommend;Access log may include: that the accession page information of user and each accession page correspond to
Access times, for example, user visits the page " www.ABCD.com/890/hty " under website " www.ABCD.com "
The access times asked are 10 times, and the visit to the page " www.ABCD.com/855555/ef " under website " www.ABCD.com "
Ask that number is 100 times, etc..
After the retrieval log and access log for getting known cheating website, therefrom get each search result page and
Accession page, and extract the page feature of page feature and accession page that there emerged a retrieval result page face.Wherein, page feature can
To include text feature and structure feature in the page, text feature is used for the characteristics of characterizing the text information in the page, and ties
Structure feature is used to characterize characteristic distributions of the various pieces of the page in structure.For example, text feature may include: in the page
Whether text information includes illegal word (such as: Falun Gong etc.), if there are a large amount of duplicate vocabulary, if sentence is not logical enough
It is suitable, if context is uncorrelated, etc..And structure feature can be with the distribution situation of various pieces in representation page, for example, the page
In title be located at which position of the page, whether be splicing between the various pieces of the page, the length of the page whether be more than
Pre-set length threshold, for showing whether the distribution between the body frame of text and other function frame is reasonable, alternatively, advertisement is distributed
Whether rationally or page body or title have been covered etc..
Certainly, in practical applications, can only using text feature as page feature, can also only using structure feature as
Page feature can also all regard text feature and structure feature as page feature.In addition, in the present embodiment, it can also be only
Page feature is extracted to the search result page in retrieval log, only the accession page in access log can also be extracted special
Sign, or extract page feature to the search result page and accession page, those skilled in the art can be in actual scene
Autonomous setting according to demand.And for retrieval log or access log, because of the search result page therein or accession page
May have it is very much, so can also randomly select in practical applications a part of search result page or a part of accession page into
The extraction of row page feature, is in the embodiment of the present application not construed as limiting this.
In practical applications, it is known that cheating website can be a kind of exemplary by step A1~step A3 as follows
Mode determines:
Step A1: the Website Hosting to be determined whether practised fraud is obtained.
In the present embodiment, whether practise fraud to website when detecting, it can be first to each station to be determined whether practised fraud
Website in point set is clustered, and then is manually marked according to all kinds of websites after cluster, that is, the every a kind of station of mark
Point belongs to cheating website or normal website, to obtain the testing result whether each website practises fraud.
Step A2: clustering each website in the Website Hosting, all kinds of websites after being clustered.
It is corresponding according to webpage included by each website and the webpage for each website got in step A1
Term, extract the feature of each webpage under each website, and be converted to corresponding feature vector.Specifically, extracting webpage
Web page characteristics and conversion characteristic vector process, can be discussed in detail with reference to below step 102.
In the case where obtaining each website after the feature vector of each webpage, these feature vectors can be clustered, be clustered
It is the process that the set of physics or abstract object is divided into the multiple classes being made of similar object.For example, in this step may be used
To be clustered using the k-means clustering algorithm for being clustered to sample, in practical applications, practise fraud each under website
The feature vector of a webpage is converted into one kind under normal circumstances, and the feature vector of each webpage under normal website is then converted into
In addition a kind of.
Step A3: the website that annotation results artificial in all kinds of websites are cheating is determined as the known cheating station
Point, the artificial annotation results are for indicating whether all kinds of websites are cheating website.
, can be by manually the two classes website be marked after two class websites after being clustered, i.e., handmarking goes out
Which kind of is cheating website, which kind of is normal website, and then determines that each website is cheating station according to handmarking's result
Point or normal website, and be as the known cheating website in step 101 using the website that artificial annotation results are cheating website
It can.
In addition to it is above-mentioned the mode to determine known cheating website is clustered to unknown each website other than, the application is implemented
Subsequent step 103 can also be detected obtained cheating website in example and also be used as known cheating website, with will pass through to it is new
Know that cheating website carries out page feature extraction, realization further updates the cheating detection model established in step 102;Certainly,
The cheating website that step 103 detection obtains can also carry out manually marking verifying, if the cheating station detected in step 103
The result for clicking through pedestrian's work mark is also cheating website, then as known cheating website to the cheating detection model of foundation into
One step updates.
After the page feature that step 101 is extracted known cheating website, step 102 is subsequently entered:
Step 102: the cheating rule building cheating detection model indicated according to the page feature, the cheating detect mould
Type is for detecting whether website practises fraud.
The cheating rule for the known cheating website that the page feature according to obtained in step 101 indicates, can use has
One classifier of machine learning method training of supervision is as cheating detection model, and the cheating detection model is for detecting other stations
Whether point is cheating website.
Specifically, step 102 may comprise steps of B1~step B2 during realization:
Step B1: by the search result page and/or the page feature of accession page, be separately converted to retrieval character to
Amount and/or access feature vector.
It in this step, first can be by the page feature of the page feature of the search result page and accession page, respectively
It is converted into the searching characteristic vector and access feature vector of known cheating website.In practical applications, only with retrieval character to
Amount building model, perhaps only with access feature vector building model or simultaneously using searching characteristic vector and access feature
Vector building model is ok.
Specifically, can first obtain each page feature each of under the website during conversion characteristic vector
Corresponding characteristic value in webpage, then statistics obtains each page feature all page corresponding eigenvalues under known cheating website
Affiliated range of characteristic values.For example, obtaining its numerical value in webpage A as webpage A for this structure feature of page length
The corresponding characteristic value of middle page length, and it is obtained in the numerical value in webpage B as the corresponding feature of page length in webpage B
Value, and so on, until acquiring in all webpages under the website until the corresponding characteristic value of page length.
Then, the corresponding characteristic value of page length in each webpage is counted, to determine page length in known cheating website
Under range of characteristic values belonging to all webpage corresponding eigenvalues.Assuming that acquiring page length corresponding spy in each webpage
After value indicative, is counted and determine that the maximum value of its corresponding eigenvalue is 1024 pixels, the minimum value of corresponding eigenvalue is 268 pictures
Element can then determine that page length corresponding eigenvalue range in each webpage under the known cheating website is 268 pixels
To 1024 pixels, then the binary numeral that computer can identify is converted by range of characteristic values to get corresponding vector is arrived
Value, such as corresponding vector value range correspond to 000100~111111.
Assuming that counting discovery altogether for all webpages under known cheating website has 100 page features, then for number
The page feature of value type, such as the numerical value of page length this page feature is 268 or 1024, then can be by by known work
Under disadvantage website the numerical value of the page feature of each webpage be added and value, as this it is known cheating website page feature
Characteristic value, then convert binary numeral for this feature value, can be obtained 1*N dimension feature vector n-th vector value,
Wherein, N is the integer greater than zero.For example, as it is known that first vector value of the feature vector of cheating website corresponds to page length
The page length of the characteristic value of this page feature, then all webpages that add up obtains the sum value 8534 later, then first vector value
It is just the binary numeral of " 8534 " conversion.
And for the page feature of nonumeric type, such as image definition this page feature, characteristic value is respectively " clear
It is clear ", " common " and " unintelligible ", then binary numeral " 2 ", " 1 " and " 0 " can be respectively adopted to distinguish in those skilled in the art
Indicate above three characteristic value " clear ", " common " and " unintelligible ".For example, as it is known that second of the feature vector of cheating website
Vector value has 5 webpage A~E for indicating this page feature of image definition under known cheating website, then known cheating
Second vector value of the feature vector of website respectively indicates the image definition of this 5 webpages to the 6th vector value.Example
Such as, it is known that the 2nd~6 vector value of the feature vector for website of practising fraud is respectively { 0,2,1,1,2 }, and the webpage pre-set
Sequence is then for from the lexicographic order of A~E, then it represents that the clarity of webpage A be it is unintelligible, the clarity of webpage B is clear, webpage
The clarity of C be it is common, the clarity of webpage D be it is common, the clarity of webpage E is clear.
And so on, whether the characteristic value according to the page feature of each webpage under known cheating website is numeric type, with
And under known cheating website webpage quantity, to finally obtain the feature vector of the corresponding 1*N dimension of known cheating website.
Certainly, aforesaid way is merely exemplary content, should not be construed as the restriction of the embodiment of the present application.
In practical applications, in the corresponding feature vector of each webpage page feature quantity, directly affect model
Trained accuracy and speed, and the feature vector for using the above method to generate, can only include important page feature, therefore
Feature vector can effectively improve subsequent training and recall precision by lower dimension.
Step B2: according to searching characteristic vector and/or access feature vector, building cheating detection model.
After the page feature of each webpage under known cheating website is converted into feature vector, available two groups of instructions
Practice data, one group be it is known cheating website searching characteristic vector constitute searching characteristic vector set, another group is known work
The access feature vector set that the access feature vector of disadvantage website is constituted.Building practise fraud detection model when, can respectively according to
Searching characteristic vector set and access feature vector set, construct two cheating detection models;A cheating can also be constructed
Detection model.
It, can because accession page is that user clicks the page checked for constructing a cheating detection model
With will access each access feature vector in feature vector set weight setting it is larger, by each access feature vector with
Searching characteristic vector will test result as desired output valve (also referred to as supervisory signals), adopt respectively as input object
With kNN (arest neighbors, k-NearestNeighbor) or the side of support vector machines (SVM, Support Vector Machine)
Method etc. has the machine learning method of supervision, and Lai Xunlian obtains cheating webpages detection model.Wherein, desired output valve can be
" testing result is cheating " or " testing result is that the probability of cheating is 100% " etc..
101~step 102 of above step is the process of building cheating detection model in the embodiment of the present application, is detected in cheating
After model construction, in the case where needing to detect other websites and whether practising fraud, following steps 103 are executed.
Step 103: treating whether detection website is practised fraud and detected according to the cheating detection model.
It, can be to be detected to other using the cheating webpages detection model trained to after cheating webpages detection model
Whether website, which practises fraud, is detected.
Specifically, detecting the process whether website to be detected practises fraud may include step C1~step C3:
Step C1: the page to be detected of website to be detected is obtained.
Firstly, obtaining each page to be detected under website to be detected, or the part under website to be detected is obtained at random
The page is as the page to be detected.In practical applications, whole pages under available website to be detected are detected, and if
The quantity of the page to be detected is excessive, can also the therefrom extraction section page, such as 60% the page as the page to be detected, taking out
When taking partial page, the percentage for extracting whole pages can be independently arranged by those skilled in the art.
Step C2: the page feature to be detected of the page to be detected is extracted, and the page feature to be detected is converted
For the feature vector to be detected of the website to be detected.
Then the page feature for extracting the page to be detected again may include text feature and structure feature etc., and will extract
To page feature to be detected be converted to the feature vector to be detected of website to be detected.The extraction of page feature can refer to step
101 introduction, the process of converting characteristic vector can refer to the introduction of step B1, and details are not described herein.
Step C3: whether meet page cheating rule according to the feature vector to be detected, detect the website to be detected
It whether is cheating website.
The input for the cheating detection model that feature vector to be detected obtained in step C2 is constructed as step 102 again,
To be exported as a result, website i.e. to be detected is cheating website or is not cheating website.In practical applications, according to use
Have the difference of the method for supervision, detection model of practising fraud output can directly for website to be detected whether be practise fraud website knot
Fruit is also possible to treat the prediction probability that detection website is cheating website, for example, website to be detected is the general of website of practising fraud
Rate is 80%, and in this case, those skilled in the art can preset a probabilistic determination threshold value, such as 70%, such as
The probability value of fruit cheating detection model output is greater than the probabilistic determination threshold value, then confirms the website to be detected for cheating website.
After whether step 103 is practised fraud to other websites and detected, it can choose and execute following steps 104.
Step 104: drop power or delete processing are carried out to the website to be detected that testing result is cheating.
In the present embodiment, each under the website to be detected in order to reduce if website to be detected is cheating website
The page by user search to a possibility that, measuring station point can be treated and carry out drop power processing, or can be directly by survey station to be checked
Each page under point is deleted.
In the embodiment of the present application, for known cheating website, the retrieval log saved from search engine and/or access day
In will, each page or partial page feature under the known cheating website are extracted, thus according to obtained page spy is extracted
Therefore the cheating feature building cheating detection model that sign indicates is treating whether detection website is made based on the cheating detection model
When disadvantage is detected, because the cheating detection model is able to reflect out cheating feature of the cheating website on the page, to it
His website just can be carried out more accurate cheating detection;Also, drawn based on user in search when because establishing cheating detection model
The retrieval log held up and access log, thus established from user perspective model just have more cheating website uniformity and
It is representative.
For the aforementioned method embodiment, for simple description, therefore, it is stated as a series of action combinations, still
Those skilled in the art should understand that the application is not limited by the described action sequence, because according to the application, it is certain
Step can be performed in other orders or simultaneously.Secondly, those skilled in the art should also know that, it is described in the specification
Embodiment belong to preferred embodiment, necessary to related actions and modules not necessarily the application.
It is corresponding with a kind of method provided by the detection method embodiment of website of practising fraud of above-mentioned the application, referring to fig. 2, this
Application additionally provides a kind of detection device embodiment of website of practising fraud, in the present embodiment, the apparatus may include:
Extraction unit 201, for extracting described known from the retrieval log of known cheating website and/or access log
The page feature for website lower page of practising fraud.
Wherein, the extraction unit 201 may include:
Subelement is obtained, for obtaining retrieval log and/or the access log of the known cheating website, the retrieval day
Will includes: term and the search result page corresponding with the term, and the access log includes: the accession page of user
And the access times of each accession page;And subelement is extracted, for extracting the search result page and/or access page
The text feature and/or structure feature in face, as the page feature.
Wherein, the extraction subelement may include:
Information extraction subelement, for extracting the text of each page from the search result page and/or accession page
Text information and/or title text information, as the text feature;And structure extraction subelement, it is used for from the retrieval
The body structure feature and header syntax feature that each page is extracted in results page and/or accession page, it is special as the structure
Sign.
Wherein, the known cheating website can determine in the following manner:
Obtain the Website Hosting to be determined whether practised fraud;Each website in the Website Hosting is clustered, is gathered
All kinds of websites after class;And the website that annotation results artificial in all kinds of websites are cheating is determined as the known work
Disadvantage website, the artificial annotation results are for indicating whether all kinds of websites are cheating website.
Model construction unit 202, the cheating rule building cheating detection model for being indicated according to the page feature, institute
Cheating detection model is stated for detecting whether website practises fraud.
Wherein, the model construction unit 202 may include:
Transforming subunit, for being separately converted to examine by the page feature of the search result page and/or accession page
Rope feature vector and/or access feature vector;And building subelement, for according to searching characteristic vector and/or access feature
Vector, building cheating detection model.
Detection unit 203, for treating whether detection website is practised fraud and detected according to the cheating detection model.
Wherein, the detection unit 203 may include:
Subelement is obtained, for obtaining the page to be detected of website to be detected;Subelement is extracted, it is described to be checked for extracting
The page feature to be detected of the page is surveyed, and the page feature to be detected is converted to the feature to be detected of the website to be detected
Vector;And detection sub-unit, for whether meeting page cheating rule according to the feature vector to be detected, described in detection
Whether website to be detected is cheating website.
Wherein, described device can also include:
Cheating processing unit 204, for carrying out drop power or delete processing to the website to be detected that testing result is cheating.
As it can be seen that in the embodiment of the present application, for known cheating website, the retrieval log saved from search engine and/or
In access log, each page or partial page feature under the known cheating website are extracted, thus obtained according to extraction
Therefore the cheating rule building cheating detection model that page feature indicates is treating detection website based on the cheating detection model
When whether cheating is detected, it will be able to which the cheating detection model is able to reflect out the cheating rule of cheating website, thus to it
His website carries out more accurate cheating detection;Also, because based on user in search engine when establishing cheating detection model
Log and access log are retrieved, so just having more uniformity and representativeness, energy based on from user perspective to establish model
It is enough that unknown cheating type is also accurately detected.
About the device in above-described embodiment, wherein modules execute the concrete mode of operation in related this method
Embodiment in be described in detail, no detailed explanation will be given here.
Fig. 3 is a kind of block diagram of detection device 800 for website of practising fraud shown according to an exemplary embodiment.Example
Such as, device 800 can be mobile phone, computer, digital broadcasting terminal, messaging device, game console, and plate is set
It is standby, Medical Devices, body-building equipment, personal digital assistant etc..
Referring to Fig. 3, device 800 may include following one or more components: processing component 802, memory 804, power supply
Component 806, multimedia component 808, audio component 810, the interface 812 of input/output (I/O), sensor module 814, and
Communication component 816.
The integrated operation of the usual control device 800 of processing component 802, such as with display, telephone call, data communication, phase
Machine operation and record operate associated operation.Processing element 802 may include that one or more processors 820 refer to execute
It enables, to perform all or part of the steps of the methods described above.In addition, processing component 802 may include one or more modules, just
Interaction between processing component 802 and other assemblies.For example, processing component 802 may include multi-media module, it is more to facilitate
Interaction between media component 808 and processing component 802.
Memory 804 is configured as storing various types of data to support the operation in equipment 800.These data are shown
Example includes the instruction of any application or method for operating on device 800, contact data, and telephone book data disappears
Breath, picture, video etc..Memory 804 can be by any kind of volatibility or non-volatile memory device or their group
It closes and realizes, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM) is erasable to compile
Journey read-only memory (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory, flash
Device, disk or CD.
Power supply module 806 provides electric power for the various assemblies of device 800.Power supply module 806 may include power management system
System, one or more power supplys and other with for device 800 generate, manage, and distribute the associated component of electric power.
Multimedia component 808 includes the screen of one output interface of offer between described device 800 and user.One
In a little embodiments, screen may include liquid crystal display (LCD) and touch panel (TP).If screen includes touch panel, screen
Curtain may be implemented as touch screen, to receive input signal from the user.Touch panel includes one or more touch sensings
Device is to sense the gesture on touch, slide, and touch panel.The touch sensor can not only sense touch or sliding action
Boundary, but also detect duration and pressure associated with the touch or slide operation.In some embodiments, more matchmakers
Body component 808 includes a front camera and/or rear camera.When equipment 800 is in operation mode, such as screening-mode or
When video mode, front camera and/or rear camera can receive external multi-medium data.Each front camera and
Rear camera can be a fixed optical lens system or have focusing and optical zoom capabilities.
Audio component 810 is configured as output and/or input audio signal.For example, audio component 810 includes a Mike
Wind (MIC), when device 800 is in operation mode, when such as call mode, recording mode, and voice recognition mode, microphone is matched
It is set to reception external audio signal.The received audio signal can be further stored in memory 804 or via communication set
Part 816 is sent.In some embodiments, audio component 810 further includes a loudspeaker, is used for output audio signal.
I/O interface 812 provides interface between processing component 802 and peripheral interface module, and above-mentioned peripheral interface module can
To be keyboard, click wheel, button etc..These buttons may include, but are not limited to: home button, volume button, start button and lock
Determine button.
Sensor module 814 includes one or more sensors, and the state for providing various aspects for device 800 is commented
Estimate.For example, sensor module 814 can detecte the state that opens/closes of equipment 800, and the relative positioning of component, for example, it is described
Component is the display and keypad of device 800, and sensor module 814 can be with 800 1 components of detection device 800 or device
Position change, the existence or non-existence that user contacts with device 800,800 orientation of device or acceleration/deceleration and device 800
Temperature change.Sensor module 814 may include proximity sensor, be configured to detect without any physical contact
Presence of nearby objects.Sensor module 814 can also include optical sensor, such as CMOS or ccd image sensor, at
As being used in application.In some embodiments, which can also include acceleration transducer, gyro sensors
Device, Magnetic Sensor, pressure sensor or temperature sensor.
Communication component 816 is configured to facilitate the communication of wired or wireless way between device 800 and other equipment.Device
800 can access the wireless network based on communication standard, such as WiFi, 2G or 3G or their combination.In an exemplary implementation
In example, communication component 816 receives broadcast singal or broadcast related information from external broadcasting management system via broadcast channel.
In one exemplary embodiment, the communication component 816 further includes near-field communication (NFC) module, to promote short range communication.Example
Such as, NFC module can be based on radio frequency identification (RFID) technology, Infrared Data Association (IrDA) technology, ultra wide band (UWB) technology,
Bluetooth (BT) technology and other technologies are realized.
In the exemplary embodiment, device 800 can be believed by one or more application specific integrated circuit (ASIC), number
Number processor (DSP), digital signal processing appts (DSPD), programmable logic device (PLD), field programmable gate array
(FPGA), controller, microcontroller, microprocessor or other electronic components are realized, for executing the above method.
In the exemplary embodiment, a kind of non-transitorycomputer readable storage medium including instruction, example are additionally provided
It such as include the memory 804 of instruction, above-metioned instruction can be executed by the processor 820 of device 800 to complete the above method.For example,
The non-transitorycomputer readable storage medium can be ROM, random access memory (RAM), CD-ROM, tape, floppy disk
With optical data storage devices etc..
A kind of non-transitorycomputer readable storage medium, when the instruction in the storage medium is by the processing of mobile terminal
When device executes, so that mobile terminal is able to carry out a kind of detection method of website of practising fraud, which comprises stand from known cheating
In the retrieval log of point and/or access log, the page feature of the known cheating website lower page is extracted;According to the page
The cheating rule building cheating detection model of character representation, the cheating detection model is for detecting whether website practises fraud;Foundation
The cheating detection model treats whether detection website is practised fraud and detected.
Wherein, described from the retrieval log of known cheating website and/or access log, extract the known website of practising fraud
The page feature of lower page, can specifically include:
Obtain retrieval log and/or the access log of the known cheating website, the retrieval log include: term with
The search result page corresponding with the term, the access log include: the accession page and each accession page of user
Access times;And
The text feature and/or structure feature for extracting the search result page and/or accession page, as the page
Feature.
Wherein, the text feature and/or structure feature for extracting the search result page and/or accession page, makees
For the page feature, can specifically include:
The body text information and/or heading-text of each page are extracted from the search result page and/or accession page
This information, as the text feature;And
Body structure feature and the header syntax that each page is extracted from the search result page and/or accession page are special
Sign, as the structure feature.
Wherein, the cheating rule building cheating detection model indicated according to the page feature, can specifically include:
By the search result page and/or the page feature of accession page, be separately converted to searching characteristic vector and/or
Access feature vector;
According to searching characteristic vector and/or access feature vector, building cheating detection model.
Wherein, described to treat whether detection website is practised fraud and detected according to the cheating detection model, specifically it can wrap
It includes:
Obtain the page to be detected of website to be detected;
Extract the page feature to be detected of the page to be detected, and by the page feature to be detected be converted to it is described to
Detect the feature vector to be detected of website;And
Whether meet page cheating rule according to the feature vector to be detected, detects whether the website to be detected is work
Disadvantage website.
Wherein, the known cheating website can determine in the following manner:
Obtain the Website Hosting to be determined whether practised fraud;
Each website in the Website Hosting is clustered, all kinds of websites after being clustered;And
The website that annotation results artificial in all kinds of websites are cheating is determined as the known website of practising fraud, the people
Work annotation results are for indicating whether all kinds of websites are cheating website.
Wherein, described device 800 can also be configured to be executed by one or more than one processor it is one or
More than one program of person includes the instruction for performing the following operation:
Drop power or delete processing are carried out to the website to be detected that testing result is cheating.
Fig. 4 is the structural schematic diagram of server in the embodiment of the present invention.The server 1900 can be different because of configuration or performance
And generate bigger difference, may include one or more central processing units (central processing units,
CPU) 1922 (for example, one or more processors) and memory 1932, one or more storage application programs
1942 or data 1944 storage medium 1930 (such as one or more mass memory units).Wherein, memory 1932
It can be of short duration storage or persistent storage with storage medium 1930.Be stored in storage medium 1930 program may include one or
More than one module (diagram does not mark), each module may include to the series of instructions operation in server.Further
Ground, central processing unit 1922 can be set to communicate with storage medium 1930, and storage medium 1930 is executed on server 1900
In series of instructions operation.
Server 1900 can also include one or more power supplys 1926, one or more wired or wireless nets
Network interface 1950, one or more input/output interfaces 1958, one or more keyboards 1956, and/or, one or
More than one operating system 1941, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM
Etc..
Those skilled in the art after considering the specification and implementing the invention disclosed here, will readily occur to of the invention its
Its embodiment.The present invention is directed to cover any variations, uses, or adaptations of the invention, these modifications, purposes or
Person's adaptive change follows general principle of the invention and including the undocumented common knowledge in the art of the disclosure
Or conventional techniques.The description and examples are only to be considered as illustrative, and true scope and spirit of the invention are by following
Claim is pointed out.
It should be understood that the present invention is not limited to the precise structure already described above and shown in the accompanying drawings, and
And various modifications and changes may be made without departing from the scope thereof.The scope of the present invention is limited only by the attached claims.
The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and
Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.
Claims (10)
1. a kind of detection method for website of practising fraud characterized by comprising
From the retrieval log of known cheating website and/or access log, the page of the known cheating website lower page is extracted
Feature;
According to the cheating rule building cheating detection model that the page feature indicates, the cheating detection model is used for measuring station
Whether point practises fraud;
Treat whether detection website is practised fraud and detected according to the cheating detection model.
2. the method according to claim 1, wherein the retrieval log and/or visit from known cheating website
It asks in log, extracts the page feature of the known cheating website lower page, comprising:
Obtain retrieval log and/or the access log of the known cheating website, the retrieval log include: term and with institute
The corresponding search result page of term is stated, the access log includes: the visit of the accession page and each accession page of user
Ask number;
The text feature and/or structure feature for extracting the search result page and/or accession page, it is special as the page
Sign.
3. according to the method described in claim 2, it is characterized in that, described extract the search result page and/or access page
The text feature and/or structure feature in face, as the page feature, comprising:
The body text information and/or title text letter of each page are extracted from the search result page and/or accession page
Breath, as the text feature;And
The body structure feature and header syntax feature of each page are extracted from the search result page and/or accession page,
As the structure feature.
4. according to the method described in claim 3, it is characterized in that, the cheating rule structure indicated according to the page feature
Build cheating detection model, comprising:
By the search result page and/or the page feature of accession page, it is separately converted to searching characteristic vector and/or access
Feature vector;
According to searching characteristic vector and/or access feature vector, building cheating detection model.
5. according to the method described in claim 4, it is characterized in that, described treat detection website according to the cheating detection model
Whether cheating is detected, comprising:
Obtain the page to be detected of website to be detected;
The page feature to be detected of the page to be detected is extracted, and the page feature to be detected is converted to described to be detected
The feature vector to be detected of website;
Whether meet page cheating rule according to the feature vector to be detected, detects whether the website to be detected is cheating station
Point.
6. the method according to claim 1, wherein the known cheating website determines in the following manner:
Obtain the Website Hosting to be determined whether practised fraud;
Each website in the Website Hosting is clustered, all kinds of websites after being clustered;
The website that annotation results artificial in all kinds of websites are cheating is determined as the known website of practising fraud, the artificial mark
Note result is for indicating whether all kinds of websites are cheating website.
7. the method according to claim 1, wherein further include:
Drop power or delete processing are carried out to the website to be detected that testing result is cheating.
8. a kind of detection device for website of practising fraud characterized by comprising
Extraction unit, for from the retrieval log of known cheating website and/or access log, extracting the known website of practising fraud
The page feature of lower page;
Model construction unit, the cheating rule building cheating detection model for being indicated according to the page feature, the cheating
Detection model is for detecting whether website practises fraud;
Detection unit, for treating whether detection website is practised fraud and detected according to the cheating detection model.
9. a kind of detection device for website of practising fraud, which is characterized in that include memory and one or more than one journey
Sequence, perhaps more than one program is stored in memory and is configured to by one or more than one processor for one of them
Executing the one or more programs includes the instruction for performing the following operation:
From the retrieval log of known cheating website and/or access log, the page of the known cheating website lower page is extracted
Feature;
According to the cheating rule building cheating detection model that the page feature indicates, the cheating detection model is used for measuring station
Whether point practises fraud;
Treat whether detection website is practised fraud and detected according to the cheating detection model.
10. a kind of computer-readable medium is stored thereon with instruction, when executed by one or more processors, so that device
Execute the detection method of the cheating website as described in one or more in claim 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710576240.XA CN110147472B (en) | 2017-07-14 | 2017-07-14 | Detection method and device for cheating sites and detection device for cheating sites |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710576240.XA CN110147472B (en) | 2017-07-14 | 2017-07-14 | Detection method and device for cheating sites and detection device for cheating sites |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110147472A true CN110147472A (en) | 2019-08-20 |
CN110147472B CN110147472B (en) | 2021-10-15 |
Family
ID=67588038
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710576240.XA Active CN110147472B (en) | 2017-07-14 | 2017-07-14 | Detection method and device for cheating sites and detection device for cheating sites |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110147472B (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101093510A (en) * | 2007-07-25 | 2007-12-26 | 北京搜狗科技发展有限公司 | Anti cheating method and system for aiming at cheat on web page |
CN101350011A (en) * | 2007-07-18 | 2009-01-21 | 中国科学院自动化研究所 | Method for detecting search engine cheat based on small sample set |
CN101777053A (en) * | 2009-01-08 | 2010-07-14 | 北京搜狗科技发展有限公司 | Method and system for identifying cheating webpages |
CN102243659A (en) * | 2011-07-18 | 2011-11-16 | 南京邮电大学 | Webpage junk detection method based on dynamic Bayesian model |
CN103064984A (en) * | 2013-01-25 | 2013-04-24 | 清华大学 | Spam webpage identifying method and spam webpage identifying system |
CN103150369A (en) * | 2013-03-07 | 2013-06-12 | 人民搜索网络股份公司 | Method and device for identifying cheat web-pages |
WO2016101737A1 (en) * | 2014-12-22 | 2016-06-30 | 北京奇虎科技有限公司 | Search query method and apparatus |
CN106326498A (en) * | 2016-10-13 | 2017-01-11 | 合网络技术(北京)有限公司 | Cheat video identification method and device |
-
2017
- 2017-07-14 CN CN201710576240.XA patent/CN110147472B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101350011A (en) * | 2007-07-18 | 2009-01-21 | 中国科学院自动化研究所 | Method for detecting search engine cheat based on small sample set |
CN101093510A (en) * | 2007-07-25 | 2007-12-26 | 北京搜狗科技发展有限公司 | Anti cheating method and system for aiming at cheat on web page |
CN101777053A (en) * | 2009-01-08 | 2010-07-14 | 北京搜狗科技发展有限公司 | Method and system for identifying cheating webpages |
CN102243659A (en) * | 2011-07-18 | 2011-11-16 | 南京邮电大学 | Webpage junk detection method based on dynamic Bayesian model |
CN103064984A (en) * | 2013-01-25 | 2013-04-24 | 清华大学 | Spam webpage identifying method and spam webpage identifying system |
CN103150369A (en) * | 2013-03-07 | 2013-06-12 | 人民搜索网络股份公司 | Method and device for identifying cheat web-pages |
WO2016101737A1 (en) * | 2014-12-22 | 2016-06-30 | 北京奇虎科技有限公司 | Search query method and apparatus |
CN106326498A (en) * | 2016-10-13 | 2017-01-11 | 合网络技术(北京)有限公司 | Cheat video identification method and device |
Non-Patent Citations (1)
Title |
---|
唐寿洪: "基于蚁群优化的网页作弊检测技术研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Also Published As
Publication number | Publication date |
---|---|
CN110147472B (en) | 2021-10-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108009521B (en) | Face image matching method, device, terminal and storage medium | |
CN108875781B (en) | Label classification method and device, electronic equipment and storage medium | |
CN105488025B (en) | Template construction method and device, information identifying method and device | |
CN106708282B (en) | A kind of recommended method and device, a kind of device for recommendation | |
CN109993125A (en) | Model training method, face identification method, device, equipment and storage medium | |
CN109871896A (en) | Data classification method, device, electronic equipment and storage medium | |
CN109389162B (en) | Sample image screening technique and device, electronic equipment and storage medium | |
CN109493852A (en) | A kind of evaluating method and device of speech recognition | |
KR20210131211A (en) | Method for training a voiceprint extraction model and method for voiceprint recognition, and device and medium thereof, program | |
CN113792207B (en) | Cross-modal retrieval method based on multi-level feature representation alignment | |
CN108121736A (en) | A kind of descriptor determines the method for building up, device and electronic equipment of model | |
CN107330019A (en) | Searching method and device | |
CN109359056A (en) | A kind of applied program testing method and device | |
CN109933714A (en) | A kind of calculation method, searching method and the relevant apparatus of entry weight | |
CN107666536A (en) | A kind of method and apparatus for finding terminal, a kind of device for being used to find terminal | |
CN111984749A (en) | Method and device for ordering interest points | |
CN109471919A (en) | Empty anaphora resolution method and device | |
CN110069624A (en) | Text handling method and device | |
CN112784142A (en) | Information recommendation method and device | |
CN108733718A (en) | Display methods, device and the display device for search result of search result | |
WO2023029397A1 (en) | Training data acquisition method, abnormal behavior recognition network training method and apparatus, computer device, storage medium, computer program and computer program product | |
CN110110207A (en) | A kind of information recommendation method, device and electronic equipment | |
CN109521888A (en) | A kind of input method, device and medium | |
CN110377808A (en) | Document processing method, device, electronic equipment and storage medium | |
CN113919361A (en) | Text classification method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |