CN110147472A - Detection method, device and the detection device for website of practising fraud of cheating website - Google Patents

Detection method, device and the detection device for website of practising fraud of cheating website Download PDF

Info

Publication number
CN110147472A
CN110147472A CN201710576240.XA CN201710576240A CN110147472A CN 110147472 A CN110147472 A CN 110147472A CN 201710576240 A CN201710576240 A CN 201710576240A CN 110147472 A CN110147472 A CN 110147472A
Authority
CN
China
Prior art keywords
website
cheating
page
feature
detected
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710576240.XA
Other languages
Chinese (zh)
Other versions
CN110147472B (en
Inventor
李健
李毅
许静芳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sogou Technology Development Co Ltd
Original Assignee
Beijing Sogou Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sogou Technology Development Co Ltd filed Critical Beijing Sogou Technology Development Co Ltd
Priority to CN201710576240.XA priority Critical patent/CN110147472B/en
Publication of CN110147472A publication Critical patent/CN110147472A/en
Application granted granted Critical
Publication of CN110147472B publication Critical patent/CN110147472B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

This application provides the cheating detection method of website, device and detection devices for website of practising fraud, wherein, the detection method of cheating website includes: to extract the page feature of the known cheating website lower page from the retrieval log of known cheating website and/or access log;According to the cheating rule building cheating detection model that the page feature indicates, the cheating detection model is for detecting whether website practises fraud;Treat whether detection website is practised fraud and detected according to the cheating detection model.Using the embodiment of the present application, the accuracy of the cheating testing result to website can be improved.

Description

Detection method, device and the detection device for website of practising fraud of cheating website
Technical field
This application involves website detection technique field, in particular to a kind of detection method for website of practising fraud, device, Yi Zhongyong In the detection device of cheating website, and, a kind of computer-readable medium.
Background technique
Currently, the case where raw website cheating, also gets over therewith as user is more and more frequent using the number of internet Come more.Website cheating is that part website in order to make the webpage for being not belonging to user query result originally also can appear in user's In query result, such case is properly termed as website cheating.Under normal circumstances, user query are main to the affiliated website cheating of webpage It is divided into based on content cheating, link cheating and deception crawler cheating etc..
In the prior art, generally each webpage under website is analyzed, and according to analysis result to determine whether in the presence of The case where website is practised fraud.
Summary of the invention
Inventor has found that when analyzing webpage, need to rely on the prior art work identified in the course of the research Fraudulent means used by disadvantage website, and if webpage under a website using a kind of fraudulent means that do not analyzed, Then the prior art is when judging whether the website practises fraud with regard to not accurate enough;Also, to webpage analyze it is general using with The method of machine sampling will may not largely have representative webpage as analysis object yet, and lead to prior art training The precision of cheating webpages model and recall deficiency.
Inventor also found in the course of the research, for known cheating webpages, if it is possible to utilize the history of search engine Search record includes, the retrieval log that webpage under the website is retrieved and the access log to access, to utilize The information such as the search result webpage being retrieved under known cheating website and the access frequency, the corresponding term that access webpage, Cheating detection model is constructed, the cheating rule that the cheating detection model is able to reflect out cheating website is allowed for, thus to it His website carries out more accurate cheating detection;Also, because based on user in search engine when establishing cheating detection model Log and access log are retrieved, so just having more uniformity and representativeness based on from user perspective to establish model.
Based on this, this application provides a kind of detection methods of website of practising fraud, and may include:
From the retrieval log of known cheating website and/or access log, the known cheating website lower page is extracted Page feature;
According to the cheating rule building cheating detection model that the page feature indicates, the cheating detection model is for examining Whether survey station point practises fraud;
Treat whether detection website is practised fraud and detected according to the cheating detection model.
Wherein, described from the retrieval log of known cheating website and/or access log, extract the known website of practising fraud The page feature of lower page may include:
Obtain retrieval log and/or the access log of the known cheating website, the retrieval log include: term with The search result page corresponding with the term, the access log include: the accession page and each accession page of user Access times;
The text feature and/or structure feature for extracting the search result page and/or accession page, as the page Feature.
Wherein, the text feature and/or structure feature for extracting the search result page and/or accession page, makees For the page feature, may include:
The body text information and/or heading-text of each page are extracted from the search result page and/or accession page This information, as the text feature;And
Body structure feature and the header syntax that each page is extracted from the search result page and/or accession page are special Sign, as the structure feature.
Wherein, the cheating rule indicated according to the page feature constructs cheating detection model, may include:
By the search result page and/or the page feature of accession page, be separately converted to searching characteristic vector and/or Access feature vector;
According to searching characteristic vector and/or access feature vector, building cheating detection model.
Wherein, described to treat whether detection website is practised fraud and detected according to the cheating detection model, may include:
Obtain the page to be detected of website to be detected;
Extract the page feature to be detected of the page to be detected, and by the page feature to be detected be converted to it is described to Detect the feature vector to be detected of website;
Whether meet page cheating rule according to the feature vector to be detected, detects whether the website to be detected is work Disadvantage website.
Wherein, the known cheating website can determine in the following manner:
Obtain the Website Hosting to be determined whether practised fraud;
Each website in the Website Hosting is clustered, all kinds of websites after being clustered;And
The website that annotation results artificial in all kinds of websites are cheating is determined as the known website of practising fraud, the people Work annotation results are for indicating whether all kinds of websites are cheating website.
Wherein, the method can also include:
Drop power or delete processing are carried out to the website to be detected that testing result is cheating.Present invention also provides a kind of dresses It sets, to guarantee the realization and application of the above method in practice.
A kind of detection device of website of practising fraud provided by the embodiments of the present application, comprising:
Extraction unit, for extracting the known cheating from the retrieval log of known cheating website and/or access log The page feature of website lower page;
Model construction unit, the cheating rule building cheating detection model for indicating according to the page feature are described Cheating detection model is for detecting whether website practises fraud;
Detection unit, for treating whether detection website is practised fraud and detected according to the cheating detection model.
Wherein, the extraction unit may include:
Subelement is obtained, for obtaining retrieval log and/or the access log of the known cheating website, the retrieval day Will includes: term and the search result page corresponding with the term, and the access log includes: the accession page of user And the access times of each accession page;And
Subelement is extracted, the text feature and/or structure for extracting the search result page and/or accession page are special Sign, as the page feature.
Wherein, the extraction subelement may include:
Information extraction subelement, for extracting the text of each page from the search result page and/or accession page Text information and/or title text information, as the text feature;And
Structure extraction subelement, for extracting the text of each page from the search result page and/or accession page Structure feature and header syntax feature, as the structure feature.
Wherein, the model construction unit may include:
Transforming subunit, for being separately converted to examine by the page feature of the search result page and/or accession page Rope feature vector and/or access feature vector;And
Subelement is constructed, for according to searching characteristic vector and/or access feature vector, building cheating detection model.
Wherein, the detection unit may include:
Subelement is obtained, for obtaining the page to be detected of website to be detected;
Subelement is extracted, for extracting the page feature to be detected of the page to be detected, and by the page to be detected Feature Conversion is the feature vector to be detected of the website to be detected;
Detection sub-unit, for whether meeting page cheating rule according to the feature vector to be detected, detection it is described to Whether detection website is cheating website.
Wherein, the known cheating website can determine in the following manner:
Obtain the Website Hosting to be determined whether practised fraud;
Each website in the Website Hosting is clustered, all kinds of websites after being clustered;
The website that annotation results artificial in all kinds of websites are cheating is determined as the known website of practising fraud, the people Work annotation results are for indicating whether all kinds of websites are cheating website.
Wherein, described device can also include:
Cheating processing unit, for carrying out drop power or delete processing to the website to be detected that testing result is cheating.
Present invention also provides a kind of detection device of website of practising fraud, include memory and one or one with On program, one of them perhaps more than one program be stored in memory and be configured to by one or more than one It includes the instruction for performing the following operation that processor, which executes the one or more programs:
From the retrieval log of known cheating website and/or access log, the known cheating website lower page is extracted Page feature;
According to the cheating rule building cheating detection model that the page feature indicates, the cheating detection model is for examining Whether survey station point practises fraud;
Treat whether detection website is practised fraud and detected according to the cheating detection model.
Present invention also provides a kind of computer-readable mediums, instruction are stored thereon with, when by one or more processors When execution, so that device executes the detection method of cheating website described in one or more as the aforementioned.
Wherein, described from the retrieval log of known cheating website and/or access log, extract the known website of practising fraud The page feature of lower page, can specifically include:
Obtain retrieval log and/or the access log of the known cheating website, the retrieval log include: term with The search result page corresponding with the term, the access log include: the accession page and each accession page of user Access times;And
The text feature and/or structure feature for extracting the search result page and/or accession page, as the page Feature.
Wherein, the text feature and/or structure feature for extracting the search result page and/or accession page, makees For the page feature, can specifically include:
The body text information and/or heading-text of each page are extracted from the search result page and/or accession page This information, as the text feature;And
Body structure feature and the header syntax that each page is extracted from the search result page and/or accession page are special Sign, as the structure feature.
Wherein, the cheating rule building cheating detection model indicated according to the page feature, can specifically include:
By the search result page and/or the page feature of accession page, be separately converted to searching characteristic vector and/or Access feature vector;
According to searching characteristic vector and/or access feature vector, building cheating detection model.
Wherein, described to treat whether detection website is practised fraud and detected according to the cheating detection model, specifically it can wrap It includes:
Obtain the page to be detected of website to be detected;
Extract the page feature to be detected of the page to be detected, and by the page feature to be detected be converted to it is described to Detect the feature vector to be detected of website;And
Whether meet page cheating rule according to the feature vector to be detected, detects whether the website to be detected is work Disadvantage website.
Wherein, the known cheating website can determine in the following manner:
Obtain the Website Hosting to be determined whether practised fraud;
Each website in the Website Hosting is clustered, all kinds of websites after being clustered;And
The website that annotation results artificial in all kinds of websites are cheating is determined as the known website of practising fraud, the people Work annotation results are for indicating whether all kinds of websites are cheating website.
Wherein, described device can also be configured to execute one or one by one or more than one processor A procedure above includes the instruction for performing the following operation:
Drop power or delete processing are carried out to the website to be detected that testing result is cheating.
In the embodiment of the present application, for known cheating website, the retrieval log saved from search engine and/or access day In will, each page or partial page feature under the known cheating website are extracted, thus according to obtained page spy is extracted Therefore the cheating feature building cheating detection model that sign indicates is treating whether detection website is made based on the cheating detection model When disadvantage is detected, because the cheating detection model is able to reflect out cheating feature of the cheating website on the page, to it His website just can be carried out more accurate cheating detection;Also, drawn based on user in search when because establishing cheating detection model The retrieval log held up and access log, thus established from user perspective model just have more cheating website uniformity and It is representative.
Detailed description of the invention
In order to more clearly explain the technical solutions in the embodiments of the present application, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, the drawings in the following description are only some examples of the present application, for For those of ordinary skill in the art, without any creative labor, it can also be obtained according to these attached drawings His attached drawing.
Fig. 1 is the exemplary process diagram of the detection method embodiment of the cheating website of the application;
Fig. 2 is the exemplary block diagram of the detection device embodiment of the cheating website of the application;
Fig. 3 is a kind of detection device 800 for website of practising fraud shown according to an exemplary embodiment in the application Block diagram;
Fig. 4 is the structural schematic diagram of server in the embodiment of the present application.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of embodiments of the present application, instead of all the embodiments.It is based on Embodiment in the application, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall in the protection scope of this application.
The application can be used in numerous general or special purpose computing device environment or configurations.Such as: personal computer, service Device computer, handheld device or portable device, laptop device, multi-processor device including any of the above devices or devices Distributed computing environment etc..
The application can describe in the general context of computer-executable instructions executed by a computer, such as program Module.Generally, program module includes routines performing specific tasks or implementing specific abstract data types, programs, objects, group Part, data structure etc..The application can also be practiced in a distributed computing environment, in these distributed computing environments, by Task is executed by the connected remote processing devices of communication network.In a distributed computing environment, program module can be with In the local and remote computer storage media including storage equipment.
With reference to Fig. 1, a kind of flow chart of the detection method embodiment for website of practising fraud of the application, the embodiment of the present application are shown The testing process of middle cheating website may include the building process and website testing process of cheating model, the structure of cheating model Building process includes step 101~step 102, and website testing process includes step 103, and the overall flow of the present embodiment includes following Step 101~step 104:
Step 101: from the retrieval log of known cheating website and/or access log, extracting the known website of practising fraud The page feature of lower page.
In the present embodiment, in the case where known cheating website, it can use the cheating stored in database The retrieval log of website, access log etc., to extract the page feature of each page or partial page under the cheating website, with Just the page feature under later use cheating website constructs cheating detection model.
Specifically, storing whole retrieval days of website " www.ABCD.com " in the database of one search engine of hypothesis Will and access log.
Wherein, retrieval log is for indicating each page under the website by the retrieval information of user search;Retrieve log It may include: the term inputted when user retrieves every time and its corresponding search result page etc..
Information of the access log for indicating to click accession page after user retrieves every time, alternatively, user is pushed away by the page The information of modes accession page such as recommend;Access log may include: that the accession page information of user and each accession page correspond to Access times, for example, user visits the page " www.ABCD.com/890/hty " under website " www.ABCD.com " The access times asked are 10 times, and the visit to the page " www.ABCD.com/855555/ef " under website " www.ABCD.com " Ask that number is 100 times, etc..
After the retrieval log and access log for getting known cheating website, therefrom get each search result page and Accession page, and extract the page feature of page feature and accession page that there emerged a retrieval result page face.Wherein, page feature can To include text feature and structure feature in the page, text feature is used for the characteristics of characterizing the text information in the page, and ties Structure feature is used to characterize characteristic distributions of the various pieces of the page in structure.For example, text feature may include: in the page Whether text information includes illegal word (such as: Falun Gong etc.), if there are a large amount of duplicate vocabulary, if sentence is not logical enough It is suitable, if context is uncorrelated, etc..And structure feature can be with the distribution situation of various pieces in representation page, for example, the page In title be located at which position of the page, whether be splicing between the various pieces of the page, the length of the page whether be more than Pre-set length threshold, for showing whether the distribution between the body frame of text and other function frame is reasonable, alternatively, advertisement is distributed Whether rationally or page body or title have been covered etc..
Certainly, in practical applications, can only using text feature as page feature, can also only using structure feature as Page feature can also all regard text feature and structure feature as page feature.In addition, in the present embodiment, it can also be only Page feature is extracted to the search result page in retrieval log, only the accession page in access log can also be extracted special Sign, or extract page feature to the search result page and accession page, those skilled in the art can be in actual scene Autonomous setting according to demand.And for retrieval log or access log, because of the search result page therein or accession page May have it is very much, so can also randomly select in practical applications a part of search result page or a part of accession page into The extraction of row page feature, is in the embodiment of the present application not construed as limiting this.
In practical applications, it is known that cheating website can be a kind of exemplary by step A1~step A3 as follows Mode determines:
Step A1: the Website Hosting to be determined whether practised fraud is obtained.
In the present embodiment, whether practise fraud to website when detecting, it can be first to each station to be determined whether practised fraud Website in point set is clustered, and then is manually marked according to all kinds of websites after cluster, that is, the every a kind of station of mark Point belongs to cheating website or normal website, to obtain the testing result whether each website practises fraud.
Step A2: clustering each website in the Website Hosting, all kinds of websites after being clustered.
It is corresponding according to webpage included by each website and the webpage for each website got in step A1 Term, extract the feature of each webpage under each website, and be converted to corresponding feature vector.Specifically, extracting webpage Web page characteristics and conversion characteristic vector process, can be discussed in detail with reference to below step 102.
In the case where obtaining each website after the feature vector of each webpage, these feature vectors can be clustered, be clustered It is the process that the set of physics or abstract object is divided into the multiple classes being made of similar object.For example, in this step may be used To be clustered using the k-means clustering algorithm for being clustered to sample, in practical applications, practise fraud each under website The feature vector of a webpage is converted into one kind under normal circumstances, and the feature vector of each webpage under normal website is then converted into In addition a kind of.
Step A3: the website that annotation results artificial in all kinds of websites are cheating is determined as the known cheating station Point, the artificial annotation results are for indicating whether all kinds of websites are cheating website.
, can be by manually the two classes website be marked after two class websites after being clustered, i.e., handmarking goes out Which kind of is cheating website, which kind of is normal website, and then determines that each website is cheating station according to handmarking's result Point or normal website, and be as the known cheating website in step 101 using the website that artificial annotation results are cheating website It can.
In addition to it is above-mentioned the mode to determine known cheating website is clustered to unknown each website other than, the application is implemented Subsequent step 103 can also be detected obtained cheating website in example and also be used as known cheating website, with will pass through to it is new Know that cheating website carries out page feature extraction, realization further updates the cheating detection model established in step 102;Certainly, The cheating website that step 103 detection obtains can also carry out manually marking verifying, if the cheating station detected in step 103 The result for clicking through pedestrian's work mark is also cheating website, then as known cheating website to the cheating detection model of foundation into One step updates.
After the page feature that step 101 is extracted known cheating website, step 102 is subsequently entered:
Step 102: the cheating rule building cheating detection model indicated according to the page feature, the cheating detect mould Type is for detecting whether website practises fraud.
The cheating rule for the known cheating website that the page feature according to obtained in step 101 indicates, can use has One classifier of machine learning method training of supervision is as cheating detection model, and the cheating detection model is for detecting other stations Whether point is cheating website.
Specifically, step 102 may comprise steps of B1~step B2 during realization:
Step B1: by the search result page and/or the page feature of accession page, be separately converted to retrieval character to Amount and/or access feature vector.
It in this step, first can be by the page feature of the page feature of the search result page and accession page, respectively It is converted into the searching characteristic vector and access feature vector of known cheating website.In practical applications, only with retrieval character to Amount building model, perhaps only with access feature vector building model or simultaneously using searching characteristic vector and access feature Vector building model is ok.
Specifically, can first obtain each page feature each of under the website during conversion characteristic vector Corresponding characteristic value in webpage, then statistics obtains each page feature all page corresponding eigenvalues under known cheating website Affiliated range of characteristic values.For example, obtaining its numerical value in webpage A as webpage A for this structure feature of page length The corresponding characteristic value of middle page length, and it is obtained in the numerical value in webpage B as the corresponding feature of page length in webpage B Value, and so on, until acquiring in all webpages under the website until the corresponding characteristic value of page length.
Then, the corresponding characteristic value of page length in each webpage is counted, to determine page length in known cheating website Under range of characteristic values belonging to all webpage corresponding eigenvalues.Assuming that acquiring page length corresponding spy in each webpage After value indicative, is counted and determine that the maximum value of its corresponding eigenvalue is 1024 pixels, the minimum value of corresponding eigenvalue is 268 pictures Element can then determine that page length corresponding eigenvalue range in each webpage under the known cheating website is 268 pixels To 1024 pixels, then the binary numeral that computer can identify is converted by range of characteristic values to get corresponding vector is arrived Value, such as corresponding vector value range correspond to 000100~111111.
Assuming that counting discovery altogether for all webpages under known cheating website has 100 page features, then for number The page feature of value type, such as the numerical value of page length this page feature is 268 or 1024, then can be by by known work Under disadvantage website the numerical value of the page feature of each webpage be added and value, as this it is known cheating website page feature Characteristic value, then convert binary numeral for this feature value, can be obtained 1*N dimension feature vector n-th vector value, Wherein, N is the integer greater than zero.For example, as it is known that first vector value of the feature vector of cheating website corresponds to page length The page length of the characteristic value of this page feature, then all webpages that add up obtains the sum value 8534 later, then first vector value It is just the binary numeral of " 8534 " conversion.
And for the page feature of nonumeric type, such as image definition this page feature, characteristic value is respectively " clear It is clear ", " common " and " unintelligible ", then binary numeral " 2 ", " 1 " and " 0 " can be respectively adopted to distinguish in those skilled in the art Indicate above three characteristic value " clear ", " common " and " unintelligible ".For example, as it is known that second of the feature vector of cheating website Vector value has 5 webpage A~E for indicating this page feature of image definition under known cheating website, then known cheating Second vector value of the feature vector of website respectively indicates the image definition of this 5 webpages to the 6th vector value.Example Such as, it is known that the 2nd~6 vector value of the feature vector for website of practising fraud is respectively { 0,2,1,1,2 }, and the webpage pre-set Sequence is then for from the lexicographic order of A~E, then it represents that the clarity of webpage A be it is unintelligible, the clarity of webpage B is clear, webpage The clarity of C be it is common, the clarity of webpage D be it is common, the clarity of webpage E is clear.
And so on, whether the characteristic value according to the page feature of each webpage under known cheating website is numeric type, with And under known cheating website webpage quantity, to finally obtain the feature vector of the corresponding 1*N dimension of known cheating website. Certainly, aforesaid way is merely exemplary content, should not be construed as the restriction of the embodiment of the present application.
In practical applications, in the corresponding feature vector of each webpage page feature quantity, directly affect model Trained accuracy and speed, and the feature vector for using the above method to generate, can only include important page feature, therefore Feature vector can effectively improve subsequent training and recall precision by lower dimension.
Step B2: according to searching characteristic vector and/or access feature vector, building cheating detection model.
After the page feature of each webpage under known cheating website is converted into feature vector, available two groups of instructions Practice data, one group be it is known cheating website searching characteristic vector constitute searching characteristic vector set, another group is known work The access feature vector set that the access feature vector of disadvantage website is constituted.Building practise fraud detection model when, can respectively according to Searching characteristic vector set and access feature vector set, construct two cheating detection models;A cheating can also be constructed Detection model.
It, can because accession page is that user clicks the page checked for constructing a cheating detection model With will access each access feature vector in feature vector set weight setting it is larger, by each access feature vector with Searching characteristic vector will test result as desired output valve (also referred to as supervisory signals), adopt respectively as input object With kNN (arest neighbors, k-NearestNeighbor) or the side of support vector machines (SVM, Support Vector Machine) Method etc. has the machine learning method of supervision, and Lai Xunlian obtains cheating webpages detection model.Wherein, desired output valve can be " testing result is cheating " or " testing result is that the probability of cheating is 100% " etc..
101~step 102 of above step is the process of building cheating detection model in the embodiment of the present application, is detected in cheating After model construction, in the case where needing to detect other websites and whether practising fraud, following steps 103 are executed.
Step 103: treating whether detection website is practised fraud and detected according to the cheating detection model.
It, can be to be detected to other using the cheating webpages detection model trained to after cheating webpages detection model Whether website, which practises fraud, is detected.
Specifically, detecting the process whether website to be detected practises fraud may include step C1~step C3:
Step C1: the page to be detected of website to be detected is obtained.
Firstly, obtaining each page to be detected under website to be detected, or the part under website to be detected is obtained at random The page is as the page to be detected.In practical applications, whole pages under available website to be detected are detected, and if The quantity of the page to be detected is excessive, can also the therefrom extraction section page, such as 60% the page as the page to be detected, taking out When taking partial page, the percentage for extracting whole pages can be independently arranged by those skilled in the art.
Step C2: the page feature to be detected of the page to be detected is extracted, and the page feature to be detected is converted For the feature vector to be detected of the website to be detected.
Then the page feature for extracting the page to be detected again may include text feature and structure feature etc., and will extract To page feature to be detected be converted to the feature vector to be detected of website to be detected.The extraction of page feature can refer to step 101 introduction, the process of converting characteristic vector can refer to the introduction of step B1, and details are not described herein.
Step C3: whether meet page cheating rule according to the feature vector to be detected, detect the website to be detected It whether is cheating website.
The input for the cheating detection model that feature vector to be detected obtained in step C2 is constructed as step 102 again, To be exported as a result, website i.e. to be detected is cheating website or is not cheating website.In practical applications, according to use Have the difference of the method for supervision, detection model of practising fraud output can directly for website to be detected whether be practise fraud website knot Fruit is also possible to treat the prediction probability that detection website is cheating website, for example, website to be detected is the general of website of practising fraud Rate is 80%, and in this case, those skilled in the art can preset a probabilistic determination threshold value, such as 70%, such as The probability value of fruit cheating detection model output is greater than the probabilistic determination threshold value, then confirms the website to be detected for cheating website.
After whether step 103 is practised fraud to other websites and detected, it can choose and execute following steps 104.
Step 104: drop power or delete processing are carried out to the website to be detected that testing result is cheating.
In the present embodiment, each under the website to be detected in order to reduce if website to be detected is cheating website The page by user search to a possibility that, measuring station point can be treated and carry out drop power processing, or can be directly by survey station to be checked Each page under point is deleted.
In the embodiment of the present application, for known cheating website, the retrieval log saved from search engine and/or access day In will, each page or partial page feature under the known cheating website are extracted, thus according to obtained page spy is extracted Therefore the cheating feature building cheating detection model that sign indicates is treating whether detection website is made based on the cheating detection model When disadvantage is detected, because the cheating detection model is able to reflect out cheating feature of the cheating website on the page, to it His website just can be carried out more accurate cheating detection;Also, drawn based on user in search when because establishing cheating detection model The retrieval log held up and access log, thus established from user perspective model just have more cheating website uniformity and It is representative.
For the aforementioned method embodiment, for simple description, therefore, it is stated as a series of action combinations, still Those skilled in the art should understand that the application is not limited by the described action sequence, because according to the application, it is certain Step can be performed in other orders or simultaneously.Secondly, those skilled in the art should also know that, it is described in the specification Embodiment belong to preferred embodiment, necessary to related actions and modules not necessarily the application.
It is corresponding with a kind of method provided by the detection method embodiment of website of practising fraud of above-mentioned the application, referring to fig. 2, this Application additionally provides a kind of detection device embodiment of website of practising fraud, in the present embodiment, the apparatus may include:
Extraction unit 201, for extracting described known from the retrieval log of known cheating website and/or access log The page feature for website lower page of practising fraud.
Wherein, the extraction unit 201 may include:
Subelement is obtained, for obtaining retrieval log and/or the access log of the known cheating website, the retrieval day Will includes: term and the search result page corresponding with the term, and the access log includes: the accession page of user And the access times of each accession page;And subelement is extracted, for extracting the search result page and/or access page The text feature and/or structure feature in face, as the page feature.
Wherein, the extraction subelement may include:
Information extraction subelement, for extracting the text of each page from the search result page and/or accession page Text information and/or title text information, as the text feature;And structure extraction subelement, it is used for from the retrieval The body structure feature and header syntax feature that each page is extracted in results page and/or accession page, it is special as the structure Sign.
Wherein, the known cheating website can determine in the following manner:
Obtain the Website Hosting to be determined whether practised fraud;Each website in the Website Hosting is clustered, is gathered All kinds of websites after class;And the website that annotation results artificial in all kinds of websites are cheating is determined as the known work Disadvantage website, the artificial annotation results are for indicating whether all kinds of websites are cheating website.
Model construction unit 202, the cheating rule building cheating detection model for being indicated according to the page feature, institute Cheating detection model is stated for detecting whether website practises fraud.
Wherein, the model construction unit 202 may include:
Transforming subunit, for being separately converted to examine by the page feature of the search result page and/or accession page Rope feature vector and/or access feature vector;And building subelement, for according to searching characteristic vector and/or access feature Vector, building cheating detection model.
Detection unit 203, for treating whether detection website is practised fraud and detected according to the cheating detection model.
Wherein, the detection unit 203 may include:
Subelement is obtained, for obtaining the page to be detected of website to be detected;Subelement is extracted, it is described to be checked for extracting The page feature to be detected of the page is surveyed, and the page feature to be detected is converted to the feature to be detected of the website to be detected Vector;And detection sub-unit, for whether meeting page cheating rule according to the feature vector to be detected, described in detection Whether website to be detected is cheating website.
Wherein, described device can also include:
Cheating processing unit 204, for carrying out drop power or delete processing to the website to be detected that testing result is cheating.
As it can be seen that in the embodiment of the present application, for known cheating website, the retrieval log saved from search engine and/or In access log, each page or partial page feature under the known cheating website are extracted, thus obtained according to extraction Therefore the cheating rule building cheating detection model that page feature indicates is treating detection website based on the cheating detection model When whether cheating is detected, it will be able to which the cheating detection model is able to reflect out the cheating rule of cheating website, thus to it His website carries out more accurate cheating detection;Also, because based on user in search engine when establishing cheating detection model Log and access log are retrieved, so just having more uniformity and representativeness, energy based on from user perspective to establish model It is enough that unknown cheating type is also accurately detected.
About the device in above-described embodiment, wherein modules execute the concrete mode of operation in related this method Embodiment in be described in detail, no detailed explanation will be given here.
Fig. 3 is a kind of block diagram of detection device 800 for website of practising fraud shown according to an exemplary embodiment.Example Such as, device 800 can be mobile phone, computer, digital broadcasting terminal, messaging device, game console, and plate is set It is standby, Medical Devices, body-building equipment, personal digital assistant etc..
Referring to Fig. 3, device 800 may include following one or more components: processing component 802, memory 804, power supply Component 806, multimedia component 808, audio component 810, the interface 812 of input/output (I/O), sensor module 814, and Communication component 816.
The integrated operation of the usual control device 800 of processing component 802, such as with display, telephone call, data communication, phase Machine operation and record operate associated operation.Processing element 802 may include that one or more processors 820 refer to execute It enables, to perform all or part of the steps of the methods described above.In addition, processing component 802 may include one or more modules, just Interaction between processing component 802 and other assemblies.For example, processing component 802 may include multi-media module, it is more to facilitate Interaction between media component 808 and processing component 802.
Memory 804 is configured as storing various types of data to support the operation in equipment 800.These data are shown Example includes the instruction of any application or method for operating on device 800, contact data, and telephone book data disappears Breath, picture, video etc..Memory 804 can be by any kind of volatibility or non-volatile memory device or their group It closes and realizes, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM) is erasable to compile Journey read-only memory (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory, flash Device, disk or CD.
Power supply module 806 provides electric power for the various assemblies of device 800.Power supply module 806 may include power management system System, one or more power supplys and other with for device 800 generate, manage, and distribute the associated component of electric power.
Multimedia component 808 includes the screen of one output interface of offer between described device 800 and user.One In a little embodiments, screen may include liquid crystal display (LCD) and touch panel (TP).If screen includes touch panel, screen Curtain may be implemented as touch screen, to receive input signal from the user.Touch panel includes one or more touch sensings Device is to sense the gesture on touch, slide, and touch panel.The touch sensor can not only sense touch or sliding action Boundary, but also detect duration and pressure associated with the touch or slide operation.In some embodiments, more matchmakers Body component 808 includes a front camera and/or rear camera.When equipment 800 is in operation mode, such as screening-mode or When video mode, front camera and/or rear camera can receive external multi-medium data.Each front camera and Rear camera can be a fixed optical lens system or have focusing and optical zoom capabilities.
Audio component 810 is configured as output and/or input audio signal.For example, audio component 810 includes a Mike Wind (MIC), when device 800 is in operation mode, when such as call mode, recording mode, and voice recognition mode, microphone is matched It is set to reception external audio signal.The received audio signal can be further stored in memory 804 or via communication set Part 816 is sent.In some embodiments, audio component 810 further includes a loudspeaker, is used for output audio signal.
I/O interface 812 provides interface between processing component 802 and peripheral interface module, and above-mentioned peripheral interface module can To be keyboard, click wheel, button etc..These buttons may include, but are not limited to: home button, volume button, start button and lock Determine button.
Sensor module 814 includes one or more sensors, and the state for providing various aspects for device 800 is commented Estimate.For example, sensor module 814 can detecte the state that opens/closes of equipment 800, and the relative positioning of component, for example, it is described Component is the display and keypad of device 800, and sensor module 814 can be with 800 1 components of detection device 800 or device Position change, the existence or non-existence that user contacts with device 800,800 orientation of device or acceleration/deceleration and device 800 Temperature change.Sensor module 814 may include proximity sensor, be configured to detect without any physical contact Presence of nearby objects.Sensor module 814 can also include optical sensor, such as CMOS or ccd image sensor, at As being used in application.In some embodiments, which can also include acceleration transducer, gyro sensors Device, Magnetic Sensor, pressure sensor or temperature sensor.
Communication component 816 is configured to facilitate the communication of wired or wireless way between device 800 and other equipment.Device 800 can access the wireless network based on communication standard, such as WiFi, 2G or 3G or their combination.In an exemplary implementation In example, communication component 816 receives broadcast singal or broadcast related information from external broadcasting management system via broadcast channel. In one exemplary embodiment, the communication component 816 further includes near-field communication (NFC) module, to promote short range communication.Example Such as, NFC module can be based on radio frequency identification (RFID) technology, Infrared Data Association (IrDA) technology, ultra wide band (UWB) technology, Bluetooth (BT) technology and other technologies are realized.
In the exemplary embodiment, device 800 can be believed by one or more application specific integrated circuit (ASIC), number Number processor (DSP), digital signal processing appts (DSPD), programmable logic device (PLD), field programmable gate array (FPGA), controller, microcontroller, microprocessor or other electronic components are realized, for executing the above method.
In the exemplary embodiment, a kind of non-transitorycomputer readable storage medium including instruction, example are additionally provided It such as include the memory 804 of instruction, above-metioned instruction can be executed by the processor 820 of device 800 to complete the above method.For example, The non-transitorycomputer readable storage medium can be ROM, random access memory (RAM), CD-ROM, tape, floppy disk With optical data storage devices etc..
A kind of non-transitorycomputer readable storage medium, when the instruction in the storage medium is by the processing of mobile terminal When device executes, so that mobile terminal is able to carry out a kind of detection method of website of practising fraud, which comprises stand from known cheating In the retrieval log of point and/or access log, the page feature of the known cheating website lower page is extracted;According to the page The cheating rule building cheating detection model of character representation, the cheating detection model is for detecting whether website practises fraud;Foundation The cheating detection model treats whether detection website is practised fraud and detected.
Wherein, described from the retrieval log of known cheating website and/or access log, extract the known website of practising fraud The page feature of lower page, can specifically include:
Obtain retrieval log and/or the access log of the known cheating website, the retrieval log include: term with The search result page corresponding with the term, the access log include: the accession page and each accession page of user Access times;And
The text feature and/or structure feature for extracting the search result page and/or accession page, as the page Feature.
Wherein, the text feature and/or structure feature for extracting the search result page and/or accession page, makees For the page feature, can specifically include:
The body text information and/or heading-text of each page are extracted from the search result page and/or accession page This information, as the text feature;And
Body structure feature and the header syntax that each page is extracted from the search result page and/or accession page are special Sign, as the structure feature.
Wherein, the cheating rule building cheating detection model indicated according to the page feature, can specifically include:
By the search result page and/or the page feature of accession page, be separately converted to searching characteristic vector and/or Access feature vector;
According to searching characteristic vector and/or access feature vector, building cheating detection model.
Wherein, described to treat whether detection website is practised fraud and detected according to the cheating detection model, specifically it can wrap It includes:
Obtain the page to be detected of website to be detected;
Extract the page feature to be detected of the page to be detected, and by the page feature to be detected be converted to it is described to Detect the feature vector to be detected of website;And
Whether meet page cheating rule according to the feature vector to be detected, detects whether the website to be detected is work Disadvantage website.
Wherein, the known cheating website can determine in the following manner:
Obtain the Website Hosting to be determined whether practised fraud;
Each website in the Website Hosting is clustered, all kinds of websites after being clustered;And
The website that annotation results artificial in all kinds of websites are cheating is determined as the known website of practising fraud, the people Work annotation results are for indicating whether all kinds of websites are cheating website.
Wherein, described device 800 can also be configured to be executed by one or more than one processor it is one or More than one program of person includes the instruction for performing the following operation:
Drop power or delete processing are carried out to the website to be detected that testing result is cheating.
Fig. 4 is the structural schematic diagram of server in the embodiment of the present invention.The server 1900 can be different because of configuration or performance And generate bigger difference, may include one or more central processing units (central processing units, CPU) 1922 (for example, one or more processors) and memory 1932, one or more storage application programs 1942 or data 1944 storage medium 1930 (such as one or more mass memory units).Wherein, memory 1932 It can be of short duration storage or persistent storage with storage medium 1930.Be stored in storage medium 1930 program may include one or More than one module (diagram does not mark), each module may include to the series of instructions operation in server.Further Ground, central processing unit 1922 can be set to communicate with storage medium 1930, and storage medium 1930 is executed on server 1900 In series of instructions operation.
Server 1900 can also include one or more power supplys 1926, one or more wired or wireless nets Network interface 1950, one or more input/output interfaces 1958, one or more keyboards 1956, and/or, one or More than one operating system 1941, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM Etc..
Those skilled in the art after considering the specification and implementing the invention disclosed here, will readily occur to of the invention its Its embodiment.The present invention is directed to cover any variations, uses, or adaptations of the invention, these modifications, purposes or Person's adaptive change follows general principle of the invention and including the undocumented common knowledge in the art of the disclosure Or conventional techniques.The description and examples are only to be considered as illustrative, and true scope and spirit of the invention are by following Claim is pointed out.
It should be understood that the present invention is not limited to the precise structure already described above and shown in the accompanying drawings, and And various modifications and changes may be made without departing from the scope thereof.The scope of the present invention is limited only by the attached claims.
The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims (10)

1. a kind of detection method for website of practising fraud characterized by comprising
From the retrieval log of known cheating website and/or access log, the page of the known cheating website lower page is extracted Feature;
According to the cheating rule building cheating detection model that the page feature indicates, the cheating detection model is used for measuring station Whether point practises fraud;
Treat whether detection website is practised fraud and detected according to the cheating detection model.
2. the method according to claim 1, wherein the retrieval log and/or visit from known cheating website It asks in log, extracts the page feature of the known cheating website lower page, comprising:
Obtain retrieval log and/or the access log of the known cheating website, the retrieval log include: term and with institute The corresponding search result page of term is stated, the access log includes: the visit of the accession page and each accession page of user Ask number;
The text feature and/or structure feature for extracting the search result page and/or accession page, it is special as the page Sign.
3. according to the method described in claim 2, it is characterized in that, described extract the search result page and/or access page The text feature and/or structure feature in face, as the page feature, comprising:
The body text information and/or title text letter of each page are extracted from the search result page and/or accession page Breath, as the text feature;And
The body structure feature and header syntax feature of each page are extracted from the search result page and/or accession page, As the structure feature.
4. according to the method described in claim 3, it is characterized in that, the cheating rule structure indicated according to the page feature Build cheating detection model, comprising:
By the search result page and/or the page feature of accession page, it is separately converted to searching characteristic vector and/or access Feature vector;
According to searching characteristic vector and/or access feature vector, building cheating detection model.
5. according to the method described in claim 4, it is characterized in that, described treat detection website according to the cheating detection model Whether cheating is detected, comprising:
Obtain the page to be detected of website to be detected;
The page feature to be detected of the page to be detected is extracted, and the page feature to be detected is converted to described to be detected The feature vector to be detected of website;
Whether meet page cheating rule according to the feature vector to be detected, detects whether the website to be detected is cheating station Point.
6. the method according to claim 1, wherein the known cheating website determines in the following manner:
Obtain the Website Hosting to be determined whether practised fraud;
Each website in the Website Hosting is clustered, all kinds of websites after being clustered;
The website that annotation results artificial in all kinds of websites are cheating is determined as the known website of practising fraud, the artificial mark Note result is for indicating whether all kinds of websites are cheating website.
7. the method according to claim 1, wherein further include:
Drop power or delete processing are carried out to the website to be detected that testing result is cheating.
8. a kind of detection device for website of practising fraud characterized by comprising
Extraction unit, for from the retrieval log of known cheating website and/or access log, extracting the known website of practising fraud The page feature of lower page;
Model construction unit, the cheating rule building cheating detection model for being indicated according to the page feature, the cheating Detection model is for detecting whether website practises fraud;
Detection unit, for treating whether detection website is practised fraud and detected according to the cheating detection model.
9. a kind of detection device for website of practising fraud, which is characterized in that include memory and one or more than one journey Sequence, perhaps more than one program is stored in memory and is configured to by one or more than one processor for one of them Executing the one or more programs includes the instruction for performing the following operation:
From the retrieval log of known cheating website and/or access log, the page of the known cheating website lower page is extracted Feature;
According to the cheating rule building cheating detection model that the page feature indicates, the cheating detection model is used for measuring station Whether point practises fraud;
Treat whether detection website is practised fraud and detected according to the cheating detection model.
10. a kind of computer-readable medium is stored thereon with instruction, when executed by one or more processors, so that device Execute the detection method of the cheating website as described in one or more in claim 1 to 7.
CN201710576240.XA 2017-07-14 2017-07-14 Detection method and device for cheating sites and detection device for cheating sites Active CN110147472B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710576240.XA CN110147472B (en) 2017-07-14 2017-07-14 Detection method and device for cheating sites and detection device for cheating sites

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710576240.XA CN110147472B (en) 2017-07-14 2017-07-14 Detection method and device for cheating sites and detection device for cheating sites

Publications (2)

Publication Number Publication Date
CN110147472A true CN110147472A (en) 2019-08-20
CN110147472B CN110147472B (en) 2021-10-15

Family

ID=67588038

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710576240.XA Active CN110147472B (en) 2017-07-14 2017-07-14 Detection method and device for cheating sites and detection device for cheating sites

Country Status (1)

Country Link
CN (1) CN110147472B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101093510A (en) * 2007-07-25 2007-12-26 北京搜狗科技发展有限公司 Anti cheating method and system for aiming at cheat on web page
CN101350011A (en) * 2007-07-18 2009-01-21 中国科学院自动化研究所 Method for detecting search engine cheat based on small sample set
CN101777053A (en) * 2009-01-08 2010-07-14 北京搜狗科技发展有限公司 Method and system for identifying cheating webpages
CN102243659A (en) * 2011-07-18 2011-11-16 南京邮电大学 Webpage junk detection method based on dynamic Bayesian model
CN103064984A (en) * 2013-01-25 2013-04-24 清华大学 Spam webpage identifying method and spam webpage identifying system
CN103150369A (en) * 2013-03-07 2013-06-12 人民搜索网络股份公司 Method and device for identifying cheat web-pages
WO2016101737A1 (en) * 2014-12-22 2016-06-30 北京奇虎科技有限公司 Search query method and apparatus
CN106326498A (en) * 2016-10-13 2017-01-11 合网络技术(北京)有限公司 Cheat video identification method and device

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101350011A (en) * 2007-07-18 2009-01-21 中国科学院自动化研究所 Method for detecting search engine cheat based on small sample set
CN101093510A (en) * 2007-07-25 2007-12-26 北京搜狗科技发展有限公司 Anti cheating method and system for aiming at cheat on web page
CN101777053A (en) * 2009-01-08 2010-07-14 北京搜狗科技发展有限公司 Method and system for identifying cheating webpages
CN102243659A (en) * 2011-07-18 2011-11-16 南京邮电大学 Webpage junk detection method based on dynamic Bayesian model
CN103064984A (en) * 2013-01-25 2013-04-24 清华大学 Spam webpage identifying method and spam webpage identifying system
CN103150369A (en) * 2013-03-07 2013-06-12 人民搜索网络股份公司 Method and device for identifying cheat web-pages
WO2016101737A1 (en) * 2014-12-22 2016-06-30 北京奇虎科技有限公司 Search query method and apparatus
CN106326498A (en) * 2016-10-13 2017-01-11 合网络技术(北京)有限公司 Cheat video identification method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
唐寿洪: "基于蚁群优化的网页作弊检测技术研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Also Published As

Publication number Publication date
CN110147472B (en) 2021-10-15

Similar Documents

Publication Publication Date Title
CN108009521B (en) Face image matching method, device, terminal and storage medium
CN108875781B (en) Label classification method and device, electronic equipment and storage medium
CN105488025B (en) Template construction method and device, information identifying method and device
CN106708282B (en) A kind of recommended method and device, a kind of device for recommendation
CN109993125A (en) Model training method, face identification method, device, equipment and storage medium
CN109871896A (en) Data classification method, device, electronic equipment and storage medium
CN109389162B (en) Sample image screening technique and device, electronic equipment and storage medium
CN109493852A (en) A kind of evaluating method and device of speech recognition
KR20210131211A (en) Method for training a voiceprint extraction model and method for voiceprint recognition, and device and medium thereof, program
CN113792207B (en) Cross-modal retrieval method based on multi-level feature representation alignment
CN108121736A (en) A kind of descriptor determines the method for building up, device and electronic equipment of model
CN107330019A (en) Searching method and device
CN109359056A (en) A kind of applied program testing method and device
CN109933714A (en) A kind of calculation method, searching method and the relevant apparatus of entry weight
CN107666536A (en) A kind of method and apparatus for finding terminal, a kind of device for being used to find terminal
CN111984749A (en) Method and device for ordering interest points
CN109471919A (en) Empty anaphora resolution method and device
CN110069624A (en) Text handling method and device
CN112784142A (en) Information recommendation method and device
CN108733718A (en) Display methods, device and the display device for search result of search result
WO2023029397A1 (en) Training data acquisition method, abnormal behavior recognition network training method and apparatus, computer device, storage medium, computer program and computer program product
CN110110207A (en) A kind of information recommendation method, device and electronic equipment
CN109521888A (en) A kind of input method, device and medium
CN110377808A (en) Document processing method, device, electronic equipment and storage medium
CN113919361A (en) Text classification method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant