CN110069693B

CN110069693B - Method and device for determining target page

Info

Publication number: CN110069693B
Application number: CN201910352767.3A
Authority: CN
Inventors: 苏晓东; 刘广; 董晓康; 耿志峰; 杜昆; 杨皓; 段海新
Original assignee: Baidu Online Network Technology Beijing Co Ltd
Current assignee: Baidu Online Network Technology Beijing Co Ltd
Priority date: 2019-04-29
Filing date: 2019-04-29
Publication date: 2021-12-24
Anticipated expiration: 2039-04-29
Also published as: CN110069693A

Abstract

The embodiment of the application discloses a method, a device, electronic equipment and a computer readable medium for determining a target page. One embodiment of the method comprises: extracting candidate pages to be detected meeting preset conditions from the page set to be detected based on the domain names of the pages in the page set to be detected, and adding the candidate pages to be detected to a candidate page queue; executing the operation of searching the target page for the candidate page to be detected in the candidate page queue to be detected, wherein the operation of searching the target page comprises the following steps: the method comprises the steps of judging the category of candidate pages to be detected, determining the candidate pages to be detected in a preset category as target pages, extracting corresponding primary domain names from the domain names of the target pages, crawling associated pages of the target pages based on the primary domain names corresponding to the target pages, and adding the associated pages of the target pages to the candidate pages to be detected queue in response to determining that the associated pages of the target pages are not in the candidate pages to be detected queue. The implementation method improves the searching efficiency of the target page.

Description

Method and device for determining target page

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to the technical field of network information processing, and particularly relates to a method and a device for determining a target page.

Background

Search Engine is a system for collecting information from the internet and providing a Search service to users, and SEO (Search Engine Optimization) is a technology for improving the natural ranking of web sites in a related Search Engine using the rules of the Search Engine.

Some means for promoting website ranking by adopting cheating means such as spam links, hidden web pages and the like appear at present. Such as a means for a search engine to continuously crawl the content of a large number of interlinked pages containing promotional content by automatically generating those pages. These websites typically do not provide content that meets the needs of the user's search, but are presented in the search results at a higher ranking.

Disclosure of Invention

The embodiment of the application provides a method, a device, electronic equipment and a computer readable medium for determining a target page.

In a first aspect, an embodiment of the present disclosure provides a method for determining a target page, including: extracting candidate pages to be detected meeting preset conditions from the page set to be detected based on the domain names of the pages in the page set to be detected, and adding the candidate pages to be detected to a candidate page queue; executing the operation of searching the target page for the candidate page to be detected in the candidate page queue to be detected, wherein the operation of searching the target page comprises the following steps: the method comprises the steps of judging the category of candidate pages to be detected, determining the candidate pages to be detected in a preset category as target pages, extracting corresponding primary domain names from the domain names of the target pages, crawling associated pages of the target pages based on the primary domain names corresponding to the target pages, and adding the associated pages of the target pages to the candidate pages to be detected queue in response to determining that the associated pages of the target pages are not in the candidate pages to be detected queue so as to perform the operation of searching the target pages for the associated pages of the target pages.

In some embodiments, the preset condition includes at least one of: the domain name of the page is not in a preset domain name white list; the confusion degree of the domain name of the page is larger than a preset confusion degree threshold value; the page does not belong to the intersection of the page set to be detected and the historical page set to be detected, and the collection time of the historical page set to be detected is earlier than that of the page set to be detected.

In some embodiments, the determining the category of the candidate to-be-detected page includes: randomly generating at least two sub-domain names based on the domain names of the candidate pages to be detected; acquiring webpage contents of sites corresponding to at least two sub domain names, and extracting characteristics of the acquired webpage contents; and determining the category of the candidate page to be detected as a preset category in response to determining that the difference between the characteristics of the web page contents of the sites corresponding to the at least two sub domain names is within a preset difference interval.

In some embodiments, the page queue to be detected includes a distributed page queue to be detected; the operation of searching the target page for the candidate to-be-detected page in the candidate to-be-detected page queue includes: and respectively executing the operation of searching the target page on the candidate page to be detected in the distributed candidate page to be detected queue by adopting a plurality of processes.

In some embodiments, the above method further comprises: and storing the target page to a database.

In some embodiments, the above method further comprises: and shielding the searched target page.

In some embodiments, the shielding processing on the searched target page includes at least one of the following: adding the uniform resource locator of the target page into a shielding page list of a crawler program of a search engine; deleting the index of the target page in an index base of a search engine; and in response to detecting that the page in the search result contains the target page, pushing risk prompt information to the client sending the search request.

In a second aspect, an embodiment of the present disclosure provides an apparatus for determining a target page, including: the extraction unit is configured to extract candidate pages to be detected which meet preset conditions from the page set to be detected based on domain names of the pages in the page set to be detected, and add the candidate pages to be detected to a candidate page queue; the searching unit is configured to execute an operation of searching a target page on the candidate page to be detected in the candidate page to be detected queue, and the operation of searching the target page comprises the following steps: the method comprises the steps of judging the category of candidate pages to be detected, determining the candidate pages to be detected in a preset category as target pages, extracting corresponding primary domain names from the domain names of the target pages, crawling associated pages of the target pages based on the primary domain names corresponding to the target pages, and adding the associated pages of the target pages to the candidate pages to be detected queue in response to determining that the associated pages of the target pages are not in the candidate pages to be detected queue so as to perform the operation of searching the target pages for the associated pages of the target pages.

In some embodiments, the search unit is further configured to perform category determination on the candidate pages to be detected as follows: randomly generating at least two sub-domain names based on the domain names of the candidate pages to be detected; acquiring webpage contents of sites corresponding to at least two sub domain names, and extracting characteristics of the acquired webpage contents; and determining the category of the candidate page to be detected as a preset category in response to determining that the difference between the characteristics of the web page contents of the sites corresponding to the at least two sub domain names is within a preset difference interval.

In some embodiments, the page queue to be detected includes a distributed page queue to be detected; and the lookup unit is further configured to: and respectively executing the operation of searching the target page on the candidate page to be detected in the distributed candidate page to be detected queue by adopting a plurality of processes.

In some embodiments, the above apparatus further comprises: a storage unit configured to store the target page to a database.

In some embodiments, the above apparatus further comprises: and the shielding unit is configured to shield the searched target page.

In some embodiments, the shielding unit is further configured to perform shielding processing on the searched target page according to at least one of the following manners: adding the uniform resource locator of the target page into a shielding page list of a crawler program of a search engine; deleting the index of the target page in an index base of a search engine; and in response to detecting that the page in the search result contains the target page, pushing risk prompt information to the client sending the search request.

In a third aspect, an embodiment of the present disclosure provides an electronic device, including: one or more processors; storage means for storing one or more programs which, when executed by one or more processors, cause the one or more processors to carry out the method for determining a target page as provided in the first aspect.

In a fourth aspect, an embodiment of the present disclosure provides a computer-readable medium on which a computer program is stored, where the program, when executed by a processor, implements the method for determining a target page provided in the first aspect.

According to the method and the device for determining the target page, the electronic device and the computer readable medium of the embodiments of the present disclosure, based on the domain name of the page in the to-be-detected page set, the candidate to-be-detected pages satisfying the preset condition are extracted from the to-be-detected page set, and are added to the candidate to-be-detected page queue, and the operation of searching for the target page is performed on the candidate to-be-detected pages in the candidate to-be-detected page queue, where the operation of searching for the target page includes: the method comprises the steps of judging the category of candidate pages to be detected, determining the candidate pages to be detected in a preset category as target pages, extracting corresponding first-level domain names from the domain names of the target pages, crawling associated pages of the target pages based on the first-level domain names corresponding to the target pages, adding the associated pages of the target pages into the candidate page queues to be detected in response to determining that the associated pages of the target pages are not in the candidate page queues to be detected, executing operation of searching the target pages on the associated pages of the target pages, finding more pages in the preset category based on domain name association by using limited search engine resources, and improving the searching efficiency of the pages in the preset category.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram to which embodiments of the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of a method for determining a target page according to the present application;

FIG. 3 is a flow diagram of another embodiment of a method for determining a target page according to the present application;

FIG. 4 is a schematic block diagram of the method for determining a target page shown in FIG. 3;

FIG. 5 is a flow diagram of yet another embodiment of a method for determining a target page according to the present application;

FIG. 6 is a block diagram illustrating an embodiment of an apparatus for determining a target page according to the present application;

FIG. 7 is a block diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 shows an exemplary system architecture to which the method for determining a target page or the apparatus for determining a target page of the present application may be applied.

As shown in fig. 1, a system architecture 100 may include a client 101, a network 102, a search engine server 103, and at least one website server 104. Network 102 serves as a medium for providing communication links between clients 101, search engine server 103, and web server 104. The network may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The client 101 may be a client having a user interface through which a user may access network resources through the client 101. The client 101 may be embodied as various electronic devices including, but not limited to, a smart phone, a notebook computer, a desktop computer, a tablet computer, a smart watch, and the like.

The search engine server 103 is a server for providing a search service, and a search engine for collecting and organizing information from the internet is run on the search engine server 103.

The web servers 104 may be servers providing various resource information on the internet, and different web servers 104 may provide different categories or different sources of resource information. For example, the website server 104 may be a video resource server, a server of an enterprise website, a server of a knowledge sharing website, and so on.

The client 101 may establish a connection with the search engine server 103 through the network 102. A browser may be installed on the client 101, and a user may initiate a search request through the client 101. After receiving the search request, the search engine server 103 captures the content of each website server 104 through the network 102 by using a crawler program, performs analysis processing, finds out a website matching the search request, and feeds back information of the found website as a search result to the client 101. The client 101 may receive search results returned by the search engine server 103.

In the application scenario of the present disclosure, the search engine server 103 may crawl a page provided by the website server 104, detect whether the page provided by the website server 104 is a preset category page, and obtain a detection result of a preset category target page.

The search engine server 103 and the web server 104 may be hardware or software. When the search engine server 103 and the website server 104 are hardware, they may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the search engine server 103 and the website server 104 are software, they may be implemented as a plurality of pieces of software or software modules (for example, a plurality of pieces of software or software modules for providing distributed services), or may be implemented as a single piece of software or software module. And is not particularly limited herein.

It should be noted that the method for determining the target page provided by the embodiment of the present disclosure may be executed by the search engine server 103, and accordingly, the apparatus for determining the target page may be disposed at the search engine server 103.

It should be understood that the number of clients, networks, search engine servers, web site servers in FIG. 1 are merely illustrative. There may be any number of clients, networks, search engine servers, web site servers, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a method for determining a target page in accordance with the present application is shown. The method for determining the target page comprises the following steps:

step 201, extracting candidate pages to be detected meeting preset conditions from the page set to be detected based on the domain name of the pages in the page set to be detected, and adding the candidate pages to be detected to a candidate page queue to be detected.

In this embodiment, the execution body of the method for determining the target page may obtain the set of pages to be detected. The set of pages to be detected may be a set of pages collected from the internet, for example, a set of pages crawled by a search engine whose page generation time/update time is within a preset time period (for example, the last week). In some optional implementation manners, the pages in the page set to be detected may also be acquired by the execution main body from other electronic devices, for example, the execution main body may receive problem pages reported by other electronic devices.

Whether the pages in the page set to be detected meet the preset conditions or not can be judged based on the domain name of each page. If the page meets the preset condition, the page can be added to the candidate page queue to be detected.

Here, the purpose of detecting the page is to find a target page. The target page may be a page that obtains a higher search ranking through cheating means through web page crawling rules for the search engine. Specifically, the target page may be a page having a particular SEO behavior characteristic, and more specifically, the target page may be a page having a black-hat SEO behavior, such as a spider pool page.

The black-capped SEO is a search engine optimization behavior for improving the ranking of web pages by using cheating means such as junk links, hidden web pages, keyword stacking and the like. The spider pool is a means for automatically generating a large number of pages which are mutually linked, attracting crawlers of a search engine to grab the pages by updating mass data, and further improving the search ranking of the pages. Valuable contents in the pages which are linked with each other in the spider pool are less, but a search engine is easy to get into crawling of a large number of low-value pages after crawling to the pages in the spider pool, and a large number of resources are wasted.

In some optional implementation manners of this embodiment, the preset condition may be determined based on a feature of the target page, for example, an SEO behavior feature of the target page. For example, the target page is a spider pool page, which has a feature of interlinking with a large number of pages, and domain names of the interlinked pages have similarity with each other, and the preset condition may be set based on the similarity feature of the domain names. And then, taking the pages which meet the preset conditions in the page set to be detected as candidate pages to be detected and adding the candidate pages to be detected into a candidate page queue to be detected.

In some optional implementation manners of this embodiment, a domain name list of the trusted page may be obtained, and other pages except the trusted page are added to the candidate page queue to be detected. The preset conditions may include: the domain name of the page is not in the list of trusted domain names.

Under the condition that the number of pages in the page set to be detected is large, candidate pages to be detected are screened through preset conditions, a large number of pages which do not need to be detected can be filtered, and therefore the number of the pages which need to be detected is reduced.

Optionally, the preset condition may include at least one of the following: the domain name of the page is not in a preset domain name white list; the confusion degree of the domain name of the page is larger than a preset confusion degree threshold value; the page does not belong to the intersection of the page set to be detected and the historical page set to be detected, and the collection time of the historical page set to be detected is earlier than that of the page set to be detected.

The preset domain name white list may be a domain name set of websites with access times exceeding a threshold value, which are obtained based on a search engine, and/or a set of trusted domain names set by a user.

In natural language processing, the degree of confusion is a measure of how good or bad a language probability model describes the probability distribution of each participle in a sentence in the sentence. The domain name confusion of a page may be an indicator of the degree of randomness of the domain name. The higher the randomness of a domain name, the greater the domain name confusion.

Specifically, the confusion of the domain name can be calculated according to the following formula (1):

wherein, U represents the domain name character string of the page to be detected; p (U) represents the confusion of the domain name string U; n is the length of the domain name string U, i.e., the total number of characters contained in the domain name string U. P (omega)₀) Is the probability of the 0 th character appearing in the domain name string U, where the 0 th character is a placeholder, representing the beginning of the string. P (omega)_i|U_i) Is in the prefix U_iProbability of occurrence of the ith character, where the prefix U_iThe prefix character string composed of the 0 th character to the i-1 th character in the domain name character string U is represented. P (omega)_i|ω_i-1) Representing the probability of the occurrence of the ith character in the case of the occurrence of the (i-1) th character.

And when the domain name confusion of the page is greater than a preset confusion threshold value, indicating that the domain name of the page has stronger randomness. The page with the domain name with strong randomness can be used as a candidate page to be detected and added into a candidate page to be detected queue. Because pages with search engine optimization cheating behaviors such as a spider pool are high in randomness, the pages with the cheating behaviors possibly optimized for the search engine can be screened out by setting a confusion threshold of a domain name.

The historical set of pages to be detected may be a set of pages collected before the current set of pages to be detected is collected. For example, the search engine may perform collection of pages to be detected at a fixed time period, and determine a target page according to the pages to be detected after collection. Whether the pages in the page set to be detected are in the intersection of the current page set to be detected and the historical page set to be detected can be judged, if yes, the pages are detected in the historical page set to be detected, and the detection can be performed again without aiming at the pages; if the page does not belong to the intersection of the current page set to be detected and the historical page to be detected, the page to be detected can be determined to be a new page, and the new page is added to the candidate page queue to be detected. Therefore, incremental calculation can be carried out according to the historical page set to be detected, and only recently added pages are extracted to serve as candidate pages to be detected.

Optionally, other page filtering methods may be used to set the preset condition, for example, the preset condition is set based on a trusted domain name list or a domain name detection result provided by a third-party search engine or a third-party page analysis service, so as to extract an untrusted page from the set of pages to be detected.

Step 202, performing an operation of searching for a target page on the candidate page to be detected in the candidate page to be detected queue.

The operation of finding the target page comprises the following steps: the method comprises the steps of judging the category of candidate pages to be detected, determining the candidate pages to be detected in a preset category as target pages, extracting corresponding primary domain names from the domain names of the target pages, crawling associated pages of the target pages based on the primary domain names corresponding to the target pages, and adding the associated pages of the target pages to the candidate pages to be detected queue in response to determining that the associated pages of the target pages are not in the candidate pages to be detected queue so as to perform the operation of searching the target pages for the associated pages of the target pages.

Specifically, for each candidate page to be detected in the candidate page to be detected queue, the category of the candidate page to be detected may be determined first. Here, the categories of the pages may be divided according to the features of the target pages desired to be found, and the pages having the features of the target pages may be classified into one category, and the pages not having the features of the target pages may be classified into another category. Specifically, when the target page is a page having an action of promoting the search ranking by the abnormal cheating means, the category of the page may represent whether the page has an action of promoting the search ranking by the abnormal cheating means, and may include a cheating page and a normal page. Optionally, when the target page is a spider pool page, the category of the page may characterize whether the page is a spider pool page, including spider pool pages and non-spider pool pages.

In this embodiment, the features of the candidate page content to be detected may be extracted, and/or the link behavior features of the page to be detected may be extracted. The link behavior feature can be obtained by performing an operation of crawling candidate pages to be detected and capturing behavior features of a crawler when crawling candidate pages to be detected. And then determining whether the candidate page to be detected is a preset category page or not based on the extracted features. Here, the preset category may be a category of a page having an action of raising a search ranking by an abnormal cheating means, or further, the preset category may be a category of a spider pool page.

Optionally, the category of the candidate to-be-detected page may be determined as follows:

at least two sub-domain names are randomly generated based on the domain names of the candidate pages to be detected. And then, acquiring the webpage contents of the sites corresponding to the at least two sub domain names, and extracting the characteristics of the acquired webpage contents. Suppose that the sites corresponding to the generated two sub-domain names are Pa and Pb, respectively, and the features of the web page contents of the sites corresponding to the two sub-domain names are denoted as lset (Pa) and lset (Pb), respectively. Then, the difference between the features of the web page contents of the sites corresponding to the at least two sub domain names may be calculated, for example, the difference diff between the features of the web page contents of the sites corresponding to the two sub domain names is calculated by using formula (2):

wherein | LSet (Pa) -LSet (Pb) | represents the size of the difference between the two sets LSet (Pa) and LSet (Pb), i.e., the number of features in the difference between LSet (Pa) and LSet (Pb); LSet (Pmin) is the smaller of LSet (Pa) and LSet (Pb); i LSet (Pmin) I is the size of the smaller of LSet (Pa) and LSet (Pb), i.e., the number of features in the smaller of LSet (Pa) and LSet (Pb).

By calculating the difference between the characteristics of the web page contents of the sites corresponding to the at least two sub-domain names, the difference between the sites corresponding to the two sub-domain names generated by the candidate page to be detected can be obtained. And then, in response to the fact that the difference between the characteristics of the webpage contents of the sites corresponding to the at least two sub domain names is within a preset difference interval, determining the category of the candidate to-be-detected page as a preset category. When the difference between the web page contents of the sites corresponding to the two sub-domain names generated by one candidate page to be detected is small, it can be determined that the category of the candidate page to be detected is the preset category.

The candidate page to be detected with the preset category as the category judgment result can be taken as the target page. Then, more pages of the preset category can be further searched for based on the domain name of the searched target page, so as to find more target pages. Specifically, a first-level domain name (i.e., a top-level domain name) may be extracted from the domain names of the determined target pages, and then related pages may be crawled according to the first-level domain name. The URL (Uniform Resource Locator) of the crawled page can be analyzed, more related pages are crawled for the URL by adopting a breadth-first traversal algorithm, and the crawled page is used as a related page of the target page. And then, judging whether the associated page of the target page is in the candidate page queue to be detected, if not, adding the crawled associated page of the target page into the candidate page queue to be detected, and implementing the operation of searching the target page on the candidate page queue to be detected so as to continuously search a new target page based on the associated page of the target page. In the process of sequentially searching for the target page for the pages in the candidate page queue to be detected, by continuously crawling the associated pages as the candidate pages to be detected and adding the candidate pages to be detected to the candidate page queue to be detected, more target pages can be found.

In the step 202, by crawling the associated page based on the domain name of the target page and adding the crawled page to the candidate to-be-detected page queue for category determination, whether the candidate to-be-detected page queue is the target page is determined, so that more target pages can be efficiently found by effectively utilizing resources of a search engine, and the efficiency of searching the target pages is improved.

According to the method for determining the target page in the embodiment of the disclosure, the candidate pages to be detected meeting the preset conditions are extracted from the set of pages to be detected based on the domain name of the pages in the set of pages to be detected, and are added to the candidate page queue to be detected, and the following operation of searching the target page is performed on the candidate pages to be detected in the candidate page queue to be detected: the method comprises the steps of judging the category of candidate pages to be detected, determining the candidate pages to be detected in a preset category as target pages, extracting corresponding first-level domain names from the domain names of the target pages, crawling associated pages of the target pages based on the first-level domain names corresponding to the target pages, adding the associated pages of the target pages into the candidate page queues to be detected in response to determining that the associated pages of the target pages are not in the candidate page queues to be detected, and executing operation of searching the target pages on the associated pages of the target pages, so that more target pages in the preset category based on domain name association can be found by using limited search engine resources, the searching efficiency of the target pages in the preset category is improved, and the method can be applied to large-scale page searching in the preset category.

Optionally, after the target page is determined, the process 200 of the method for determining a target page may further include: and storing the target page to a database. The target page may be recorded by saving the URL of the target page and/or the page content of the target page in the database, to correct the ranking of the target page according to the URL of the target page saved in the database and/or the page content of the target page when a search service is subsequently provided based on a search engine, or to cull the target page saved in the database when a search service is provided based on a search engine.

Referring to FIG. 3, shown is a flow diagram of another embodiment of a method for determining a target page in accordance with the present application. As shown in FIG. 3, a flow 300 of a method for determining a target page includes the steps of:

step 301, extracting candidate pages to be detected which meet preset conditions from the page set to be detected based on the domain name of the pages in the page set to be detected, and adding the candidate pages to be detected to the distributed page queue to be detected.

The execution subject of the method for determining the target page may obtain a set of pages to be detected, where the set of pages to be detected may be a set of pages collected from the internet, and may be, for example, a set of pages crawled by a search engine whose page generation time/update time is within a preset time period (for example, the last week). In some optional implementation manners, the pages in the page set to be detected may also be acquired by the execution main body from other electronic devices, for example, a problem page reported by other electronic devices may be received.

Whether the pages in the page set to be detected meet the preset conditions or not can be judged based on the domain names of the pages. If the page meets the preset condition, the page can be added to the distributed candidate page queue to be detected.

The preset condition is a condition for performing preliminary filtering on the set of pages to be detected, and may be a condition set based on the characteristics of the target page desired to be determined. For example, when the target page is a spider pool page, the preset condition may be set according to a feature of similarity between domain names. For another example, the target page may be a virus page, and the preset condition may be set according to a characteristic of the virus page.

The operation of extracting the candidate pages to be detected satisfying the preset condition from the page set to be detected in step 301 of this embodiment is consistent with the operation of extracting the candidate pages to be detected satisfying the preset condition from the page set to be detected in step 201 of the foregoing embodiment, and details are not repeated here.

In this embodiment, the candidate to-be-detected page queue is a distributed candidate to-be-detected page queue. A plurality of candidate to-be-detected page queues may be created in advance, and when it is determined that the to-be-detected page satisfies the preset condition, the to-be-detected page may be added to one of the candidate to-be-detected page queues. Here, the candidate queue to be detected may be a distributed queue such as Kafka or redis.

And step 302, respectively executing the operation of searching the target page on the candidate page to be detected in the distributed candidate page to be detected queue by adopting a plurality of processes.

The candidate pages to be detected in the distributed page queue to be detected can be subjected to category judgment by adopting a plurality of processes. For example, M processes are used to process N candidate to-be-detected page queues respectively, where M and N are positive integers, where each process performs an operation of searching for a target page on at least one candidate to-be-detected page queue, or multiple processes may perform an operation of searching for a target page on different candidate to-be-detected page queues in the same candidate to-be-detected page queue. The corresponding relationship between the process and the candidate to-be-detected page queue processed by the process can be determined according to a preset resource scheduling policy, which is not particularly limited herein. Each process processes one candidate page to be detected in one operation of searching the target page. Therefore, the detection efficiency of the candidate page queue to be detected can be improved in an asynchronous multi-process mode, and the searching speed of the target page is further improved.

The operation of searching for the target page performed on the candidate page to be detected may include: the method comprises the steps of carrying out category judgment on candidate pages to be detected in a distributed page queue to be detected, determining the candidate pages to be detected in a preset category as target pages, extracting corresponding first-level domain names from the domain names of the target pages, crawling associated pages of the target pages based on the first-level domain names corresponding to the target pages, responding to the fact that the associated pages of the target pages are not in the distributed candidate page queue to be detected, adding the associated pages of the target pages into the distributed candidate page queue to be detected, and executing operation of searching the target pages on the associated pages of the target pages.

The operation of searching for the target page in the present embodiment is the same as the operation of searching for the target page in step 202 in the previous embodiment. The specific implementation manner of searching for the target page in step 202 is also applicable to the operation of searching for the target page in step 302 in this embodiment, and details are not described here.

Optionally, in the operation of searching for the target page, after determining that the preset category of candidate pages to be detected is the target page, extracting the first-level domain name for the target page, and crawling the associated page of the target page based on the first-level domain name may be performed through a plurality of threads, which is beneficial to further improving the crawling efficiency of the page associated with the target page, so as to improve the speed of finding other associated target pages by the target page.

The process 300 of the method for determining a target page according to the embodiment is to add the candidate to-be-detected page to the distributed candidate to-be-detected page queue, and to respectively execute the operation of the target page on the candidate to-be-detected page queue in the distributed candidate to-be-detected page queue by using multiple processes, so that the speed of finding the target page can be effectively increased.

Optionally, the process 300 of the method for determining a target page may further include: and storing the target page to a database. Therefore, when the search engine crawls the page in the process of providing the search service, corresponding filtering operation can be executed according to the target page stored in the database.

With continued reference to FIG. 4, there is shown a schematic architectural diagram of the method for determining a target page shown in FIG. 3. As shown in fig. 4, URLs in the page sets to be detected are first input to a filter for filtering, the filter excludes trusted pages through a domain name white list and domain name confusion, and recent incremental pages are extracted from the cached history page sets to be detected as candidate pages to be detected. And distributing the candidate pages to be detected to a distributed candidate page queue to be detected. And then a plurality of target page searching processes (Checkers) respectively process the pages in the distributed candidate page queue to be detected. After each target page searching process judges the category of the candidate page to be detected, a plurality of crawling threads (finders) are used for crawling the associated page of the searched target page, whether the crawled associated page is in the distributed candidate page queue to be detected is judged, if the crawled associated page is not in the distributed candidate page queue to be detected, the crawled associated page can be added into the distributed candidate page queue to be detected, and the target page is continuously searched according to the mode described above. The searched target page can be synchronized into the database.

Referring to FIG. 5, shown is a flow diagram of yet another embodiment of a method for determining a target page in accordance with the present application. As shown in fig. 5, a flow 500 of a method for determining a target page includes the following steps:

step 501, extracting candidate pages to be detected which meet preset conditions from a page set to be detected based on domain names of the pages in the page set to be detected, and adding the candidate pages to be detected to a candidate page queue to be detected;

step 502, performing an operation of searching for a target page on the candidate page to be detected in the candidate page to be detected queue.

Step 501 and step 502 of this embodiment are respectively the same as step 201 and step 202 of the foregoing embodiment, and specific implementation manners of step 501 and step 502 may refer to descriptions of step 201 and step 202 in the foregoing embodiment, which are not described herein again.

Furthermore, optionally, the candidate pending page queue may be a distributed candidate pending page queue. When the operation of searching for the target page in step 502 is executed, the candidate pages to be detected in the distributed candidate page to be detected queue may be processed by a plurality of processes, respectively.

Step 503, shielding the searched target page.

In this embodiment, the target page may be a page with a higher rank obtained by a cheating means with respect to a page crawling policy and a ranking mode of a search engine, for example, a spider pool page. The searched target page can be shielded, so that the target page is prevented from influencing the search result. Specifically, the search engine may be controlled to skip the target page when crawling the page, or filter out the target page in the search results.

Optionally, the searched target page may be shielded as follows: adding the uniform resource locator of the target page into a shielding page list of a crawler program of a search engine; deleting the index of the target page in an index base of a search engine; and in response to detecting that the page in the search result contains the target page, pushing risk prompt information to the client sending the search request.

In the method for shielding processing, when a crawler program of a search engine crawls pages, the pages in a shielding page list can be skipped. After the index of the target page is deleted from the index database of the search engine, the crawler program of the search engine cannot crawl the target page pointed by the deleted index.

When the crawler program of the search engine does not process the target page and the search result contains the target page, risk prompt information for prompting the target page to be a page with risk can be pushed to the client sending the search request. The client may determine whether to present the page to the user based on the risk hint information. For example, a plug-in of the browser may filter the content of the page or mask the entire page according to the risk hint information.

Optionally, after step 502 and before step 503, the process 500 of the method for determining a target page may further include: and storing the target page to a database. The target page may be recorded by saving the URL of the target page and/or the page content of the target page in a database. In this way, the masking process may be performed based on the target page stored by the database in step 503.

According to the method for determining the target page, the target page is shielded, so that the target page can be effectively, quickly and comprehensively blocked from being transmitted through a search engine, and the influence on the search result of a user is reduced.

With further reference to fig. 6, as an implementation of the methods shown in the above-mentioned figures, the present application provides an embodiment of an apparatus for determining a target page, where the embodiment of the apparatus corresponds to the embodiments of the methods shown in fig. 2, fig. 3, and fig. 5, and the apparatus may be applied to various electronic devices.

As shown in fig. 6, the apparatus 600 for determining a target page of the present embodiment includes: an extraction unit 601 and a lookup unit 602. The extracting unit 601 may be configured to extract candidate pages to be detected from the set of pages to be detected based on domain names of the pages in the set of pages to be detected, and add the candidate pages to be detected to the candidate page queue; the searching unit 602 may be configured to perform an operation of searching for a target page on a candidate page to be detected in the candidate page to be detected queue, where the operation of searching for a target page includes: the method comprises the steps of judging the category of candidate pages to be detected, determining the candidate pages to be detected in a preset category as target pages, extracting corresponding primary domain names from the domain names of the target pages, crawling associated pages of the target pages based on the primary domain names corresponding to the target pages, and adding the associated pages of the target pages to the candidate pages to be detected queue in response to determining that the associated pages of the target pages are not in the candidate pages to be detected queue so as to perform the operation of searching the target pages for the associated pages of the target pages.

In some embodiments, the finding unit 602 may be further configured to perform a category determination on the candidate pages to be detected as follows: randomly generating at least two sub-domain names based on the domain names of the candidate pages to be detected; acquiring webpage contents of sites corresponding to at least two sub domain names, and extracting characteristics of the acquired webpage contents; and determining the category of the candidate page to be detected as a preset category in response to determining that the difference between the characteristics of the web page contents of the sites corresponding to the at least two sub domain names is within a preset difference interval.

In some embodiments, the page queue to be detected includes a distributed page queue to be detected; and the lookup unit 602 may be further configured to: and respectively executing the operation of searching the target page on the candidate page to be detected in the distributed candidate page to be detected queue by adopting a plurality of processes.

In some embodiments, the apparatus 600 may further include: a storage unit configured to store the target page to a database.

In some embodiments, the apparatus 600 may further include: and the shielding unit is configured to shield the searched target page.

In some embodiments, the shielding unit may be further configured to perform shielding processing on the searched target page according to at least one of the following manners: adding the uniform resource locator of the target page into a shielding page list of a crawler program of a search engine; deleting the index of the target page in an index base of a search engine; and in response to detecting that the page in the search result contains the target page, pushing risk prompt information to the client sending the search request.

It should be understood that the elements described in apparatus 600 correspond to various steps in the methods described with reference to fig. 2, 3, and 5. Thus, the operations and features described above for the method are equally applicable to the apparatus 600 and the units included therein, and are not described in detail here.

The device 600 for determining a target page according to the embodiment of the present application extracts candidate pages to be detected that satisfy preset conditions from a set of pages to be detected based on a domain name of the pages in the set of pages to be detected, and adds the candidate pages to be detected to a candidate page queue to be detected, and performs the following operation of searching for the target page on the candidate pages to be detected in the candidate page queue to be detected: the method comprises the steps of judging the category of candidate pages to be detected, determining the candidate pages to be detected in a preset category as target pages, extracting corresponding first-level domain names from the domain names of the target pages, crawling associated pages of the target pages based on the first-level domain names corresponding to the target pages, and adding the associated pages of the target pages into the candidate pages to be detected in response to determining that the associated pages of the target pages are not in a candidate pages queue to be detected, so that more pages in the preset category based on domain name association can be found by using limited search engine resources, and the detection efficiency of the pages in the preset category is improved.

Referring now to FIG. 7, a schematic diagram of an electronic device (e.g., the search engine server of FIG. 1) 700 suitable for use in implementing embodiments of the present disclosure is shown. The electronic device shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 7, electronic device 700 may include a processing means (e.g., central processing unit, graphics processor, etc.) 701 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)702 or a program loaded from storage 708 into a Random Access Memory (RAM) 703. In the RAM703, various programs and data necessary for the operation of the electronic apparatus 700 are also stored. The processing device 701, the ROM 702, and the RAM703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Generally, the following devices may be connected to the I/O interface 705: input devices 706 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 707 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; a storage device 708 including, for example, a hard disk; and a communication device 709. The communication means 709 may allow the electronic device 700 to communicate wirelessly or by wire with other devices to exchange data. While fig. 7 illustrates an electronic device 700 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in fig. 7 may represent one device or may represent multiple devices as desired.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via the communication means 709, or may be installed from the storage means 708, or may be installed from the ROM 702. The computer program, when executed by the processing device 701, performs the above-described functions defined in the methods of embodiments of the present disclosure. It should be noted that the computer readable medium described in the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In embodiments of the present disclosure, however, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: extracting candidate pages to be detected meeting preset conditions from the page set to be detected based on the domain names of the pages in the page set to be detected, and adding the candidate pages to be detected to a candidate page queue; executing the operation of searching the target page for the candidate page to be detected in the candidate page queue to be detected, wherein the operation of searching the target page comprises the following steps: the method comprises the steps of judging the category of candidate pages to be detected, determining the candidate pages to be detected in a preset category as target pages, extracting corresponding primary domain names from the domain names of the target pages, crawling associated pages of the target pages based on the primary domain names corresponding to the target pages, and adding the associated pages of the target pages to the candidate pages to be detected queue in response to determining that the associated pages of the target pages are not in the candidate pages to be detected queue so as to perform the operation of searching the target pages for the associated pages of the target pages.

Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an extraction unit and a lookup unit. For example, the extraction unit may be further described as a unit that extracts candidate pages to be detected that satisfy preset conditions from the set of pages to be detected and adds the candidate pages to be detected to the candidate page queue to be detected, based on domain names of pages in the set of pages to be detected.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. A method for determining a target page, comprising:

extracting candidate pages to be detected meeting preset conditions from the page set to be detected based on the domain name of the pages in the page set to be detected, and adding the candidate pages to be detected to a candidate page queue;

executing an operation of searching a target page for the candidate to-be-detected page in the candidate to-be-detected page queue, wherein the operation of searching the target page comprises the following steps:

performing category judgment on the candidate pages to be detected, determining the candidate pages to be detected in a preset category as target pages, extracting corresponding primary domain names from the domain names of the target pages, crawling associated pages of the target pages based on the primary domain names corresponding to the target pages, and adding the associated pages of the target pages to the candidate pages to be detected queue in response to determining that the associated pages of the target pages are not in the candidate pages to be detected queue so as to execute the operation of searching the target pages for the associated pages of the target pages;

further comprising: setting to obtain the preset condition based on the similarity characteristic of the domain names between one page in the set of pages to be detected and other pages in the set of pages to be detected; the spider pool page has the characteristic of interlinking a large number of pages, and domain names of the interlinked pages have similarity.

2. The method of claim 1, wherein the preset conditions further comprise at least one of:

the domain name of the page is not in a preset domain name white list;

the confusion degree of the domain name of the page is larger than a preset confusion degree threshold value;

and the page does not belong to the intersection of the page set to be detected and the historical page set to be detected, and the collection time of the historical page set to be detected is earlier than that of the page set to be detected.

3. The method according to claim 1, wherein the determining the category of the candidate page to be detected includes:

randomly generating at least two sub-domain names based on the domain names of the candidate pages to be detected;

acquiring the webpage contents of the sites corresponding to the at least two sub domain names, and extracting the characteristics of the acquired webpage contents;

and determining the category of the candidate page to be detected as a preset category in response to determining that the difference between the characteristics of the web page contents of the sites corresponding to the at least two sub domain names is within a preset difference interval.

4. The method of claim 1, wherein the queue of pages to be detected comprises a distributed queue of pages to be detected; and

the operation of searching for the target page is executed on the candidate page to be detected in the candidate page to be detected queue, and the operation comprises the following steps:

and respectively executing the operation of searching the target page on the candidate page to be detected in the distributed candidate page to be detected queue by adopting a plurality of processes.

5. The method of claim 1, wherein the method further comprises:

and storing the target page to a database.

6. The method of any of claims 1-5, wherein the method further comprises:

and shielding the searched target page.

7. The method of claim 6, wherein the blocking the searched target page comprises at least one of the following:

adding the uniform resource locator of the target page into a shielding page list of a crawler program of a search engine;

deleting the index of the target page in an index database of a search engine;

and pushing risk prompt information to a client sending a search request in response to detecting that the page in the search result contains the target page.

8. An apparatus for determining a target page, comprising:

the extraction unit is configured to extract candidate pages to be detected meeting preset conditions from the page set to be detected based on domain names of the pages in the page set to be detected, and add the candidate pages to be detected to a candidate page queue;

a searching unit configured to perform an operation of searching for a target page on a candidate page to be detected in the candidate page to be detected queue, where the operation of searching for the target page includes:

the preset condition setting unit is configured to set and obtain the preset condition based on the similarity characteristic of the domain name between one page in the set of pages to be detected and other pages in the set of pages to be detected; the spider pool page has the characteristic of interlinking a large number of pages, and domain names of the interlinked pages have similarity.

9. The apparatus of claim 8, wherein the preset conditions further comprise at least one of:

the domain name of the page is not in a preset domain name white list;

10. The apparatus according to claim 8, wherein the search unit is further configured to perform the category determination on the candidate pages to be detected as follows:

11. The apparatus of claim 8, wherein the queue of pages to be detected comprises a distributed queue of pages to be detected; and

the lookup unit is further configured to:

12. The apparatus of claim 8, wherein the apparatus further comprises:

a storage unit configured to store the target page to a database.

13. The apparatus of any one of claims 8-12, wherein the apparatus further comprises:

and the shielding unit is configured to shield the searched target page.

14. The apparatus of claim 13, wherein the masking unit is further configured to mask the searched target page according to at least one of the following manners:

deleting the index of the target page in an index database of a search engine;

15. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.

16. A computer-readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method of any one of claims 1-7.