CN113742625A - Page data processing method, device, equipment and medium - Google Patents

Page data processing method, device, equipment and medium Download PDF

Info

Publication number
CN113742625A
CN113742625A CN202111043631.8A CN202111043631A CN113742625A CN 113742625 A CN113742625 A CN 113742625A CN 202111043631 A CN202111043631 A CN 202111043631A CN 113742625 A CN113742625 A CN 113742625A
Authority
CN
China
Prior art keywords
page
pages
sprocket
quality information
spider
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111043631.8A
Other languages
Chinese (zh)
Inventor
刘伟
陈由之
张博
林赛群
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202111043631.8A priority Critical patent/CN113742625A/en
Publication of CN113742625A publication Critical patent/CN113742625A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • G06F16/9577Optimising the visualization of content, e.g. distillation of HTML documents

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present disclosure provides a page data processing method, device, equipment and medium, which relate to the field of computers, and specifically relate to a computer network technology, a search engine technology and a software application technology. The method comprises the following steps: determining a chain wheel with a flow guiding relation, wherein the chain wheel comprises a plurality of first pages, and for each first page, the first page can sequentially pass through the plurality of first pages according to the flow guiding relation and then flow back to the first page; determining at least one second page different from the plurality of first pages and page quality information corresponding to the at least one second page one by one, wherein each second page is guided by at least one part of the corresponding first pages; determining sprocket quality information of the sprocket based on the respective page quality information of the at least one second page; and in response to determining that the sprocket quality information meets a first preset condition, deleting at least one second page and performing first processing on a flow guide in the plurality of first pages to a first page of any one of the second pages.

Description

Page data processing method, device, equipment and medium
Technical Field
The present disclosure relates to the field of computers, and in particular, to a computer network technology, a search engine technology, and a software application technology, and in particular, to a method and an apparatus for processing page data, an electronic device, a computer-readable storage medium, and a computer program product.
Background
The search engine grabs a large number of web pages, filters the web pages, and then records the filtered web pages into the index library. After a user sends a query request to a search engine, the search engine screens out relevant pages according to the request, sorts the pages through various means, and displays all or part of the relevant pages to the user based on a sorting result.
The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, unless otherwise indicated, the problems mentioned in this section should not be considered as having been acknowledged in any prior art.
Disclosure of Invention
The disclosure provides a page data processing method, a page data processing device, an electronic device, a computer readable storage medium and a computer program product.
According to an aspect of the present disclosure, a page data processing method is provided. The page data processing method comprises the following steps: determining a chain wheel with a flow guiding relation, wherein the chain wheel comprises a plurality of first pages, and for each first page in the plurality of first pages, the first page can sequentially pass through the plurality of first pages according to the flow guiding relation and then flow back to the first page; determining at least one second page different from the plurality of first pages and page quality information corresponding to the at least one second page one by one, wherein each second page in the at least one second page is guided by at least one part of the corresponding first pages in the plurality of first pages; determining sprocket quality information of the sprocket based on the respective page quality information of the at least one second page; and in response to determining that the sprocket quality information meets a first preset condition, deleting the at least one second page and performing first processing on the first page of any one of the at least one second page to which the flow in the plurality of first pages is directed.
According to another aspect of the present disclosure, there is provided a page data processing apparatus. The page data processing apparatus includes: a first determination unit configured to determine a sprocket having a flow guiding relationship, the sprocket comprising a plurality of first pages, wherein, for each of the plurality of first pages, the first page is capable of being sequentially guided back through the plurality of first pages according to the flow guiding relationship; a second determining unit configured to determine at least one second page different from the plurality of first pages and page quality information in one-to-one correspondence with the at least one second page, wherein each of the at least one second page is steered by a corresponding at least part of the plurality of first pages; a third determination unit configured to determine sprocket quality information of the sprocket based on the respective page quality information of the at least one second page; and a first processing unit configured to delete the at least one second page and perform a first process on a first page of any one of the at least one second page to a stream of the plurality of first pages in response to determining that the sprocket quality information satisfies a first preset condition.
According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the page data processing method.
According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to execute the above page data processing method.
According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program, wherein the computer program realizes the above-mentioned page data processing method when being executed by a processor.
According to one or more embodiments of the disclosure, by identifying the sprocket with a diversion relationship and judging quality information of a target page (second page) in which part of pages in the sprocket are simultaneously diverted, the quality information of the sprocket is determined based on the quality information of all the target pages, and then the target page is deleted when the quality of the sprocket meets a preset condition and the page in the sprocket which is diverted to any target page is processed according to the quality information of the sprocket, so that identification and quality evaluation of the sprocket are realized, and by processing the poor sprocket, the overall quality of a large number of pages acquired by a crawler is improved, the number of low-quality pages in a database is reduced, and the use experience of a user on a search engine is improved. In addition, the quality information of all target pages is referred to when the quality of the chain wheel is evaluated, so that the problem that the pages which are not processed in the chain wheel are processed when the quality of the target pages is misjudged or other situations occur is avoided, and the accuracy and the robustness of the page data processing method are further improved.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the embodiments and, together with the description, serve to explain the exemplary implementations of the embodiments. The illustrated embodiments are for purposes of illustration only and do not limit the scope of the claims. Throughout the drawings, identical reference numbers designate similar, but not necessarily identical, elements.
Fig. 1 illustrates a flowchart of a page data processing method according to an exemplary embodiment of the present disclosure;
FIG. 2 shows a schematic view of a sprocket according to an exemplary embodiment of the present disclosure;
FIG. 3 shows a flowchart of a page data processing method according to an exemplary embodiment of the present disclosure;
FIG. 4 shows a flowchart of a page data processing method according to an exemplary embodiment of the present disclosure;
FIG. 5 shows a schematic diagram of a spider pool, according to an exemplary embodiment of the present disclosure;
fig. 6 illustrates a block diagram of a structure of a page data processing apparatus according to an exemplary embodiment of the present disclosure;
fig. 7 illustrates a block diagram of a structure of a page data processing apparatus according to an exemplary embodiment of the present disclosure; and
FIG. 8 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
In the present disclosure, unless otherwise specified, the use of the terms "first", "second", etc. to describe various elements is not intended to limit the positional relationship, the timing relationship, or the importance relationship of the elements, and such terms are used only to distinguish one element from another. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, based on the context, they may also refer to different instances.
The terminology used in the description of the various described examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, if the number of elements is not specifically limited, the elements may be one or more. Furthermore, the term "and/or" as used in this disclosure is intended to encompass any and all possible combinations of the listed items.
In the related art, some existing page data processing methods also process other page nodes or websites which are directed to the page node after processing the low-quality page node itself or the website where the low-quality page node is located, but the existing methods have a single processing mode for the directed node and have poor effects.
In order to solve the problems, according to the method and the device, the chain wheel with the flow guide relation is identified, the quality information of the target page (second page) with partial pages in the chain wheel simultaneously flowing is judged, the quality information of the chain wheel is determined based on the quality information of all the target pages, the target page is deleted when the quality of the chain wheel meets the preset condition, the page flowing to any target page in the chain wheel is processed according to the quality information of the chain wheel, identification and quality evaluation of the chain wheel are achieved, the poor chain wheel is processed, the overall quality of a large number of pages acquired by a crawler is improved, the number of low-quality pages in a database is reduced, and the use experience of a user on a search engine is improved. In addition, the quality information of all target pages is referred to when the quality of the chain wheel is evaluated, so that the problem that the pages which are not processed in the chain wheel are processed when the quality of the target pages is misjudged or other situations occur is avoided, and the accuracy and the robustness of the page data processing method are further improved.
According to an aspect of the present disclosure, a page data processing method is provided. As shown in fig. 1, the page data processing method includes: s101, determining a chain wheel with a flow guiding relation, wherein the chain wheel comprises a plurality of first pages, and for each first page in the plurality of first pages, the first page can sequentially pass through the plurality of first pages according to the flow guiding relation and is guided back to the first page; step S102, determining at least one second page different from the plurality of first pages and page quality information corresponding to the at least one second page one by one, wherein each second page in the at least one second page is guided by at least one part of corresponding first pages in the plurality of first pages; step S103, determining sprocket quality information of the sprocket based on the respective page quality information of at least one second page; and step S104, in response to the fact that the sprocket quality information meets the first preset condition, deleting at least one second page, and conducting first processing on the guide flow in the plurality of first pages to the first page of any one of the at least one second page.
Therefore, by identifying the chain wheel with the flow guide relation and judging the quality information of the target page (second page) with partial pages in the chain wheel simultaneously guiding flow, the quality information of the chain wheel is determined based on the quality information of all the target pages, the target page is deleted when the quality of the chain wheel meets the preset condition, the page which guides flow to any target page in the chain wheel is processed according to the quality information of the chain wheel, the identification and quality evaluation of the chain wheel are realized, the poor chain wheel is processed, the overall quality of a large number of pages acquired by the crawler is improved, the number of the low-quality pages in the database is reduced, and the use experience of a user on a search engine is improved. In addition, the quality information of all target pages is referred to when the quality of the chain wheel is evaluated, so that the problem that the pages which are not processed in the chain wheel are processed when the quality of the target pages is misjudged or other situations occur is avoided, and the accuracy and the robustness of the page data processing method are further improved.
According to some embodiments, the first page and the second page may be, for example, web pages that are crawled by a search engine. The page data processing method described in the present disclosure may be applied to page data that is not stored in the index repository after being captured, and may also be applied to page data that already exists in the index repository, which is not limited herein.
According to some embodiments, links may be included in a web page and used to stream to other web pages, such that a search engine may crawl and process the streamed page according to the links on the crawled page when crawling the page. The web pages and the link and flow relationship between the pages can form a page relationship network. As the number of pages covered increases, the page relationship network becomes more complex.
According to some embodiments, any two of the plurality of first pages may belong to different websites, and any one of the plurality of first pages and any one of the at least one second page may belong to different websites. Because the links among the webpages in the same website do little help to the diversion, only the external links among the websites can be mined.
According to some embodiments, the step S101 of determining the sprocket having the air flow guiding relationship may be, for example, determining the sprocket in a page relationship network. The sprocket may, for example, be formed from a plurality of first pages having an annular flow directing structure. For each of the plurality of first pages, the flow may be sequentially guided through each of the plurality of first pages starting from the first page, and finally guided back to the first page. In an exemplary embodiment, as shown in FIG. 2, first pages 202, 204, 206, 208, 210, and 212 constitute sprockets. Wherein, the flow 202 is directed to 204, 204 is directed to 206, 206 is directed to 208, 208 is directed to 210, 210 is directed to 212, and 212 is directed to 202, so that each first page can be directed clockwise six times to return to the first page. Each of the first pages 202 and 212 are directed to the second page 214.
According to some embodiments, each of the at least one second page determined in step S102 may be a page to which all the first pages in the chain wheel are guided, or may be a page to which at least a part of the first pages in the chain wheel are guided, which is not limited herein. The disclosure is likewise not limited with respect to the number of second pages.
According to some embodiments, the dimension of the determination of the page quality information of the second page may include, for example, whether there is a cheating action, the richness of the content, the browsing experience of the user, the authority of the page, and the like. In some embodiments, the page quality information of the second page may be determined in only one dimension. In one exemplary embodiment, in response to determining that a second page has cheating behavior, its page quality information is set directly to "low quality". In another exemplary embodiment, in response to determining that the authority of a second page is high, the page quality information is set to "high quality" directly. In other embodiments, the quality of the page may be comprehensively determined according to multiple dimensions, which is not limited herein. It is understood that the present disclosure does not limit the specific manner of indicating the page quality information, for example, the page quality score may be used to indicate the page quality information, multiple gears may be used to indicate the page quality information, and other manners may be used to indicate the page quality information, which is not limited herein.
According to some embodiments, the step S103 of determining sprocket quality information of the sprocket based on the respective page quality information of the at least one second page may include, for example: determining sprocket quality information based on the respective page quality scores of the at least one second page. In some embodiments, the sprocket quality information may be determined based on at least one of an average, a weighted average, a maximum, a minimum, a median of the respective page quality scores of the at least one second page. In one exemplary embodiment, the average of the respective page quality scores of the at least one second page may be directly used as the sprocket quality information. Specifically, let the page quality score of the second page be single _ target _ quality, the sprocket quality information can be calculated by the following formula:
target_quality=∑single_target_quality/n
wherein n is the number of second pages.
It is understood that the skilled person can set rules to determine the sprocket quality information based on the respective page quality information of at least one second page, so that the sprocket quality information can comprehensively represent the quality of the second pages, and the method is not limited herein.
According to some embodiments, step S104, in response to determining that the sprocket quality information satisfies the first preset condition, the at least one second page may be deleted, and the first process may be performed on the first page of the stream of the plurality of first pages to any one of the at least one second page. Under the not good condition of sprocket quality, can delete whole second page to carry out first processing to all water conservancy diversion to the first page of these second pages, in order to avoid grabbing more low-quality pages through the first page of the too low-quality page of water conservancy diversion, promoted the whole quality of the page that the crawler obtained.
It is understood that the first preset condition can be set by a person skilled in the art according to the requirement to realize accurate identification of the low-mass sprocket, and is not limited herein.
According to some embodiments, the first process may comprise, for example, at least one of a pressing process, a weight reduction, a page deletion. In one exemplary embodiment, the first processing may include deleting a first page of the plurality of first pages that is streamed to any one of the at least one second page and that does not belong to the whitelist. From this, delete through the first page that will lead to the second page and do not belong to the white list to cut off the part water conservancy diversion relation between the first page on the sprocket of poor quality, in order to avoid the reptile to grab the content that the node water conservancy diversion on the sprocket of poor quality is gone wrong, further promoted the whole quality of the page that the reptile obtained.
According to some embodiments, the deleting the first page may be, for example, deleting the first page in the chain wheel, deleting the first page in all chain wheels including the first page, and removing the first page from the grabbed page, which is not limited herein.
According to some embodiments, the white list may comprise at least one of: user original content pages, official website pages, and large website pages. In some embodiments, in the user original content page, a malicious party may add a link of a bad page to the page in a reply or other manner, and if such a page is directly deleted, unreasonable pressing or the like may be caused on a part of the good user original content page. Thus, this type of user-originated content page may be added to the white list to avoid being processed. Similarly, official website pages, large website pages, and pages of other guaranteed quality websites may also have similar problems, and thus pages on these websites may also be added to the whitelist. It will be appreciated that a person skilled in the art may set the white list itself to avoid processing trusted pages that will be maliciously attacked.
According to some embodiments, as shown in fig. 3, the page data processing method may further include: and S305, in response to the fact that the sprocket quality information meets the second preset condition, performing first processing on each of the plurality of first pages. Steps S301 to S304 in fig. 3 are similar to steps S101 to S104 in fig. 1, and are not described herein again. Therefore, under the condition that the quality of the chain wheel is seriously poor, each first page in the chain wheel can be processed, so that the further grabbing of other pages guided by the first pages in the chain wheel is avoided or reduced. It is understood that the second preset condition can be set by a person skilled in the art according to the requirement to realize accurate identification of the low-mass sprocket of two degrees, which is not limited herein.
In the embodiment, the quality of the webpage captured by the search engine is improved in the page scale by identifying the chain wheel and determining the quality information of the chain wheel so as to process the plurality of first pages included by the chain wheel. The method can be expanded to the website scale so as to improve the quality of the webpage captured by the search engine from the website scale.
According to some embodiments, as shown in fig. 4, the page processing method may further include: step S405, determining a spider pool, wherein the spider pool comprises at least one chain wheel with a flow guiding relation and a website to which a plurality of first pages included by each chain wheel belong; step S406, determining spider pool quality information of the spider pool at least based on the respective chain wheel quality information of at least one chain wheel; and step S407, in response to the fact that the spider pond quality information meets the third preset condition, performing second processing on at least one website corresponding to at least one chain wheel. Therefore, the spider pool is determined according to the website where each page in the chain wheel is located, the total score of the spider pool is calculated based on the scores of all chain wheels in the spider pool, whether the websites need to be processed or not is judged according to the score of the spider pool, malicious diversion is monitored and evaluated on the scale of the websites, the inferior spider pool is processed, the overall quality of the pages and the websites obtained by the crawler is further improved, the number of the low-quality pages and websites in a database is reduced, and the use experience of a search engine is improved.
It is understood that, a person skilled in the art may set the third preset condition according to needs to achieve accurate identification of the low-quality spider pool, which is not limited herein.
According to some embodiments, a spider pool may be, for example, a collection of websites and their included web pages. In a spider pool, one sprocket may be included, at least one sprocket may be included, and no sprocket may be included, and the present disclosure is directed only to spider pools that include sprockets. As shown in FIG. 5, the first pages 502, 504, 506, 508, 510, and 512 constitute sprockets, and each of the first pages 502 and 512 leads to a second page 514. The first pages 502-512 correspond to the websites 516-526 one by one, and the websites 516-526 form a spider pool which comprises a chain wheel formed by the first pages 502-512.
According to some embodiments, the step S406 of determining spider pool quality information of the spider pool based on at least the sprocket quality information of each of the at least one sprocket may include, for example: determining sprocket quality information for each of the at least one sprockets based on a plurality of pages included in each of the at least one sprockets; and determining spider pool quality information based on the respective sprocket quality information of the at least one sprocket. In some embodiments, the sprocket mass information for each sprocket can be determined using the methods described above and will not be described in detail herein.
According to some embodiments, determining spider pool mass information based on the respective sprocket mass information of the at least one sprocket may be, for example, determining spider pool mass information based on at least one of highest sprocket mass information, lowest sprocket mass information, average sprocket mass information, median sprocket mass information, weighted sprocket mass information. In one exemplary embodiment, an average of the respective sprocket quality scores of the at least one sprocket may be used as the spider pool quality information. It is understood that a person skilled in the art can set rules to determine spider-pool quality information based on the respective sprocket quality information of at least one sprocket, so that the spider-pool quality information can comprehensively represent the quality of the sprockets, and the method is not limited herein.
According to some embodiments, as shown in fig. 4, the page processing method may further include: step S408, determining a first scale parameter of each of the plurality of sprockets. The first scale parameter may be positively correlated with the number of pages included by the corresponding sprocket. Step S406, determining spider pool quality information of the spider pool may include, for example: spider pool quality information for the spider pool is determined based on at least the respective sprocket quality information and the first scale parameter for the at least one sprocket. Since the larger the sprocket size is, the larger the damage is, more reasonable spider pool quality information can be obtained by taking the sprocket size into consideration when calculating spider pool quality information.
According to some embodiments, determining spider pool quality information for a spider pool based on at least the respective sprocket quality information and the first scale parameter for the at least one sprocket may comprise: multiplying the respective sprocket mass information of the at least one sprocket by the respective first scale parameter to obtain respective weighted mass information of the at least one sprocket; and determining spider pool quality information based on the respective weighted quality information of the at least one sprocket. In some embodiments, the respective sprocket mass score of the at least one sprocket and the respective first scale parameter may be multiplied to obtain a respective weighted mass score of the at least one sprocket. Specifically, let the sprocket quality information be the wheel _ quality, the weighted quality score can be calculated by the following equation:
weighted_quality=w1*wheel_quality
wherein, w1Is a first scale parameter. It is understood that the first scale parameter can be set by one skilled in the art to represent a positive correlation between the weighted mass fraction and the corresponding sprocket scale, and is not limited thereto.
In some embodiments, spider pool quality information may be determined based on at least one of an average, a weighted average, a maximum, a minimum, a median of the respective weighted quality scores of the at least one sprocket. In one exemplary embodiment, the average of the respective weighted quality scores of at least one sprocket may be directly used as spider pool quality information.
According to some embodiments, as shown in fig. 4, the page processing method may further include: and step S409, determining a second scale parameter of the spider pool. The second scale parameter is positively correlated to the number of at least one sprocket. Step S406, determining spider pool quality information of the spider pool may include, for example: spider pool quality information for the spider pool is determined based on the respective sprocket quality information and the first scale parameter for the at least one sprocket and the second scale parameter for the spider pool. Since the larger the scale of the spider pool, the larger the damage generated, more reasonable spider pool quality information can be obtained by taking into account the same high scale of the spider pool when calculating the spider pool quality information.
According to some embodiments, determining spider pool quality information for a spider pool based on the respective sprocket quality information and the first scale parameter for the at least one sprocket and the second scale parameter for the spider pool may include: multiplying the respective sprocket mass information of the at least one sprocket by the respective first scale parameter to obtain respective weighted mass information of the at least one sprocket; and determining spider pool quality information based on the respective weighted quality information of the at least one sprocket and the second scale parameter. In some embodiments, the average of the respective weighted mass scores of the at least one sprocket may be multiplied by a second scale parameter to obtain spider pool mass information. Specifically, the spider pool quality information may be calculated by the following formula:
pool_quality=w2*∑weighted_quality/m
wherein, w2M is the number of sprockets for spider pool scale parameters. It is understood that, the second scale parameter can be set by a person skilled in the art to represent a positive correlation between the spider pool quality information and the spider pool scale, and is not limited herein.
According to some embodiments, the second processing of the at least one website may include performing at least one of the following for each of the at least one website: deleting the website, adjusting the weight of the website, and masking the website. Therefore, malicious websites in the inferior spider pool are processed, the overall quality of the pages and websites acquired by the crawler is improved, the number of the low-quality pages and websites in the database is reduced, and the use experience of a search engine is improved.
According to some embodiments, whitelists may be set up for web sites in the spider pool to achieve exemption of web sites for which partial quality is guaranteed. In some embodiments, the white list may include, for example, at least one of: user creative content websites, official websites, and large websites.
According to some embodiments, the quality of the web pages crawled by the search engine can also be improved on the main domain scale above the website scale. In one exemplary embodiment, in response to determining that the number of malicious websites included in the target home domain exceeds a preset threshold, the target home domain may be processed accordingly. It can be understood that, a person skilled in the art can set the discrimination dimension and the discrimination standard for the malicious home domain by himself, and set the corresponding processing mode by himself, so as to further improve the overall quality of the page and the website acquired by the crawler, which is not limited herein.
According to another aspect of the disclosure, a page data processing device is also provided. As shown in fig. 6, the page data processing apparatus 600 includes: a first determining unit 610 configured to determine a sprocket having a flow guiding relationship, the sprocket comprising a plurality of first pages, wherein for each of the plurality of first pages, the first page is capable of being sequentially guided back through the plurality of first pages according to the flow guiding relationship; a second determining unit 620 configured to determine at least one second page different from the plurality of first pages and page quality information in one-to-one correspondence with the at least one second page, wherein each of the at least one second page is steered by at least a portion of the corresponding first page of the plurality of first pages; a third determining unit 630 configured to determine sprocket quality information of the sprocket based on the respective page quality information of the at least one second page; and a first processing unit 640 configured to delete the at least one second page and perform a first process on a first page of the stream of the plurality of first pages to any one of the at least one second page in response to determining that the sprocket quality information satisfies a first preset condition.
The operations of the units 610-640 of the page data processing apparatus 600 are similar to the operations of the steps S101-S104 of the page data processing method, and are not described herein again.
According to some embodiments, the first processing unit 640 may be further configured to: deleting the first page of the plurality of first pages that is streamed to any one of the at least one second page and that does not belong to the white list.
According to some embodiments, the first processing unit 640 may be further configured to: and performing first processing on each of the plurality of first pages in response to determining that the sprocket quality information satisfies a second preset condition.
According to some embodiments, any two of the plurality of first pages belong to different websites, and any one of the plurality of first pages and any one of the at least one second page belong to different websites.
According to some embodiments, as shown in fig. 7, the page data processing apparatus 700 includes: a fourth determination unit 750 configured to determine a spider pool including at least one sprocket having a diversion relationship and websites to which a plurality of first pages included in each of the sprockets belong; a fifth determining unit 760 configured to determine spider pool quality information of the spider pool based on at least sprocket quality information of each of the at least one sprocket; and a second processing unit 770 configured to perform a second processing on at least one website corresponding to the at least one sprocket in response to determining that the spider pool quality information satisfies a third preset condition. The operations of the units 710-740 in fig. 7 are similar to the operations of the units 610-640 in fig. 6, and are not repeated herein.
According to some embodiments, as shown in fig. 7, the page data processing apparatus 700 may further include: a sixth determining unit 780 configured to determine a first scale parameter of each of the plurality of sprockets. The first scale parameter is positively correlated with the number of pages included by the corresponding sprocket. The fifth determining unit 760 is further configured to determine spider pool quality information of the spider pool based on at least the sprocket quality information and the first scale parameter of each of the at least one sprocket.
According to some embodiments, as shown in fig. 7, the page data processing apparatus 700 may further include: a seventh determining unit 790 configured to determine a second scale parameter of the spider pool. The second scale parameter is positively correlated to the number of at least one sprocket. The fifth determining unit 760 is further configured to determine spider pool quality information of the spider pool based on the sprocket quality information and the first scale parameter of each of the at least one sprocket and the second scale parameter of the spider pool.
According to some embodiments, the second processing unit 770 is further configured to perform, for each of the at least one web site, at least one of: deleting the website, adjusting the weight of the website, and masking the website.
According to an embodiment of the present disclosure, there is also provided an electronic device, a readable storage medium, and a computer program product.
Referring to fig. 8, a block diagram of a structure of an electronic device 800, which may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic device is intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 8, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The calculation unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.
A number of components in the device 800 are connected to the I/O interface 805, including: an input unit 806, an output unit 807, a storage unit 808, and a communication unit 809. The input unit 806 may be any type of device capable of inputting information to the device 800, and the input unit 806 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device, and may include, but is not limited to, a mouse, a keyboard, a touch screen, a track pad, a track ball, a joystick, a microphone, and/or a remote control. Output unit 807 can be any type of device capable of presenting information and can include, but is not limited to, a display, speakers, a video/audio output terminal, a vibrator, and/or a printer. The storage unit 808 may include, but is not limited to, a magnetic disk, an optical disk. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, such as bluetooth (TM) devices, 1302.11 devices, WiFi devices, WiMax devices, cellular communication devices, and/or the like.
Computing unit 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 801 executes the respective methods and processes described above, such as the page data processing method. For example, in some embodiments, the page data processing method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 808. In some embodiments, part or all of the computer program can be loaded and/or installed onto device 800 via ROM 802 and/or communications unit 809. When the computer program is loaded into the RAM 803 and executed by the computing unit 801, one or more steps of the page data processing method described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the page data processing method by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be performed in parallel, sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.
Although embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the above-described methods, systems and apparatus are merely exemplary embodiments or examples and that the scope of the present invention is not limited by these embodiments or examples, but only by the claims as issued and their equivalents. Various elements in the embodiments or examples may be omitted or may be replaced with equivalents thereof. Further, the steps may be performed in an order different from that described in the present disclosure. Further, various elements in the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced with equivalent elements that appear after the present disclosure.

Claims (20)

1. A page data processing method comprises the following steps:
determining a chain wheel with a flow guiding relation, wherein the chain wheel comprises a plurality of first pages, and for each first page in the plurality of first pages, the first page can sequentially pass through the plurality of first pages according to the flow guiding relation and be guided back to the first page;
determining at least one second page different from the plurality of first pages and page quality information corresponding to the at least one second page in a one-to-one mode, wherein each second page in the at least one second page is guided by at least one part of corresponding first pages in the plurality of first pages;
determining sprocket quality information for the sprocket based on the respective page quality information for the at least one second page; and
in response to determining that the sprocket quality information meets a first preset condition, deleting the at least one second page, and performing first processing on a first page of any one of the plurality of first pages.
2. The method of claim 1, wherein first processing a first page of the stream of the plurality of first pages to any of the at least one second page comprises:
deleting the first page of the plurality of first pages that is directed to any of the at least one second page and that does not belong to the whitelist.
3. The method of claim 2, wherein the white list comprises at least one of:
user original content pages, official website pages, and large website pages.
4. The method of claim 1 or 2, further comprising:
performing the first processing on each of the plurality of first pages in response to determining that the sprocket quality information satisfies a second preset condition.
5. The method of claim 1, wherein any two first pages of the plurality of first pages belong to different websites, and any one first page of the plurality of first pages and any one second page of the at least one second page belong to different websites.
6. The method of claim 5, further comprising:
determining a spider pool, wherein the spider pool comprises at least one chain wheel with a diversion relation and a website to which each chain wheel comprises a plurality of first pages;
determining spider pool quality information for the spider pool based at least on the respective sprocket quality information for the at least one sprocket; and
and performing second processing on at least one website corresponding to the at least one chain wheel in response to determining that the spider pool quality information meets a third preset condition.
7. The method of claim 6, further comprising:
determining a first scale parameter for each of the plurality of sprockets, wherein the first scale parameter is positively correlated with the number of pages included by the corresponding sprocket,
wherein determining the spider pool quality information for the spider pool comprises:
determining spider pool quality information for the spider pool based on at least the respective sprocket quality information and a first scale parameter for the at least one sprocket.
8. The method of claim 7, further comprising:
determining a second scale parameter of the spider pool, wherein the second scale parameter is positively correlated with the number of the at least one sprocket,
wherein determining the spider pool quality information for the spider pool comprises:
determining spider pool quality information for the spider pool based on the respective sprocket quality information and the first scale parameter for the at least one sprocket and the second scale parameter for the spider pool.
9. The method of any of claims 6 to 8, wherein the second processing of the at least one website comprises performing at least one of the following for each of the at least one website:
deleting the website, adjusting the weight of the website, and masking the website.
10. A page data processing apparatus comprising:
a first determination unit configured to determine a sprocket having a flow guiding relationship, the sprocket comprising a plurality of first pages, wherein for each of the plurality of first pages, the first page is capable of being sequentially guided back through the plurality of first pages according to the flow guiding relationship;
a second determining unit configured to determine at least one second page different from the plurality of first pages and page quality information in one-to-one correspondence with the at least one second page, wherein each of the at least one second page is steered by at least a portion of the respective first pages of the plurality of first pages;
a third determination unit configured to determine sprocket quality information of the sprocket based on the respective page quality information of the at least one second page; and
a first processing unit configured to delete the at least one second page and perform a first process on a first page of the stream of the plurality of first pages to any one of the at least one second page in response to determining that the sprocket quality information satisfies a first preset condition.
11. The apparatus of claim 10, wherein the first processing unit is further configured to:
deleting the first page of the plurality of first pages that is directed to any of the at least one second page and that does not belong to the whitelist.
12. The apparatus of claim 10 or 11, wherein the first processing unit is further configured to:
performing the first processing on each of the plurality of first pages in response to determining that the sprocket quality information satisfies a second preset condition.
13. The apparatus of claim 10, wherein any two of the plurality of first pages belong to different websites, and any one of the plurality of first pages and any one of the at least one second page belong to different websites.
14. The apparatus of claim 11, further comprising:
a fourth determination unit configured to determine a spider pool including at least one sprocket having a diversion relationship and websites to which a plurality of first pages included in each sprocket respectively belong;
a fifth determination unit configured to determine spider pool quality information of the spider pool based on at least sprocket quality information of each of the at least one sprocket; and
a second processing unit configured to perform a second processing on at least one website corresponding to the at least one sprocket in response to determining that the spider pool quality information satisfies a third preset condition.
15. The apparatus of claim 14, further comprising:
a sixth determination unit configured to determine a first scale parameter of each of the plurality of sprockets, wherein the first scale parameter is positively correlated with the number of pages included in the corresponding sprocket,
wherein the fifth determination unit is further configured to determine spider pool quality information of the spider pool based on at least the sprocket quality information of each of the at least one sprocket and the first scale parameter.
16. The apparatus of claim 15, further comprising:
a seventh determination unit configured to determine a second scale parameter of the spider pool, wherein the second scale parameter is positively correlated with the number of the at least one sprocket,
wherein the fifth determination unit is further configured to determine spider pool quality information of the spider pool based on the respective sprocket quality information and the first scale parameter of the at least one sprocket and the second scale parameter of the spider pool.
17. The apparatus of any of claims 14 to 16, wherein the second processing unit is further configured to perform, for each of the at least one website, at least one of:
deleting the website, adjusting the weight of the website, and masking the website.
18. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein
The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.
19. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-9.
20. A computer program product comprising a computer program, wherein the computer program realizes the method of any one of claims 1-9 when executed by a processor.
CN202111043631.8A 2021-09-07 2021-09-07 Page data processing method, device, equipment and medium Pending CN113742625A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111043631.8A CN113742625A (en) 2021-09-07 2021-09-07 Page data processing method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111043631.8A CN113742625A (en) 2021-09-07 2021-09-07 Page data processing method, device, equipment and medium

Publications (1)

Publication Number Publication Date
CN113742625A true CN113742625A (en) 2021-12-03

Family

ID=78736491

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111043631.8A Pending CN113742625A (en) 2021-09-07 2021-09-07 Page data processing method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN113742625A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102541949A (en) * 2010-12-31 2012-07-04 百度在线网络技术(北京)有限公司 Method and equipment for determining authority values on basis of preset link relation of pages
WO2014190776A1 (en) * 2013-05-28 2014-12-04 Tencent Technology (Shenzhen) Company Limited Method and apparatus for notifying a user of updated content for a webpage
US20190018883A1 (en) * 2017-07-17 2019-01-17 Midea Group Co., Ltd. Computer-based platform for quality management of home devices
CN110633433A (en) * 2019-07-19 2019-12-31 北京无限光场科技有限公司 Page caching method and device, electronic equipment and storage medium
CN111737615A (en) * 2020-05-14 2020-10-02 北京百度网讯科技有限公司 Method and device for acquiring page resources, electronic equipment and readable storage medium
CN112347327A (en) * 2020-10-22 2021-02-09 杭州安恒信息技术股份有限公司 Website detection method and device, readable storage medium and computer equipment
CN113326418A (en) * 2021-06-30 2021-08-31 北京百度网讯科技有限公司 Method and device for determining webpage information source and webpage quality

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102541949A (en) * 2010-12-31 2012-07-04 百度在线网络技术(北京)有限公司 Method and equipment for determining authority values on basis of preset link relation of pages
WO2014190776A1 (en) * 2013-05-28 2014-12-04 Tencent Technology (Shenzhen) Company Limited Method and apparatus for notifying a user of updated content for a webpage
US20190018883A1 (en) * 2017-07-17 2019-01-17 Midea Group Co., Ltd. Computer-based platform for quality management of home devices
CN110633433A (en) * 2019-07-19 2019-12-31 北京无限光场科技有限公司 Page caching method and device, electronic equipment and storage medium
CN111737615A (en) * 2020-05-14 2020-10-02 北京百度网讯科技有限公司 Method and device for acquiring page resources, electronic equipment and readable storage medium
CN112347327A (en) * 2020-10-22 2021-02-09 杭州安恒信息技术股份有限公司 Website detection method and device, readable storage medium and computer equipment
CN113326418A (en) * 2021-06-30 2021-08-31 北京百度网讯科技有限公司 Method and device for determining webpage information source and webpage quality

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
胡毅: ""搜索引擎优化及其应用研究"", 《中国优秀硕士学位论文全文数据库信息科技辑》, no. 4, pages 8 - 9 *

Similar Documents

Publication Publication Date Title
CN112801164B (en) Training method, device, equipment and storage medium of target detection model
JP2018538587A (en) Risk assessment method and system
CN112560496A (en) Training method and device of semantic analysis model, electronic equipment and storage medium
CN110572409B (en) Industrial Internet security risk prediction method, device, equipment and storage medium
CN112953938B (en) Network attack defense method, device, electronic equipment and readable storage medium
CN113986564A (en) Application data flow monitoring method and device, computer equipment and medium
CN113791837B (en) Page processing method, device, equipment and storage medium
CN113495825A (en) Line alarm processing method and device, electronic equipment and readable storage medium
CN113742625A (en) Page data processing method, device, equipment and medium
EP4134838A1 (en) Word mining method and apparatus, electronic device and readable storage medium
CN113495841B (en) Compatibility detection method, device, equipment, storage medium and program product
CN114679335A (en) Network security risk assessment training and assessment method and equipment for power monitoring system
CN114386779A (en) Network security state evaluation method, system, computer and readable storage medium
CN108810230B (en) Method, device and equipment for acquiring incoming call prompt information
CN112817463A (en) Method, equipment and storage medium for acquiring audio data by input method
CN114205164B (en) Traffic classification method and device, training method and device, equipment and medium
CN113591095B (en) Identification information processing method and device and electronic equipment
CN112800315B (en) Data processing method, device, equipment and storage medium
CN113971216B (en) Data processing method and device, electronic equipment and memory
CN114172725B (en) Illegal website processing method and device, electronic equipment and storage medium
CN113810400A (en) Website parasite detection method, device, equipment and medium
CN114615092B (en) Network attack sequence generation method, device, equipment and storage medium
CN110719260B (en) Intelligent network security analysis method and device and computer readable storage medium
CN113360798A (en) Flooding data identification method, device, equipment and medium
CN113836242A (en) Data processing method and device, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination