CN112948731A - Cache analysis method and system for website domain name resource and computer storage medium - Google Patents

Cache analysis method and system for website domain name resource and computer storage medium Download PDF

Info

Publication number
CN112948731A
CN112948731A CN201911269923.6A CN201911269923A CN112948731A CN 112948731 A CN112948731 A CN 112948731A CN 201911269923 A CN201911269923 A CN 201911269923A CN 112948731 A CN112948731 A CN 112948731A
Authority
CN
China
Prior art keywords
resource
domain name
cache
website domain
analyzed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911269923.6A
Other languages
Chinese (zh)
Inventor
张伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ZTE Corp
Original Assignee
ZTE Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ZTE Corp filed Critical ZTE Corp
Priority to CN201911269923.6A priority Critical patent/CN112948731A/en
Publication of CN112948731A publication Critical patent/CN112948731A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

The invention provides a cache analysis method, a system and a computer storage medium of website domain name resources, which are characterized in that website domain name resources to be analyzed are collected, furthermore, all resource contents of the website domain name resources to be analyzed are captured, basic header information corresponding to all resource contents is obtained, furthermore, the basic header information of all resource contents is analyzed, and whether the website domain name resources are cached or not is determined; the problems that the cacheability of internet website content resources is analyzed manually in the related technology, the operation and maintenance efficiency is greatly reduced, the labor cost is increased, and the optimization of performance and service cannot be realized are solved. That is, the method, the system and the computer storage medium for cache analysis of website domain name resources provided by the embodiments of the present invention implement intelligent capture and cacheability analysis of internet website content resources, greatly improve operation and maintenance efficiency, reduce labor cost, and implement optimization of performance and service.

Description

Cache analysis method and system for website domain name resource and computer storage medium
Technical Field
The embodiment of the invention relates to the technical field of communication, in particular to a cache analysis method, a cache analysis system and a computer storage medium for website domain name resources.
Background
With the rapid development of the internet, internet cache products are continuously increased, at present, operation and maintenance personnel analyze the cacheability of internet website content resources one by one, and because the internet website content resources are various and difficult to analyze and judge, the cacheability of the internet website content resources is analyzed in a manual mode, so that the operation and maintenance efficiency is greatly reduced, the labor cost is increased, and the optimization of performance and service cannot be realized.
Disclosure of Invention
The cache analysis method, the cache analysis system and the computer storage medium for the website domain name resources provided by the embodiment of the invention mainly solve the technical problem that the cacheability of internet website content resources is analyzed in a manual mode in the related technology, thereby greatly reducing the operation and maintenance efficiency, increasing the labor cost and being incapable of realizing the optimization of performance and service.
In order to solve the above technical problem, an embodiment of the present invention provides a cache analysis method for a website domain name resource, including:
collecting website domain name resources to be analyzed;
capturing each resource content of the website domain name resource to be analyzed, and acquiring basic header information corresponding to each resource content;
and analyzing the basic header information of the content of each resource to determine whether to cache the website domain name resource.
To solve the above technical problem, an embodiment of the present invention provides a system, including a processor, a memory, and a communication bus;
the communication bus is used for realizing connection communication between the processor and the memory;
the processor is configured to execute one or more programs stored in the memory to implement the steps of the cache analysis method for website domain name resources as described above.
To solve the above technical problem, an embodiment of the present invention provides a computer storage medium, where one or more programs are stored, and the one or more programs are executable by one or more processors to implement the steps of the cache analysis method for website domain name resources as described above.
The invention has the beneficial effects that:
according to the method, the system and the computer storage medium for cache analysis of the website domain name resource, provided by the embodiment of the invention, the website domain name resource to be analyzed is collected, further, each resource content of the website domain name resource to be analyzed is captured, the basic header information corresponding to each resource content is obtained, further, the basic header information of each resource content is analyzed, and whether the website domain name resource is cached or not is determined; the problems that the cacheability of internet website content resources is analyzed manually in the related technology, the operation and maintenance efficiency is greatly reduced, the labor cost is increased, and the optimization of performance and service cannot be realized are solved. That is, the method, the system and the computer storage medium for cache analysis of website domain name resources provided by the embodiments of the present invention realize intelligent capture and cacheability analysis of internet website content resources, avoid manually analyzing the cacheability of the internet website content resources one by one in the related art, bring much convenience to operation and maintenance personnel, greatly improve operation and maintenance efficiency, reduce labor cost, provide an intelligent operation and maintenance means for internet cache products with different day by day and diverse user demands, and realize optimization of performance and service.
Additional features and corresponding advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
The invention will be further described with reference to the accompanying drawings and examples, in which:
fig. 1 is a schematic basic flowchart of a method for analyzing a cache of a website domain name resource according to an embodiment of the present invention;
fig. 2 is a schematic basic flowchart of a cache analysis method for website domain name resources according to a second embodiment of the present invention;
fig. 3 is a schematic structural diagram of a system according to a third embodiment of the present invention;
fig. 4 is a schematic interface diagram of a grabbing task according to a third embodiment of the present invention;
fig. 5 is a schematic interface diagram of cacheable analysis of each resource content of a website domain name resource according to a third embodiment of the present invention;
fig. 6 is a schematic interface diagram of distribution of content of each resource of the website domain name resource according to the third embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in detail below with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The first embodiment is as follows:
in order to solve the problems that the cacheability of internet website content resources is analyzed manually in the related art, the operation and maintenance efficiency is greatly reduced, the labor cost is increased, and the optimization of performance and service cannot be realized, the embodiment of the invention provides a cache analysis method of website domain name resources, which comprises the steps of acquiring the website domain name resources to be analyzed, further capturing the content of each resource of the website domain name resources to be analyzed, acquiring the basic header information corresponding to each resource content, further analyzing the basic header information of each resource content, and determining whether the website domain name resources are cached; fig. 1 is a schematic flow chart of a cache analysis method for website domain name resources according to an embodiment of the present invention.
S101, collecting website domain name resources to be analyzed.
It should be noted that, in the embodiment of the present invention, the website domain name resource includes each resource content, where each resource content refers to each sub domain name resource and each uniform resource locator URL resource included in each sub domain name, and it should be understood that the sub domain name resource may also be represented by a URL. In order to better understand, in the embodiment of the present invention, the website domain name resource to be analyzed is also referred to as a main domain name resource, and the content of each resource included in the website domain name resource is referred to as a sub domain name resource.
Optionally, the embodiment of the present invention includes, but is not limited to, acquiring the website domain name resource to be analyzed by the following three ways:
the first method is as follows: collecting DNS logs of the distributed network directory service, obtaining the ranking of DNS click rate and the distribution of analysis results, and using the DNS logs as website domain name resources to be analyzed.
The second method comprises the following steps: and collecting the cache system logs, acquiring the request and the resource type accessed by the user, and taking the request and the resource type as the website domain name resource to be analyzed.
The third method comprises the following steps: and analyzing the Top websites or CP websites of the Top websites or content providers which are selected on the basis of the national website ranking, and acquiring the domain name resources of the websites to be analyzed.
It should be noted that, the three methods for acquiring the domain name resource of the website to be analyzed are listed here, and in practical application, the method can be flexibly adjusted according to a specific application scenario.
S102: and capturing the resource content of the website domain name resource to be analyzed, and acquiring basic header information corresponding to the resource content.
Optionally, in the embodiment of the present invention, capturing each resource content of the website domain name resource to be analyzed may be implemented based on a script framework of a cross-platform computer programming language Python, so as to obtain basic header information corresponding to each resource content; wherein, Scapy is a fast and high-level screen capture and World Wide Web (Web) capture framework developed by Python, is used for capturing Web sites and extracting structured data from pages, and provides base classes of various crawlers, such as BasePider, sitemap crawlers and the like.
Optionally, after acquiring the website domain name resource to be analyzed and before capturing each resource content of the website domain name resource to be analyzed in the embodiment of the present invention, the method includes:
establishing a capturing task according to a website domain name resource to be analyzed; wherein:
the method for capturing the resource content of the website domain name resource to be analyzed comprises the following steps: controlling the grabbing state, the grabbing depth and the grabbing time of the grabbing task; wherein:
optionally, the capturing state in the embodiment of the present invention includes five states of to-be-captured, capturing in progress, capturing completed, to-be-stopped, and stopping completed. For example, 0 represents to crawl, which may be converted to 1; 1 represents in crawling, it can be converted to 2 or 3; 2 represents the completion of the crawl, which can be converted to 0; 3 denotes to be stopped, which can be converted to 4; 4 indicates stop completion, which may translate to 0; it should be understood that when the grabbing task is stopped, it will have a short buffer duration corresponding to the pending stop condition, and when the buffer duration has elapsed, it will correspond to the stop complete condition.
Optionally, in the embodiment of the present invention, controlling the grabbing depth includes:
and acquiring depth information from the attribute of a request object of the grabbing task, wherein the content of each resource of the website domain name resource is in a multi-stage increasing form, the corresponding value of the depth information is increased along with the grabbing stage number, and when the acquired corresponding value of the depth information is greater than a preset depth threshold value, the grabbing of the next-stage resource content of the current resource content of the website domain name resource is stopped.
In the embodiment of the invention, in order to improve the grabbing flexibility of each resource content of the website domain name resource, the grabbing depth of the grabbing task can be controlled; it should be understood that the resource contents of the website domain name resources are increased in a first-level and first-level manner, that is, the main domain name resource includes each sub domain name resource of the first layer, each sub domain name resource of the first layer includes some sub domain name resources of the second layer, and so on. For better understanding, the main domain name resource includes two layers of sub domain name resources, for example, the main domain name resource is 1, the sub domain name resources of the first layer are respectively 11, 12 and 13, the sub domain name resource of the first layer 11 includes the sub domain name resources 111 and 112 of the second layer, the sub domain name resource of the first layer 12 includes the sub domain name resources 121, 122 and 123 of the second layer, and the sub domain name resource of the first layer 13 includes the sub domain name resources 131, 132 and 133 of the second layer, which are similar to the pyramid form and are not described again; therefore, the depth information can be obtained from the attribute of the request object of the grabbing task, wherein the corresponding value of the depth information is increased along with the number of grabbing layers, so that the current number of layers in the pyramid can be known, and further, when the corresponding value of the depth information is greater than a preset depth threshold value, the next-level resource content of the current resource content of the website domain name resource is determined not to be grabbed any more. It is worth noting that the preset depth threshold is flexibly set by operation and maintenance personnel according to actual requirements, so that the flexibility of capturing the content of each resource of the website domain name resource is improved.
Optionally, in the embodiment of the present invention, controlling the capturing time includes:
and setting grabbing time for the grabbing task, and stopping grabbing the content of each resource of the website domain name resource when the grabbing time reaches a preset grabbing time threshold.
In the embodiment of the invention, in order to avoid trapping of grabbing into endless loop, the grabbing time of a grabbing task can be controlled; namely, when the capturing time of the capturing task reaches a preset capturing time threshold, determining that the content of each resource of the website domain name resource is not captured any more. It is worth noting that the preset capturing time threshold is flexibly set by operation and maintenance personnel according to actual requirements, so that the phenomenon that the content of each resource for capturing the domain name resource of the website falls into endless loop is avoided.
It should be clear that, a corresponding capture task may be established for one website domain name resource to be analyzed, or a corresponding capture task may also be established for a plurality of website domain name resources to be analyzed, that is, the corresponding relationship between the number of website domain name resources to be analyzed and the number of capture tasks is one-to-one or one-to-many, and in practical application, flexible adjustment may be made according to a specific application scenario.
For better understanding, in the embodiment of the present invention, a one-to-one correspondence between the number of domain name resources of a website to be analyzed and the number of crawling tasks is used for an exemplary description; optionally, in the embodiment of the present invention, multiple website domain name resources (i.e., different crawling tasks) may be concurrently executed at the same time, or may not be concurrently executed at the same time; optionally, in the embodiment of the present invention, the resource contents of the same website domain name resource (i.e., the same crawling task) may be concurrently executed at the same time, or may not be concurrently executed at the same time, where the method of concurrently executing at the same time is adopted, so that the crawling efficiency of the resource contents may be improved to a certain extent.
Optionally, after acquiring the website domain name resource to be analyzed and before capturing each resource content of the website domain name resource to be analyzed in the embodiment of the present invention, the method further includes:
acquiring an IP address of a website domain name resource to be analyzed;
and when the IP address is 80 ports, capturing the content of each resource of the website domain name resource to be analyzed.
It should be understood that, in the embodiment of the present invention, when the IP address corresponding to the website domain name resource to be analyzed belongs to the port 80, the content of each resource of the website domain name resource to be analyzed is captured, otherwise, the content of each resource of the website domain name resource to be analyzed is not captured.
Optionally, in the embodiment of the present invention, capturing each resource content of the website domain name resource to be analyzed, and acquiring basic header information corresponding to each resource content, where the basic header information includes but is not limited to:
capturing links in pages corresponding to resource contents of website domain name resources in a GET mode and/or an HEAD mode; the page hyperlink in the page HTML can be captured in a GET mode, for example, < a href ═ http:// www.sina.com.cn/>; links to referenced picture/script/video/audio resources in page HTML can be grabbed by HEAD means, e.g. < img src ═ http:// n.sinaimg. cn/photo/20161215/JiEo-fxypunk 66994. jpg "/>; further, basic header information corresponding to each resource content is acquired.
Optionally, after capturing each resource content of the website domain name resource to be analyzed and acquiring the basic header information corresponding to each resource content in the embodiment of the present invention, the method further includes:
acquiring an IP address corresponding to each resource content;
and respectively analyzing the IP addresses, and filtering the resource content of the port with the IP address being not 80.
It should be understood that, in the embodiment of the present invention, the resource contents of the website domain name resource to be analyzed are screened first, that is, only the resource contents whose IP addresses belong to the 80 ports are left, and further, the basic header information of the resource contents is analyzed to determine whether to cache the website resource contents.
S103: and analyzing the basic header information of the content of each resource to determine whether to cache the website domain name resource.
Optionally, the basic header information in the embodiment of the present invention includes field information, where the field information includes, but is not limited to, a uniform resource locator URL, a status code, a cache object, and a cache duration; wherein:
analyzing the basic header information of each resource content to determine whether to cache the website domain name resource, comprising: when the information of each field meets corresponding preset conditions, determining a domain name resource of a cache website; wherein:
when the field information comprises URLs, analyzing the URLs in the field information, counting the number of dynamic resources, calculating the proportion of the dynamic resources in the total URLs (namely the sum of the dynamic resources and the static resources) of the website domain name resources, and when the proportion is smaller than a first preset threshold value, determining that the URLs meet preset conditions; it should be understood that a URL is determined to be a dynamic resource when it contains the following characters: "? "," "/cgi-bin/", ". pl", ". asp", ". cgi", ". jsp", ". php". For better understanding, a specific example is described here, for example, assuming that the total number of URLs corresponding to each resource content of the resource to be analyzed is a, where the number of URLs belonging to the dynamic resource is a1, and at this time, the ratio t1 of the dynamic resource to the total URLs of the website domain name resource is a 1/a; further, it is determined whether T1 is smaller than a first preset threshold T1, and if yes, it is determined that the URL satisfies a preset condition.
When the field information comprises the state codes, analyzing the state codes in the field information, counting the number of the cacheable state codes, calculating the proportion of the cacheable state codes to the total state codes of the static resources, and determining that the state codes meet preset conditions when the proportion is greater than a second preset threshold; it should be understood that a state code is determined to be a cacheable state code when it is 200, 301, etc. For better understanding, a specific example is described here, for example, assuming that the total number of state codes of each static resource content of the resource to be analyzed is B, where the number of cacheable state codes is B1, and at this time, the ratio t2 of the cacheable state codes to the total state codes is B1/B; further, it is determined whether T2 is greater than a second predetermined threshold T2, and if yes, it is determined that the status code satisfies the predetermined condition.
When the field information comprises cache objects, analyzing the cache objects in the field information, counting the number of the cache objects belonging to the non-cache objects, calculating the proportion of the non-cache objects in the total cache objects of the static resources, and when the proportion is greater than a third preset threshold value, determining that the cache objects meet preset conditions; it should be understood that a cached object is determined to be a non-cacheable object when it is a field as follows: a) finally, modifying the time 'Last-Modified'; b) "Set-Cookie": http 1.0 no Cache, http 1.1 "Cache-Control: no-cache, private "; c) "Pragma: no-cache "; d) no Authorization "; e) "Cache-Control: no-Cache, no-store, private". For better understanding, a specific example is described here, for example, assuming that the total number of cache objects of each static resource content of the resource to be analyzed is C, where the number of the cache objects belonging to the non-cacheable object is C1, and at this time, the ratio t3 of the non-cacheable object to the total cache objects is C1/C; further, whether T3 is greater than a third preset threshold T3 is determined, and if yes, it is determined that the cache object meets the preset condition.
When the field information comprises cache duration, analyzing the cache duration in each field information, counting the number of the cache durations which are greater than a preset duration threshold, calculating the proportion of the cache duration which is greater than the preset duration threshold to the total cache duration of the static resource, and when the proportion is greater than a fourth preset threshold, determining that the cache duration meets a preset condition. For better understanding, a specific example is described here, for example, assuming that the total number of cache durations of each static resource content of the resource to be analyzed is D, where the number of the cache durations greater than the preset duration threshold is D1, and at this time, the ratio t4 of the cache durations greater than the preset duration threshold to the total cache duration is D1/D; further, whether T4 is greater than a fourth preset threshold T4 is determined, and if yes, it is determined that the caching duration meets the preset condition.
Optionally, in the embodiment of the present invention, when each field information meets a preset condition, the method includes:
setting corresponding weighted values for all the proportions;
multiplying the proportions by the corresponding weighted values, adding the multiplied proportions to obtain the resource cacheability of the website domain name resources, determining to cache the website domain name resources when the resource cacheability value is greater than a fifth preset threshold, and determining not to cache the website domain name resources when the resource cacheability value is less than or equal to the fifth preset threshold.
For better understanding, a specific example is still described here, for example, with reference to the above example, if the weighted values corresponding to the URL, the status code, the cache object, and the cache duration ratio are w1, w2, w3, and w4, respectively, then the resource cacheability S ═ w1 × t1+ w2 × t2+ w3 × t3+ w4 × t4 of the website domain name resource is determined, when S is greater than a fifth preset threshold S1, it is determined that the website domain name resource is cached, and when S is less than or equal to a fifth preset threshold S1, it is determined that the website domain name resource is not cached.
It should be understood that the above description is that the field information includes a URL, a state code, a cache object, and a cache duration, and when the URL, the state code, the cache object, and the cache duration respectively satisfy their corresponding preset conditions, the cached website domain name resource is determined, and when at least one of the URL, the state code, the cache object, and the cache duration does not satisfy its corresponding preset condition, the cached website domain name resource is determined not to be cached; when the URL is determined not to meet the corresponding preset conditions, the website domain name resource is directly determined not to be cached, the execution process of whether other field information meets the corresponding preset conditions or not is omitted, and the power consumption is reduced.
Optionally, in some examples, the domain name resource of the cache website may also be determined when the field information includes a URL, a status code, a cache object, and a cache duration, and when any one or any combination of the URL, the status code, the cache object, and the cache duration satisfies a preset condition corresponding thereto; it should be noted that, in practical applications, the adjustment can be flexibly made according to specific application scenarios.
It should be noted that, in practical applications, the weighted values and the field information corresponding to the first/second/third/fourth/fifth preset threshold, the preset duration threshold, the URL, the status code, the cache object, and the cache duration ratio can be flexibly adjusted according to specific application scenarios.
Optionally, analyzing the basic header information of each resource content in the embodiment of the present invention to determine whether to cache the website domain name resource includes: and outputting and displaying the determined result for operation and maintenance personnel to check.
Optionally, in the embodiment of the present invention, field information (URL, status code, cache object, cache duration) of the obtained basic header information of each resource content is recorded and stored, and may be output and displayed subsequently for operation and maintenance staff to view the detailed part, which is more convenient for management.
The method for analyzing the cache of the website domain name resource provided by the embodiment of the invention comprises the steps of acquiring the website domain name resource to be analyzed, further capturing each resource content of the website domain name resource to be analyzed to obtain the basic header information corresponding to each resource content, further analyzing the basic header information of each resource content to determine whether the website domain name resource is cached; compared with the related technology, the method has the following advantages:
the method has the advantages that intelligent capture and cacheability analysis of resource contents contained in website domain name resources are realized, the problem that the cacheability of internet website content resources is analyzed one by one manually in the related technology is avoided, great convenience is brought to operation and maintenance personnel, and the labor cost is reduced;
by analyzing the content of each resource contained in the website domain name resource, the analysis surface is wider, and the cache analysis of the website domain name resource is more accurate by combining the weighting coefficient, so that the operation and maintenance efficiency is greatly improved;
and thirdly, after the analysis is finished, the determination result and the detailed analysis process of each resource content are output, so that the operation and maintenance personnel can check and manage more conveniently, and further convenience is brought to the operation and maintenance personnel.
Example two:
the embodiment of the present invention is described with reference to a specific image display control process on the basis of the first embodiment, and please refer to fig. 2.
S201: and collecting website domain name resources to be analyzed.
The DNS log can be collected, the DNS click rate ranking and analysis result distribution are obtained, and the obtained DNS click rate ranking and analysis result distribution is used as a website domain name resource to be analyzed; or, collecting the cache system logs, acquiring the request and the resource type accessed by the user, and taking the request and the resource type as the website domain name resource to be analyzed; or, analyzing the Top website or the CP website to obtain the website domain name resource to be analyzed.
S202: acquiring an IP address of a website domain name resource to be analyzed, judging whether the IP address is an 80-port or not, if so, executing S203, and if not, executing S211.
S203: and establishing a capturing task according to the website domain name resource to be analyzed.
The method comprises the steps of acquiring the content of each resource of the website domain name resource to be analyzed, and controlling the acquisition state, the acquisition depth and the acquisition time of an acquisition task in the process of acquiring the content of each resource of the website domain name resource to be analyzed.
S204: capturing each resource content of the website domain name resource to be analyzed based on a Scapy frame, and acquiring basic header information corresponding to each resource content, wherein the basic header information comprises field information URL, state codes, cache objects and cache duration.
Links in pages corresponding to the resource contents of the website domain name resources can be captured in a GET mode and/or an HEAD mode, and basic header information corresponding to the resource contents is obtained.
The method comprises the steps of capturing resource contents of website domain name resources to be analyzed in a script frame, obtaining basic header information corresponding to the resource contents, obtaining IP addresses corresponding to the resource contents, further analyzing the IP addresses respectively, and filtering the resource contents of which the IP addresses are non-80 ports.
S205: analyzing the URLs in the field information, counting the number of the dynamic resources, calculating the proportion of the dynamic resources to the total URL of the website domain name resources, judging whether the proportion is smaller than a first preset threshold value, if so, determining that the URLs meet preset conditions, and executing S206, otherwise, executing S211.
S206: analyzing the state codes in each field information, counting the number of the state codes which belong to the cacheable state codes, calculating the proportion of the cacheable state codes to the total state codes of the static resources, judging whether the proportion is greater than a second preset threshold value, if so, determining that the state codes meet preset conditions, and executing S207, and if not, executing S211.
S207: analyzing the cache objects in each field information, counting the number of the non-cacheable objects, calculating the proportion of the non-cacheable objects to the total cache objects of the static resources, judging whether the proportion is greater than a third preset threshold value, if so, determining that the cache objects meet preset conditions, and executing S208, otherwise, executing S211.
S208: analyzing the cache duration in each field information, counting the number of the cache durations larger than a preset duration threshold, calculating the proportion of the cache durations larger than the preset duration threshold to the total cache duration of the static resource, judging whether the proportion is larger than a fourth preset threshold, if so, determining that the cache durations meet preset conditions, and executing S209, and if not, executing S211.
Wherein, S205, S206, S207, S208 may also be executed in parallel.
S209: multiplying the proportions by the corresponding weighted values, adding the multiplied proportions to obtain the resource cacheability of the website domain name resource, and judging whether the resource cacheability value is greater than a fifth preset threshold value, if so, executing S210, and if not, executing S211.
S210: and determining the domain name resource of the cache website.
S211: determining not to cache the website domain name resource.
S212: and outputting and displaying the determination result.
The method for analyzing the cache of the website domain name resource provided by the embodiment of the invention comprises the steps of acquiring the website domain name resource to be analyzed, further capturing each resource content of the website domain name resource to be analyzed to obtain the basic header information corresponding to each resource content, further analyzing the basic header information of each resource content to determine whether the website domain name resource is cached; the problems that the cacheability of internet website content resources is analyzed manually in the related technology, the operation and maintenance efficiency is greatly reduced, the labor cost is increased, and the optimization of performance and service cannot be realized are solved.
Example three:
referring to fig. 3, the system provided in the embodiment of the present invention includes a processor 301, a memory 302, and a communication bus 303;
the communication bus 303 is used for realizing connection communication between the processor 301 and the memory 302;
the processor 301 is configured to execute one or more programs stored in the memory 302 to implement the steps of the cache analysis method for the website domain name resource according to the first embodiment to the second embodiment. Wherein:
fig. 4 is a schematic diagram of a possible interface for capturing tasks, in which information such as task batch, task name, task status, progress, creation time, and task source can be displayed. The grabbing task can support operations such as execution, stop, addition, deletion and the like; the task states are divided into states of waiting to be grabbed, grabbing in the middle, grabbing completed, waiting to stop, stopping completed and the like.
Fig. 5 is a schematic diagram of a feasible interface for cache analysis of resource contents of website domain name resources.
Please refer to fig. 6, which is a schematic view of an interface for content distribution of each resource of a website domain name resource, where the content distribution of each resource visually shows the resource distribution under the domain name, including: data type, request times ratio (request times/total times), size ratio (size/total size), traffic ratio (traffic/total traffic); the request times ratio, the size ratio and the flow ratio can be displayed in a pie chart mode and the like, so that operation and maintenance personnel can conveniently check the requests.
It should be noted that, in order to avoid redundant description, all of the examples in the first to second embodiments are not fully described in the embodiments of the present invention, and it should be clear that all of the examples in the first to second embodiments are applicable to the present embodiment.
Embodiments of the present invention further provide a computer storage medium (i.e., a computer-readable storage medium), where one or more programs are stored, and the one or more programs are executable by one or more processors to implement the steps of the cache analysis method for a website domain name resource in the first to second embodiments.
The computer storage media includes volatile or nonvolatile, removable or non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, computer program modules or other data. Computer storage media includes, but is not limited to, RAM (Random Access Memory), ROM (Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory), flash Memory or other Memory technology, CD-ROM (Compact disk Read-Only Memory), Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer.
It will be apparent to those skilled in the art that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software (which may be implemented in program code executable by a computing device), firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed over computer-readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media), executed by a computing device, and in some cases may perform the steps shown or described in a different order than here. The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art.
In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art. Thus, the present invention is not limited to any specific combination of hardware and software.
The foregoing is a more detailed description of embodiments of the present invention, and the present invention is not to be considered limited to such descriptions. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims (12)

1. A cache analysis method for website domain name resources comprises the following steps:
collecting website domain name resources to be analyzed;
capturing each resource content of the website domain name resource to be analyzed, and acquiring basic header information corresponding to each resource content;
and analyzing the basic header information of the content of each resource to determine whether to cache the website domain name resource.
2. The method for analyzing the cache of the website domain name resource according to claim 1, wherein after the website domain name resource to be analyzed is collected, capturing each resource content of the website domain name resource to be analyzed, and before the basic header information corresponding to each resource content is acquired, the method comprises:
establishing a capturing task according to the website domain name resource to be analyzed;
in the process of capturing the resource content of the website domain name resource to be analyzed, the method comprises the following steps:
and controlling the grabbing state, the grabbing depth and the grabbing time of the grabbing task.
3. The cache analysis method for website domain name resources according to claim 2, wherein the grabbing state comprises five states of waiting to grab, grabbing in the middle, grabbing completed, waiting to stop, stopping completed;
controlling the grabbing depth, comprising:
acquiring depth information from the attribute of a request object of the grabbing task, wherein the resource content of the website domain name resource is in a multi-stage increasing form, the corresponding value of the depth information increases with the grabbing stage number, and when the acquired corresponding value of the depth information is greater than a preset depth threshold value, the grabbing of the next-stage resource content of the current resource content of the website domain name resource is stopped;
controlling the grabbing time, comprising:
and setting grabbing time for the grabbing task, and stopping grabbing the content of each resource of the website domain name resource when the grabbing time reaches a preset grabbing time threshold.
4. The cache analysis method for website domain name resources according to any one of claims 1 to 3, wherein the capturing each resource content of the website domain name resources to be analyzed to obtain basic header information corresponding to each resource content comprises:
and capturing links in pages corresponding to the resource contents of the website domain name resources in a GET mode and/or an HEAD mode, and acquiring basic header information corresponding to the resource contents, wherein the basic header information comprises field information.
5. The method for cache analysis of website domain name resources according to claim 4, wherein said field information comprises: uniform resource locator URL, status code, cache object, cache duration;
analyzing the basic header information of each resource content to determine whether to cache the website domain name resource, including:
when the information of each field meets corresponding preset conditions, determining to cache the website domain name resource;
and when one preset condition which does not meet the corresponding preset condition exists in each field information, determining not to cache the website domain name resource.
6. The method for analyzing the cache of the website domain name resource according to claim 5, wherein the determining to cache the website domain name resource when each field information respectively satisfies a corresponding preset condition comprises:
when the field information comprises URLs, analyzing the URLs in the field information, counting the number of dynamic resources, calculating the proportion of the dynamic resources in the total URLs of the website domain name resources, and determining that the URLs meet preset conditions when the proportion is smaller than a first preset threshold;
when the field information comprises the state codes, analyzing the state codes in the field information, counting the number of the cacheable state codes, calculating the proportion of the cacheable state codes to the total state codes of the static resources, and determining that the state codes meet preset conditions when the proportion is greater than a second preset threshold;
when the field information comprises cache objects, analyzing the cache objects in the field information, counting the number of the cache objects belonging to the non-cache objects, calculating the proportion of the non-cache objects in the total cache objects of the static resources, and when the proportion is greater than a third preset threshold value, determining that the cache objects meet preset conditions;
when the field information comprises cache duration, analyzing the cache duration in each field information, counting the number of the cache durations which are greater than a preset duration threshold, calculating the proportion of the cache duration which is greater than the preset duration threshold to the total cache duration of the static resource, and when the proportion is greater than a fourth preset threshold, determining that the cache duration meets a preset condition.
7. The method for analyzing the cache of the website domain name resource according to claim 6, wherein when each field information respectively satisfies a preset condition, the method further comprises:
setting corresponding weighted values for all the proportions;
multiplying the proportions by the corresponding weighted values and then adding the proportions to obtain the resource cacheability of the website domain name resource, determining to cache the website domain name resource when the resource cacheability value is larger than a fifth preset threshold value, and determining not to cache the website domain name resource when the resource cacheability value is smaller than or equal to the fifth preset threshold value.
8. The method for analyzing the cache of the website domain name resource according to any one of claims 1 to 3, wherein before capturing each resource content of the website domain name resource to be analyzed, the method comprises:
acquiring the IP address of the website domain name resource to be analyzed;
and when the IP address is an 80 port, capturing the content of each resource of the website domain name resource to be analyzed.
9. The cache analysis method for website domain name resources according to any one of claims 1 to 3, wherein the collecting the website domain name resources to be analyzed comprises:
collecting DNS logs, acquiring domain name click rate ranking and analysis result distribution, and taking the domain name click rate ranking and analysis result distribution as website domain name resources to be analyzed;
or, collecting the cache system logs, acquiring the request and the resource type accessed by the user, and taking the request and the resource type as the website domain name resource to be analyzed;
or, the websites or the content provider websites with the top ranking are selected and analyzed based on the national website ranking, and the website domain name resources to be analyzed are obtained.
10. The method for analyzing the cache of the website domain name resource according to any one of claims 1 to 3, wherein after analyzing the basic header information of the content of each resource and determining whether to cache the website domain name resource, the method comprises: and outputting and displaying the determination result.
11. A system comprising a processor, a memory, and a communication bus;
the communication bus is used for realizing connection communication between the processor and the memory;
the processor is configured to execute one or more programs stored in the memory to implement the steps of the method for cache analysis of website domain name resources according to any one of claims 1-10.
12. A computer storage medium, characterized in that the computer storage medium stores one or more programs executable by one or more processors to implement the steps of the cache analysis method of website domain name resources according to any one of claims 1-10.
CN201911269923.6A 2019-12-11 2019-12-11 Cache analysis method and system for website domain name resource and computer storage medium Pending CN112948731A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911269923.6A CN112948731A (en) 2019-12-11 2019-12-11 Cache analysis method and system for website domain name resource and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911269923.6A CN112948731A (en) 2019-12-11 2019-12-11 Cache analysis method and system for website domain name resource and computer storage medium

Publications (1)

Publication Number Publication Date
CN112948731A true CN112948731A (en) 2021-06-11

Family

ID=76234295

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911269923.6A Pending CN112948731A (en) 2019-12-11 2019-12-11 Cache analysis method and system for website domain name resource and computer storage medium

Country Status (1)

Country Link
CN (1) CN112948731A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023279744A1 (en) * 2021-07-05 2023-01-12 北京百度网讯科技有限公司 Method and apparatus for grabbing pressure, electronic device and readable storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023279744A1 (en) * 2021-07-05 2023-01-12 北京百度网讯科技有限公司 Method and apparatus for grabbing pressure, electronic device and readable storage medium

Similar Documents

Publication Publication Date Title
US11647096B2 (en) Method and apparatus for automatically optimizing the loading of images in a cloud-based proxy service
US11609839B2 (en) Distributed code tracing system
US8918602B2 (en) Dynamically altering time to live values in a data cache
CN107480277B (en) Method and device for collecting website logs
US20160065684A1 (en) Method and apparatus for automatically optimizing the loading of images in a cloud-based proxy service
US8935798B1 (en) Automatically enabling private browsing of a web page, and applications thereof
US10291738B1 (en) Speculative prefetch of resources across page loads
US8577827B1 (en) Network page latency reduction using gamma distribution
CN104426985B (en) Show the method, apparatus and system of webpage
US20090085921A1 (en) Populate Web-Based Content Based on Space Availability
CN102662600A (en) Method for mutually dragging files at different domain names
CN111367596A (en) Method and device for realizing service data processing and client
CN109359231A (en) A kind of information crawler method, server and the storage medium of distributed network crawler
CN112948731A (en) Cache analysis method and system for website domain name resource and computer storage medium
US8935285B2 (en) Searchable and size-constrained local log repositories for tracking visitors&#39; access to web content
EP3863252A1 (en) Advertisement anti-shielding method and device
CN109284428A (en) Data processing method, device and storage medium
CN111581553B (en) Network image display method, system, electronic equipment and storage medium
CN104468740B (en) A kind of webpage transmission intelligent processing system and its method
CN115563423A (en) Data acquisition method and device, computer equipment and storage medium
CA2788100A1 (en) Crawling of generated server-side content
CN111339388A (en) Information crawling system
US9940311B2 (en) Optimized read/write access to a document object model
CN110858238B (en) Data processing method and device
CN111753231B (en) Method and device for loading third-party H5 page and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination