CN108255868B - Method and device for checking links in website - Google Patents

Method and device for checking links in website Download PDF

Info

Publication number
CN108255868B
CN108255868B CN201611248666.4A CN201611248666A CN108255868B CN 108255868 B CN108255868 B CN 108255868B CN 201611248666 A CN201611248666 A CN 201611248666A CN 108255868 B CN108255868 B CN 108255868B
Authority
CN
China
Prior art keywords
data
link
page
website
log data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611248666.4A
Other languages
Chinese (zh)
Other versions
CN108255868A (en
Inventor
郑继攀
冯鸳鹤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201611248666.4A priority Critical patent/CN108255868B/en
Publication of CN108255868A publication Critical patent/CN108255868A/en
Application granted granted Critical
Publication of CN108255868B publication Critical patent/CN108255868B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a method and a device for checking links in a website. Wherein, the method comprises the following steps: acquiring log data of a website in a preset time period, wherein the log data at least comprises: click data, page jump data and heartbeat data; acquiring link data containing website links in click data; inquiring whether page jump data and heartbeat data corresponding to the link data exist in the log data to obtain an inquiry result; and determining whether the link is broken according to the query result. The invention solves the technical problems of high resource occupancy rate and poor timeliness caused by the fact that a crawler mode is adopted to check the broken link in the website in the prior art.

Description

Method and device for checking links in website
Technical Field
The invention relates to the field of website detection, in particular to a method and a device for checking links in a website.
Background
With the development of internet technology, websites have become a main tool for people to obtain information from the internet, and in the process of accessing websites, scenes that pages cannot be opened are often encountered, so that the user experience is greatly influenced. Therefore, whether a certain link on a webpage can be normally accessed can be judged, and the method is an important business technology for both a maintainer and a decider of the website. In particular, for government business, superior often considers the performance of the affiliated websites, and the link breaking index is usually used as an important index. Therefore, in website quality detection, checking the number of broken links existing in a website is an important index for measuring the quality of the website.
At present, an existing broken link judgment method generally includes that an entry page is given, then each link of a website is recursively crawled step by step based on the entry page in a crawler mode, each link is requested, and if a normal state code cannot be returned, the link is calibrated to be a broken link. This scheme adopts the mode of crawling step by step to carry out the broken chain inspection, has following not enough:
a large amount of resources such as network bandwidth and servers are occupied, and the resources are expensive;
secondly, the timeliness is poor, and due to the fact that more resources are needed, the user often climbs for only a few weeks or even a few months, and new broken links cannot be found in time during the period;
thirdly, continuously crawling the target website can bring operating pressure to the target website;
and fourthly, the target website is based on safety consideration, and sometimes some anti-crawling technologies are applied, so that crawling cannot be smoothly carried out, and even results are inaccurate.
In view of the above problems, no effective solution has been proposed.
Disclosure of Invention
The embodiment of the invention provides a method and a device for checking links in a website, which at least solve the technical problems of high resource occupancy rate and poor timeliness caused by the fact that links are broken in the website checked in a crawler mode in the prior art.
According to an aspect of an embodiment of the present invention, there is provided a method for checking a link in a website, including: acquiring log data of a website in a preset time period, wherein the log data at least comprises: click data, page jump data and heartbeat data; acquiring link data containing website links in click data; inquiring whether page jump data and heartbeat data corresponding to the link data exist in the log data to obtain an inquiry result; and determining whether the link is broken according to the query result.
Further, querying whether page jump data and heartbeat data corresponding to the link data exist in the log data to obtain a query result, wherein the query result comprises: inquiring whether page jump data containing link data exists from the log data; if the log data contains page jump data containing the link data, acquiring a page identifier of the page jump data containing the link data; inquiring whether heartbeat data containing page identification exists from the log data; and if the log data contains heartbeat data containing the page identifier, determining that page jump data and heartbeat data corresponding to the link data exist in the log data.
Further, determining whether the link is broken according to the query result includes: if page jump data and heartbeat data corresponding to the link data exist in the log data, determining that the link is a normal link; and if the page jump data and the heartbeat data corresponding to the link data do not exist in the log data, determining whether the link is broken according to the page request result of the link data.
Further, determining whether the link is broken according to the page request result of the link data includes: sending a page request of link data;
receiving a page state code returned in response to the page request; if the page state code is equal to a preset value, determining that the link is a normal link, wherein the preset value is used for representing that the request result of the page request is a request success; and if the page state code is not equal to the preset value, determining that the link is broken.
Further, after determining that the link is broken, the method further comprises: marking the link data; and storing the marked link data into a broken link data table.
Further, acquiring log data of the website in a preset time period includes: collecting log data of a website; dividing the log data of the website into a plurality of data groups according to the types of users and user configuration data; selecting log data of any one data packet; dividing log data of any data packet into log data of a plurality of preset time periods; selecting log data of any one preset time period.
According to another aspect of the embodiments of the present invention, there is also provided an apparatus for checking a link in a website, including: the first acquisition module is used for acquiring log data of a website in a preset time period, wherein the log data at least comprises: click data, page jump data and heartbeat data; the second acquisition module is used for acquiring link data containing website links in the click data; the first query module is used for querying whether page jump data and heartbeat data corresponding to the link data exist in the log data to obtain a query result; and the first determining module is used for determining whether the link is broken according to the query result.
Further, the first query module includes: the second query module is used for querying whether page jump data containing link data exists in the log data; the third acquisition module is used for acquiring the page identifier of the page jump data containing the link data if the page jump data containing the link data exists in the log data; the third query module is used for querying whether heartbeat data containing the page identification exists in the log data; and the second determining module is used for determining that page jump data and heartbeat data corresponding to the link data exist in the log data if the heartbeat data containing the page identifier exists in the log data.
Further, the first determining module includes: the third determining module is used for determining that the link is a normal link if page jump data and heartbeat data corresponding to the link data exist in the log data; and the fourth determining module is used for determining whether the link is broken according to the page request result of the link data if the page jump data and the heartbeat data corresponding to the link data do not exist in the log data.
Further, the fourth determining module includes: the sending module is used for sending a page request of the link data; the receiving module is used for receiving the page state code returned by the response page request; the fifth determining module is used for determining that the link is a normal link if the page status code is equal to a preset value, wherein the preset value is used for representing that the request result of the page request is that the request is successful; and the sixth determining module is used for determining that the link is broken if the page state code is not equal to the preset value.
Further, the apparatus further comprises: a marking module for marking the link data; and the storage module is used for storing the marked link data into the broken link data table.
Further, the first obtaining module comprises: the acquisition module is used for acquiring log data of the website; the first division module is used for dividing the log data of the website into a plurality of data groups according to the types of the user and the user configuration data; the first selecting module is used for selecting the log data of any one data packet; the second dividing module is used for dividing the log data of any data packet into the log data of a plurality of preset time periods; and the second selection module is used for selecting the log data of any one preset time period.
In the embodiment of the present invention, log data of a website in a preset time period is obtained, where the log data at least includes: click data, page jump data and heartbeat data; acquiring link data containing website links in click data; inquiring whether page jump data and heartbeat data corresponding to the link data exist in the log data to obtain an inquiry result; whether the link is broken is determined according to the query result, and the purpose of determining normal link and broken link in the website by analyzing the user behavior log data of the website is achieved, so that the technical effects of improving the checking efficiency and reducing the occupation of network and server resources are achieved, and the technical problems of high resource occupancy rate and poor timeliness caused by the fact that a crawler mode is adopted to check the broken link in the website in the prior art are solved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
FIG. 1 is a flow diagram of a method for checking links in a web site, according to an embodiment of the invention;
FIG. 2 is a flow diagram of an alternative method for checking links in a web site, according to an embodiment of the present invention;
FIG. 3 is a flow diagram of an alternative method for checking links in a web site, according to an embodiment of the present invention;
FIG. 4 is a flow diagram of an alternative method for checking links in a web site, according to an embodiment of the present invention;
FIG. 5 is a flow diagram of an alternative method for checking links in a web site, according to an embodiment of the present invention;
FIG. 6 is a flow diagram of an alternative method for checking links in a web site, according to an embodiment of the present invention;
FIG. 7 is a flow diagram of a preferred method for checking links in a web site, according to an embodiment of the present invention; and
FIG. 8 is a diagram illustrating an apparatus for checking links in a website according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
First, some terms or terms appearing in the description of the embodiments of the present application are applicable to the following explanations:
chain scission: the method mainly refers to the link on the webpage which cannot be normally opened after clicking, and the failure of normal opening refers to the condition that the requested address returns a status code like 404 and 500;
pv data: i.e., PageView data, page data collected by js tracker, including page jump data for jumping from a certain page to another page through a link.
mc data: the Mouse Click data is obtained from the Click page data;
hb data: heartbeat data, send and response data used to verify whether a link between two pages is up.
Example 1
In accordance with an embodiment of the present invention, there is provided an embodiment of a method for checking for links in a website, it being noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowchart, in some cases, the steps illustrated or described may be performed in an order different than that described herein.
Fig. 1 is a flowchart of a method for checking links in a website according to an embodiment of the present invention, as shown in fig. 1, the method includes the following steps:
step S102, obtaining log data of the website in a preset time period, wherein the log data at least comprises: click data, page jump data, and heartbeat data.
Specifically, in the above step, the website may be a target website of a link to be checked; the log data may be user behavior log data of the user on the website collected by a data acquisition system deployed on the website, and the collection of the user behavior log data is started after the user inputs a URL of the target website and sends an HTTP request to a server of the target website. After the log data of the website to be checked is collected, analyzing and processing the user log data through a certain means, and acquiring the log data of the website to be checked in a preset time period, wherein the acquired log data at least comprises click data, page jump data and heartbeat data. The click data can be click log data of a page clicked by a user through a mouse; the page jump data may be page jump log data that a user jumps to other pages through links on the current webpage, and it should be noted that there is a record for each jump process and a record of a page identifier of the jump; the heartbeat data may be transmit and response data that verifies whether a link between two pages is up.
In an optional embodiment, the user behavior log data of the user on the target website can be acquired through a js tracker data acquisition code deployed on the target website by the GWD product.
As an alternative implementation, the daily log data of the website to be checked may be collected, and the collected log data is divided into log data of a plurality of preset time periods according to a preset time interval, in a preferred embodiment, the preset time interval may be 30 minutes, for example, if the daily log data of the website to be checked is collected, the log data is divided into 48 preset time periods according to the time interval of 30 minutes, which are 00:00 to 00:30, 00:30 to 01:00, and … 11:30 to 00:00, respectively.
It should be noted that, after receiving a page request from a user, the server of the website to be checked adds a record in the Log file, where the record content may include: remote host name (or IP address), login name, login full name, date of request, time of request, details of request (including method, address, protocol of request), status of request return, size of request document, etc.
Step S104, link data containing website links in the click data is obtained.
Specifically, in the above step, the website link may be url information for accessing each page of the website to be detected, after obtaining log data of the website to be detected within a preset time period, data including linkurl fields is searched in click log data, and link data including the linkurl fields is obtained.
It should be noted that all the links of the website to be checked can be obtained by collecting the click log data of all the pages of the website to be checked accessed by the user and finding all the link data containing url information from the click log data, so as to check the links of the website in the following.
And step S106, inquiring whether page jump data and heartbeat data corresponding to the link data exist in the log data to obtain an inquiry result.
Specifically, in the above step, after the link data of the website to be checked in the preset time period is acquired, whether page jump data and heartbeat data corresponding to the link data in the preset time period exist in the log data of the website to be checked, so as to obtain a corresponding query result.
It should be noted that only after the page jump data corresponding to the link data is found in the log data of the website to be checked, whether the response data of the page identifier exists can be found according to the target page identifier pointed by the link contained in the page jump data corresponding to the link data, that is, whether the link has corresponding pv data and hb data is found in the log data, and only if both the pv data and the hb data exist, it is determined that the link is a normal link, and other links are suspected to be broken links.
And step S108, determining whether the link is broken according to the query result.
Specifically, in the above step, the broken link may be a link on a web page of the website to be checked, which cannot be normally opened after clicking, and a normal link in the website to be checked may be found by querying from the log data whether page jump data and heartbeat data corresponding to the link data exist, and if the page jump data and the heartbeat data corresponding to the link data exist in the log data, the link corresponding to the link data is considered to be a normal link; if the log data does not contain page jump data and heartbeat data corresponding to the link data, the link corresponding to the link data is considered to be suspected broken link; when the links in the website are checked, only the links suspected to be broken can be requested, so that the access times of the website to be checked can be reduced, and the operation pressure of the website is relieved.
As can be seen from the above, in the above embodiments of the present application, by acquiring the user behavior log data of the website to be checked (target website), analyzing and processing the user behavior log data, finding out the link data contained in the click log data, and searching whether page jump data and heartbeat data corresponding to the link data exist in the log data, determining normal links and suspected broken links in the website to be checked according to the query result so as to request only the suspected broken links in the following process, achieving the purpose of determining the normal links and broken links in the website by analyzing the user behavior log data of the website, thereby realizing the technical effects of improving the checking efficiency and reducing the occupation of network and server resources, and further, the technical problems of high resource occupancy rate and poor timeliness caused by chain breakage in a website detection mode by adopting a crawler in the prior art are solved.
In an alternative embodiment, as shown in fig. 2, querying whether page jump data and heartbeat data corresponding to link data exist in log data to obtain a query result may include the following steps:
step S202, inquiring whether page jump data containing link data exists in log data;
step S204, if the log data contains page jump data containing link data, acquiring a page identifier of the page jump data containing the link data;
step S206, inquiring whether heartbeat data containing page identification exists in the log data;
step S208, if the heartbeat data containing the page identifier exists in the log data, determining that page jump data and heartbeat data corresponding to the link data exist in the log data.
Specifically, in the embodiment, after the click log data in the preset time period is acquired to obtain the link data including the link of the website, whether page jump data including the link data exists is searched for in the page jump log data, after the page jump data including the link data is found, the page identifier of the target page jumped to in the page jump data including the link data is acquired, whether heartbeat data including the page identifier exists is searched for in the heartbeat log data, and if heartbeat data including the page identifier exists in the log data, it is determined that page jump data and heartbeat data corresponding to the link data exist in the log data.
In an alternative embodiment, as shown in fig. 3, determining whether the link is broken according to the query result may include the following steps:
step S302, if page jump data and heartbeat data corresponding to the link data exist in the log data, determining that the link is a normal link;
step S304, if the page jump data and the heartbeat data corresponding to the link data do not exist in the log data, determining whether the link is broken according to the page request result of the link data.
Specifically, in the above embodiment, the page jump data at least includes a page id of a target page pointed by the link number, and if response data of the page id also exists in the heartbeat data, it is indicated that a link corresponding to the link data is a normal link; in an optional implementation scheme, if page jump data corresponding to link data does not exist in log data, it indicates that a user has not made a page request through a link corresponding to the link data, it is not determined whether the link is a normal link or an error link, so that a page is requested from the link data, and it is determined whether the link is a normal link or a broken link according to a returned page request result; in another optional embodiment, if page jump data corresponding to link data exists in log data, but heartbeat data of a corresponding page in the page jump data is not found, it indicates that a user has made a page request through the link data, but has not returned response data of a target page identifier to which the link data points, and therefore it is not determined whether the requested page is successfully returned, and therefore it is also necessary to request a page from the link data, and it is determined whether the link is a normal link or a broken link according to a returned page request result.
By the embodiment, the purpose of determining whether the link in the website is a normal link or not by searching the page jump data and the heartbeat data corresponding to the link data is achieved, and accordingly, the link can be determined.
In an alternative embodiment, as shown in fig. 4, determining whether the link is broken according to the page request result of the link data may include the following steps:
step S402, sending a page request of link data;
step S404, receiving a page state code returned by responding to the page request;
step S406, if the page status code is equal to a preset value, determining that the link is a normal link, wherein the request result of the preset value used for representing the page request is a request success;
in step S408, if the page status code is not equal to the preset value, it is determined that the link is broken.
Specifically, in the above embodiment, when there is no page jump data corresponding to the link data in the log data of the website to be checked, or when there is no page identifier containing the page jump data in the log data of the website to be checked, it is indicated that the link data is a suspected broken link, a page request of the link data needs to be sent to the server, and a page status code returned in response to the page request is received, and if the returned page status code is equal to a preset value, it is determined that the link is a normal link; and if the returned page state code is not equal to the preset value, determining that the link is broken.
In an alternative embodiment, the preset value may be 200. Normally, the status code is 200 to indicate that the page request is successful; if the returned page status code is not 200, it can be determined that the link corresponding to the link data is definitely broken.
Through the embodiment, the link breaking confirmation is carried out by sending the page request to the link data, and whether the suspected link breaking is the link breaking is further verified.
In an alternative embodiment, as shown in fig. 5, after determining that the link is broken, the method may further include the following steps:
step S502, marking the link data;
and step S504, storing the marked link data into a broken link data table.
Specifically, in the above embodiment, after determining that a certain link in the website to be checked is a broken link, marking the link data of the link, and storing the marked link data into the broken link data table.
In an alternative embodiment, as shown in fig. 6, acquiring log data of a website in a preset time period includes:
step S602, collecting log data of a website;
step S604, dividing the log data of the website into a plurality of data groups according to the types of the user and the user configuration data;
step S606, selecting the log data of any data group;
step S608, dividing the log data of any data packet into log data in a plurality of preset time periods;
step S610, selecting log data in any preset time period.
Specifically, in the above embodiment, before obtaining the log data of the website in the preset time period, first, the log data of the website to be checked in a certain time period is collected, the collected log data is classified into a plurality of data groups according to the user accessing the website and the user configuration data, the log data of a certain data group is selected, the log data of a plurality of preset time periods is divided according to the preset time interval, and the log data in any one preset time period is selected for processing.
By the embodiment, the purpose of analyzing the user behavior log data is achieved, the log data of the website to be checked are classified, the resource occupation is reduced, and the website checking efficiency is improved.
As a preferred embodiment, the above-mentioned embodiment of the present application can be described with reference to fig. 7, fig. 7 is a flowchart of a preferred method for checking a wrong link in a website according to an embodiment of the present invention, as shown in fig. 7, including the following steps:
step S702, collecting daily log data of the website, and grouping according to the user and the user configuration.
Specifically, in the above step, daily log data of the website to be inspected is collected, and the collected log data is classified into a plurality of data groups according to the user who accesses the website and user configuration data.
Step S704, a group of log data is selected.
Specifically, in the above step, log data of a certain data packet is selected.
Step S706, performing time interval slicing on the set of data at preset time intervals.
Specifically, in the above step, the log data of a selected data packet is divided into log data of a plurality of time intervals according to a preset time interval.
Step S708, selecting mc click data for each time interval.
Specifically, in the above step, a mouse (mc) in any time interval is selected to click the log data for processing.
Step S710 determines whether or not there is link data including a linkurl field.
Specifically, in the above step, it is determined whether a field containing a website link exists in the obtained click log data of the time interval, and if yes, step S714 is executed; if not, step S712 is performed.
In step S712, the data not containing the linkurl field is discarded.
Specifically, in the above step, the data in the click log data in the time interval, which does not include the linkurl field, is discarded, and only the data including the linkurl field is retained.
In step S714, the link URL is selected.
Specifically, in the above step, link data, that is, a link URL, including a website link in the click log data of the time interval is acquired.
Step S716, find whether the pv log data in the time zone includes the link URL.
Specifically, in the above step, whether page data containing the link URL exists is searched from the pv log data in the time zone; if so, go to step S718; if not, step S724 is performed.
In step S718, the pvid (i.e., pageview id) of the pv data containing the link URL is searched.
Specifically, in the above step, after the page data including the link URL is found in the pv log data in the time zone, the pvid of the pv data is acquired.
In step S720, it is searched whether there is heartbeat data containing the pvid in the time region.
Specifically, in the above step, whether heartbeat data of the pvid is included is searched from heartbeat log data in the time zone; if so, perform step S722; if not, step S724 is performed.
In step S722, the link is determined to be a normal link.
Specifically, in the above step, when there is heartbeat data including the pvid, it is determined that the link corresponding to the pvid is a normal link.
In step S724, the link URL is requested.
Specifically, in the above step, a page request of the link URL is transmitted.
Step S726 returns whether the status code is 200.
Specifically, in the above steps, whether the return status code responding to the link URL page request is 200 is received, and if yes, step S722 is executed; if not, step S728 is performed.
In step S728, the link URL is marked as a broken link.
Specifically, in the above steps, after determining that a certain link in the website to be checked is a broken link, the link URL of the link is marked.
Step S730, store the marked URL in the broken link data table.
Specifically, in the above step, the marked link data (link URL) is stored into the broken link data table.
By the embodiment, the link URL of the normal link and the link URL which may be a wrong link in the website to be checked are found out by analyzing the user behavior log data of the website to be checked, the page request is sent to the link URL which may be a wrong link, the wrong link in the website is further determined according to the return state responding to the page request, and the link breakage check is performed on the target website under the condition that a large amount of resources are not available and the target website is not affected, so that the check efficiency is improved, and the resource occupation is reduced.
Example 2
According to the embodiment of the invention, the embodiment of the device for checking the links in the website is also provided. The method of checking links in a website in embodiment 1 of the present invention may be performed in the apparatus in embodiment 2 of the present invention.
Fig. 8 is a schematic diagram of an apparatus for checking links in a website according to an embodiment of the present invention, as shown in fig. 8, the apparatus includes: a first obtaining module 801, a second obtaining module 803, a first querying module 805 and a first determining module 807.
The first obtaining module 801 is configured to obtain log data of a website in a preset time period, where the log data at least includes: click data, page jump data and heartbeat data; a second obtaining module 803, configured to obtain link data that includes a website link in the click data; a first query module 805, configured to query whether page jump data and heartbeat data corresponding to link data exist in log data, to obtain a query result; a first determining module 807 for determining whether the link is broken according to the query result.
As can be seen from the above, in the above embodiments of the present application, by acquiring the user behavior log data of the website to be checked (target website), analyzing and processing the user behavior log data, finding out the link data contained in the click log data, and searching whether page jump data and heartbeat data corresponding to the link data exist in the log data, determining normal links and suspected broken links in the website to be checked according to the query result so as to request only the suspected broken links in the following process, achieving the purpose of determining the normal links and broken links in the website by analyzing the user behavior log data of the website, thereby realizing the technical effects of improving the checking efficiency and reducing the occupation of network and server resources, and further, the technical problems of high resource occupancy rate and poor timeliness caused by chain breakage in a website detection mode by adopting a crawler in the prior art are solved.
In an optional embodiment, the first query module includes: the second query module is used for querying whether page jump data containing link data exists in the log data; the third acquisition module is used for acquiring the page identifier of the page jump data containing the link data if the page jump data containing the link data exists in the log data; the third query module is used for querying whether heartbeat data containing the page identification exists in the log data; and the second determining module is used for determining that page jump data and heartbeat data corresponding to the link data exist in the log data if the heartbeat data containing the page identifier exists in the log data.
In an optional embodiment, the first determining module includes: the third determining module is used for determining that the link is a normal link if page jump data and heartbeat data corresponding to the link data exist in the log data; and the fourth determining module is used for determining whether the link is broken according to the page request result of the link data if the page jump data and the heartbeat data corresponding to the link data do not exist in the log data.
In an optional embodiment, the fourth determining module includes: the sending module is used for sending a page request of the link data; the receiving module is used for receiving the page state code returned by the response page request; the fifth determining module is used for determining that the link is a normal link if the page status code is equal to a preset value, wherein the preset value is used for representing that the request result of the page request is that the request is successful; and the sixth determining module is used for determining that the link is broken if the page state code is not equal to the preset value.
In an optional embodiment, the apparatus further comprises: a marking module for marking the link data; and the storage module is used for storing the marked link data into the broken link data table.
In an optional embodiment, the first obtaining module includes: the acquisition module is used for acquiring log data of the website; the first division module is used for dividing the log data of the website into a plurality of data groups according to the types of the user and the user configuration data; the first selecting module is used for selecting the log data of any one data packet; the second dividing module is used for dividing the log data of any data packet into the log data of a plurality of preset time periods; and the second selection module is used for selecting the log data of any one preset time period.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (6)

1. A method for checking links in a website, comprising:
acquiring log data of a website in a preset time period, wherein the log data at least comprises: click data, page jump data and heartbeat data;
acquiring link data containing website links in the click data;
inquiring whether page jump data and heartbeat data corresponding to the link data exist in the log data to obtain an inquiry result;
determining whether the link is broken according to the query result;
inquiring whether page jump data and heartbeat data corresponding to the link data exist in the log data to obtain an inquiry result, wherein the inquiring comprises the following steps:
inquiring whether page jump data containing the link data exists from the log data;
if the log data contains page jump data containing the link data, acquiring a page identifier of the page jump data containing the link data;
inquiring whether heartbeat data containing the page identifier exists in the log data;
if the log data contains heartbeat data containing the page identifier, determining that page jump data and heartbeat data corresponding to the link data exist in the log data;
determining whether the link is broken according to the query result, wherein the determining comprises:
if page jump data and heartbeat data corresponding to the link data exist in the log data, determining that the link is a normal link;
and if the log data does not contain page jump data and heartbeat data corresponding to the link data, determining whether the link is broken according to a page request result of the link data.
2. The method of claim 1, wherein determining whether the link is broken according to the page request result of the link data comprises:
sending a page request of the link data;
receiving a page state code returned in response to the page request;
if the page state code is equal to a preset value, determining that the link is a normal link, wherein the preset value is used for representing that the request result of the page request is a request success;
and if the page state code is not equal to the preset value, determining that the link is broken.
3. The method of claim 2, wherein after determining that the link is broken, the method further comprises:
marking the link data;
and storing the marked link data into a broken link data table.
4. The method of claim 1, wherein obtaining log data of the website within a preset time period comprises:
collecting log data of the website;
dividing the log data of the website into a plurality of data groups according to the types of users and user configuration data;
selecting log data of any one data packet;
dividing the log data of any one data packet into a plurality of log data of the preset time period;
and selecting log data of any one preset time period.
5. An apparatus for checking links in a website, comprising:
the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring log data of a website in a preset time period, and the log data at least comprises: click data, page jump data and heartbeat data;
the second acquisition module is used for acquiring link data containing website links in the click data;
the first query module is used for querying whether page jump data and heartbeat data corresponding to the link data exist in the log data to obtain a query result;
the first determining module is used for determining whether the link is broken according to the query result;
wherein the first query module comprises:
the second query module is used for querying whether page jump data containing the link data exists in the log data;
a third obtaining module, configured to obtain a page identifier of the page jump data including the link data if the page jump data including the link data exists in the log data;
the third query module is used for querying whether heartbeat data containing the page identifier exists in the log data;
a second determining module, configured to determine that page jump data and heartbeat data corresponding to the link data exist in the log data if the log data includes the heartbeat data including the page identifier;
wherein the first determining module comprises:
a third determining module, configured to determine that the link is a normal link if page jump data and heartbeat data corresponding to the link data exist in the log data;
and a fourth determining module, configured to determine whether the link is broken according to a page request result of the link data if the log data does not include page jump data and heartbeat data corresponding to the link data.
6. The apparatus of claim 5, wherein the fourth determining module comprises:
the sending module is used for sending the page request of the link data;
the receiving module is used for receiving the page state code returned by responding to the page request;
a fifth determining module, configured to determine that the link is a normal link if the page status code is equal to a preset value, where the preset value is used to represent that a request result of the page request is a request success;
and the sixth determining module is used for determining that the link is broken if the page state code is not equal to the preset value.
CN201611248666.4A 2016-12-29 2016-12-29 Method and device for checking links in website Active CN108255868B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611248666.4A CN108255868B (en) 2016-12-29 2016-12-29 Method and device for checking links in website

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611248666.4A CN108255868B (en) 2016-12-29 2016-12-29 Method and device for checking links in website

Publications (2)

Publication Number Publication Date
CN108255868A CN108255868A (en) 2018-07-06
CN108255868B true CN108255868B (en) 2020-11-24

Family

ID=62721411

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611248666.4A Active CN108255868B (en) 2016-12-29 2016-12-29 Method and device for checking links in website

Country Status (1)

Country Link
CN (1) CN108255868B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113127239A (en) * 2019-12-31 2021-07-16 深圳云天励飞技术有限公司 Page state monitoring method, device, terminal and storage medium
CN112418938A (en) * 2020-11-26 2021-02-26 努比亚技术有限公司 Advertisement identification method, device, terminal and computer readable storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103778254A (en) * 2014-02-24 2014-05-07 北京国双科技有限公司 Method, device and system for processing page access data
CN105141598A (en) * 2015-08-14 2015-12-09 中国传媒大学 APT (Advanced Persistent Threat) attack detection method and APT attack detection device based on malicious domain name detection

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101241512B (en) * 2008-03-10 2012-01-11 北京搜狗科技发展有限公司 Search method for redefining enquiry word and device therefor
CN102025559B (en) * 2010-11-09 2013-07-03 百度在线网络技术(北京)有限公司 Method for detecting and processing dead links on basis of classification, and network equipment
CN102724059B (en) * 2012-03-31 2015-03-11 常熟市支塘镇新盛技术咨询服务有限公司 Website operation state monitoring and abnormal detection based on MapReduce
CN103136360B (en) * 2013-03-07 2016-09-07 北京宽连十方数字技术有限公司 A kind of internet behavior markup engine and to should the behavior mask method of engine
CN103297435B (en) * 2013-06-06 2016-12-28 中国科学院信息工程研究所 A kind of abnormal access behavioral value method and system based on WEB daily record
CN104158697B (en) * 2013-10-18 2017-07-21 深圳信息职业技术学院 A kind of dead chain detection method and device
CN104734896B (en) * 2013-12-18 2019-04-23 青岛海尔空调器有限总公司 The acquisition methods and system of service sub-system operating condition
CN104618328A (en) * 2014-12-29 2015-05-13 厦门欣欣信息有限公司 Network security protection method and device
WO2016186975A1 (en) * 2015-05-15 2016-11-24 Virsec Systems, Inc. Detection of sql injection attacks
CN105719162B (en) * 2016-01-20 2020-02-07 北京京东尚科信息技术有限公司 Method and device for monitoring validity of promotion link

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103778254A (en) * 2014-02-24 2014-05-07 北京国双科技有限公司 Method, device and system for processing page access data
CN105141598A (en) * 2015-08-14 2015-12-09 中国传媒大学 APT (Advanced Persistent Threat) attack detection method and APT attack detection device based on malicious domain name detection

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"Web日志文件的异常数据挖掘算法及其应用";赵泽茂 等;《计算机工程》;20030905;第29卷(第17期);第195-197页 *

Also Published As

Publication number Publication date
CN108255868A (en) 2018-07-06

Similar Documents

Publication Publication Date Title
US8825849B2 (en) Distributed data collection and aggregation
US6741990B2 (en) System and method for efficient and adaptive web accesses filtering
US9215240B2 (en) Investigative and dynamic detection of potential security-threat indicators from events in big data
CN101192227B (en) Log file analytical method and system based on distributed type computing network
US9300755B2 (en) System and method for determining information reliability
CN111159514B (en) Method, device and equipment for detecting task effectiveness of web crawler and storage medium
CN104700289A (en) Advertising method and device
CN105357054A (en) Website traffic analysis method and apparatus, and electronic equipment
WO2014180130A1 (en) Method and system for recommending contents
CN107294919A (en) A kind of detection method and device of horizontal authority leak
CN102752288A (en) Method and device for identifying network access action
CN108206769B (en) Method, apparatus, device and medium for filtering network quality alarms
US20140331142A1 (en) Method and system for recommending contents
KR101443071B1 (en) Error Check System of Webpage
CN105577431A (en) User information identification and classification method based on internet application and system thereof
CN109039787A (en) log processing method, device and big data cluster
CN104301161A (en) Computing method, computing device and communication system for business quality index
CN105589782A (en) User behavior collection method based on browser
CN103428249B (en) A kind of Collecting and dealing method of HTTP request bag, system and server
CN105207832A (en) Server stress testing method and device
CN108255868B (en) Method and device for checking links in website
CN104462096A (en) Public opinion monitoring and analysis method and device
CN102684925A (en) Method and device for acquiring internet access source information
CN112217657A (en) Data transmission method, data processing method, device and medium based on SD-WAN system
CN114301800A (en) Network equipment quality difference analysis method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant