CN108880921B - Webpage monitoring method and device, storage medium and server - Google Patents
Webpage monitoring method and device, storage medium and server Download PDFInfo
- Publication number
- CN108880921B CN108880921B CN201710329418.0A CN201710329418A CN108880921B CN 108880921 B CN108880921 B CN 108880921B CN 201710329418 A CN201710329418 A CN 201710329418A CN 108880921 B CN108880921 B CN 108880921B
- Authority
- CN
- China
- Prior art keywords
- webpage content
- website
- content
- picture
- webpage
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/50—Testing arrangements
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/08—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/16—Threshold monitoring
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Theoretical Computer Science (AREA)
- Environmental & Geological Engineering (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The application discloses a webpage monitoring method, which comprises the following steps: acquiring a website; acquiring and storing first content through the website at a first moment; acquiring second content through the website at a second moment; judging whether the difference between the first content and the second content is larger than a preset first threshold value or not; and when the difference between the first content and the second content is larger than the first threshold value, judging that the webpage content corresponding to the website is changed. The technical scheme provided by the application needs less workload when monitoring the change of the webpage, has high monitoring efficiency and can save system resources.
Description
Technical Field
The application relates to the technical field of internet, in particular to a webpage monitoring method.
Background
A web page is a basic element constituting a website, and is a platform for carrying various website applications. A web page is a plain Text file containing hypertext Markup Language (HTML) tags that may be stored on a computer in a corner of the world, a "page" of the world wide web. The web page is in HTML format with an extension of HTML or htm, which is typically read by a web browser.
Disclosure of Invention
The application provides a webpage monitoring method, which is used for reducing the workload required by monitoring webpage changes, improving the monitoring efficiency and saving system resources.
The application provides a webpage monitoring device for reduce the required work load of monitoring webpage change, improve monitoring efficiency, save system resource.
The application provides a computer-readable storage medium for reducing the workload required for monitoring the change of a webpage, improving the monitoring efficiency and saving system resources.
The embodiment of the application provides a webpage monitoring method, which comprises the following steps:
acquiring a website;
acquiring and storing first content through the website at a first moment;
acquiring second content through the website at a second moment;
judging whether the difference between the first content and the second content is larger than a preset first threshold value or not;
and when the difference between the first content and the second content is larger than the first threshold value, judging that the webpage content corresponding to the website is changed.
The embodiment of the application provides a webpage monitoring device, includes:
the website acquisition module is used for acquiring a website;
the content acquisition module is used for acquiring and storing first content through the website at a first moment and acquiring second content through the website at a second moment;
and the first judging module is used for judging whether the difference between the first content and the second content is greater than a preset first threshold value, and when the difference between the first content and the second content is greater than the first threshold value, judging that the webpage content corresponding to the website is changed.
An embodiment of the present application provides a computer-readable storage medium storing computer-readable instructions for execution by at least one processor to perform any one of the method embodiments provided by the embodiments of the present application.
In the embodiment of the application, a webpage address is obtained, a first content is obtained and stored through the website at a first moment, a second content is obtained through the website at a second moment, whether the difference between the first content and the second content is larger than a preset first threshold value or not is judged, and when the difference between the first content and the second content is larger than the first threshold value, the webpage content corresponding to the website is judged to be changed. By using the technical scheme provided by the embodiment of the application, the webpage content corresponding to a website can be obtained at different moments, the webpage contents obtained at the two moments are compared, and whether the webpage content corresponding to the website is changed or not is judged. By using the embodiment of the application, whether the webpage content corresponding to the website is changed or not can be judged by judging whether the difference of the webpage content corresponding to the same website at different moments is larger than a preset threshold value, less workload is needed for monitoring the webpage change by using the scheme, the monitoring efficiency is high, and system resources can be saved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
FIG. 1 is a schematic block diagram of an implementation environment to which embodiments of the present application relate;
fig. 2 is a schematic flowchart of a web page monitoring method according to an embodiment of the present application;
fig. 3 is a schematic flowchart of a web page monitoring method according to an embodiment of the present application;
FIG. 3A shows a schematic diagram of a page for publishing advertisements;
FIG. 3B shows a schematic diagram of a landing page;
fig. 4 is a schematic flowchart of a method for monitoring a web page by using a screenshot mode according to an embodiment of the present application;
fig. 4A is a schematic flowchart of a method for determining whether web page content is completely loaded according to an embodiment of the present application;
FIG. 4B shows a screenshot of a landing page that is not fully loaded;
FIG. 4C illustrates a screenshot of a landing page containing an indication that loading is incomplete;
fig. 5 is a schematic structural diagram of a web page monitoring device according to an embodiment of the present disclosure;
fig. 6 is a schematic structural diagram of a web page monitoring device according to an embodiment of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Fig. 1 is a schematic structural diagram of an implementation environment according to embodiments of the present application. As shown in fig. 1, the implementation environment includes: a web monitoring server 110, a web address storage server 120, a web content storage server 130, client devices 140-1, 140-2, and 140-3, and an IM server 150.
The web monitoring server 110, the web address storage server 120, and the web content storage server 130 may be a server, a server cluster composed of several servers, or a cloud computing service center.
The client devices 140-1, 140-2, and 140-3 may be PCs, notebook computers, mobile phones, tablet computers, smart televisions, or the like.
The web page monitoring server 110 may run a headless browser for cooperating with the web page address storage server 120 and the web page content storage server 130 for monitoring changes in the web page content published by the client devices 140-1, 140-2, and 140-3. A headless browser (browser) refers to a browser without an interface, and the headless browser has the structure and functions of a general browser, but does not have a physical display window, and the display window is a virtual browsing window. The virtual browse window may simulate the display function of the window, but its contents are not displayed on the display.
And a web address storage server 120 for storing the web address.
And a web content storage server 130 for storing web content corresponding to the web address stored by the web address storage server 120.
When the web page monitoring server 110 monitors that the web page content corresponding to a certain web address changes, the change event is recorded, a change notification is generated, and the change notification is sent to the IM server 150 through a wired or wireless network. The IM server 150 receives the change notification, obtains the website from the change notification, finds the client device corresponding to the website, for example, the client device 140-1, and sends a warning message to the client device 140-1 to notify the client device 140-1 that the web page content corresponding to the website, which is issued by the monitoring server 110, has been changed.
Fig. 2 is a schematic flowchart of a web page monitoring method according to an embodiment of the present application. As shown in fig. 2, the method includes the following steps.
And step 203, acquiring second content through the website at a second moment.
In the embodiment of the application, a webpage address is obtained, a first content is obtained and stored through the website at a first moment, a second content is obtained through the website at a second moment, whether the difference between the first content and the second content is larger than a preset first threshold value or not is judged, and when the difference between the first content and the second content is larger than the first threshold value, the webpage content corresponding to the website is judged to be changed. By using the technical scheme provided by the embodiment of the application, the webpage content corresponding to a website can be acquired by using the headless browser running in the background at different moments, the webpage contents acquired at the two moments are compared, and whether the webpage content corresponding to the website is changed or not is judged. By using the embodiment of the application, whether the webpage content corresponding to the website is changed or not can be judged by judging whether the difference of the webpage content corresponding to the same website at different moments is larger than a preset threshold value, less workload is needed for monitoring the webpage change by using the scheme, the monitoring efficiency is high, and system resources can be saved.
Fig. 3 is a schematic flowchart of a web page monitoring method according to an embodiment of the present application. As shown in fig. 3, the method includes the following steps.
In step 301, the web page monitoring server may run a headless browser to obtain a web address.
In this step, the web page corresponding to the website may be a common web page, or a landing page for publishing multimedia information, such as advertisement information. Landing page refers to the first page to which the user connects by clicking on the information published in media, as shown in fig. 3B. FIG. 3A shows a schematic diagram of a page for publishing advertisements. In this FIG. 3A, a web page containing a picture of a beverage is displayed. When a user clicks any area of the beverage picture on a client interface, the client equipment searches a webpage address corresponding to the beverage picture according to a preset corresponding relation between the beverage picture and the landing page address, and sends the landing page address to a server. The server acquires and sends landing page content, i.e. the landing page content shown in fig. 3B, to the client according to the landing page address. FIG. 3B shows a schematic of a landing page. Fig. 3B contains the introduction of the beverage, provides an interface for the user to communicate with the manufacturer, such as a telephone, online consultation, and provides an official two-dimensional code.
In this embodiment, the address of the landing page is named as a landing page address, the address for displaying the multimedia information is named as a link address, and when the multimedia information is triggered, the web page content corresponding to the landing page address is pulled according to the pre-stored correspondence between the multimedia information and the landing page address, and the web page content corresponding to the landing page address is displayed. In this embodiment, the monitored address may be a link address or a landing page address. The multimedia information may be text information, such as web links, pictures, or videos. The multimedia information may be displayed in a browser, for example, a web page content corresponding to the link address is displayed in the browser, and the multimedia information is displayed in the web page content. The multimedia information can also be displayed in other application programs, for example, in a floating window form on a playing interface of a video player, when a user clicks the multimedia information, a browser is started, and the browser is utilized to pull and display the web page content corresponding to the landing page.
In the embodiment of the present application, the landing page may be monitored, and a web page corresponding to a link address containing multimedia information for linking to the landing page may also be monitored.
In this step, the headless browser may be a PhantomJS component, which employs a Webkit kernel. The PhantomJS component can read a code corresponding to the website, such as an HTML code, and determine whether the HTML code includes information of a moving picture to be displayed, such as description information of the moving picture. For example, it may be determined whether the name format of the dynamic picture is included in the name formats of the pictures, and if the name format of the dynamic picture is included, it is determined that the code includes the information of the dynamic picture, and step 303 is performed; otherwise, step 304 is performed.
In this embodiment, it may be determined whether the code includes the information of the dynamic image, and when the code includes the information of the dynamic image, the code is used to determine whether the content of the web page corresponding to the website changes. And when the information of the dynamic picture is judged not to be contained, judging whether the webpage content changes or not in a screenshot mode. And a corresponding judgment method can be adopted according to the display attribute of the picture, so that the judgment efficiency and the judgment accuracy are improved.
In an embodiment of the present invention, it is also possible to directly determine whether the web content corresponding to the website changes by using a screenshot, or a code, or a screenshot plus a code without determining whether the code corresponding to the website contains information of a dynamic picture.
Fig. 4 is a schematic flowchart of a method for monitoring a web page by using a screenshot mode according to an embodiment of the present application. As shown in fig. 4, the method includes the following steps. In this embodiment, a headless browser is taken as a PhantomJS component running on a web page monitoring server, and a web page is taken as a landing page for example.
Step 401, the web page monitoring server runs a PhantomJS component, and obtains the landing page address from the web page address storage server by using the PhantomJS component.
Before this step, a first website, that is, the web page content corresponding to the link address, may be displayed, a click operation of a user on multimedia information in the web page content is received, a web page content acquisition request including the landing page address is generated according to a correspondence between pre-stored multimedia information and the landing page address, the web page content acquisition request including the landing page address is sent to a web page content storage server, and the web page content corresponding to the landing page address is received and displayed from the web page content storage server.
Step 402, the PhantomJS component obtains the first landing page content from the web content storage server according to the landing page address.
In this step, the PhantomJS component sets the size of the virtual browser window, and acquires the corresponding first landing page content from the web content storage server according to the size.
In this step, the set size of the virtual browser includes the size and aspect ratio of the virtual browser. And acquiring first landing page content matched with the size from the webpage content storage server according to the size of the virtual browser.
In step 403, the PhantomJS component determines whether the content of the first landing page is completely loaded. When the first landing page content is judged to be completely loaded, step 404 is executed, otherwise, the process is ended.
In this embodiment, whether the web page content changes is determined by using a screenshot mode. The precondition for the screenshot is to ensure that the web page is loaded completely. Only loading a complete screenshot of the webpage has the significance of making subsequent judgment.
In this step, whether the first landing page content is completely loaded may be determined in any one of the following three ways, or in any combination.
The first mode is as follows: in the embodiment of the present application, the first landing page website is composed of a plurality of sub-websites. The first landing page content corresponding to the landing page address is equal to the sum of the landing page contents corresponding to the sub-websites respectively. The sub landing page content corresponding to each sub website is a part of the first landing page content. The judging method comprises the following steps:
when the PhantomJS component generates a webpage content acquisition request containing the sub-website, calling a counter to execute an operation of adding 1;
the PhantomJS component receives the webpage content corresponding to each sub-network address from the webpage content storage server;
calling the counter to execute a subtraction operation of 1 after the headless browser receives the webpage content corresponding to one sub-address;
the PhantomJS component judges whether the webpage content acquisition request is generated for each sub-website;
judging the count value of the counter after judging that the webpage content acquisition request is generated for each sub-website;
and when the counting value of the counter is the initial value, judging that the first landing page content is completely loaded.
The PhantomJS component utilizes this first manner to determine whether the first landing page content has been completely loaded, as described below in one embodiment. Fig. 4A is a flowchart illustrating a method for determining whether web page content is completely loaded according to an embodiment of the present application. As shown in fig. 4A, the method includes the following steps.
In step 401A, the PhantomJS component obtains a landing page address including a plurality of sub-web addresses from the web address storage server.
In this step, the landing page address is stored in the web address storage server in the form of a list.
In step 403A, the PhantomJS component sends the first web content acquisition request to the web content storage server, and calls the counter to perform a subtraction operation of 1 when the first sub-web content returned by the web content storage server according to the sub-web address is received.
In step, the PhantomJS component determines whether the End identifier of the first sub-web page content has been received, and if the End identifier is received, it determines that the first sub-web page content has been received.
The code that performs some of the functions in this flow is shown as follows:
And if the landing page address also comprises other sub-websites except the sub-website, continuing to generate webpage content acquisition requests for the other sub-websites.
In step 406A, it is determined that the first landing page content is completely loaded.
In step 407A, it is determined that the first landing page content is not completely loaded.
In an embodiment of the present application, after the web content acquisition request is generated by using each sub-address, it may be determined whether the web content acquisition request including the sub-address is generated for each sub-address, but after a predetermined time, for example, 30 seconds, that is, no matter whether the web content acquisition request including the sub-address is generated for each sub-address, after the default 30 seconds, the loading operation is finished, and it is directly determined whether the value of the counter is the initial value.
The second mode is as follows: the histogram of a picture contains three channels, R, G and B channels. The color scale is a value from 0 to 255, and the RGB channel has 256 × 3 to 768 total color scales. The number of pixels is the number of pixels of the picture at a certain color level. The PhantomJS component performs screenshot operation on the first landing page content at a first moment to generate a first picture, obtains a histogram of the first picture, and judges whether the number of pixels in the histogram is a first set value, for example, whether the color gradation number of 0, 1 or 2 exceeds a predetermined first threshold value, or whether the number of pixels exceeds a second set value, for example, whether the color gradation number exceeds a predetermined second threshold value, for example, 3. For example, if it is determined that the number of gradations for which the number of pixels is 0 exceeds a predetermined first threshold value, or the number of gradations for which the number of pixels exceeds a predetermined second set value is greater than 3, it is determined that the first landing page content is not completely loaded. As shown in FIG. 4B, FIG. 4B shows a screenshot of a landing page that is not fully loaded. In FIG. 4B, the entire web page contains too many gray regions, and it is determined that the landing page is not fully loaded.
The third mode is as follows: and the PhantomJS component performs screenshot operation on the first landing page content at a second moment to generate a second picture. And traversing the second picture by the PhantomJS component, judging whether the second picture contains an identifier indicating incomplete loading, and if so, judging that the content of the first landing page is not completely loaded. In the process of loading the first landing page content by using the virtual browser window, if the first landing page is not loaded successfully or contains a part which is not loaded successfully, the part which is not loaded successfully is replaced by an identifier which indicates that the loading is not complete, such as a question mark pattern, and the picture functions as a placeholder. As shown in FIG. 4C, this FIG. 4C illustrates a screenshot of a landing page containing an indication that loading is incomplete. Therefore, after the phartomjs component is used to perform screenshot processing on the identifier including the incomplete indication loading, the generated second picture includes the identifier including the incomplete indication loading. And when the second picture is judged to contain the identification indicating incomplete loading, judging that the content of the first landing page is not completely loaded.
And step 404, when the PhantomJS component determines that the first landing page content is completely loaded, executing screenshot operation on the first landing page content at a third moment, and obtaining and storing the third picture.
The third time may be a certain time after the content push protocol is signed with the content push party and the push content starts to be pushed by using the first landing page.
In step 405, after a predetermined time, the PhantomJS component acquires the landing page address from the web page address storage server.
In step 406, the PhantomJS component obtains second landing page content from the web content storage server according to the landing page address.
In this step, the PhantomJS component obtains the second landing page content from the web content storage server, also according to the size of the virtual browser window. The size of the virtual browser window used in this step is the same as that used in step 402, i.e., has the same size and aspect ratio. Therefore, the sizes and the aspect ratios of the pictures obtained by executing the screenshot operation on the first landing page content and the second landing page content can be ensured to be consistent.
In step 407, the PhantomJS component determines whether the second landing page content is completely loaded. When the second landing page is judged to be completely loaded, step 408 is executed, otherwise, the process is ended.
In this step, the method for determining whether the second landing page content is completely loaded is the same as the method for determining whether the first landing page content is completely loaded in step 403.
In step 409, the PhantomJS component acquires the stored third picture and the fourth picture, and determines whether the difference between the third picture and the fourth picture is greater than a set threshold. If the difference is greater than the set threshold, step 410 is executed, otherwise, the web page content corresponding to the web address is concluded to have not been changed.
In this step, the PhantomJS component may respectively calculate a first hash value and a second hash value of the third picture and the fourth picture by using a perceptual hash algorithm, calculate a hamming distance between the first hash value and the second hash value, determine whether the hamming distance is greater than the preset threshold, for example, 20, and determine that a difference between the third picture and the fourth picture is greater than the preset threshold when the hamming distance is greater than the preset threshold.
In an embodiment of the present application, the PhantomJS component may also respectively calculate a first histogram and a second histogram of the third picture and the fourth picture, calculate a mean square error of the first histogram and the second histogram, and determine whether the mean square error is greater than a preset threshold, for example, 0.2. And if the mean square error is larger than the preset threshold, judging that the difference between the third picture and the fourth picture is larger than the set threshold.
And step 410, the PhantomJS component acquires and stores the first HTML code corresponding to the landing page address from the web content storage server at the fifth moment.
Step 411, the PhantomJS component acquires the second HTML code corresponding to the landing page address from the web content storage server at the sixth time.
Step 412, comparing the difference between the first HTML code and the second HTML code, when the difference between the first HTML code and the second HTML code is greater than the set threshold, determining that the landing page content corresponding to the landing page address is changed, otherwise, obtaining the conclusion that the web page content corresponding to the website is not changed.
In an embodiment of the present application, a code comparison tool, for example, a diff tool in the Linux system, may be used to compare the line number difference between the first HTML code and the second HTML code, obtain an absolute value of the line number difference, and when the absolute value of the line number difference is determined to be greater than a predetermined line number threshold, determine that the difference between the first HTML code and the second HTML code is greater than a set threshold.
In an embodiment of the application, a hash value of the first HTML code may be calculated by using a hash processing tool, for example, a simhash algorithm, to obtain a third hash value, a hash value of the second HTML code is calculated to obtain a fourth hash value, a hamming distance between the third hash value and the fourth hash value is compared, whether the hamming distance is greater than a predetermined threshold value is determined, and when the hamming distance is determined to be greater than the predetermined threshold value, it is determined that a difference between the first HTML code and the second HTML code is greater than a set threshold value.
Fig. 5 is a schematic structural diagram of a web page monitoring device according to an embodiment of the present application. As shown in fig. 5, the apparatus includes: a website address obtaining module 501, a content obtaining module 502 and a first judging module 503.
The website address obtaining module 501 is configured to obtain a website address.
The content obtaining module 502 is configured to obtain and store a first content through the website at a first time, and obtain a second content through the website at a second time.
The first determining module 503 is configured to determine whether a difference between the first content and the second content is greater than a preset first threshold, and determine that the web content corresponding to the website is changed when the difference between the first content and the second content is greater than the first threshold.
In an embodiment of the present application, the content obtaining module 502 is further configured to
Setting the size of a virtual browsing window, loading first webpage content in the virtual browsing window according to the size at the first moment, judging whether the first webpage content is completely loaded, and executing screenshot operation on the first webpage content to obtain and store a first picture when the first webpage content is judged to be completely loaded;
and loading second webpage content in the virtual browsing window at the second moment according to the size, judging whether the second webpage content is completely loaded, and executing screenshot operation on the second webpage content to obtain a second picture when the second webpage content is completely loaded.
The first determining module 503 is further configured to determine whether a difference between the first picture and the second picture is greater than the first threshold.
In an embodiment of the present application, the website includes: a plurality of sub-web addresses. The content obtaining module 502 is further configured to generate a web content obtaining request including the sub-website for each sub-website, send the web content obtaining request to a web content storage server, call a counter to execute an add-1 operation when a web content obtaining request including the sub-website is generated, receive web content corresponding to each sub-website from the web content storage server, after receiving the web page content corresponding to each sub-website, calling the counter to execute the operation of subtracting 1, judging whether the web page content acquisition request is generated for each sub-website, judging the counting value of the counter after judging that the webpage content acquisition request is generated for each sub-website, and when the counting value of the counter is the initial value, judging that the first webpage content is completely loaded.
In an embodiment of the application, the content obtaining module 502 is further configured to perform a screenshot operation on the first web content at a third time to generate a third picture, obtain a histogram of the third picture, and determine that the first web content is not completely loaded when it is determined that the color level number of the pixels in the histogram is that the first setting value exceeds a predetermined second threshold or the color level number of the pixels in the histogram exceeds the second setting value exceeds a predetermined third threshold.
In an embodiment of the present application, the content obtaining module 502 is further configured to perform a screenshot operation on the first webpage content at a fourth time, generate a fourth picture, traverse the fourth picture, determine whether the fourth picture includes an identifier indicating that loading is incomplete, and determine that loading of the first webpage content is not completed when it is determined that the fourth picture includes the identifier indicating that loading is incomplete.
In an embodiment of the present application, the apparatus further comprises: a second determining module 504, configured to determine whether a difference between the first picture and the second picture is greater than a fourth threshold, when it is determined that the difference between the first picture and the second picture is greater than the fourth threshold, obtain and store a first HTML code corresponding to the website at a fifth time, obtain a second HTML code corresponding to the website at a sixth time, compare the difference between the first HTML code and the second HTML code, and when the difference between the first HTML code and the second HTML code is greater than the fifth threshold, determine that the content corresponding to the website is changed.
In an embodiment of the present application, the apparatus further comprises: a third determining module 505, configured to determine whether the first content corresponding to the website includes information of a dynamic picture, and instruct the content obtaining module to perform a screenshot operation on the first webpage content when it is determined that the first content does not include the information of the dynamic picture.
In an embodiment of the present application, the content obtaining module 502 is further configured to obtain and store a third HTML code corresponding to the website at the first time, and obtain a fourth HTML code corresponding to the website at the second time. The first determining module 503 is further configured to determine whether a difference between the third HTML code and the fourth HTML code is greater than the first threshold.
In an embodiment of the present application, the apparatus further comprises: the web page content display module 506 is configured to display the web page content corresponding to the first website.
In an embodiment of the present application, the apparatus further comprises: the web content pull module 507 is configured to receive a click operation of a user on multimedia information in the web content, generate a web content acquisition request including a landing page address according to a pre-stored correspondence between the multimedia information and the landing page address, send the web content acquisition request including the landing page address to a web content storage server, receive web content corresponding to the landing page address from the web content storage server, and send the web content corresponding to the landing page address to the web content display module for display.
Fig. 6 is a schematic structural diagram of a web page monitoring device according to an embodiment of the present disclosure. As shown in fig. 6, the web page monitoring apparatus may include: a processor 601, a non-volatile computer-readable memory 602, a display unit 603, a network communication interface 604. These components communicate over a bus 605.
In this embodiment, the memory 602 stores a plurality of program modules, including: an application 606, a network communication module 607, and an operating system 608.
The processor 601 may read various modules (not shown in the figure) included in the application program in the memory 602 to execute various functional applications and data processing of the web page monitoring device. The processor 601 in this embodiment may be one or more, and may be a CPU, a processing unit/module, an ASIC, a logic module, a programmable gate array, or the like.
The application programs 606 may include: an application installed and running on the mobile terminal.
In this embodiment, the network communication interface 604 and the network communication module 607 cooperate to complete the transceiving of various network signals of the web page monitoring device, such as the communication with the web page address storage server 120 and the web page content storage server 130.
The display unit 603 has a display panel for inputting and displaying related information.
In addition, functional modules in the embodiments of the present application may be integrated into one processing unit, or each module may exist alone physically, or two or more modules are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit. The functional modules of the embodiments may be located in one terminal or network node, or may be distributed over a plurality of terminals or network nodes.
In addition, each of the embodiments of the present application can be realized by a data processing program executed by, for example, a computer. It is clear that a data processing program constitutes the present application. Further, the data processing program, which is generally stored in one storage medium, is executed by directly reading the program out of the storage medium or by installing or copying the program into a storage device (such as a hard disk and/or a memory) of the data processing device. Such a storage medium therefore also constitutes the present application. The storage medium may use any type of recording means, such as a paper storage medium (e.g., paper tape, etc.), a magnetic storage medium (e.g., a flexible disk, a hard disk, a flash memory, etc.), an optical storage medium (e.g., a CD-ROM, etc.), a magneto-optical storage medium (e.g., an MO, etc.), and the like.
The present application further provides a computer-readable storage medium having computer-readable instructions stored thereon for execution by at least one processor to perform any one of the embodiments of the methods described herein.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the scope of protection of the present application.
Claims (16)
1. A webpage monitoring method is applied to a webpage monitoring server, and the method comprises the following steps:
operating a headless browser to acquire a website; setting the size of a virtual browsing window through the headless browser;
acquiring first webpage content matched with the size and a third hypertext markup language (HTML) code corresponding to the website through the website at a first moment;
when the third HTML code is judged to contain the information of the dynamic picture to be displayed, a fourth HTML code corresponding to the website is obtained through the website at a second moment, and whether the webpage content corresponding to the website is changed or not is judged by using whether the difference between the third HTML code and the fourth HTML code is larger than a first threshold value or not;
and when the third HTML code is judged not to contain the information of the dynamic picture to be displayed, acquiring second webpage content adaptive to the size through the website at a second moment, and judging whether the webpage content corresponding to the website is changed or not by using a screenshot mode.
2. The method of claim 1,
the determining whether the difference between the first web page content and the second web page content is greater than the first threshold value comprises:
loading the first webpage content in the virtual browsing window, judging whether the first webpage content is completely loaded, and executing screenshot operation on the first webpage content to obtain and store a first picture when the first webpage content is completely loaded;
loading the second webpage content in the virtual browsing window, judging whether the second webpage content is completely loaded, and executing screenshot operation on the second webpage content to obtain a second picture when the second webpage content is completely loaded;
and judging whether the difference between the first picture and the second picture is larger than the first threshold value.
3. The method of claim 2, wherein the web address comprises: the step of judging whether the first webpage content is completely loaded comprises the following steps:
generating a webpage content acquisition request containing the sub-websites for each sub-website, and sending the webpage content acquisition request to a webpage content storage server;
calling a counter to execute an operation of adding 1 when generating a webpage content acquisition request containing the sub-website;
receiving the webpage content corresponding to each sub-address from the webpage content storage server;
calling the counter to execute the operation of subtracting 1 after receiving the webpage content corresponding to one sub-website;
judging whether the webpage content acquisition request is generated for each sub-website or not;
judging the count value of the counter after judging that the webpage content acquisition request is generated for each sub-website;
and when the counting value of the counter is the initial value, judging that the first webpage content is completely loaded.
4. The method of claim 2, wherein determining whether the first web content is fully loaded comprises:
executing screenshot operation on the first webpage content at a third moment to generate a third picture;
acquiring a histogram of the third picture;
and when the color level number of the pixels with the number of the first set values in the histogram is judged to exceed a preset second threshold value or the color level number of the pixels with the number of the second set values in the histogram exceeds a preset third threshold value, judging that the first webpage content is not completely loaded.
5. The method of claim 2, wherein determining whether the first web content is fully loaded comprises:
executing screenshot operation on the first webpage content at a fourth time to generate a fourth picture;
traversing the fourth picture;
judging whether the fourth picture contains an identifier indicating incomplete loading;
and when the fourth picture is judged to contain the identification indicating incomplete loading, judging that the first webpage content is not completely loaded.
6. The method of claim 2, further comprising:
judging whether the difference between the first picture and the second picture is larger than a fourth threshold value or not;
when the difference between the first picture and the second picture is judged to be larger than the fourth threshold value, acquiring and storing a first HTML code corresponding to the website at a fifth moment, and acquiring a second HTML code corresponding to the website at a sixth moment;
comparing differences of the first HTML code and the second HTML code;
and when the difference between the first HTML code and the second HTML code is larger than a fifth threshold value, judging that the content corresponding to the website is changed.
7. The method of claim 1, further comprising:
and judging whether the third HTML code contains the information of the dynamic picture to be displayed.
8. The method according to claim 7, wherein the information of the dynamic picture to be displayed refers to description information of the dynamic picture.
9. The method of claim 1, wherein the headless browser is a PhantomJS component.
10. The method of claim 1, wherein the web address comprises: a landing page address; the method further comprises the following steps:
displaying the webpage content corresponding to the first website;
responding to the operation of the user on the multimedia information in the webpage content;
generating a webpage content acquisition request containing the landing page address according to the pre-stored corresponding relation between the multimedia information and the landing page address;
sending the webpage content acquisition request containing the landing page address to a webpage content storage server;
and receiving and displaying the webpage content corresponding to the landing page address from the webpage content storage server.
11. A web page monitoring device, comprising:
the website acquisition module is used for operating a headless browser to acquire a website;
the content acquisition module is used for setting the size of a virtual browsing window through the headless browser; acquiring first webpage content matched with the size and a third hypertext markup language (HTML) code corresponding to the website through the website at a first moment; when the third HTML code is judged to contain the information of the dynamic picture to be displayed, a fourth HTML code corresponding to the website is obtained through the website at a second moment; when the third HTML code does not contain the information of the dynamic picture to be displayed, acquiring second webpage content matched with the size through the website at a second moment;
the first judging module is used for judging whether the webpage content corresponding to the website is changed or not by utilizing whether the difference between the third HTML code and the fourth HTML code is larger than a first threshold value or not when the third HTML code is judged to contain the information of the dynamic picture to be displayed; and when the third HTML code is judged not to contain the information of the dynamic picture to be displayed, judging that the webpage content corresponding to the website is changed by using a screenshot mode.
12. The apparatus of claim 11,
the content acquisition module is used for loading the first webpage content in the virtual browsing window, judging whether the first webpage content is completely loaded or not, and executing screenshot operation on the first webpage content to obtain and store a first picture when the first webpage content is judged to be completely loaded; loading the second webpage content in the virtual browsing window, judging whether the second webpage content is completely loaded, and executing screenshot operation on the second webpage content to obtain a second picture when the second webpage content is completely loaded;
the first judging module is used for judging whether the difference between the first picture and the second picture is larger than the first threshold value.
13. The apparatus of claim 12, wherein the web address comprises: a plurality of child web addresses;
the content acquisition module is used for generating a webpage content acquisition request containing the sub-websites for each sub-website, sending the webpage content acquisition request to a webpage content storage server, calling a counter to execute an operation of adding 1 when generating a webpage content acquisition request containing the sub-websites, receiving webpage content corresponding to each sub-website from the webpage content storage server, after receiving the web page content corresponding to each sub-website, calling the counter to execute the operation of subtracting 1, judging whether the web page content acquisition request is generated for each sub-website, judging the counting value of the counter after judging that the webpage content acquisition request is generated for each sub-website, and when the counting value of the counter is the initial value, judging that the first webpage content is completely loaded.
14. The apparatus of claim 11, wherein the web address comprises: a landing page address; the apparatus further comprises:
the webpage content display module is used for displaying webpage content corresponding to the first website;
the webpage content pulling module is used for responding to the operation of a user on the multimedia information in the webpage content, generating a webpage content obtaining request containing the landing page address according to the pre-stored corresponding relation between the multimedia information and the landing page address, sending the webpage content obtaining request containing the landing page address to a webpage content storage server, receiving the webpage content corresponding to the landing page address from the webpage content storage server, and sending the webpage content corresponding to the landing page address to the webpage content display module for displaying.
15. A computer-readable storage medium having computer-readable instructions stored thereon for execution by at least one processor to perform the method of any one of claims 1 to 10.
16. A server comprising a memory and a processor, the memory having stored therein computer-readable instructions which, when executed by the processor, implement the method of any one of claims 1 to 10.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710329418.0A CN108880921B (en) | 2017-05-11 | 2017-05-11 | Webpage monitoring method and device, storage medium and server |
PCT/CN2018/085961 WO2018205918A1 (en) | 2017-05-11 | 2018-05-08 | Webpage monitoring method and apparatus, and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710329418.0A CN108880921B (en) | 2017-05-11 | 2017-05-11 | Webpage monitoring method and device, storage medium and server |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108880921A CN108880921A (en) | 2018-11-23 |
CN108880921B true CN108880921B (en) | 2021-07-02 |
Family
ID=64104333
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710329418.0A Active CN108880921B (en) | 2017-05-11 | 2017-05-11 | Webpage monitoring method and device, storage medium and server |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN108880921B (en) |
WO (1) | WO2018205918A1 (en) |
Families Citing this family (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109753790A (en) * | 2018-11-29 | 2019-05-14 | 武汉极意网络科技有限公司 | A kind of landing page monitoring method and system |
CN109740094A (en) * | 2018-12-27 | 2019-05-10 | 上海掌门科技有限公司 | Page monitoring method, equipment and computer storage medium |
CN109933739A (en) * | 2019-03-01 | 2019-06-25 | 重庆邮电大学移通学院 | A kind of Web page sequencing method and system based on transition probability |
CN109978626A (en) * | 2019-03-29 | 2019-07-05 | 上海幻电信息科技有限公司 | Web advertisement change monitoring method, apparatus and storage medium |
US10984067B2 (en) | 2019-06-26 | 2021-04-20 | Wangsu Science & Technology Co., Ltd. | Video generating method, apparatus, server, and storage medium |
CN110457624A (en) * | 2019-06-26 | 2019-11-15 | 网宿科技股份有限公司 | Video generation method, device, server and storage medium |
CN110798377B (en) * | 2019-10-17 | 2021-07-16 | 东软集团股份有限公司 | Monitoring image sending method and device, storage medium and electronic equipment |
CN110795676A (en) * | 2019-10-31 | 2020-02-14 | 北京知道创宇信息技术股份有限公司 | Website monitoring method and device, electronic equipment and storage medium |
CN113743970A (en) * | 2020-05-29 | 2021-12-03 | 北京达佳互联信息技术有限公司 | Method and device for detecting landing page |
CN112182452A (en) * | 2020-09-27 | 2021-01-05 | 中国平安财产保险股份有限公司 | Page component rendering processing method, device, equipment and computer readable medium |
CN113269587A (en) * | 2021-05-24 | 2021-08-17 | 上海妙契科技有限公司 | Method, device, storage medium and server for monitoring illegal advertisements |
CN114124487B (en) * | 2021-11-10 | 2023-12-01 | 恒安嘉新(北京)科技股份公司 | Webpage access realization method, device, equipment and storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102591963A (en) * | 2011-12-30 | 2012-07-18 | 奇智软件(北京)有限公司 | Method and device for controlling webpage content loading |
CN103455603A (en) * | 2013-09-03 | 2013-12-18 | 小米科技有限责任公司 | Method and device for caching webpage content and loading webpage and terminal device |
Family Cites Families (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TW523697B (en) * | 2001-08-29 | 2003-03-11 | Synq Technology Inc | Automatic advertisement transaction system and method therefor |
US8886660B2 (en) * | 2008-02-07 | 2014-11-11 | Siemens Enterprise Communications Gmbh & Co. Kg | Method and apparatus for tracking a change in a collection of web documents |
CN102073654B (en) * | 2009-11-20 | 2012-12-19 | 富士通株式会社 | Methods and equipment for generating and maintaining web content extraction template |
CN102339290B (en) * | 2010-07-22 | 2013-12-11 | 北大方正集团有限公司 | Method and device for directionally acquiring webpage data information |
CN104077708A (en) * | 2013-03-28 | 2014-10-01 | 北京齐尔布莱特科技有限公司 | Advertisement putting screen capture method |
CN103678628B (en) * | 2013-12-19 | 2018-01-19 | 贝壳网际(北京)安全技术有限公司 | Information-pushing method and system |
CN104142987A (en) * | 2014-07-24 | 2014-11-12 | 腾讯科技(深圳)有限公司 | Page content management method and device and terminal device |
CN105630843B (en) * | 2014-11-17 | 2019-04-12 | 广州市动景计算机科技有限公司 | Web evolution monitoring method and device |
CN105677658B (en) * | 2014-11-19 | 2020-07-28 | 阿里巴巴集团控股有限公司 | Page display method and device |
CN106407218B (en) * | 2015-07-31 | 2020-03-03 | 北京国双科技有限公司 | Navigation webpage detection method and device |
CN106547774B (en) * | 2015-09-21 | 2020-02-28 | 北京国双科技有限公司 | Website content detection method and device |
CN106547776B (en) * | 2015-09-21 | 2019-12-03 | 北京国双科技有限公司 | The detection method and device of web site contents |
-
2017
- 2017-05-11 CN CN201710329418.0A patent/CN108880921B/en active Active
-
2018
- 2018-05-08 WO PCT/CN2018/085961 patent/WO2018205918A1/en active Application Filing
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102591963A (en) * | 2011-12-30 | 2012-07-18 | 奇智软件(北京)有限公司 | Method and device for controlling webpage content loading |
CN103455603A (en) * | 2013-09-03 | 2013-12-18 | 小米科技有限责任公司 | Method and device for caching webpage content and loading webpage and terminal device |
Non-Patent Citations (3)
Title |
---|
"An enhanced model for effective navigation of a website using clustering technique";S. Renuka,;《International Conference on Information Communication and Embedded Systems (ICICES2014)》;20150209;全文 * |
"基于服务器集群的云监控系统的设计与实现";赵代梅,;《中国优秀硕士学位论文全文数据库-信息科技辑》;20160415;I140-359起全文 * |
"基于电子商务平台的数据分析系统";廖静欣,;《中国优秀硕士学位论文全文数据库-经济与管理科学辑》;20170415;J157-184起全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN108880921A (en) | 2018-11-23 |
WO2018205918A1 (en) | 2018-11-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108880921B (en) | Webpage monitoring method and device, storage medium and server | |
US11128662B2 (en) | Method, client, and server for preventing web page hijacking | |
US20160335680A1 (en) | Securing expandable display advertisements in a display advertising environment | |
CN104426925B (en) | Web page resources acquisition methods and device | |
US20100083098A1 (en) | Streaming Information that Describes a Webpage | |
CN112948035A (en) | Method and device for controlling micro front-end page, terminal equipment and storage medium | |
US20090085921A1 (en) | Populate Web-Based Content Based on Space Availability | |
CN110457632B (en) | Webpage loading processing method and device | |
EP3528474B1 (en) | Webpage advertisement anti-shielding methods and content distribution network | |
CN110555179A (en) | Dynamic website script evidence obtaining method, terminal equipment and storage medium | |
CN112437318A (en) | Content display method, device and system and storage medium | |
CN109359260B (en) | Network page change monitoring method, device, equipment and medium | |
CN104881452B (en) | Resource address sniffing method, device and system | |
CN108933947B (en) | Bullet screen display method and device | |
CN113641924B (en) | Webpage interactive time point detection method and device, electronic equipment and storage medium | |
CN110866208A (en) | Responsive layout method, device and equipment for page | |
CN111783010B (en) | Webpage blank page monitoring method, device, terminal and storage medium | |
CN110334301B (en) | Page restoration method and device | |
JP2003536140A5 (en) | ||
CN108415746B (en) | Application interface display method and device, storage medium and electronic equipment | |
CN111083145A (en) | Message sending method and device and electronic equipment | |
CN113655977B (en) | Material display method and device, electronic equipment and storage medium | |
CN111310135A (en) | Watermark adding method and device based on virtual desktop | |
CN114640876A (en) | Multimedia service video display method and device, computer equipment and storage medium | |
CN111163138B (en) | Method, device and server for reducing network load during game |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |