CN112579851A - Page content crawling method and device, storage medium and equipment - Google Patents

Page content crawling method and device, storage medium and equipment Download PDF

Info

Publication number
CN112579851A
CN112579851A CN201910935533.1A CN201910935533A CN112579851A CN 112579851 A CN112579851 A CN 112579851A CN 201910935533 A CN201910935533 A CN 201910935533A CN 112579851 A CN112579851 A CN 112579851A
Authority
CN
China
Prior art keywords
page
target page
target
data
length
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910935533.1A
Other languages
Chinese (zh)
Other versions
CN112579851B (en
Inventor
满悦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201910935533.1A priority Critical patent/CN112579851B/en
Publication of CN112579851A publication Critical patent/CN112579851A/en
Application granted granted Critical
Publication of CN112579851B publication Critical patent/CN112579851B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The disclosure relates to a method, a device, a storage medium and a device for crawling page content, wherein the method comprises the following steps: determining the display height of a visual area and the current page length of a loaded target page; according to the current page length of the target page, controlling the target page to sequentially scroll by taking the display height as a scroll distance so as to enable the target page to scroll to a target position, and displaying the target page in the visible area, wherein when the target page scrolls to the target position, new data are not loaded on the target page; crawling data which are displayed in the visual area and are not crawled in the target page, and storing the crawled data into a page data set. Therefore, the content data of the target page can be accurately crawled.

Description

Page content crawling method and device, storage medium and equipment
Technical Field
The present disclosure relates to the field of page content processing, and in particular, to a page content crawling method, device, storage medium, and apparatus.
Background
With the development of web technologies and crawler technologies, many long web pages need to be scrolled multiple times to be displayed in the viewable area of the display device. Therefore, the website starts to display the long webpage in a page data lazy loading mode, that is, when the page of the long webpage is loaded for the first time, the page data is cached, and when part of the page is displayed in the visible area, the data of the part of the page in the visible area is loaded from the cached page data (that is, the data is loaded as required), so that the data is displayed.
In the prior art, when crawling the page content data of the long webpage displayed in the manner, the current page length of the long webpage is usually determined, the page content data is obtained by rolling the long webpage once (rolling from the head of the long webpage to the bottom of the webpage), and when crawling the data of the long webpage in the rolling manner, whether the data is successfully loaded as required in the rolling process of the long webpage cannot be guaranteed, so that the crawled page data is incomplete.
Disclosure of Invention
The purpose of the present disclosure is to provide a page content crawling method, device, storage medium, and device that can accurately crawl content data of a page.
In order to achieve the above object, according to a first aspect of the present disclosure, there is provided a page content crawling method, including:
determining the display height of a visual area and the current page length of a loaded target page;
according to the current page length of the target page, controlling the target page to sequentially scroll by taking the display height as a scroll distance so as to enable the target page to scroll to a target position, and displaying the target page in the visible area, wherein when the target page scrolls to the target position, new data are not loaded on the target page;
crawling data which are displayed in the visual area and are not crawled in the target page, and storing the crawled data into a page data set.
Optionally, after the step of crawling data, which is displayed in the visible area and is not crawled, in the target page and storing the crawled data in a page data set, the method further includes:
controlling the target page to roll, and determining whether the target page loads new data or not when the target page is completely displayed in the visible area;
if the target page is not loaded with new data, crawling the target page, scrolling to the target page from the target position, and displaying corresponding display data, and storing the display data to the page data set.
Optionally, after the step of determining whether the target page is loaded with new data, the method further includes:
if it is determined that the target page is loaded with new data, re-detecting the current page length of the target page, returning to the current page length according to the target page according to the re-detected current page length of the target page, controlling the target page to sequentially scroll by taking the display height as a scroll distance so as to scroll the target page to a target position, and displaying the target page in the visible area.
Optionally, the method further comprises:
recording the data loading times of the target page, wherein the data loading times are initially zero, and executing an adding operation when determining that the target page is loaded with new data;
determining whether the data loading times are smaller than a preset threshold value;
if the data loading times are smaller than a preset threshold value, executing the step of determining whether the target page loads new data or not when the target page is completely displayed in the visible area;
if the data loading times are not less than the preset threshold, when the target page is completely displayed in the visible area, crawling the target page, then scrolling to the target page from the target position, and completely displaying corresponding display data, and storing the display data to the page data set.
Optionally, the controlling, according to the current page length of the target page, the target page to sequentially scroll with the display height as a scroll distance, so that the target page is scrolled to a target position, and the target page is displayed in the visible area includes:
controlling the target page to scroll by taking the display height as a scroll distance so as to load data corresponding to the visual area when the target page is displayed in the visual area;
re-detecting the current page length of the target page, and determining whether the target page is scrolled to the target position according to the re-detected current page length of the target page and the scrolled distance of the target page;
and if the target page is not scrolled to the target position, after a preset time period, returning to the step of controlling the target page to scroll by taking the display height as a scrolling distance until the target page is scrolled to the target position.
Optionally, the determining whether the target page is scrolled to the target position according to the re-detected current page length of the target page and the scrolled distance of the target page includes:
determining whether the target length obtained by subtracting the scrolled distance of the target page from the current page length of the re-detected target page and subtracting the display height is greater than the display height;
and if the target length is not larger than the display height, determining that the target page is scrolled to the target position.
According to a second aspect of the present disclosure, there is provided a page content crawling apparatus, the apparatus comprising:
the first determining module is used for determining the display height of the visible area and the current page length of the loaded target page;
the scrolling module is used for controlling the target page to sequentially scroll by taking the display height as a scrolling distance according to the current page length of the target page so as to enable the target page to scroll to a target position and display the target page in the visible area, wherein when the target page scrolls to the target position, new data are not loaded on the target page;
the first crawling module is used for crawling data which are displayed in the visual area and are not crawled in the target page, and storing the crawled data into a page data set.
Optionally, the apparatus further comprises:
the second determining module is used for controlling the target page to roll and determining whether the target page loads new data or not when the target page is completely displayed in the visible area;
and the second crawling module is used for crawling all display data corresponding to all display of the target page after scrolling to the target page from the target position in the target page under the condition that new data are not loaded on the target page, and storing the display data into the page data set.
Optionally, the apparatus further comprises:
the detection module is used for detecting the current page length of the target page again under the condition that the target page is determined to be loaded with new data, triggering the rolling module to roll the target page in sequence by taking the display height as a rolling distance according to the detected current page length of the target page, so that the target page is rolled to a target position, and displaying the target page in the visible area.
Optionally, the apparatus further comprises:
the recording module is used for recording the data loading times of the target page, wherein the data loading times are initially zero, and an adding operation is executed when the target page is determined to be loaded with new data;
the third determining module is used for determining whether the data loading times is smaller than a preset threshold value;
the second determining module is configured to determine whether the target page loads new data when the target page is completely displayed in the visible area when the number of data loads is smaller than a preset threshold;
and the third crawling module is used for crawling display data corresponding to all display of the target page after scrolling to the target page from the target position in the target page when the target page is completely displayed in the visible area under the condition that the data loading times are not less than the preset threshold value, and storing the display data into the page data set.
Optionally, the scrolling module comprises:
the control submodule is used for controlling the target page to scroll by taking the display height as a scroll distance so as to load data corresponding to the visual area when the target page is displayed in the visual area;
the first determining submodule is used for re-detecting the current page length of the target page and determining whether the target page is scrolled to the target position according to the re-detected current page length of the target page and the scrolled distance of the target page;
and the triggering sub-module is used for triggering the control sub-module to control the target page to scroll by taking the display height as a scrolling distance after a preset time period under the condition that the target page is not scrolled to the target position until the target page is scrolled to the target position.
Optionally, the first determining sub-module includes:
a second determining submodule, configured to determine whether a target length obtained by subtracting the display height from the current page length of the re-detected target page, the scrolled distance of the target page, and the display height is greater than the display height;
and the third determining sub-module is used for determining that the target page is scrolled to the target position under the condition that the target length is not greater than the display height.
According to a third aspect of the present disclosure, there is provided a storage medium having stored thereon a program which, when executed by a processor, performs the steps of the method of any one of the above-mentioned first aspects.
According to a fourth aspect of the present disclosure, there is provided an apparatus comprising:
at least one processor, and at least one memory, bus connected with the processor;
the processor and the memory complete mutual communication through the bus;
the processor is configured to call program instructions in the memory to perform the steps of any of the above methods of the first aspect.
In the technical scheme, the display height of the visual area and the current page length of the loaded target page are determined, the target page is controlled to sequentially scroll by taking the display height as a scrolling distance according to the current page length of the target page, so that the target page is scrolled to a target position, and the target page is displayed in the visual area, so that the accuracy of data loading as required in the page scrolling process can be ensured. And moreover, when the target page is rolled to the target position, the page data is crawled, so that the integrity and the orderliness of the crawled page data are ensured, the data crawling times can be effectively reduced, and the page content crawling efficiency is improved.
Additional features and advantages of the disclosure will be set forth in the detailed description which follows.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure without limiting the disclosure. In the drawings:
FIG. 1 is a flow diagram of a method for crawling page content provided in accordance with one embodiment of the present disclosure;
fig. 2 is a flowchart of an exemplary implementation manner for controlling a target page to sequentially scroll by using a display height as a scroll distance according to a current page length of the target page, so that the target page is scrolled to a target position, and the target page is displayed in a visible area according to an embodiment of the present disclosure;
FIG. 3 is a display schematic of a destination page provided in accordance with one embodiment of the present disclosure;
FIG. 4 is a display schematic of a destination page provided in accordance with another embodiment of the present disclosure;
FIG. 5 is a display schematic of a destination page provided in accordance with another embodiment of the present disclosure;
FIG. 6 is a display schematic of a destination page provided in accordance with another embodiment of the present disclosure;
FIG. 7 is a block diagram of a page content crawling apparatus provided in accordance with one embodiment of the present disclosure;
fig. 8 is a block diagram of an apparatus provided in accordance with one embodiment of the present disclosure.
Detailed Description
The following detailed description of specific embodiments of the present disclosure is provided in connection with the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present disclosure, are given by way of illustration and explanation only, not limitation.
In order to make the technical solutions provided by the embodiments of the present invention easier to understand for those skilled in the art, first, related technologies will be briefly described below.
In the related art, for a long page that can be completely displayed only by multiple scrolling, to avoid multiple interactions with a server in the page display process, when the long page is loaded for the first time, multi-screen data corresponding to the long page is cached at the same time. Along with the change of the rolling position of the long page, the data corresponding to the screen can be loaded and displayed in the corresponding visual area at each rolling position, so that the data can be loaded as required, and the server resources can not be excessively occupied. Therefore, when the long page is displayed for the first time, the first screen data of the long page is displayed; when the long page is scrolled, for example, when the long page is scrolled to the second screen, data for displaying the content of the second screen can be loaded from the data cached in the long page as required, and when the data is loaded as required, other new data cannot be loaded on the page. When all the data cached in the long page are displayed, the target page data can generate a request flow, if the long page has data to be loaded, the generated request flow is not zero, at the moment, the long page can be loaded with new data, so that the long page can continue to be displayed in a rolling manner, and the loading is effective data loading. Therefore, the response time can be effectively prolonged, and the resource occupation of the server side can be reduced.
As described above, both data on demand and data efficient loads result in an increase in current page length. Therefore, in the prior art, when page content data is acquired by one-time scrolling of a long page (from the head of the long page to the bottom of the page), it cannot be guaranteed whether data loading is successful as required in the scrolling process, so that the crawled page data is incomplete.
In order to solve the above problem, the present disclosure provides a page content crawling method. Fig. 1 is a flowchart illustrating a method for crawling page content according to an embodiment of the present disclosure, where as illustrated in fig. 1, the method may include:
in S11, the display height of the visible region and the current page length of the loaded target page are determined. When the page length of the target page is larger than the display height, the target page can be completely displayed in the visible area only by scrolling for multiple times. As can be seen from the foregoing, in the process of scrolling the target page, under the influence of data on-demand loading and data effective loading, the corresponding current page lengths may be different when the target page is scrolled to different display positions.
The visible area may be a display area in the device, for example, a display interface of a display, and the display height is a height of the corresponding display area, and the display height may be determined by obtaining a hardware parameter of the device. The current page length of the loaded target page is the page length determined when the target page is loaded for the first time, wherein the mode for determining the current page length is the prior art, and is not described herein again.
In S12, according to the current page length of the target page, the target page is controlled to sequentially scroll with the display height as the scroll distance, so that the target page is scrolled to the target position, and the target page is displayed in the visible area, where new data is not loaded on the target page when the target page is scrolled to the target position.
In this embodiment, when the target page is controlled to scroll, the display height is taken as the scroll distance, so that the target page can be scrolled by one screen of data every time the target page is scrolled, and the problem of incomplete data loading as required due to one-time scrolling from the head of the page to the bottom of the page in the prior art is avoided. In addition, in the process of scrolling the control target page to the target position, when the target page is displayed in the visual area, new data is not loaded on the target page, and data loading corresponding to the page content displayed in the visual area is realized by data loading on demand.
It should be noted that when the target page is scrolled to the target position, the target page does not load new data, but when the target page is scrolled, new data may be loaded or may not be loaded.
In S13, the data that has been displayed in the visible region and has not been crawled in the target page is crawled, and the crawled data is stored in the page data set, so that the data in the page data set is the page content crawled from the target page.
The data which is displayed in the visual area and is not crawled in the crawling target page can be realized in the following mode. When the data of the target page is crawled for the first time, the data displayed in the visual area in the target page can be directly crawled, and meanwhile, the crawling position corresponding to the crawled data can be recorded. And then, when the data of the target page is crawled again, the data which is displayed in the visible area and is behind the crawling position in the target page can be crawled, and the crawling position is updated. Therefore, on the one hand, the page data can be crawled quickly, on the other hand, the redundant page data can be effectively prevented from being crawled, and the data crawling amount and the data storage amount are effectively reduced.
In the technical scheme, the display height of the visual area and the current page length of the loaded target page are determined, the target page is controlled to sequentially scroll by taking the display height as a scrolling distance according to the current page length of the target page, so that the target page is scrolled to a target position, and the target page is displayed in the visual area, so that the accuracy of data loading as required in the page scrolling process can be ensured. And moreover, when the target page is rolled to the target position, the page data is crawled, so that the integrity and the orderliness of the crawled page data are ensured, the data crawling times can be effectively reduced, and the page content crawling efficiency is improved.
Optionally, after step 13 of crawling data that is displayed in the visible region and is not crawled in the target page, and storing the crawled data into a page data set, the method further includes:
and controlling the target page to scroll, and determining whether the target page is loaded with new data or not when the target page is completely displayed in the visible area.
Illustratively, after the target page is completely displayed in the visual area, data crawling is performed on the target page after the generation request traffic of the target page is completed. And if the new data is not crawled at the moment, determining that the new data is loaded on the target page.
And if the target page is not loaded with new data, crawling the target page, scrolling to the target page from the target position, and displaying corresponding display data, and storing the display data to a page data set.
If the target page is not loaded with new data, the target page is represented that no new data are displayed currently, and at the moment, corresponding display data are all displayed by scrolling to the target page from the target position in the crawling target page, namely the data which are displayed in the crawling target page and are not crawled yet. For example, data corresponding to the crawl position to the bottommost position of the target page can be crawled according to the recorded crawl position.
For example, data in the page data set may be merged to determine content data for the target page. For example, the data may be merged from early to late according to the crawl time of the data, or from front to back according to the crawl position of the data. Wherein the closer the crawling position is to the head of the page, the more forward the crawling position is.
According to the technical scheme, when the target page data are completely displayed and new data are not loaded, data crawling is performed again, and therefore the residual content data of the target page are crawled. Through the technical scheme, on one hand, accurate crawling on the content data of the target page can be facilitated. On the other hand, the integrity of the content data crawling of the target page can be ensured, the application range of the page content crawling method is improved, and meanwhile, accurate data support is provided for subsequent analysis based on the content data.
Optionally, in S12, according to the current page length of the target page, one exemplary implementation manner of controlling the target page to sequentially scroll by taking the display height as the scroll distance so as to scroll the target page to the target position and display the target page in the visible area is as follows, as shown in fig. 2, this step may include:
in S21, the target page is controlled to scroll by a scroll distance equal to the display height so that data corresponding to the visible region is loaded when the target page is displayed in the visible region.
In this embodiment, when the target page is controlled to scroll, the display height is taken as the scroll distance, so that the target page can be scrolled by one screen of data every time the target page is scrolled, and the data corresponding to the visible area is the data required by the page portion of the target page displayed in the visible area.
In S22, the current page length of the target page is redetected, and it is determined whether the target page has been scrolled to the target position based on the redetected current page length of the target page and the scrolled distance of the target page.
The target page is loaded and then displayed for the first time as shown in fig. 3, where D is a scroll bar corresponding to the visible area. Since the current page length L1 of the target page is greater than the display height H, the portion a of the target page in fig. 3 needs to be scrolled to be displayed. After the target page is controlled to scroll by taking the display height as the scroll distance, the display diagram of the target page is shown in fig. 4, where the part B in the target page is the part which has been scrolled and displayed, that is, the part located in the visible area C in fig. 3.
As described above, in the process of data on-demand loading, if a picture is displayed in the visible area C, the current page length may increase after the picture is loaded on demand, and at this time, the current page length needs to be redetected, as shown in fig. 4, where the current page length of the redetected target page is L2.
Optionally, an exemplary implementation manner of determining whether the target page is scrolled to the target position according to the re-detected current page length of the target page and the scrolled distance of the target page is as follows:
determining whether the target length obtained by subtracting the scrolled distance of the target page from the current page length of the re-detected target page and subtracting the display height is greater than the display height;
and if the target length is not larger than the display height, determining that the target page is scrolled to the target position.
As shown in fig. 4, when the target page is scrolled once, that is, when the target page is scrolled from the interface shown in fig. 3 to the interface shown in fig. 4, the scrolled distance of the target page is M, where M is equal to H, and the target length N is used to indicate a length corresponding to the content of the page that is not yet displayed in the target page.
When the target length is larger than the display height, the target page is indicated to be scrolled once again and still not be displayed completely, and when the target length is smaller than the display height, the target page is indicated to be scrolled once again and all display is possible. Therefore, the scrolling position of the target page can be accurately controlled, scrolling is stopped when the target page is scrolled once again and the data crawling can be carried out when the target page is displayed completely, the data corresponding to the part, loaded as required, of the data in the target page can be crawled at one time, and the data crawling efficiency can be improved.
If the target page is not scrolled to the target position, after a preset time period, returning to the step 21 of controlling the target page to scroll by taking the display height as the scrolling distance, and pausing the scrolling until the target page is scrolled to the target position.
The preset time period can be set according to actual use scenes, and it is only required to ensure that data corresponding to the currently displayed page content in the visual area is loaded as required in the preset time period. Illustratively, the preset period may be set to 50 microseconds. For example, when the control target page is scrolled to the state shown in fig. 4, and the target page is not scrolled to the target position at this time, the current state is maintained for 50 microseconds, and then the control target page is scrolled again with the display height H as the scroll distance, and the scrolled target page is as shown in fig. 5.
As shown in fig. 5, the scrolled distance M of the target page is 2H, and the determined target length N is smaller than the display height H at this time, that is, it is determined that the target page is scrolled to the target position at this time, and the scrolling may be stopped at this time.
In the technical scheme, the target page is controlled to roll by taking the display height as the rolling distance, so that one screen of data can be rolled when the target page rolls once, the current page length of the target page can be detected in real time, the accuracy of the data in the process of controlling the target page to roll is ensured, and the crawling efficiency of the page content is improved.
Optionally, after the step of determining whether the target page is loaded with new data, the method further includes:
and if the target page is determined to be loaded with new data, re-detecting the current page length of the target page, returning to the current page length according to the target page according to the re-detected current page length of the target page, controlling the target page to sequentially scroll by taking the display height as a scroll distance so as to enable the target page to scroll to a target position, and displaying the target page in the visible area, namely controlling the target page to sequentially scroll by taking the display height as the scroll distance according to the re-detected current page length of the target page.
In this embodiment, if it is determined that the target page is loaded with new data, this indicates that this loading is a data valid loading.
In the above example, when the target page scrolls to the state shown in fig. 5, data that is displayed in the visible area and is not crawled in the target page, that is, data corresponding to areas B and C in the target page (that is, three-screen data that has been scroll-displayed) is crawled.
After the data crawling is performed, the target page is controlled to scroll, at this time, the target page may be all displayed, if it is determined that the target page is loaded with new data, the current page length of the target page may be increased, as shown in fig. 6, and the current page length of the target page detected again is L3. Thereafter, S12 is re-entered and the subsequent steps are continuously executed to control the target page to continue scrolling and data crawling is performed. The execution manner of S12 and the following steps thereof is described in detail above, and is not described herein again.
In the technical scheme, when the target page is completely displayed, whether the target page is loaded with new data is determined, and when the target page is determined to be loaded with the new data, the target page is controlled to sequentially scroll by taking the display height as the scroll distance so as to execute the corresponding data loading as required when the subsequent page is displayed, so that the data loading pressure caused by the simultaneous occurrence of the effective data loading and the corresponding data loading as required can be avoided. Meanwhile, data loading of the long page in the rolling process can be effectively distinguished according to needs, the risk that crawling of page content data is interrupted due to excessive data loading at the same time can be effectively avoided, and the safety and the fluency of the page content crawling method are guaranteed.
Optionally, the method further comprises:
recording the data loading times of the target page, wherein the data loading times are initially zero, and executing an adding operation when determining that the target page is loaded with new data, namely the data loading times are used for representing the times of effective loading of the data of the target page.
And determining whether the data loading times is less than a preset threshold, wherein the preset threshold can be set according to an actual use scene, which is not limited by the present disclosure.
If the data loading times are smaller than a preset threshold value, executing the step of determining whether the target page loads new data or not when the target page is completely displayed in the visible area;
if the data loading times are not less than the preset threshold, when the target page is completely displayed in the visible area, crawling the target page, then scrolling to the target page from the target position, and completely displaying corresponding display data, and storing the display data to the page data set.
Illustratively, the preset threshold is 2. When the target page continues to be scrolled to the target page and is completely displayed after the state shown in fig. 5, at this time, the data loading frequency is zero, at this time, the data loading frequency is smaller than the preset threshold value, that is, the effective data loading frequency does not reach the upper limit value yet, at this time, whether the target page loads new data or not can be determined, so that data crawling continues after the new data are loaded. If it is determined that the target page is loaded with new data, the display state of the target page is as shown in fig. 6, and an add operation is performed on the data loading times, where the data loading times is 1.
And then, controlling the target page to continue to roll until the target page is completely displayed again, determining whether the data loading frequency 1 is smaller than a preset threshold value, re-determining whether the target page loads new data, performing data crawling after the target page loads the new data, and if determining that the target page loads the new data, performing an adding operation on the data loading frequency, wherein the data loading frequency is 2 at the moment.
And repeating the steps until the target page is completely displayed again, determining that the data loading times 2 are not less than a preset threshold value, and if the data loading times are reached, directly crawling the target page from the target position to the target page and then scrolling to the target page to completely display corresponding display data, and storing the display data into a page data set to obtain the content data of the target page. The specific implementation of crawling the display data to determine the content data of the target page is described in detail above, and is not described herein again.
Through the technical scheme, the data loading times of the target page are recorded, the crawled page data volume can be accurately controlled when the content data of the page is crawled, the use requirements of users are met, the pressure of excessive data crawling on subsequent data analysis is avoided, and the data processing capacity is reduced.
The present disclosure also provides a page content crawling apparatus, as shown in fig. 7, the apparatus 10 includes:
a first determining module 100, configured to determine a display height of the visible area and a current page length of the loaded target page;
a scrolling module 200, configured to control the target page to sequentially scroll by using the display height as a scrolling distance according to a current page length of the target page, so that the target page is scrolled to a target position, and the target page is displayed in the visible area, where when the target page is scrolled to the target position, new data is not loaded on the target page;
the first crawling module 300 is configured to crawl data, which is displayed in the visible area and is not crawled, in the target page, and store the crawled data in a page data set.
Optionally, the apparatus further comprises:
the second determining module is used for controlling the target page to roll and determining whether the target page loads new data or not when the target page is completely displayed in the visible area;
and the second crawling module is used for crawling all display data corresponding to all display of the target page after scrolling to the target page from the target position in the target page under the condition that new data are not loaded on the target page, and storing the display data into the page data set.
Optionally, the apparatus further comprises:
the detection module is used for detecting the current page length of the target page again under the condition that the target page is determined to be loaded with new data, triggering the rolling module to roll the target page in sequence by taking the display height as a rolling distance according to the detected current page length of the target page, so that the target page is rolled to a target position, and displaying the target page in the visible area.
Optionally, the apparatus further comprises:
the recording module is used for recording the data loading times of the target page, wherein the data loading times are initially zero, and an adding operation is executed when the target page is determined to be loaded with new data;
the third determining module is used for determining whether the data loading times is smaller than a preset threshold value;
the second determining module is configured to determine whether the target page loads new data when the target page is completely displayed in the visible area when the number of data loads is smaller than a preset threshold;
and the third crawling module is used for crawling display data corresponding to all display of the target page after scrolling to the target page from the target position in the target page when the target page is completely displayed in the visible area under the condition that the data loading times are not less than the preset threshold value, and storing the display data into the page data set.
Optionally, the scrolling module comprises:
the control submodule is used for controlling the target page to scroll by taking the display height as a scroll distance so as to load data corresponding to the visual area when the target page is displayed in the visual area;
the first determining submodule is used for re-detecting the current page length of the target page and determining whether the target page is scrolled to the target position according to the re-detected current page length of the target page and the scrolled distance of the target page;
and the triggering sub-module is used for triggering the control sub-module to control the target page to scroll by taking the display height as a scrolling distance after a preset time period under the condition that the target page is not scrolled to the target position until the target page is scrolled to the target position.
Optionally, the first determining sub-module includes:
a second determining submodule, configured to determine whether a target length obtained by subtracting the display height from the current page length of the re-detected target page, the scrolled distance of the target page, and the display height is greater than the display height;
and the third determining sub-module is used for determining that the target page is scrolled to the target position under the condition that the target length is not greater than the display height.
The page content crawling device comprises a processor and a memory, wherein the first determining module, the rolling module, the first crawling module and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.
The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more, and the content data of the target page is crawled by adjusting kernel parameters.
An embodiment of the present invention provides a storage medium, on which a program is stored, and when the program is executed by a processor, the method for crawling page content is implemented.
The embodiment of the invention provides a processor, which is used for running a program, wherein the page content crawling method is executed when the program runs.
An embodiment of the present invention provides an apparatus, as shown in fig. 8, an apparatus 70 includes at least one processor 701, and at least one memory 702 and a bus 703 that are connected to the processor 701; the processor 701 and the memory 702 complete mutual communication through a bus 703; the processor 701 is configured to call program instructions in the memory 702 to execute the page content crawling method described above. The device herein may be a server, a PC, etc.
The present application further provides a computer program product adapted to perform a program for initializing the following method steps when executed on a data processing device:
determining the display height of a visual area and the current page length of a loaded target page;
according to the current page length of the target page, controlling the target page to sequentially scroll by taking the display height as a scroll distance so as to enable the target page to scroll to a target position, and displaying the target page in the visible area, wherein when the target page scrolls to the target position, new data are not loaded on the target page;
crawling data which are displayed in the visual area and are not crawled in the target page, and storing the crawled data into a page data set.
Optionally, after the step of crawling data, which is displayed in the visible area and is not crawled, in the target page and storing the crawled data in a page data set, the method further includes:
controlling the target page to roll, and determining whether the target page loads new data or not when the target page is completely displayed in the visible area;
if the target page is not loaded with new data, crawling the target page, scrolling to the target page from the target position, and displaying corresponding display data, and storing the display data to the page data set.
Optionally, after the step of determining whether the target page is loaded with new data, the method further includes:
if it is determined that the target page is loaded with new data, re-detecting the current page length of the target page, returning to the current page length according to the target page according to the re-detected current page length of the target page, controlling the target page to sequentially scroll by taking the display height as a scroll distance so as to scroll the target page to a target position, and displaying the target page in the visible area.
Optionally, the method further comprises:
recording the data loading times of the target page, wherein the data loading times are initially zero, and executing an adding operation when determining that the target page is loaded with new data;
determining whether the data loading times are smaller than a preset threshold value;
if the data loading times are smaller than a preset threshold value, executing the step of determining whether the target page loads new data or not when the target page is completely displayed in the visible area;
if the data loading times are not less than the preset threshold, when the target page is completely displayed in the visible area, crawling the target page, then scrolling to the target page from the target position, and completely displaying corresponding display data, and storing the display data to the page data set.
Optionally, the controlling, according to the current page length of the target page, the target page to sequentially scroll with the display height as a scroll distance, so that the target page is scrolled to a target position, and the target page is displayed in the visible area includes:
controlling the target page to scroll by taking the display height as a scroll distance so as to load data corresponding to the visual area when the target page is displayed in the visual area;
re-detecting the current page length of the target page, and determining whether the target page is scrolled to the target position according to the re-detected current page length of the target page and the scrolled distance of the target page;
and if the target page is not scrolled to the target position, after a preset time period, returning to the step of controlling the target page to scroll by taking the display height as a scrolling distance until the target page is scrolled to the target position.
Optionally, the determining whether the target page is scrolled to the target position according to the re-detected current page length of the target page and the scrolled distance of the target page includes:
determining whether the target length obtained by subtracting the scrolled distance of the target page from the current page length of the re-detected target page and subtracting the display height is greater than the display height;
and if the target length is not larger than the display height, determining that the target page is scrolled to the target position.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a device includes one or more processors (CPUs), memory, and a bus. The device may also include input/output interfaces, network interfaces, and the like.
The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip. The memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (10)

1. A method for crawling page content, the method comprising:
determining the display height of a visual area and the current page length of a loaded target page;
according to the current page length of the target page, controlling the target page to sequentially scroll by taking the display height as a scroll distance so as to enable the target page to scroll to a target position, and displaying the target page in the visible area, wherein when the target page scrolls to the target position, new data are not loaded on the target page;
crawling data which are displayed in the visual area and are not crawled in the target page, and storing the crawled data into a page data set.
2. The method of claim 1, wherein after the step of crawling data in the target page that has been displayed in the viewable area and that has not been crawled, and storing the crawled data to a set of page data, the method further comprises:
controlling the target page to roll, and determining whether the target page loads new data or not when the target page is completely displayed in the visible area;
if the target page is not loaded with new data, crawling the target page, scrolling to the target page from the target position, and displaying corresponding display data, and storing the display data to the page data set.
3. The method of claim 2, wherein after the step of determining whether the target page loads new data, the method further comprises:
if it is determined that the target page is loaded with new data, re-detecting the current page length of the target page, returning to the current page length according to the target page according to the re-detected current page length of the target page, controlling the target page to sequentially scroll by taking the display height as a scroll distance so as to scroll the target page to a target position, and displaying the target page in the visible area.
4. The method of claim 3, further comprising:
recording the data loading times of the target page, wherein the data loading times are initially zero, and executing an adding operation when determining that the target page is loaded with new data;
determining whether the data loading times are smaller than a preset threshold value;
if the data loading times are smaller than a preset threshold value, executing the step of determining whether the target page loads new data or not when the target page is completely displayed in the visible area;
if the data loading times are not less than the preset threshold, when the target page is completely displayed in the visible area, crawling the target page, then scrolling to the target page from the target position, and completely displaying corresponding display data, and storing the display data to the page data set.
5. The method according to claim 1, wherein said controlling, according to a current page length of the target page, the target page to sequentially scroll by using the display height as a scroll distance, so that the target page is scrolled to a target position and displayed in the visible area, comprises:
controlling the target page to scroll by taking the display height as a scroll distance so as to load data corresponding to the visual area when the target page is displayed in the visual area;
re-detecting the current page length of the target page, and determining whether the target page is scrolled to the target position according to the re-detected current page length of the target page and the scrolled distance of the target page;
and if the target page is not scrolled to the target position, after a preset time period, returning to the step of controlling the target page to scroll by taking the display height as a scrolling distance until the target page is scrolled to the target position.
6. The method of claim 5, wherein determining whether the target page has scrolled to the target position according to the re-detected current page length of the target page and the scrolled distance of the target page comprises:
determining whether the target length obtained by subtracting the scrolled distance of the target page from the current page length of the re-detected target page and subtracting the display height is greater than the display height;
and if the target length is not larger than the display height, determining that the target page is scrolled to the target position.
7. An apparatus for crawling page content, the apparatus comprising:
the first determining module is used for determining the display height of the visible area and the current page length of the loaded target page;
the scrolling module is used for controlling the target page to sequentially scroll by taking the display height as a scrolling distance according to the current page length of the target page so as to enable the target page to scroll to a target position and display the target page in the visible area, wherein when the target page scrolls to the target position, new data are not loaded on the target page;
the first crawling module is used for crawling data which are displayed in the visual area and are not crawled in the target page, and storing the crawled data into a page data set.
8. The apparatus of claim 7, further comprising:
the second determining module is used for controlling the target page to roll and determining whether the target page loads new data or not when the target page is completely displayed in the visible area;
and the second crawling module is used for crawling all display data corresponding to all display of the target page after scrolling to the target page from the target position in the target page under the condition that new data are not loaded on the target page, and storing the display data into the page data set.
9. A storage medium having a program stored thereon, the program being characterized in that it realizes the steps of the method according to any one of claims 1-6 when executed by a processor.
10. An apparatus, characterized in that the apparatus comprises:
at least one processor, and at least one memory, bus connected with the processor;
the processor and the memory complete mutual communication through the bus;
the processor is configured to invoke program instructions in the memory to perform the steps of the method of any of claims 1-6.
CN201910935533.1A 2019-09-29 2019-09-29 Page content crawling method and device, storage medium and equipment Active CN112579851B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910935533.1A CN112579851B (en) 2019-09-29 2019-09-29 Page content crawling method and device, storage medium and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910935533.1A CN112579851B (en) 2019-09-29 2019-09-29 Page content crawling method and device, storage medium and equipment

Publications (2)

Publication Number Publication Date
CN112579851A true CN112579851A (en) 2021-03-30
CN112579851B CN112579851B (en) 2024-07-26

Family

ID=75110759

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910935533.1A Active CN112579851B (en) 2019-09-29 2019-09-29 Page content crawling method and device, storage medium and equipment

Country Status (1)

Country Link
CN (1) CN112579851B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6741268B1 (en) * 1999-07-26 2004-05-25 Nec Corporation Page information display method and apparatus, and storage medium for storing program or data for display page
CN103853729A (en) * 2012-11-29 2014-06-11 腾讯科技(深圳)有限公司 Page loading method and system
CN103885965A (en) * 2012-12-21 2014-06-25 鸿富锦精密工业(深圳)有限公司 Page loading management method and page loading management system
CN104965659A (en) * 2015-07-06 2015-10-07 无锡天脉聚源传媒科技有限公司 Page information preloading method and apparatus
CN105354062A (en) * 2015-11-06 2016-02-24 深圳市金立通信设备有限公司 Method for displaying loaded page and mobile terminal
CN109375973A (en) * 2018-09-20 2019-02-22 北京城市网邻信息技术有限公司 Page display method, device, computer equipment and computer readable storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6741268B1 (en) * 1999-07-26 2004-05-25 Nec Corporation Page information display method and apparatus, and storage medium for storing program or data for display page
CN103853729A (en) * 2012-11-29 2014-06-11 腾讯科技(深圳)有限公司 Page loading method and system
CN103885965A (en) * 2012-12-21 2014-06-25 鸿富锦精密工业(深圳)有限公司 Page loading management method and page loading management system
CN104965659A (en) * 2015-07-06 2015-10-07 无锡天脉聚源传媒科技有限公司 Page information preloading method and apparatus
CN105354062A (en) * 2015-11-06 2016-02-24 深圳市金立通信设备有限公司 Method for displaying loaded page and mobile terminal
CN109375973A (en) * 2018-09-20 2019-02-22 北京城市网邻信息技术有限公司 Page display method, device, computer equipment and computer readable storage medium

Also Published As

Publication number Publication date
CN112579851B (en) 2024-07-26

Similar Documents

Publication Publication Date Title
CN106547420B (en) Page processing method and device
CN108089856B (en) Page element monitoring method and device
CN113393294B (en) Page display method and device, equipment and storage medium
CN107526592B (en) Page adaptation method and device
CN105373593B (en) The method and device of object element in a kind of displayed web page
CN106155524A (en) Page control method and device
CN111767002A (en) Page display method, device, equipment and storage medium
CN107390982B (en) Screenshot method, screenshot equipment and terminal equipment
CN107391534B (en) Page display method, page file return method, page display device, page file return device and computer storage medium
CN113190321A (en) Method and equipment for application program page pull-up refreshing
CN114428657B (en) Sliding method and equipment based on Taro framework at H5 end
CN110069194B (en) Page blockage determining method and device, electronic equipment and readable storage medium
CN111427637B (en) Page rendering method and device
CN109582188B (en) Method, device and related equipment for realizing element positioning in popup window
CN110968811A (en) Display control method and device
CN110489023A (en) Implementation method, device, equipment, medium and the system of windows display
CN112579851B (en) Page content crawling method and device, storage medium and equipment
CN111381745B (en) Page switching method, device and equipment
CN110020264B (en) Method and device for determining invalid hyperlinks
CN111414123B (en) Information processing method and device
CN105204724A (en) Information display method and device
CN112578963B (en) Menu processing method and device, storage medium and electronic equipment
CN110215702B (en) Method and device for controlling grouping in game
CN108984247B (en) Information display method, terminal equipment and network equipment thereof
CN114168027B (en) Information display method, device and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant