WO2019200783A1

WO2019200783A1 - Method for data crawling in page containing dynamic image or table, device, terminal, and storage medium

Info

Publication number: WO2019200783A1
Application number: PCT/CN2018/100159
Authority: WO
Inventors: 阮晓雯; 徐亮; 肖京
Original assignee: 平安科技（深圳）有限公司
Priority date: 2018-04-18
Filing date: 2018-08-13
Publication date: 2019-10-24
Also published as: CN108595583B; CN108595583A

Abstract

A method for data crawling in a page containing a dynamic image or table, the method comprising: launching a browser by means of an automatic testing tool, and inputting a link of a given website; crawling the given website for page information related to a crawling keyword input by a user; rendering and parsing a crawled page; capturing a screenshot of the parsed page by means of the automatic testing tool and storing the screenshot image; performing identification on the screenshot image according to a pre-trained image identification model, and obtaining the content of the screenshot image; determining whether traversal of the given website and pages corresponding to the crawling keyword is completed; if so, terminating the procedure; and if not, continuing the above procedure. The present application further provides a device for data crawling in a page containing a dynamic image or table, a terminal, and a storage medium. The present application enables automatic crawling of dynamically loaded data in an image or table and identifies content in an image.

Description

Dynamic chart class page data crawling method, device, terminal and storage medium

This application claims priority to Chinese Patent Application No. 201810349975.3, entitled "Dynamic Charts Page Data Crawling Method, Device, Terminal and Storage Medium", filed on April 18, 2018, all of which are entitled The content is incorporated herein by reference.

Technical field

The present application relates to the field of web crawler technology, and in particular, to a dynamic chart type page data crawling method, device, terminal and storage medium.

Background technique

With the popularity of modern web technologies such as Asynchronous JavaScript and XML (Ajax), which creates interactive web applications without sacrificing browser compatibility, the form of web page data has undergone profound changes. There are more and more web content dynamically generated by Ajax on the Internet. Users often encounter some webpage prompts "click to load more" or automatically load more content as the mouse scrolls. These new forms of web pages require user interaction to trigger the generation and display of content, which improves the user's browsing experience to a certain extent, but poses a serious challenge to the traditional data collection method based on grabbing HTML files.

Especially for the dynamically loaded chart data in the webpage, it is generally displayed after asynchronous loading, and the traditional crawler is difficult to crawl; some text data is displayed in the form of a chart after using encryption technology, and the chart cannot be directly downloaded. Get; in the process of crawling data, you will often encounter problems that need to be input; in addition, some interference information will be added to the chart, making the real data information in the chart difficult to obtain. At this stage, a large amount of manpower is generally required to obtain dynamic chart class data.

Summary of the invention

In view of the above, it is necessary to propose a dynamic chart class page data crawling method, device, terminal and storage medium, which can automatically crawl the dynamically loaded chart class data, and take a screenshot of the crawled chart class data and input it to the advance. In the trained picture recognition model, the content in the picture is recognized, and the compatibility is good, the speed is fast, and the data is captured accurately compared with the traditional web crawler product.

A first aspect of the present application provides a dynamic chart class page data crawling method, the method comprising:

a) Start the browser with an automated test tool and enter a link to the website where the data is to be crawled;

b) crawling the page information related to the crawling keyword input by the user from the website to be crawled;

c) rendering and parsing the crawled page;

d) taking a screenshot of the parsed page by the automated test tool to obtain a screenshot image and saving the screenshot image;

e) identifying the screenshot picture according to a pre-trained picture recognition model, and obtaining content in the screenshot picture;

f) determining whether the website to be crawled data and the page corresponding to the crawling keyword have been traversed; and

When it is determined that the website to be crawled data and the page corresponding to the crawling keyword have been traversed, the process ends; or

When it is determined that the website to be crawled data and the page corresponding to the crawling keyword are not traversed, the above b) to f) are continued.

A second aspect of the present application provides a dynamic chart class page data crawling device, the device comprising:

a startup module for launching a browser with an automated testing tool and entering a link to a website to be crawled;

a crawling module, configured to crawl, from the website that is to be crawled data, page information related to the crawling keyword input by the user;

a parsing module for rendering and parsing the crawled page;

a screenshot module, configured to take a screenshot of the parsed page by using the automated test tool to obtain a screenshot image and save the screenshot image;

An identification module, configured to identify the screenshot image according to a pre-trained picture recognition model, to obtain content in the screenshot picture.

A third aspect of the present application provides a terminal, the terminal comprising a processor and a memory, the processor implementing the dynamic chart class page data crawling method when the computer readable instructions stored in the memory are executed.

A fourth aspect of the present application provides a non-volatile readable storage medium having stored thereon computer readable instructions that, when executed by a processor, implement the dynamic chart class page data crawling method.

The dynamic chart class page data crawling method, device, terminal and storage medium described in the present application use Selenium technology to simulate user login browser, dynamic loading and screenshot downloading operations, and then combine web crawling technology to automatically crawl dynamics. The loaded chart class data, the crawled information is exactly the same as the graphic information seen by the real user, and the captured chart data is screenshotd and input into the pre-trained image recognition model to identify the content in the image. Compared with traditional web crawler products, it has good compatibility, fast speed and accurate data capture.

Secondly, in the training process of the picture recognition model, by gradually increasing the number of training sets participating in the training, under the premise of ensuring the recognition rate of the picture recognition model, using less samples to participate in the training, the picture recognition model can be shortened to the utmost extent. Training time, improve the training efficiency of the picture recognition model, that is, find the optimal number of training sets between the accuracy and efficiency of the picture recognition model.

DRAWINGS

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings to be used in the embodiments or the prior art description will be briefly described below. Obviously, the drawings in the following description are only It is an embodiment of the present application, and those skilled in the art can obtain other drawings according to the provided drawings without any creative work.

FIG. 1 is a flowchart of a dynamic chart class page data crawling method provided in Embodiment 1 of the present application.

FIG. 2 is a flowchart of a method for taking a screenshot of a parsed page and obtaining a screenshot image and saving the screenshot image according to the second embodiment of the present application.

FIG. 3 is a flowchart of a training method of a picture recognition model according to Embodiment 3 of the present application.

4 is a structural diagram of a dynamic chart class page data crawling device provided in Embodiment 4 of the present application.

FIG. 5 is a schematic diagram of sub-function modules of the de-duplication module provided in Embodiment 5 of the present application.

6 is a sub-function block diagram of a training module provided in Embodiment 6 of the present application.

FIG. 7 is a structural diagram of a terminal provided in Embodiment 7 of the present application.

The present application will be further described in conjunction with the above drawings in the following detailed description.

detailed description

The above described objects, features, and advantages of the present invention will be more clearly understood from the following detailed description. It should be noted that the embodiments in the present application and the features in the embodiments may be combined with each other without conflict.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without departing from the inventive scope are the scope of the present application.

All technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention applies, unless otherwise defined. The terminology used herein is for the purpose of describing particular embodiments, and is not intended to be limiting.

The dynamic chart class page data crawling method of the embodiment of the present application is applied to one or more terminals. The dynamic chart class page data crawling method can also be applied to a hardware environment composed of a terminal and a server connected to the terminal through a network. Networks include, but are not limited to, wide area networks, metropolitan area networks, or local area networks. The dynamic chart class page data crawling method of the embodiment of the present application may be executed by a server or by a terminal; or may be performed by a server and a terminal together.

For the terminal that needs to perform the dynamic chart class page data crawling method, the dynamic chart class page data crawling function provided by the method of the present application may be directly integrated on the terminal, or the client for implementing the method of the present application may be installed. . For example, the method provided by the present application can also be run on a server or the like in the form of a software development kit (SDK), and provide an interface of a dynamic chart type page data crawling function in the form of an SDK, a terminal or Other devices can track the hand through the provided interface.

Embodiment 1

FIG. 1 is a flowchart of a dynamic chart class page data crawling method provided in Embodiment 1 of the present application. The order of execution in the flowchart can be changed according to different requirements, and some steps can be omitted.

S11. Start the browser with an automated test tool and enter a link to the website to be crawled.

The computer software automated testing technology Selenium Web Driver (hereinafter referred to as Selenium) has a strong visual automatic interaction function, which simulates the interaction between people and web pages through programming, thereby triggering dynamic data loading and obtaining dynamically generated data. Selenium technology can realistically simulate the actions users perform on the website's webpage, such as simulating users clicking "View More", "Auto Login", "Click Link", "Fill Form", "Roll Mouse", "Mouse Drag" , "Scroll down after the page is loaded", "Click to page", "Screen save" and other operations.

In this embodiment, the browser is opened by the Selenium tool, and the link of the website to be crawled data (Uniform Resource Locator, URL) is input in the browser, and the Selenium tool calls the get() method to open the website to be crawled by the user. Web page.

For example, if the user needs to crawl the "Face Recognition Books" data on the "Dangdang" website, open the browser (for example, Google Chrome) through the selenium tool, and enter the URL "www.dangdang.com" of the "Dangdang" website. You can launch the "Dangdang" website and display the "Dangdang" website's web page.

In this embodiment, if the user needs to crawl data of multiple websites, the link of the website to be crawled data may be simultaneously input into the queue of the browser opened by the selenium tool, and the crawler program sequentially climbs the plurality of the website. The data in the website where the data is to be crawled.

S12. Climb the page information related to the crawling keyword input by the user from the website to be crawled.

When the website to be crawled is opened by the Selenium tool, the user inputs a crawl keyword, for example, "face recognition", and the Selenium tool simulates "face recognition" on the website where the user browses the data to be crawled. "Page information for all pages."

S13. Render and parse the crawled page.

When the Selenium tool crawls the page, it will trigger Ajax to request data asynchronously from the server. After receiving the original data of the reply, it will be formatted into a new HTML node, inserted into the initial HTML file, and finally the dynamic content will be generated by the browser kernel rendering engine. display. Send the page service request to the wire protocol through the selenium service, and then operate the browser API to get the original page loaded by the browser. Return to the selenium service through the wire protocol, and when the selenium service gets the page, it is handed to the parsing module for page parsing.

S14. Perform a screenshot of the parsed page by using the automated test tool to obtain a screenshot image and save the screenshot image.

The driver of the Selenium tool instructs the browser to execute the command, and finally the browser saves the screenshot in the kernel. The final effect is exactly the same as the user's use of the mouse to capture the image on the page and save it.

Preferably, the screenshot of the parsed page by the automated test tool to obtain a screenshot image and saving the screenshot image may further include: de-duplicating the table in the parsed page according to the perceptual hash value.

Refer to FIG. 2 and its corresponding description for the process of stepping through the parsed page by the automated test tool to obtain a screenshot picture and saving the screenshot picture for further refinement.

S15. Identify the screenshot picture according to a pre-trained picture recognition model, and obtain content in the screenshot picture.

In this embodiment, the training method of the picture recognition model is specifically referred to FIG. 3 and its corresponding description.

S16. Determine whether the website to be crawled data and the page corresponding to the crawling keyword have been traversed.

When it is determined that the website to be crawled data and the page corresponding to the crawling keyword have been traversed, the process ends; otherwise, when determining the website to be crawled data and corresponding to the crawling keyword The pages of the page are not traversed, and the above S12 to S15 are continued.

In summary, the dynamic chart class page data crawling method described in the present application uses Selenium technology to simulate a user login browser, dynamic loading, and screenshot downloading operations, and then combines web crawling technology to automatically crawl dynamically loaded. The chart type data, the crawled information is exactly the same as the graphic information seen by the real user, and the captured chart data is screenshotd and input into the pre-trained picture recognition model to identify the content in the picture. Compared with traditional web crawler products, it has good compatibility, fast speed and accurate data capture.

Embodiment 2

FIG. 2 is a flowchart of a method for taking a screenshot of a parsed page and obtaining a screenshot image and saving the screenshot image according to the second embodiment of the present application. The order of execution in the flowchart can be changed according to different requirements, and some steps can be omitted.

S21. Determine, by the automated testing tool, whether a chart exists in the parsed page.

In this embodiment, the automated testing tool determines whether a chart exists in the parsed page by identifying whether the parsed page has a tag related to the chart display and control.

Determining that a chart exists in the parsed page when the automated test tool recognizes that there is a tag related to the chart display and control in the parsed page; when the automated test tool identifies the parsing The label associated with the chart display and control does not exist in the subsequent page, and it is determined that the chart does not exist in the parsed page.

The tags related to the chart display and control include: img, table, tr, td, colspan, and the like.

Because the charts in the webpage are written in HTML language, there are many DIVs, CSSs, and HTML tags related to the chart that control the display format of the page. It can be judged whether the parsed page exists by determining whether there is a tag attribute related to the chart. The chart, when identifying the tag attribute related to the chart, determines that there is a chart in the parsed page, and when the tag attribute related to the chart is not recognized, it is determined that the chart does not exist in the parsed page.

When it is determined that there is no chart in the parsed page, step S22 is performed; otherwise, when it is determined that there is a chart in the parsed page, step S23 is performed.

S22. Climb the information in the parsed page, and save the crawled information according to a preset data format.

When it is determined that there is no chart in the parsed page, the parsed page is not screenshotd, and the crawler program directly crawls the information in the parsed page and stores it according to a preset data format.

In this embodiment, by determining whether there is a graph in the parsed page to perform different operations, when there is a graph in the parsed page, the parsed page is screenshotd and the graph in the page is screenshotd, and the parsed page is If there is no chart in the middle, the screenshot operation will not be performed. This can save network resources and avoid screenshots of all parsed pages, thus wasting network resources. In addition, when there is no chart in the parsed page, no screenshot operation is performed. Simplify the operation process and help improve crawling efficiency.

S23. Perform a screenshot of the chart in the parsed page to obtain a screenshot image.

In this embodiment, the screenshot of the graph in the parsed page by the Selenium tool simulation user further includes downloading the graph in the parsed page.

S24. Calculate a perceptual hash value of the screenshot picture.

In this embodiment, the perceptual hash algorithm is used to calculate the perceptual hash value of the screenshot image. The specific process includes:

1) Perform grayscale processing on the screenshot image;

2) Calculate the grayscale average value of the screenshot image after the grayscale processing;

3) comparing the gray value of each pixel of the screenshot image after the grayscale processing with the size of the gray average value;

4) The grayscale value of the pixel of the screenshot image after the grayscale processing is greater than or equal to the grayscale average value is 1, and the grayscale value of the pixel of the grayscale processed screenshot image is smaller than the grayscale The average value of the degree is recorded as 0;

5) Connect the comparison result of each pixel obtained in 4) according to a preset connection rule to obtain a perceptual hash value of the screenshot picture.

S25. Determine whether a similarity between the perceived hash value of the screenshot picture and the perceived hash value of the screenshot picture is greater than a preset similarity threshold.

In this embodiment, determining whether the similarity between the perceptual hash value of the screenshot image and the perceptual hash value of the screenshotd image is greater than a preset similarity threshold specifically includes: comparing the perception of the screenshot image. The number of digits of the same value between the hash value and the perceived hash value of the captured picture; whether the number of bits of the same value is greater than the preset similarity threshold.

For example, the screenshot image after the grayscale processing is 8*8 pixels, and the average value of the grayscale is 45. When the grayscale value of the pixel in the first row of the first row is greater than 45, the comparison result is recorded as 1, otherwise the comparison result is Recorded as 0; when the gray value of the pixel in the second row of the first row is greater than 45, the comparison result is recorded as 1, otherwise the comparison result is recorded as 0; when the gray value of the pixel in the first row and the third column is greater than 45, The result is recorded as 1, otherwise the comparison result is recorded as 0; and so on. The comparison results are then combined from left to right and from top to bottom into 64-bit numbers, which are the perceived hash values of the screenshot picture. When it is determined that the number of bits (eg, 61) having the same value between the perceptual hash value of the screenshot picture and the perceptual hash value of the screenshotd picture is greater than the preset similarity threshold (eg, 60) , indicating that the screenshot picture is the same as the screenshot picture.

When it is determined that the similarity between the perceptual hash value of the screenshot picture and the perceptual hash value of the screenshotd picture is greater than a preset similarity threshold, step S26 is performed; otherwise, when the perceptual hash of the screenshot picture is determined When the similarity between the value and the perceived hash value of the screenshotd picture is less than or equal to the preset similarity threshold, step S27 is performed.

S26. Delete the screenshot picture.

S27. Associate the screenshot picture and the corresponding parsed page in a specific location set in advance.

In this embodiment, the preset specific location is dedicated to storing the screenshot picture and the corresponding parsed page. The specific location can be a specific folder or a folder named with a specific name. Each time the screenshot picture and the corresponding parsed page line are stored in association, so that the page where the chart is located can be quickly found after the event, and the method based on the context semantic analysis is further analyzed according to the position information of the chart in the page. The content of the chart in the page.

In summary, the screenshot picture de-duplication method provided by the present application determines, according to the perceptual hash value, whether the screenshot picture and the screenshot picture are the same to achieve the purpose of deduplication, and the perceptual hash calculation result is accurate, and has the same content. Downloading for deletion or de-duplication can remove redundant screenshot images, effectively saving storage space. In addition, the screenshot image and the corresponding parsed page are stored in association, which facilitates post-mortem management and analysis.

Embodiment 3

FIG. 3 is a flowchart of a training method of a picture recognition model according to Embodiment 3 of the present application. The order of execution in the flowchart can be changed according to different requirements, and some steps can be omitted.

S31. Acquire multiple pictures.

In this embodiment, multiple small reptiles can automatically obtain multiple images from various websites provided on the Internet, and multiple images can be manually downloaded from various search engines (for example, Baidu, Google, 360) to form image data. The set is saved in a local database. The content in the picture can include, but is not limited to, numbers, characters, letters, images, tables, etc., and the letters can also be case sensitive.

S32. Perform pre-processing on the multiple pictures to obtain a data set to be participated in the training picture recognition model.

In this embodiment, each picture in the picture data set is preprocessed separately, and the preprocessing includes: background removal, segmentation, scaling, cropping, flipping, and/or warping, etc., so that the training pictures have the same size and After the same perspective, the image recognition model is trained to effectively improve the authenticity and accuracy of the image recognition model.

In this embodiment, the background removal may be performed by using a binarization method. If the pixel on the picture is larger than a preset threshold, it is white, otherwise it is black, that is, the original picture is converted into a picture with only black and white to effectively remove the picture background. Interference element.

In this embodiment, each picture in the picture data set may be segmented using a segmentation function, and each number or each character in the picture is divided into a single number or character.

S33. The data set is divided into a training set and a test set by using a cross-validation method.

The training set is used to train a picture recognition model, and the test set is used to test the performance of the trained picture recognition model. If the accuracy of the test is higher, it indicates that the performance of the trained picture recognition model is better; if the accuracy of the test is low, it indicates that the performance of the trained picture recognition model is poor.

The data set can be divided into appropriate proportions (for example, 3 to 2) to obtain a training set and a training set.

S34. Randomly select a first preset number of training set training picture recognition models in the training set.

In this embodiment, all the pictures in the original training set need not be trained in the picture recognition model, but the first preset number of training sets are selected in the original training set to participate in the training, which can reduce the training involved in the training. The number of sets saves the training time of the picture recognition model.

In addition, the random number generation algorithm is used for random selection, which can increase the randomness of the training set participating in the training and improve the robustness of the picture recognition model.

In the first embodiment, the first preset number may be a preset fixed value, for example, 60, that is, training for randomly selecting 60 samples to participate in the picture recognition model in the original training set.

In the second embodiment, the first preset number may be a preset ratio value, for example, 1/10, that is, a random selection of a 1/10 ratio sample participating in the image recognition model training in the original training set.

S35. Test the accuracy of the trained picture recognition model by using the test set. If the accuracy rate is greater than or equal to the preset accuracy rate threshold, the training ends; if the accuracy rate is less than the preset accuracy rate threshold, the picture recognition model is retrained. .

Preferably, the retraining the picture recognition model comprises: adding a second preset number of training sets to the first preset from the training set except the first preset number of training sets in the training set The number of trainings is concentrated, and the above steps S32 to S35 are re-executed until the accuracy of the trained picture recognition model is greater than or equal to the preset accuracy rate threshold.

In the first embodiment, the second preset number may be a preset fixed value, for example, 20, that is, randomly selected in the training set except the first preset number of training sets in the training set. 20 pictures participated in the training of the picture recognition model.

In the second embodiment, the second preset number may be a preset ratio value, for example, 1/20, that is, training in addition to the first preset number of training sets in the training set. Focus on randomly selecting 1/20 scale pictures to participate in the training of picture recognition models.

In the third embodiment, the second preset number may be the first preset number of preset ratio values, for example, 1/5, that is, the first preset number of training sets are divided in the training set. In addition to the training set, the first preset number of 1/5 scale pictures are randomly selected to participate in the training of the picture recognition model.

The picture recognition model training method provided by the present application can minimize the number of training sets participating in the training, and under the premise of ensuring the recognition rate of the picture recognition model, using less samples to participate in the training, the picture recognition model can be shortened to the utmost extent. Training time, improve the training efficiency of the picture recognition model, that is, find the optimal number of training sets between the accuracy and efficiency of the picture recognition model.

The above description is only a specific embodiment of the present application, but the scope of protection of the present application is not limited thereto, and those skilled in the art can also make without departing from the concept of the present application. Improvements, but these are all within the scope of this application.

The function modules and hardware structures of the terminal for realizing the above-mentioned dynamic chart class page data crawling method are respectively described below with reference to the fourth to seventh figures.

Embodiment 4

4 is a functional block diagram of a dynamic chart class page data crawling device according to Embodiment 4 of the present application.

In some embodiments, the dynamic chart class page data crawler 40 operates in a terminal. The dynamic chart class page data crawler 40 can include a plurality of functional modules consisting of program code segments. The program code of each program segment in the dynamic chart class page data crawling device 40 may be stored in a memory and executed by at least one processor to execute (see FIG. 1 and its related description) for the dynamic chart class. Crawling of page data.

In this embodiment, the dynamic chart class page data crawling device 40 of the terminal may be divided into a plurality of functional modules according to the functions performed by the terminal. The function module may include: a startup module 401, a crawl module 402, a parsing module 403, a screenshot module 404, a deduplication module 405, a training module 406, an identification module 407, and a determination module 408. A module as referred to in this application refers to a series of computer readable instruction segments that are executable by at least one processor and capable of performing a fixed function, which are stored in the memory. In some embodiments, the functionality of each module will be detailed in subsequent embodiments.

The startup module 401 is configured to start a browser by using an automated testing tool, and input a link of a website to be crawled.

The crawling module 402 is configured to crawl page information related to the crawling keyword input by the user from the website to be crawled.

The parsing module 403 is configured to parse and parse the crawled page.

The screenshot module 404 is configured to take a screenshot of the parsed page by the automated testing tool to obtain a screenshot image and save the screenshot image.

The de-duplication module 405 is configured to de-scale the table in the parsed page according to the perceptual hash value.

The training module 406 is configured to train a picture recognition model.

The identification module 407 is configured to identify the screenshot picture according to a pre-trained picture recognition model, and obtain content in the screenshot picture.

The determining module 408 is configured to determine whether the website to be crawled data and the page corresponding to the crawling keyword have been traversed. When the determining module 408 determines that the website to be crawled data and the page corresponding to the crawling keyword are not traversed, the

modules

401, 402, 403, 404, 405 and 407 are repeatedly executed.

In summary, the dynamic chart type page data crawling device described in the present application uses Selenium technology to simulate a user login browser, dynamic loading, and screenshot downloading operations, and then combines web crawling technology to automatically crawl dynamically loaded. The chart type data, the crawled information is exactly the same as the graphic information seen by the real user, and the captured chart data is screenshotd and input into the pre-trained picture recognition model to identify the content in the picture. Compared with traditional web crawler products, it has good compatibility, fast speed and accurate data capture.

Embodiment 5

FIG. 5 is a schematic diagram of sub-function modules of the de-duplication module provided in Embodiment 5 of the present application. The de-duplication module 405 includes: a first determining sub-module 4051, a saving sub-module 4052, a screenshot sub-module 4053, a computing sub-module 4054, a second determining sub-module 4055, a deleting sub-module 4056, and an associated sub-module 4057.

The first determining sub-module 4051 is configured to determine, by the automated testing tool, whether a chart exists in the parsed page.

The saving submodule 4052 is configured to: when the first determining submodule 4051 determines that there is no chart in the parsed page, crawl the information in the parsed page, and save the crawled information according to a preset data format.

The screenshot sub-module 4053 is configured to: when the first determining sub-module 4051 determines that a graph exists in the parsed page, perform a screenshot on the graph in the parsed page to obtain a screenshot image.

The calculation sub-module 4054 is configured to calculate a perceptual hash value of the screenshot picture.

In this embodiment, the specific process of the calculation submodule 4054 includes:

1) Perform grayscale processing on the screenshot image;

The second determining sub-module 4055 is configured to determine whether a similarity between the perceived hash value of the screenshot picture and the perceived hash value of the screenshot picture is greater than a preset similarity threshold.

The deleting sub-module 4056 is configured to delete when the second determining sub-module 4055 determines that the similarity between the perceived hash value of the screenshot picture and the perceived hash value of the screenshot picture is greater than a preset similarity threshold. The screenshot picture.

The association sub-module 4057 is configured to: when the second determining sub-module 4055 determines that the similarity between the perceived hash value of the screenshot picture and the perceived hash value of the screenshotd picture is less than or equal to a preset similarity threshold And storing the screenshot picture and the corresponding parsed page in a specific location set in advance.

Embodiment 6

6 is a sub-function block diagram of a training module provided in Embodiment 6 of the present application. The training module 406 includes: an obtaining submodule 4061, a preprocessing module 4062, a dividing submodule 4063, a selecting submodule 4064, and a testing submodule 4065.

The obtaining submodule 4061 is configured to acquire a plurality of pictures.

The pre-processing module 4062 is configured to perform pre-processing on the multiple pictures to obtain a data set to be participated in the training picture recognition model.

The dividing sub-module 4063 is configured to divide the data set into a training set and a test set by using a cross-validation method.

The selecting sub-module 4064 is configured to randomly select a first preset number of training set training picture recognition models in the training set.

In the first embodiment, the first preset number may be a preset fixed value, for example, 60, that is, randomly training 60 samples to participate in the training of the picture recognition model in the original training set.

The test sub-module 4065 is configured to test the accuracy of the trained picture recognition model by using the test set, and if the accuracy rate is greater than or equal to the preset accuracy rate threshold, the training ends; if the accuracy rate is less than the preset accuracy rate threshold, The selection sub-module 4064 adds a second preset number of training sets to the first preset number of training sets from the training set except the first preset number of training sets in the training set, and The test sub-module 4065 is re-executed until the accuracy of the trained picture recognition model is greater than or equal to the preset accuracy rate threshold.

The above-described integrated unit implemented in the form of a software function module can be stored in a non-volatile readable storage medium. The software function module is stored in a storage medium and includes a plurality of instructions for causing a computer device (which may be a personal computer, a dual screen device, or a network device, etc.) or a processor to execute the embodiments of the present application. Part of the method.

Example 7

FIG. 7 is a schematic diagram of a terminal according to Embodiment 5 of the present application.

The terminal 7 comprises a memory 71, at least one processor 72, computer readable instructions 73 stored in the memory 71 and operable on the at least one processor 72, and at least one communication bus 74.

The at least one processor 72 implements the steps in the dynamic chart class page data crawling method embodiment when the computer readable instructions 73 are executed, or the at least one processor 72 executes the computer readable instructions 73 The functions of the modules/units in the above device embodiments are implemented.

Illustratively, the computer readable instructions 73 may be partitioned into one or more modules/units, the one or more modules/units being stored in the memory 71 and by the at least one processor 72 Execute to complete this application. The one or more modules/units may be a series of computer readable instruction segments capable of performing a particular function, the instruction segments being used to describe the execution of the computer readable instructions 73 in the terminal 7.

The terminal 7 can be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. It will be understood by those skilled in the art that the schematic diagram 5 is merely an example of the terminal 7, and does not constitute a limitation of the terminal 7, and may include more or less components than those illustrated, or combine some components or different components. For example, the terminal 7 may further include an input/output device, a network access device, a bus, and the like.

The at least one processor 72 may be a central processing unit (CPU), or may be another general-purpose processor, a digital signal processor (DSP), or an application specific integrated circuit (ASIC). ), a Field-Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, and the like. The processor 72 may be a microprocessor or the processor 72 may be any conventional processor or the like. The processor 72 is a control center of the terminal 7, and connects various terminals of the entire terminal 7 by using various interfaces and lines. section.

The memory 71 can be used to store the computer readable instructions 73 and/or modules/units by running or executing computer readable instructions and/or modules/units stored in the memory 71, and The data stored in the memory 71 is called to implement various functions of the terminal 7. The memory 71 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may be Data (such as audio data, phone book, etc.) created according to the use of the terminal 7 is stored. In addition, the memory 71 may include a high-speed random access memory, and may also include a non-volatile memory such as a hard disk, a memory, a plug-in hard disk, a smart memory card (SMC), and a secure digital (SD). Card, flash card, at least one disk storage device, flash device, or other volatile solid state storage device.

The modules/units integrated by the terminal 7 can be stored in a non-volatile readable storage medium if implemented in the form of a software functional unit and sold or used as a stand-alone product. Based on such understanding, the present application implements all or part of the processes in the foregoing embodiments, and may also be implemented by computer-readable instructions, which may be stored in a non-volatile manner. In reading a storage medium, the computer readable instructions, when executed by a processor, implement the steps of the various method embodiments described above. Wherein, the computer readable instructions comprise computer readable instruction code, which may be in the form of source code, an object code form, an executable file or some intermediate form or the like. The computer readable medium can include any entity or device capable of carrying the computer readable instruction code, a recording medium, a USB flash drive, a removable hard drive, a magnetic disk, an optical disk, a computer memory, a read only memory (ROM, Read-Only) Memory), random access memory (RAM), electrical carrier signals, telecommunications signals, and software distribution media. It should be noted that the content contained in the computer readable medium may be appropriately increased or decreased according to the requirements of legislation and patent practice in a jurisdiction, for example, in some jurisdictions, according to legislation and patent practice, computer readable media Does not include electrical carrier signals and telecommunication signals.

In the several embodiments provided by the present application, it should be understood that the disclosed terminal and method may be implemented in other manners. For example, the terminal embodiment described above is only illustrative. For example, the division of the unit is only a logical function division, and the actual implementation may have another division manner.

In addition, each functional unit in each embodiment of the present application may be integrated in the same processing unit, or each unit may exist physically separately, or two or more units may be integrated in the same unit. The above integrated unit can be implemented in the form of hardware or in the form of hardware plus software function modules.

It is obvious to those skilled in the art that the present application is not limited to the details of the above-described exemplary embodiments, and the present invention can be implemented in other specific forms without departing from the spirit or essential characteristics of the present application. Therefore, the present embodiments are to be considered as illustrative and not restrictive, and the scope of the invention is defined by the appended claims instead All changes in the meaning and scope of equivalent elements are included in this application. Any reference signs in the claims should not be construed as limiting the claim. In addition, it is to be understood that the term "comprising" does not exclude other elements or the singular does not exclude the plural. A plurality of units or devices recited in the system claims can also be implemented by a unit or device by software or hardware. The first, second, etc. words are used to denote names and do not denote any particular order.

It should be noted that the above embodiments are only used to explain the technical solutions of the present application, and are not limited thereto. Although the present application is described in detail with reference to the preferred embodiments, those skilled in the art should understand that the technical solutions of the present application can be applied. Modifications or equivalent substitutions are made without departing from the spirit of the invention.

Claims

A dynamic chart class page data crawling method, the method comprising:

a) Start the browser with an automated test tool and enter a link to the website where the data is to be crawled;

b) crawling the page information related to the crawling keyword input by the user from the website to be crawled;

c) rendering and parsing the crawled page;

d) taking a screenshot of the parsed page by the automated test tool to obtain a screenshot image and saving the screenshot image;

e) identifying the screenshot picture according to a pre-trained picture recognition model, and obtaining content in the screenshot picture;

f) determining whether the website to be crawled data and the page corresponding to the crawling keyword have been traversed; and

When it is determined that the website to be crawled data and the page corresponding to the crawling keyword have been traversed, the process ends; or

When it is determined that the website to be crawled data and the page corresponding to the crawling keyword are not traversed, the above b) to f) are continued.
The method according to claim 1, wherein the screenshot of the parsed page by the automated test tool to obtain a screenshot image and saving the screenshot image comprises:

Determining, by the automated testing tool, whether a chart exists in the parsed page;

When it is determined that there is no chart in the parsed page, the information in the parsed page is crawled, and the crawled information is saved according to a preset data format;

When it is determined that there is a chart in the parsed page, a screenshot is taken on the chart in the parsed page to obtain a screenshot image.
The method according to claim 1 or 2, wherein the screenshot of the parsed page by the automated test tool to obtain a screenshot image and save the screenshot image comprises:

Calculating a perceptual hash value of the screenshot picture;

Determining whether a similarity between the perceptual hash value of the screenshot picture and the perceptual hash value of the screenshot picture is greater than a preset similarity threshold;

The screenshot picture is deleted when it is determined that the similarity between the perceptual hash value of the screenshot picture and the perceptual hash value of the screenshot picture is greater than a preset similarity threshold.
The method of claim 3, wherein the screenshot of the parsed page by the automated test tool to obtain a screenshot image and saving the screenshot image further comprises:

Correlating the screenshot picture and the corresponding parsed page when determining that the similarity between the perceptual hash value of the screenshot picture and the perceptual hash value of the screenshot picture is less than or equal to a preset similarity threshold Stored in a specific location set in advance.
The method of claim 1 wherein the training process of the pre-trained picture recognition model comprises:

Get multiple images;

Pre-processing the plurality of pictures to obtain a data set to be participated in the training picture recognition model;

The data set is divided into a training set and a test set by using a cross-validation method;

Randomly selecting a first preset number of training set training picture recognition models in the training set;

Using the test set to test the accuracy of the trained picture recognition model;

If the accuracy is greater than or equal to the preset accuracy threshold, the training ends;

If the accuracy is less than the preset accuracy threshold, the picture recognition model is retrained.
The method of claim 5 wherein said retraining picture recognition model comprises:

Adding, from the training set except the first preset number of training sets, a second preset number of training sets to the first preset number of training sets until the trained picture recognition model The accuracy rate is greater than or equal to the preset accuracy threshold.
The method according to claim 5, wherein the second preset number is a preset fixed value, or a preset proportional value, or the first preset number of preset proportional values.
A dynamic chart class page data crawling device, characterized in that the device comprises:

a startup module for launching a browser with an automated testing tool and entering a link to a website to be crawled;

a crawling module, configured to crawl, from the website that is to be crawled data, page information related to the crawling keyword input by the user;

a parsing module for rendering and parsing the crawled page;

a screenshot module, configured to take a screenshot of the parsed page by using the automated test tool to obtain a screenshot image and save the screenshot image;

An identification module, configured to identify the screenshot image according to a pre-trained picture recognition model, to obtain content in the screenshot picture.
A terminal, comprising: a processor and a memory, wherein when the processor is configured to execute the computer readable instructions stored in the memory, the following steps are implemented:

a) Start the browser with an automated test tool and enter a link to the website where the data is to be crawled;

b) crawling the page information related to the crawling keyword input by the user from the website to be crawled;

c) rendering and parsing the crawled page;

d) taking a screenshot of the parsed page by the automated test tool to obtain a screenshot image and saving the screenshot image;

e) identifying the screenshot picture according to a pre-trained picture recognition model, and obtaining content in the screenshot picture;

f) determining whether the website to be crawled data and the page corresponding to the crawling keyword have been traversed; and

When it is determined that the website to be crawled data and the page corresponding to the crawling keyword have been traversed, the process ends; or

When it is determined that the website to be crawled data and the page corresponding to the crawling keyword are not traversed, the above b) to f) are continued.
The terminal according to claim 9, wherein the screenshot of the parsed page by the automated testing tool to obtain a screenshot image and saving the screenshot image comprises:

Determining, by the automated testing tool, whether a chart exists in the parsed page;

When it is determined that there is no chart in the parsed page, the information in the parsed page is crawled, and the crawled information is saved according to a preset data format;

When it is determined that there is a chart in the parsed page, a screenshot is taken on the chart in the parsed page to obtain a screenshot image.
The terminal according to claim 9 or 10, wherein the screenshot of the parsed page by the automated test tool to obtain a screenshot image and save the screenshot image includes:

Calculating a perceptual hash value of the screenshot picture;

Determining whether a similarity between the perceptual hash value of the screenshot picture and the perceptual hash value of the screenshot picture is greater than a preset similarity threshold;

The screenshot picture is deleted when it is determined that the similarity between the perceptual hash value of the screenshot picture and the perceptual hash value of the screenshot picture is greater than a preset similarity threshold.
The terminal according to claim 11, wherein the screenshot of the parsed page by the automated testing tool to obtain a screenshot image and save the screenshot image further includes:

Correlating the screenshot picture and the corresponding parsed page when determining that the similarity between the perceptual hash value of the screenshot picture and the perceptual hash value of the screenshot picture is less than or equal to a preset similarity threshold Stored in a specific location set in advance.
The terminal according to claim 9, wherein the training process of the pre-trained picture recognition model comprises:

Get multiple images;

Pre-processing the plurality of pictures to obtain a data set to be participated in the training picture recognition model;

The data set is divided into a training set and a test set by using a cross-validation method;

Randomly selecting a first preset number of training set training picture recognition models in the training set;

Using the test set to test the accuracy of the trained picture recognition model;

If the accuracy is greater than or equal to the preset accuracy threshold, the training ends;

If the accuracy is less than the preset accuracy threshold, the picture recognition model is retrained.
The terminal according to claim 13, wherein the retraining picture recognition model comprises:

Adding, from the training set except the first preset number of training sets, a second preset number of training sets to the first preset number of training sets until the trained picture recognition model The accuracy rate is greater than or equal to the preset accuracy threshold.
A non-volatile readable storage medium having stored thereon computer readable instructions, wherein the computer readable instructions, when executed by a processor, implement the following steps:

a) Start the browser with an automated test tool and enter a link to the website where the data is to be crawled;

b) crawling the page information related to the crawling keyword input by the user from the website to be crawled;

c) rendering and parsing the crawled page;

d) taking a screenshot of the parsed page by the automated test tool to obtain a screenshot image and saving the screenshot image;

e) identifying the screenshot picture according to a pre-trained picture recognition model, and obtaining content in the screenshot picture;

f) determining whether the website to be crawled data and the page corresponding to the crawling keyword have been traversed; and

When it is determined that the website to be crawled data and the page corresponding to the crawling keyword have been traversed, the process ends; or

When it is determined that the website to be crawled data and the page corresponding to the crawling keyword are not traversed, the above b) to f) are continued.
The storage medium according to claim 15, wherein the screenshot of the parsed page by the automated testing tool to obtain a screenshot image and saving the screenshot image comprises:

Determining, by the automated testing tool, whether a chart exists in the parsed page;

When it is determined that there is no chart in the parsed page, the information in the parsed page is crawled, and the crawled information is saved according to a preset data format;

When it is determined that there is a chart in the parsed page, a screenshot is taken on the chart in the parsed page to obtain a screenshot image.
The storage medium according to claim 15 or 16, wherein the screenshot of the parsed page by the automated test tool to obtain a screenshot image and saving the screenshot image comprises:

Calculating a perceptual hash value of the screenshot picture;

Determining whether a similarity between the perceptual hash value of the screenshot picture and the perceptual hash value of the screenshot picture is greater than a preset similarity threshold;

The screenshot picture is deleted when it is determined that the similarity between the perceptual hash value of the screenshot picture and the perceptual hash value of the screenshot picture is greater than a preset similarity threshold.
The storage medium of claim 17, wherein the screenshot of the parsed page by the automated testing tool to obtain a screenshot image and saving the screenshot image further comprises:

Correlating the screenshot picture and the corresponding parsed page when determining that the similarity between the perceptual hash value of the screenshot picture and the perceptual hash value of the screenshot picture is less than or equal to a preset similarity threshold Stored in a specific location set in advance.
The storage medium of claim 15, wherein the training process of the pre-trained picture recognition model comprises:

Get multiple images;

Pre-processing the plurality of pictures to obtain a data set to be participated in the training picture recognition model;

The data set is divided into a training set and a test set by using a cross-validation method;

Randomly selecting a first preset number of training set training picture recognition models in the training set;

Using the test set to test the accuracy of the trained picture recognition model;

If the accuracy is greater than or equal to the preset accuracy threshold, the training ends;

If the accuracy is less than the preset accuracy threshold, the picture recognition model is retrained.
The storage medium of claim 19, wherein the retraining picture recognition model comprises:

Adding, from the training set except the first preset number of training sets, a second preset number of training sets to the first preset number of training sets until the trained picture recognition model The accuracy rate is greater than or equal to the preset accuracy threshold.