CN108595583B

CN108595583B - Dynamic graph page data crawling method, device, terminal and storage medium

Info

Publication number: CN108595583B
Application number: CN201810349975.3A
Authority: CN
Inventors: 阮晓雯; 徐亮; 肖京
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2018-04-18
Filing date: 2018-04-18
Publication date: 2022-12-02
Anticipated expiration: 2038-04-18
Also published as: CN108595583A; WO2019200783A1

Abstract

A method for crawling data of a dynamic chart page comprises the following steps: starting a browser by adopting an automatic testing tool, and inputting a link of a website to be crawled; crawling page information related to a crawling keyword input by a user from a website to be crawled; rendering and analyzing the crawled page; screenshot is carried out on the analyzed page through an automatic testing tool to obtain a screenshot picture, and the screenshot picture is stored; identifying the screenshot picture according to a pre-trained picture identification model to obtain the content in the screenshot picture; judging whether the website of the data to be crawled and the page corresponding to the crawling keyword are traversed or not; when the determination is that all the data are traversed, ending the process; otherwise, the above process is continued. The invention also provides a dynamic chart page data crawling device, a terminal and a storage medium. The invention can automatically crawl the dynamically loaded chart data and identify the content in the picture.

Description

Dynamic graph page data crawling method, device, terminal and storage medium

Technical Field

The invention relates to the technical field of web crawlers, in particular to a method, a device, a terminal and a storage medium for crawling dynamic graph page data.

Background

With the popularization of modern Web page technologies such as the popular method (Asynchronous JavaScript and XML, ajax) for creating interactive Web applications without sacrificing browser compatibility, the form of Web page data has also changed profoundly. More and more page contents are generated dynamically by using Ajax on the Internet, and users often encounter some webpage prompts of clicking to load more or automatically loading more contents along with the rolling of a mouse. These new forms of web pages require user interaction to trigger the generation and display of content, improving the user browsing experience to some extent, but presenting a significant challenge to traditional data collection methods based on crawling HTML files.

Particularly, dynamically loaded graph data in a webpage are generally displayed after being asynchronously loaded, but are difficult to crawl by a traditional crawler; some text data are also displayed in a chart form after an encryption technology is adopted, and the chart cannot be directly downloaded and obtained; the problem of needing input is often encountered in the process of crawling data; in addition, some interference information is added to the chart, so that the real data information in the chart is difficult to acquire. At present, a large amount of manpower investment is generally needed to obtain the dynamic chart data.

Disclosure of Invention

In view of the above, it is necessary to provide a method, an apparatus, a terminal and a storage medium for crawling dynamic graph page data, which can automatically crawl dynamically loaded graph data, perform screenshot on the crawled graph data, and then input the screenshot into a pre-trained picture recognition model to recognize the content in a picture.

The invention provides a method for crawling page data of a dynamic chart class, which comprises the following steps:

a) Starting a browser by adopting an automatic testing tool, and inputting a link of a website to be crawled;

b) Crawling page information related to a crawling keyword input by a user from the website of the data to be crawled;

c) Rendering and analyzing the crawled page;

d) Screenshot is carried out on the analyzed page through the automatic testing tool to obtain a screenshot picture, and the screenshot picture is stored;

e) Identifying the screenshot picture according to a pre-trained picture identification model to obtain the content in the screenshot picture;

f) Judging whether the website of the data to be crawled and the page corresponding to the crawling keyword are traversed or not; and

when the website of the data to be crawled and the page corresponding to the crawling keyword are determined to be traversed, ending the process; or

And when the website of the data to be crawled and the page corresponding to the crawled keyword are determined not to be traversed, continuing to execute the steps b) to f).

In a preferred embodiment, the capturing a screenshot of the parsed page by the automatic test tool to obtain a screenshot picture and storing the screenshot picture includes:

judging whether a chart exists in the analyzed page through the automatic test tool;

when the situation that no chart exists in the analyzed page is determined, crawling the information in the analyzed page, and storing the crawled information according to a preset data format; and

and when the chart exists in the analyzed page, carrying out screenshot on the chart in the analyzed page to obtain a screenshot picture.

calculating a perceptual hash value of the screenshot picture;

judging whether the similarity between the perceptual hash value of the screenshot picture and the perceptual hash value of the screenshot picture is larger than a preset similarity threshold value or not;

and deleting the screenshot picture when the similarity between the perceptual hash value of the screenshot picture and the perceptual hash value of the screenshot picture is determined to be larger than a preset similarity threshold.

In a preferred embodiment, the capturing a screenshot of the parsed page by the automated testing tool to obtain a screenshot picture and storing the screenshot picture further includes:

and when the similarity between the perceptual hash value of the screenshot picture and the perceptual hash value of the screenshot picture is determined to be smaller than or equal to a preset similarity threshold, associating and storing the screenshot picture and the corresponding analyzed page at a preset specific position.

In a preferred embodiment, the pre-trained picture recognition model includes:

acquiring a plurality of pictures;

preprocessing the plurality of pictures to obtain a data set of a recognition model of the pictures to be trained;

dividing the data set into a training set and a test set by adopting a cross validation method;

randomly selecting a first preset number of training set training picture recognition models in the training set;

testing the accuracy of the trained picture recognition model by using the test set;

if the accuracy is greater than or equal to a preset accuracy threshold, ending the training;

and if the accuracy is smaller than the preset accuracy threshold, retraining the picture recognition model.

In a preferred embodiment, the retraining the picture recognition model includes:

and adding a second preset number of training sets to the first preset number of training sets from the training sets except the first preset number of training sets until the accuracy of the trained picture recognition model is greater than or equal to the preset accuracy threshold.

In a preferred embodiment, the second preset number is a preset fixed value, or a preset proportional value of the first preset number.

A second aspect of the present invention provides a dynamic graph-like page data crawling apparatus, including:

the starting module is used for starting the browser by adopting an automatic testing tool and inputting a link of a website to be crawled;

the crawling module is used for crawling page information related to crawling keywords input by a user from the website of the data to be crawled;

the analysis module is used for rendering and analyzing the crawled page;

the screenshot module is used for screenshot the analyzed page through the automatic test tool to obtain a screenshot picture and storing the screenshot picture;

and the recognition module is used for recognizing the screenshot picture according to a pre-trained picture recognition model to obtain the content in the screenshot picture.

A third aspect of the present invention provides a terminal, where the terminal includes a processor and a memory, and the processor is configured to implement the method for crawling page data of dynamic graph classes when executing a computer program stored in the memory.

A fourth aspect of the present invention provides a computer-readable storage medium on which a computer program is stored, the computer program, when executed by a processor, implementing the dynamic graph-like page data crawling method.

According to the method, the device, the terminal and the storage medium for crawling the dynamic chart page data, the Selenium technology is adopted to simulate the operations of logging in a browser, dynamically loading, screenshot downloading and the like by a user, and then the web crawler technology is combined, so that the dynamically loaded chart data can be automatically crawled, the crawled information is completely consistent with the image and text information seen by a real user, the crawled chart data is input into a pre-trained picture recognition model after being subjected to screenshot, the content in the picture is recognized, and compared with the traditional web crawler product, the method, the device, the terminal and the storage medium are good in compatibility, high in speed and accurate in data capture.

Secondly, in the training process of the picture recognition model, by gradually increasing the number of training sets participating in training, on the premise of ensuring the recognition rate of the picture recognition model, fewer samples are used for participating in training, the training time of the picture recognition model can be shortened to the maximum extent, the training efficiency of the picture recognition model is improved, and the number of the optimal training sets is found between the accuracy rate and the efficiency of the picture recognition model.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a flowchart of a method for crawling data in a dynamic graph page according to an embodiment of the present invention.

Fig. 2 is a flowchart of a method for capturing a screenshot from an analyzed page and storing the screenshot in an embodiment of the present invention.

Fig. 3 is a flowchart of a training method of a picture recognition model according to a third embodiment of the present invention.

Fig. 4 is a structural diagram of a dynamic graph page data crawling apparatus according to a fourth embodiment of the present invention.

Fig. 5 is a sub-functional block diagram of a deduplication module according to a fifth embodiment of the present invention.

Fig. 6 is a sub-function block diagram of a training module according to a sixth embodiment of the present invention.

Fig. 7 is a block diagram of a terminal according to a seventh embodiment of the present invention.

The following detailed description will further illustrate the invention in conjunction with the above-described figures.

Detailed Description

In order that the above objects, features and advantages of the present invention can be more clearly understood, a detailed description of the present invention will be given below with reference to the accompanying drawings and specific embodiments. It should be noted that the embodiments of the present invention and features of the embodiments may be combined with each other without conflict.

In the following description, numerous specific details are set forth to provide a thorough understanding of the present invention, and the described embodiments are merely a subset of the embodiments of the present invention, rather than a complete embodiment. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.

The dynamic graph page data crawling method provided by the embodiment of the invention is applied to one or more terminals. The method for crawling the page data of the dynamic graph class can also be applied to a hardware environment formed by a terminal and a server connected with the terminal through a network. Networks include, but are not limited to: a wide area network, a metropolitan area network, or a local area network. The dynamic graph page data crawling method can be executed by a server or a terminal; or may be performed by both the server and the terminal.

The terminal which needs to perform the method for crawling the page data of the dynamic chart can directly integrate the function for crawling the page data of the dynamic chart provided by the method on the terminal or install a client for realizing the method of the invention. For another example, the method provided by the present invention may also be run on a device such as a server in the form of a Software Development Kit (SDK), an interface of the dynamic graph page data crawling function is provided in the form of an SDK, and the terminal or other devices may implement hand tracking through the provided interface.

Example one

Fig. 1 is a flowchart of a method for crawling data in a dynamic graph page according to an embodiment of the present invention. The execution sequence in the flowchart may be changed and some steps may be omitted according to different requirements.

And S11, starting a browser by adopting an automatic testing tool, and inputting a link of a website to be crawled.

The computer software automated testing technology, selenium Web Driver (hereinafter referred to as Selenium) has a strong visual automatic interaction function, and the interaction between a person and a webpage is simulated through programming, so that dynamic data loading is triggered, and dynamically generated data is acquired. The Selenium technology can truly simulate the operation executed by a user on a website webpage, such as simulating the operations of clicking, viewing more, automatically logging in, clicking a link, filling a form, scrolling a mouse, dragging the mouse, scrolling down after page loading is finished, clicking to turn a page, saving a screenshot and the like.

In this embodiment, a browser is opened through a Selenium tool, a link (Uniform Resource Locator, URL) of a website to which data is to be crawled is input in the browser, and the Selenium tool calls a get () method to open a Web page of the website to which data is to be crawled, which is input by a user.

For example, if the user needs to crawl the "face recognition book" data on the "current" website, the browser (for example, google browser) is opened through the selenium tool, and the "current" website URL "www.

In this embodiment, if the user needs to crawl data of multiple websites, the links of the websites where data are to be crawled may be simultaneously input into the queue of the browser opened by the selenium tool, and the crawler program crawls the data in the websites where data are to be crawled in sequence.

And S12, crawling the page information related to the crawling keyword input by the user from the website of the data to be crawled.

When the website of the data to be crawled is opened through the Selenium tool, a user inputs a crawling keyword, for example, "face recognition", and then the Selenium tool simulates the user to browse page information of all webpages of the "face recognition" on the website of the data to be crawled.

And S13, rendering and analyzing the crawled page.

The Selenium tool triggers Ajax to asynchronously request data from a server when crawling a page, after receiving the replied original data, formats and assembles the replied original data into a new HTML node, inserts the new HTML node into an initial HTML file, and finally displays the dynamic content by a browser kernel rendering engine. And sending a page obtaining service request to a wire protocol through the selenium service, and then operating a browser API to obtain the original page loaded by the browser. And returning to the selenium service through a wire protocol, and when the selenium service gets the page, delivering the page to a resolution module for page resolution.

And S14, performing screenshot on the analyzed page through the automatic test tool to obtain a screenshot picture and storing the screenshot picture.

And the driver of the Selenium tool instructs the browser to execute the command, and finally the browser performs screenshot saving operation in the kernel, wherein the final effect is completely the same as the effect of a user for clipping and saving the picture on the page by using a mouse.

Preferably, the capturing the parsed page by the automated testing tool to obtain a capture picture and storing the capture picture may further include: and removing the duplicate of the table in the analyzed page according to the perception hash value.

The process of capturing the analyzed page by the automated testing tool in step S14 to obtain a screenshot picture and storing the screenshot picture for further refinement specifically refers to fig. 2 and its corresponding description.

And S15, identifying the screenshot picture according to a pre-trained picture identification model to obtain the content in the screenshot picture.

In this embodiment, the method for training the image recognition model specifically refers to fig. 3 and the corresponding description thereof.

And S16, judging whether the website of the data to be crawled and the page corresponding to the crawling keyword are traversed or not.

When the website of the data to be crawled and the page corresponding to the crawling keyword are determined to be traversed, ending the process; otherwise, when it is determined that the website of the data to be crawled and the page corresponding to the crawling keyword are not traversed, the steps from S12 to S15 are continuously executed.

In summary, the method for crawling data of pages of dynamic charts, provided by the invention, adopts the Selenium technology to simulate operations of a user for logging in a browser, dynamically loading, downloading a screenshot and the like, and then combines the web crawler technology, so that the dynamically loaded chart data can be automatically crawled, the crawled information is completely consistent with the image and text information seen by a real user, the crawled chart data is input into a pre-trained picture recognition model after being subjected to screenshot, the content in the picture is recognized, and compared with the traditional web crawler product, the method for crawling data of pages of dynamic charts has the advantages of good compatibility, high speed and accurate data crawling.

Example two

Fig. 2 is a flowchart of a method for capturing a screenshot image obtained by capturing an parsed page and storing the screenshot image according to a second embodiment of the present invention. The execution sequence in the flowchart may be changed and some steps may be omitted according to different requirements.

And S21, judging whether the analyzed page has a chart or not through the automatic test tool.

In this embodiment, the automatic testing tool determines whether the graph exists in the parsed page by identifying whether the tag related to graph display and control exists in the parsed page.

When the automatic testing tool identifies that the label related to the graph display and control exists in the analyzed page, determining that the graph exists in the analyzed page; and when the automatic testing tool identifies that the label related to the graph display and control does not exist in the analyzed page, determining that the graph does not exist in the analyzed page.

The tabs associated with the graphical display and control include: tags such as img, table, tr, td, colspan, etc.

The method comprises the steps that a graph in a webpage is written by using an HTML language, a plurality of DIV (digital information technology), CSS (cascading style sheet) and HTML (hypertext markup language) tags related to the graph can be stored in the graph, the existence of the graph in the analyzed page can be judged by judging whether tag attributes related to the graph exist, when the tag attributes related to the graph are identified, the graph in the analyzed page is determined, and when the tag attributes related to the graph are not identified, the existence of the graph in the analyzed page is determined.

When it is determined that no chart exists in the parsed page, performing step S22; otherwise, when it is determined that the diagram exists in the parsed page, step S23 is performed.

S22, crawling the information in the analyzed page, and storing the crawled information according to a preset data format.

And when the situation that no chart exists in the analyzed page is determined, screenshot is not carried out on the analyzed page, information in the analyzed page is directly crawled by adopting a crawler program, and the information is stored according to a preset data format.

In the embodiment, different operations are executed by judging whether the diagram exists in the analyzed page, when the diagram exists in the analyzed page, the diagram in the page is captured while the screenshot is performed on the analyzed page, and when the diagram does not exist in the analyzed page, the screenshot operation is not performed, so that network resources can be saved conveniently, and the waste of the network resources caused by the screenshot on all the analyzed pages is avoided; in addition, when no chart exists in the analyzed page, screenshot operation is not performed, so that the operation flow is simplified, and the crawling efficiency is improved.

And S23, carrying out screenshot on the diagram in the analyzed page to obtain a screenshot picture.

In this embodiment, simulating, by the Selenium tool, that the user captures the diagram in the parsed page further includes downloading the diagram in the parsed page.

And S24, calculating a perceptual hash value of the screenshot picture.

In this embodiment, a perceptual hash algorithm (perceptual hash algorithm) is used to calculate a perceptual hash value of the screenshot picture, and the specific process includes:

1) Carrying out graying processing on the screenshot picture;

2) Calculating the gray average value of the grayed screenshot picture;

3) Comparing the gray value of each pixel of the screenshot picture after graying with the average gray value;

4) Recording the gray value of the pixel of the screenshot picture after the graying treatment, which is greater than or equal to the average gray value, as 1, and recording the gray value of the pixel of the screenshot picture after the graying treatment, which is less than the average gray value, as 0;

5) And connecting the comparison results of each pixel obtained in the step 4) according to a preset connection rule to obtain a perceptual hash value of the screenshot picture.

And S25, judging whether the similarity between the perceptual hash value of the screenshot picture and the perceptual hash value of the screenshot picture is larger than a preset similarity threshold value or not.

In this embodiment, the determining whether the similarity between the perceptual hash value of the screenshot picture and the perceptual hash value of the captured picture is greater than a preset similarity threshold specifically includes: comparing the number of digits of the same numerical value between the perceptual hash value of the screenshot picture and the perceptual hash value of the screenshot picture; and judging whether the number of digits of the same numerical value is greater than the preset similarity threshold value.

For example, the grayed screenshot picture is 8 × 8 pixels, the average grayscale value is 45, when the grayscale value of the pixels in the first row and the first column is greater than 45, the comparison result is recorded as 1, otherwise, the comparison result is recorded as 0; when the gray value of the pixels in the first row and the second column is greater than 45, the comparison result is marked as 1, otherwise, the comparison result is marked as 0; when the gray value of the pixel in the first row and the third column is greater than 45, the comparison result is marked as 1, otherwise, the comparison result is marked as 0; and so on. And then combining the comparison results from left to right and from top to bottom into 64-bit numbers, wherein the 64-bit numbers are the perceptual hash values of the screenshot pictures. And when the number of bits (for example, 61) having the same value between the perceptual hash value of the screenshot picture and the perceptual hash value of the captured picture is judged to be larger than the preset similarity threshold (for example, 60), the screenshot picture and the captured picture are the same.

When the similarity between the perceptual hash value of the screenshot picture and the perceptual hash value of the screenshot picture is determined to be larger than a preset similarity threshold, executing a step S26; otherwise, when it is determined that the similarity between the perceptual hash value of the screenshot picture and the perceptual hash value of the captured picture is less than or equal to the preset similarity threshold, step S27 is performed.

And S26, deleting the screenshot picture.

And S27, associating and storing the screenshot picture and the corresponding analyzed page at a preset specific position.

In this embodiment, the preset specific position is dedicated to storing the screenshot picture and the corresponding parsed page. The specific location may be a specific folder or a folder named by a specific name. Storing the screenshot picture and the corresponding analyzed page line in a correlated manner each time, so that the page where the chart is located can be conveniently and quickly found out afterwards, and further analyzing the content of the chart in the page based on a context semantic analysis method according to the position information and the like of the chart in the page.

In summary, the screenshot picture duplicate removal method provided by the invention judges whether the screenshot picture is the same as the captured picture according to the perceptual hash value so as to achieve the purpose of duplicate removal, the perceptual hash calculation result is accurate, the download with the same content is deleted or subjected to duplicate removal, redundant screenshot pictures can be removed, and the storage space is effectively saved. In addition, the screenshot picture and the corresponding analyzed page are stored in an associated mode, and post management and analysis are facilitated.

EXAMPLE III

Fig. 3 is a flowchart of a training method of a picture recognition model according to a third embodiment of the present invention. The execution sequence in the flowchart may be changed and some steps may be omitted according to different requirements.

And S31, acquiring a plurality of pictures.

In this embodiment, a plurality of pictures may be automatically obtained from each website provided on the internet through another small crawler, or a plurality of pictures may be manually downloaded from each search engine (e.g., hundred degrees, google, 360) to form a picture data set, and the picture data set is stored in the local database. The content in the picture may include, but is not limited to: numbers, characters, letters, images, tables, etc., letters may also be case-specific.

And S32, preprocessing the plurality of pictures to obtain a data set of the picture recognition model to be trained.

In this embodiment, each picture in the picture data set is respectively preprocessed, where the preprocessing includes: background removal, segmentation, scaling, clipping, turning and/or distortion and the like are carried out, so that after the training pictures have the same size and the same visual angle, the picture recognition model is trained, and the authenticity and the accuracy of the picture recognition model are effectively improved.

In this embodiment, a binarization method may be used to remove the background, and if the pixels on the picture are larger than a preset threshold value, the picture is white, otherwise the picture is black, that is, the original picture is converted into a picture with only black and white colors, so as to effectively remove the interference elements of the picture background.

In this embodiment, each picture in the picture data set may be divided by using a dividing function, and each number or each character in the picture may be divided into a single number or character.

And S33, dividing the data set into a training set and a test set by adopting a cross validation method.

The training set is used for training the picture recognition model, and the test set is used for testing the performance of the trained picture recognition model. If the accuracy rate of the test is higher, the performance of the trained picture recognition model is better; if the accuracy of the test is low, the performance of the trained picture recognition model is poor.

The data set may be partitioned in a suitable ratio (e.g., 3 to 2) to obtain a training set and a training set.

And S34, randomly selecting a first preset number of training set training picture recognition models in the training set.

In this embodiment, instead of performing the training of the picture recognition model on all the pictures in the original training set, a first preset number of training sets are selected from the original training set to participate in the training, so that the number of training sets participating in the training can be reduced, and the training time of the picture recognition model can be saved.

In addition, random number generation algorithm is adopted for random selection, so that the randomness of a training set participating in training can be increased, and the robustness of the picture recognition model can be improved.

In the first embodiment, the first preset number may be a preset fixed value, for example, 60, that is, 60 samples are randomly selected from the original training set to participate in the training of the picture recognition model.

In the second embodiment, the first preset number may be a preset ratio value, for example, 1/10, that is, samples with a ratio of 1/10 are randomly selected from the original training set to participate in the training of the picture recognition model.

S35, testing the accuracy of the trained picture recognition model by using the test set, and finishing training if the accuracy is greater than or equal to a preset accuracy threshold; and if the accuracy is smaller than the preset accuracy threshold, retraining the picture recognition model.

Preferably, the retraining of the picture recognition model comprises: and adding a second preset number of training sets to the first preset number of training sets from the training sets except the first preset number of training sets in the training sets, and re-executing the steps S32 to S35 until the accuracy of the trained picture recognition model is greater than or equal to a preset accuracy threshold.

In the first embodiment, the second preset number may be a preset fixed value, for example, 20, that is, 20 pictures are randomly selected from the training sets other than the training set of the first preset number in the training set to participate in the training of the picture recognition model.

In the second embodiment, the second preset number may be a preset ratio value, for example, 1/20, that is, pictures with a ratio of 1/20 are randomly selected from the training sets except the training set with the first preset number to participate in the training of the picture recognition model.

In a third embodiment, the second preset number may be a preset ratio of the first preset number, for example, 1/5, that is, in the training sets except the first preset number of training sets, the pictures in the ratio of 1/5 of the first preset number are randomly selected to participate in the training of the picture recognition model.

According to the picture recognition model training method, the number of training sets participating in training is increased step by step, on the premise that the recognition rate of the picture recognition model is guaranteed, fewer samples are used for participating in training, the training time of the picture recognition model can be shortened to the maximum extent, the training efficiency of the picture recognition model is improved, and the number of the optimal training sets is found between the accuracy rate and the efficiency rate of the picture recognition model.

The above description is only a specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and it will be apparent to those skilled in the art that modifications may be made without departing from the inventive concept of the present invention, and these modifications are within the scope of the present invention.

Next, with reference to fig. 4 to 7, a functional module and a hardware structure of a terminal for implementing the above method for crawling page data of dynamic charts are respectively described.

Example four

Fig. 4 is a functional block diagram of a dynamic graph page data crawling apparatus according to a fourth embodiment of the present invention.

In some embodiments, the dynamic graph class page data crawling apparatus 40 operates in a terminal. The dynamic graph page data crawling apparatus 40 may include a plurality of functional modules composed of program code segments. The program code of each program segment in the dynamic graph-like page data crawling apparatus 40 may be stored in a memory and executed by at least one processor to perform (see fig. 1 and its related description in detail) crawling of dynamic graph-like page data.

In this embodiment, the dynamic graph page data crawling apparatus 40 of the terminal may be divided into a plurality of functional modules according to the functions executed by the apparatus. The functional module may include: the system comprises a starting module 401, a crawling module 402, an analyzing module 403, a screenshot module 404, a duplication eliminating module 405, a training module 406, an identification module 407 and a judgment module 408. The module referred to herein is a series of computer program segments capable of being executed by at least one processor and capable of performing a fixed function and stored in the memory. In some embodiments, the functionality of the various modules will be described in greater detail in subsequent embodiments.

The starting module 401 is configured to start a browser by using an automated testing tool, and input a link of a website to be crawled.

And the crawling module 402 is configured to crawl page information related to a crawling keyword input by a user from the website of the data to be crawled.

And the parsing module 403 is configured to render and parse the crawled page.

The Selenium tool triggers Ajax to asynchronously request data from the server when crawling a page, after receiving the replied original data, the Selenium tool formats and assembles the data into a new HTML node which is inserted into an initial HTML file, and finally, a browser kernel rendering engine displays dynamic content. And sending a page obtaining service request to a wire protocol through the selenium service, and then operating a browser API to obtain the original page loaded by the browser. And returning to the selenium service through a wire protocol, and when the selenium service gets to the page, delivering the page to a resolution module for page resolution.

And a screenshot module 404, configured to capture a screenshot of the parsed page through the automatic test tool to obtain a screenshot picture, and store the screenshot picture.

And a deduplication module 405, configured to perform deduplication on the table in the parsed page according to the perceptual hash value.

And a training module 406 for training the image recognition model.

And the recognition module 407 is configured to recognize the screenshot picture according to a pre-trained picture recognition model, so as to obtain content in the screenshot picture.

The determining module 408 is configured to determine whether the website of the data to be crawled and the page corresponding to the crawled keyword have been traversed. When the determining module 408 determines that the website of the data to be crawled and the page corresponding to the crawled keyword are not traversed, the

above modules

401, 402, 403, 404, 405, and 407 are repeatedly executed.

In summary, the dynamic graph page data crawling apparatus provided by the invention adopts the Selenium technology to simulate the operations of a user logging in a browser, dynamic loading, screenshot downloading and the like, and combines the web crawler technology, so that the dynamically loaded graph data can be automatically crawled, the crawled information is completely consistent with the image-text information seen by a real user, the crawled graph data is input into a pre-trained image recognition model after being subjected to screenshot, the content in the image is recognized, and compared with the traditional web crawler product, the dynamic graph page data crawling apparatus is good in compatibility, fast in speed and accurate in data crawling.

EXAMPLE five

Fig. 5 is a block diagram of sub-functional modules of a deduplication module according to a fifth embodiment of the present invention. The de-weighting module 405 includes: a first judging sub-module 4051, a saving sub-module 4052, a screenshot sub-module 4053, a calculating sub-module 4054, a second judging sub-module 4055, a deleting sub-module 4056 and an associating sub-module 4057.

The first judging sub-module 4051 is configured to judge, by using the automatic testing tool, whether a chart exists in the parsed page.

When the automatic test tool identifies that the label related to the graph display and control exists in the analyzed page, determining that the graph exists in the analyzed page; and when the automatic testing tool identifies that the label related to the graph display and control does not exist in the analyzed page, determining that the graph does not exist in the analyzed page.

The tabs associated with the graphical display and control include: img, table, tr, td, colspan, etc.

The saving sub-module 4052 is configured to crawl information in the parsed page when the first determining sub-module 4051 determines that no graph exists in the parsed page, and save the crawled information according to a preset data format.

In the embodiment, different operations are executed by judging whether the diagram exists in the analyzed page, when the diagram exists in the analyzed page, the diagram in the page is captured while the screenshot is performed on the analyzed page, and when the diagram does not exist in the analyzed page, the screenshot operation is not performed, so that network resources can be saved conveniently, and the condition that the screenshot is performed on all the analyzed pages so as to waste the network resources is avoided; in addition, when no chart exists in the analyzed page, screenshot operation is not performed, so that the operation flow is simplified, and the crawling efficiency is improved.

The screenshot sub-module 4053 is configured to capture a screenshot of the diagram in the parsed page to obtain a screenshot picture when the first determining sub-module 4051 determines that the diagram exists in the parsed page.

In this embodiment, simulating, by the Selenium tool, the user to screenshot the chart in the parsed page further includes downloading the chart in the parsed page.

The calculating submodule 4054 is configured to calculate a perceptual hash value of the screenshot picture.

In this embodiment, the specific process of the calculating sub-module 4054 includes:

1) Carrying out graying processing on the screenshot picture;

2) Calculating the gray average value of the grayed screenshot picture;

3) Comparing the gray value of each pixel of the screenshot picture after the graying treatment with the gray value average value;

The second judging sub-module 4055 is configured to judge whether a similarity between the perceptual hash value of the screenshot picture and the perceptual hash value of the captured picture is greater than a preset similarity threshold.

In this embodiment, the determining whether the similarity between the perceptual hash value of the screenshot picture and the perceptual hash value of the captured picture is greater than a preset similarity threshold specifically includes: comparing the number of digits of the same numerical value between the perceptual hash value of the screenshot picture and the perceptual hash value of the screenshot picture; and judging whether the number of digits of the same numerical value is larger than the preset similarity threshold value.

For example, the grayed screenshot picture is 8 × 8 pixels, the average grayscale value is 45, when the grayscale value of the pixels in the first row and the first column is greater than 45, the comparison result is recorded as 1, otherwise, the comparison result is recorded as 0; when the gray value of the pixels in the first row and the second column is greater than 45, the comparison result is marked as 1, otherwise, the comparison result is marked as 0; when the gray value of the pixel in the first row and the third column is greater than 45, the comparison result is marked as 1, otherwise, the comparison result is marked as 0; and so on. And then combining the comparison results into 64-bit numbers from left to right and from top to bottom, wherein the 64-bit numbers are the perceptual hash values of the screenshot picture. And when the number of bits (for example, 61) having the same value between the perceptual hash value of the screenshot picture and the perceptual hash value of the captured picture is judged to be larger than the preset similarity threshold (for example, 60), the screenshot picture and the captured picture are the same.

A deleting sub-module 4056, configured to delete the screenshot picture when the second determining sub-module 4055 determines that the similarity between the perceptual hash value of the screenshot picture and the perceptual hash value of the screenshot picture is greater than a preset similarity threshold.

The associating sub-module 4057 is configured to, when the second determining sub-module 4055 determines that the similarity between the perceptual hash value of the screenshot picture and the perceptual hash value of the screenshot picture is smaller than or equal to a preset similarity threshold, associate and store the screenshot picture and the corresponding parsed page at a preset specific position.

EXAMPLE six

Fig. 6 is a functional block diagram of a training module according to a sixth embodiment of the present invention. The training module 406 includes: an acquisition sub-module 4061, a pre-processing module 4062, a division sub-module 4063, a selection sub-module 4064, and a test sub-module 4065.

The obtaining sub-module 4061 is configured to obtain multiple pictures.

In this embodiment, a plurality of pictures may be automatically obtained from each website provided on the internet by another crawler, or a plurality of pictures may be manually downloaded from each search engine (e.g., hundred degrees, google, 360) to form a picture data set, and the picture data set is stored in the local database. The content in the picture may include, but is not limited to: numbers, characters, letters, images, tables, etc., letters may also be case-specific.

The preprocessing module 4062 is configured to preprocess the multiple pictures to obtain a data set of the picture recognition model to be trained.

In this embodiment, each picture in the picture data set is respectively preprocessed, where the preprocessing includes: the method comprises the following steps of background removal, segmentation, scaling, cutting, turning and/or distortion and the like, so that after training pictures have the same size and the same visual angle, the picture recognition model is trained, and the authenticity and the accuracy of the picture recognition model are effectively improved.

And a dividing submodule 4063, configured to divide the data set into a training set and a test set by using a cross validation method.

The selecting sub-module 4064 is configured to randomly select a first preset number of training set training pattern recognition models from the training sets.

In this embodiment, instead of training all the pictures in the original training set with the picture recognition model, a first preset number of training sets are selected from the original training set to participate in the training, so that the number of training sets participating in the training can be reduced, and the training time of the picture recognition model can be saved.

In addition, random number generation algorithm is adopted for random selection, so that the randomness of the training set participating in training can be increased, and the robustness of the image recognition model can be improved.

The test sub-module 4065 is configured to test the accuracy of the trained picture recognition model by using the test set, and if the accuracy is greater than or equal to a preset accuracy threshold, the training is ended; if the accuracy is smaller than the preset accuracy threshold, the selecting sub-module 4064 adds a second preset number of training sets to the first preset number of training sets from the training sets except for the first preset number of training sets, and re-executes the testing sub-module 4065 until the accuracy of the trained image recognition model is greater than or equal to the preset accuracy threshold.

In the third embodiment, the second preset number may be a preset proportion value of the first preset number, for example, 1/5, that is, in training sets other than the first preset number of training sets, randomly selecting 1/5 proportion pictures of the first preset number to participate in training of the picture recognition model.

According to the picture recognition model training method provided by the invention, by gradually increasing the number of training sets participating in training, on the premise of ensuring the recognition rate of the picture recognition model, fewer samples are used for participating in training, the training time of the picture recognition model can be shortened to the maximum extent, the training efficiency of the picture recognition model is improved, and the optimal number of training sets is found between the accuracy and the efficiency of the picture recognition model.

The integrated unit implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a dual-screen device, or a network device) or a processor (processor) to execute parts of the methods according to the embodiments of the present invention.

EXAMPLE seven

Fig. 7 is a schematic diagram of a terminal according to a fifth embodiment of the present invention.

The terminal 7 includes: a memory 71, at least one processor 72, a computer program 73 stored in said memory 71 and executable on said at least one processor 72, at least one communication bus 74.

The at least one processor 72 executes the computer program 73 to implement the steps in the above-mentioned dynamic graph class page data crawling method embodiment, or the at least one processor 72 executes the computer program 73 to implement the functions of each module/unit in the above-mentioned apparatus embodiment.

Illustratively, the computer program 73 can be partitioned into one or more modules/units, which are stored in the memory 71 and executed by the at least one processor 72 to implement the present invention. The one or more modules/units may be a series of computer program instruction segments capable of performing certain functions, which are used to describe the execution of the computer program 73 in the terminal 7.

The terminal 7 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. It will be appreciated by those skilled in the art that the schematic diagram 5 is merely an example of the terminal 7 and does not constitute a limitation of the terminal 7, and may comprise more or less components than those shown, or some components may be combined, or different components, for example, the terminal 7 may further comprise an input-output device, a network access device, a bus, etc.

The at least one Processor 72 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. The processor 72 may be a microprocessor or the processor 72 may be any conventional processor or the like, said processor 72 being the control center of said terminal 7, the various parts of the whole terminal 7 being connected by means of various interfaces and lines.

The memory 71 may be used for storing the computer programs 73 and/or modules/units, and the processor 72 may implement various functions of the terminal 7 by running or executing the computer programs and/or modules/units stored in the memory 71 and calling data stored in the memory 71. The memory 71 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the terminal 7, and the like. Further, the memory 71 may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.

The modules/units integrated with the terminal 7, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, read-Only Memory (ROM), random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

In the embodiments provided in the present invention, it should be understood that the disclosed terminal and method can be implemented in other manners. For example, the above-described terminal embodiments are merely illustrative, and for example, the division of the unit is only one logical function division, and there may be another division manner in actual implementation.

In addition, functional units in the embodiments of the present invention may be integrated into the same processing unit, or each unit may exist alone physically, or two or more units are integrated into the same unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or that the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not to denote any particular order.

Finally, it should be noted that the above embodiments are only intended to illustrate the technical solutions of the present invention and not to limit the same, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made to the technical solutions of the present invention without departing from the spirit of the technical solutions of the present invention.

Claims

1. A method for crawling data of a dynamic chart page is characterized by comprising the following steps:

b) Crawling page information related to crawling keywords input by a user from the website of the data to be crawled;

c) Rendering and analyzing the crawled page;

d) Judging whether a chart exists in the analyzed page through the automatic testing tool, crawling information in the analyzed page when the fact that the chart does not exist in the analyzed page is determined, storing the crawled information according to a preset data format, and instructing the browser to capture the chart in the analyzed page in a kernel through a driving program of the automatic testing tool to obtain a capture picture when the fact that the chart exists in the analyzed page is determined;

when the website of the data to be crawled and the page corresponding to the crawling keyword are determined to be traversed, ending the flow; or

2. The method of claim 1, wherein said capturing the parsed page by the automated testing tool to obtain a screenshot picture and saving the screenshot picture comprises:

calculating a perceptual hash value of the screenshot picture;

3. The method of claim 2, wherein said capturing the parsed page with the automated testing tool to obtain a screenshot image and saving the screenshot image further comprises:

4. The method of claim 1, wherein the training process of the pre-trained picture recognition model comprises:

acquiring a plurality of pictures;

preprocessing the plurality of pictures to obtain a data set of a picture recognition model to be trained;

5. The method of claim 4, wherein the retraining the picture recognition model comprises:

and adding a second preset number of training sets to the first preset number of training sets from the training sets except the first preset number of training sets in the training sets until the accuracy of the trained picture recognition model is greater than or equal to the preset accuracy threshold.

6. The method of claim 5, wherein the second preset number is a preset fixed value, or a preset proportional value of the first preset number.

7. An apparatus for crawling data on a dynamic graph-like page, the apparatus comprising:

the crawling module is used for crawling page information related to a crawling keyword input by a user from the website of the data to be crawled;

the analysis module is used for rendering and analyzing the crawled page;

the screenshot module is used for judging whether a chart exists in the analyzed page through the automatic test tool, crawling information in the analyzed page when the fact that the chart does not exist in the analyzed page is confirmed, storing the crawled information according to a preset data format, and instructing the browser to screenshot the chart in the analyzed page in a kernel through a driving program of the automatic test tool to obtain a screenshot picture when the fact that the chart exists in the analyzed page is confirmed;

8. A terminal, characterized in that the terminal comprises a processor and a memory, the processor is configured to implement the method for crawling dynamic graph-like page data according to any of claims 1 to 6 when executing a computer program stored in the memory.

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements a method for crawling dynamic graph-like page data according to any of claims 1 to 6.