WO2019136960A1

WO2019136960A1 - Method and device for crawling website data, storage medium and server

Info

Publication number: WO2019136960A1
Application number: PCT/CN2018/097499
Authority: WO
Inventors: 李晨光; 王盼
Original assignee: 深圳壹账通智能科技有限公司
Priority date: 2018-01-12
Filing date: 2018-07-27
Publication date: 2019-07-18
Also published as: CN108345641B; CN108345641A

Abstract

Disclosed are a method and device for crawling website data, a computer-readable storage medium and a server, solving the problem that many websites require input of a verification code to block a crawler system, resulting in the crawler system not being able to crawl data. The method provided by the present application comprises: initiating an access request to a target website of which data is to be crawled; receiving feedback information that the target website requires input of a verification code, and then acquiring a target verification code picture, on the target website, corresponding to feedback information; putting the target verification code picture into a pre-trained machine learning model for recognition, to obtain a verification code answer output by the machine learning model; executing, according to the output verification code answer, a verification operation of the target website requiring input of a verification code; and when verification of the target website is passed, crawling data from the target website.

Description

Method, storage medium, server and device for crawling website data

This application claims the priority of the Chinese Patent Application filed on January 12, 2018, the Chinese Patent Office, the application number is 201810029529.4, and the invention is entitled "A method for crawling website data, a storage medium and a server". The citations are incorporated herein by reference.

Technical field

The present application relates to the field of data processing technologies, and in particular, to a method for crawling website data, a computer readable storage medium, a server, and a device.

Background technique

In the Internet environment, data is a very important asset. At present, the crawler system is one of the important ways to effectively obtain data. However, many websites use the method of inputting a verification code to block the crawler system, so that the system cannot access these websites and complete data crawling.

technical problem

The embodiment of the present application provides a method for crawling website data, a computer readable storage medium, a server and a device, which can automatically complete verification of a target website, break through obstacles of the website to crawl data, and enable the crawler system to smoothly crawl the website. The data on it.

Technical solution

In a first aspect, a method of crawling website data is provided, including:

Initiating an access request to the target website that crawled the data;

After receiving the feedback information that the target website requests to input the verification code, acquiring the target verification code picture corresponding to the feedback information on the target website;

And the target verification code picture is input into the pre-trained machine learning model for identification, and the verification code answer output by the machine learning model is obtained;

Performing, according to the output verification code answer, the verification operation of the target website requesting input verification code;

After the verification by the target website, the data is crawled from the target website.

In a second aspect, a computer readable storage medium is stored, the computer readable storage medium storing computer readable instructions that, when executed by a processor, implement the steps of the method of crawling website data.

In a third aspect, a server is provided, comprising a memory, a processor, and computer readable instructions stored in the memory and executable on the processor, the processor implementing the computer readable instructions The steps of the above method of crawling website data.

In a fourth aspect, an apparatus for crawling website data is provided, which can include means for implementing the steps of the method of crawling website data described above.

Beneficial effect

In the embodiment of the present application, when the target website data is crawled and the target website is required to input the verification code, the target verification code picture can be identified by the machine learning model, the verification code answer is obtained, and the target website is automatically completed according to the verification code answer. The verification, the breakthrough of the website to block the data, so that the crawler system can successfully crawl the data on the website.

DRAWINGS

1 is a flow chart of an embodiment of a method for crawling website data in an embodiment of the present application;

2 is a schematic flowchart of pre-training a machine learning model in an application scenario according to a method for crawling website data in an embodiment of the present application;

FIG. 3 is a schematic flowchart of a step 103 of a method for crawling website data in an application scenario according to an embodiment of the present application;

4 is a structural diagram of an embodiment of an apparatus for crawling website data in an embodiment of the present application;

FIG. 5 is a schematic diagram of a server according to an embodiment of the present application.

Embodiments of the invention

The embodiment of the present application provides a method for crawling website data, a computer readable storage medium, a server, and a device, which are used to solve the problem that many websites take a request for inputting a verification code to block the crawler system, so that the crawler system cannot climb data. problem.

Referring to FIG. 1 , an embodiment of a method for crawling website data in an embodiment of the present application includes:

101. Initiating an access request to a target website that crawls data;

In this embodiment, when it is necessary to crawl data, it is necessary to first initiate an access request to the target website. The executor of the embodiment may be a terminal device or a server. Preferably, the execution subject in this embodiment is a server.

It can be understood that, after initiating an access request to the target website, if the target website has a precaution against the crawling system, the verification code is often required, and a target verification code picture is generated on the target website. Therefore, the server receives the feedback information requested by the target website to input the verification code, and then the server obtains the target verification code picture corresponding to the feedback information on the target website.

103. Enter the target verification code picture into a pre-trained machine learning model for identification, and obtain a verification code answer output by the machine learning model;

After obtaining the target verification code picture, the server may input the target verification code picture into a pre-trained machine learning model for identification. The machine learning model may specifically be a deep learning model or an SVM vector machine learning model. It can be seen that the machine learning model is pre-trained by a large number of learning samples. Therefore, after inputting the target verification code picture, the output verification code answer can be obtained, and the verification code answer output here is the verification code requested by the target website. .

Further, as shown in FIG. 2, the machine learning model can be pre-trained by the following steps:

201. Acquire multiple verification code pictures;

202. For each verification code picture, the verification code picture is cut into each picture block that includes an independent verification code;

203. Perform binarization processing on each of the picture blocks.

204. Mark a corresponding verification code answer for each picture block after binarization;

205: Input each binarized picture block as an input to a machine learning model, and obtain a training answer output by the machine learning model;

206. Targeting each training answer, adjusting model parameters of the machine learning model to minimize an error between the obtained training answers and the verification code answers of the respective markers;

207. If the error rate between the training answers of the respective outputs and the verification code answers of the respective markers is less than a preset threshold, determine that the machine learning model training is completed.

For the above step 201, before training the machine learning model, it is necessary to collect a large number of learning samples, that is, a verification code picture. These captcha images can be obtained from the website, or can be collected manually, which is not limited here.

For the above step 202, since the verification code picture generally includes a plurality of characters, for each verification code picture, in order to improve the learning efficiency and the recognition accuracy of the machine learning model, the verification code picture may be cut and processed, and The picture of the verification code character is cut into pieces of pictures each containing an independent verification code. Wherein, regarding the cutting verification code picture, the cutting mode may be set for the verification code picture with the similarity of the verification code character spacing, and the cutting mode is reasonably set according to the size of the verification code picture and the position of each verification code character on the verification code picture. For example, if five verification code characters are equally spaced on a 100*20 picture, the verification code picture can be cut into five 20*20 picture blocks at equal intervals during cutting, and the cutting method can be applied to all verification codes. A captcha image with equally spaced characters.

It should be noted that the above characters refer to letters, numbers, words and/or symbols.

After the picture block is obtained by the step 203, the picture blocks need to be binarized to obtain the binarized picture blocks.

For the above step 204, each of the binarized picture blocks is marked with a corresponding verification code answer, and the binarized picture blocks and the corresponding verification code answers are used as learning samples of the machine learning model.

For the above steps 205-207, the binarized picture blocks and the corresponding verification code answers are respectively trained as input and output for the machine learning model. The machine learning model in this embodiment may be a deep learning model or an SVM vector machine learning model. For ease of understanding, the present embodiment is described by taking an SVM vector machine learning model as an example. It can be seen that the SVM model is a supervised classification learning model. By adjusting the parameters of the model, the prediction accuracy of the SVM model can be improved, for example, the kernel functions -rbf, poly, sigmoid, linear, and the like. After each binarized picture block is input as an input to a machine learning model, and the training answer output by the machine learning model is obtained, each training answer is targeted, and the model parameters of the machine learning model are adjusted to minimize The error between each training answer obtained and the verification code answer of each mark until the error rate of the SVM model to the learning sample is less than a preset threshold, such as 10%, or the SVM model identifies the learning sample more accurately than the pre-predetermined rate If the threshold is set, for example, 90%, the SVM model can be considered to be completed.

Regarding the parameter adjustment of the SVM model, the kernel function selected in this embodiment is RBF, and the RBF kernel function has two parameters: a penalty factor c and a kernel parameter y. Therefore, it is hoped that the optimal parameter set (c, y) can be found to give the SVM model the best recognition performance. Preferably, when adjusting the parameters of the SVM model, the problem of parameter adjustment can be attributed to selecting a optimal parameter group (C, y) within a small "good area". It can be understood that different S and C models are obtained by different C and y. The purpose is to find the best combination of parameters to make the performance of the SVM model the best, that is, the recognition error rate is the lowest. In this embodiment, in an application scenario, a plurality of (C, y) values may be selected, and then the same learning sample is used for training, and finally the (C, y) value corresponding to the best performing SVM model is selected. The (C, y) value at this time is used as the final parameter value for the subsequent training.

Further, based on the step of pre-training the machine learning model in FIG. 2, as shown in FIG. 3, the foregoing step 103 may specifically include:

301. Cut the target verification code picture into target picture blocks each including an independent verification code.

302. Perform binarization processing on each of the target picture blocks.

303. Input the binarized target picture block as input to the machine learning model, and obtain a verification code answer output by the machine learning model.

For the above step 301, the principle is similar to the above step 202. Since the verification code picture generally contains a plurality of characters, and the machine learning model is for the picture block, the target is needed before the machine learning model is input for recognition. The captcha image is cut, and a picture containing a plurality of captcha characters is cut into target block blocks each containing an independent captcha.

For the above step 302, the principle is similar to the above step 203, and details are not described herein again.

In step 303 above, after obtaining the binarized target picture block, the server inputs each binarized target picture block as input to the machine learning model, and obtains a verification code answer output by the machine learning model.

104. Perform, according to the output verification code answer, the verification operation of the target website requesting input verification code;

After obtaining the verification code answer, the server may input the verification code answer to the location of the input verification code specified by the target website, and then trigger the “OK” button on the target website to perform verification code verification. After the target website background verification server enters the verification code answer, if the verification code answer is correct, the target website will feedback the verified information to the server; otherwise, the target website feeds back to the server the information that the verification fails.

105. After verifying through the target website, crawl data from the target website.

After verification by the target website, the server can crawl data from the target website through the crawler system.

Further, since the recognition rate of the machine learning model is generally difficult to reach 100%, in the actual use process, there is always an incorrect answer of the output verification code, which leads to the failure of the target website verification. Therefore, after performing the verification operation of the target website requesting the verification code according to the output verification code answer, if the target website feedback verification fails after the verification operation, the target verification provided by the target website may be refreshed. The code picture is returned to step 102 to re-acquire the refreshed target verification code picture on the target website and re-execute steps 103-105 above, and try to pass the verification again. Further, if the number of times to refresh the verification code picture exceeds a preset number threshold, for example, more than 5 times, the current user of the server may be notified, and the data crawling for the target website is failed this time.

Further, before step 103, a machine learning model corresponding to the target verification code picture may be selected from a pre-established model set, where different machine learning models adopt verification codes under different classifications. The picture is pre-trained as a learning sample. It can be understood that the format of the verification code characters on the verification code pictures used by different websites are often very different, for example, some verification code characters are used in the body, and some verification code characters are in the Song. If the learning sample of the machine learning model contains various verification code characters in different formats, the difficulty of training the machine learning model will be greatly increased, and the recognition accuracy of the machine learning model after the training is completed will be reduced. Therefore, in this embodiment, the verification code pictures as learning samples can be classified into different categories in advance, and then a machine learning model is separately trained for each category, and the trained machine learning models corresponding to the respective categories are collected. Model collection. When data crawling on a website requires a verification code, first determine which category the verification code image of the website belongs to, and then select the corresponding machine learning model from the model collection, and the verification code image of the website. (ie, the target verification code picture) is input into the selected machine learning model for identification, and the output verification code answer is obtained. In this way, not only the training and recognition accuracy of the machine learning model is facilitated, but also the application range of the server crawl data is improved.

Further, the classification of the learning samples used by the respective machine learning model pre-training may be predetermined by any one of the following three methods:

In the first method, each verification code image as a learning sample is classified according to a website of a respective source, wherein one website corresponds to one category. It can be understood that when training the machine learning model, the corresponding machine learning model can be separately trained for each website. In the actual use process, because the target website that needs to crawl data is generally limited, for example, there are only a few websites, the first method can also meet the needs of practical applications without causing excessive model training burden.

In the second manner, first, each character of the verification code in the verification code picture as the learning sample is extracted; then, each of the verification code pictures is classified according to the type of the extracted character, wherein one of the categories belongs to The type corresponds to a category. The type of the character mentioned here may specifically refer to a different form of the character's writing or expression, such as Song, 楷, cursive, Roman, etc. It should be noted that the font of the symbol is also included. For example, some websites use Song characters as verification codes, and some websites use scorpion characters as verification codes. For the training of machine learning models, the types of different characters (here specifically referred to as fonts) are combined to improve training. The difficulty and accuracy of model identification. Therefore, classifying the fonts of characters according to different verification codes as classification criteria can be more beneficial to the completion of the training of the machine learning model and improve the recognition accuracy of the machine learning model when used.

In a third manner, the spacing between each verification code character in each verification code picture as a learning sample is obtained first; then, each verification code picture is performed according to each preset spacing interval to which the corresponding spacing of the verification code pictures belongs. Classification, where a spacing interval corresponds to a classification. It can be understood that, for example, some websites use a verification code image on which two adjacent verification code characters differ by 3 pixel positions, and some websites use a verification code picture on which two adjacent verification code characters differ by five. Pixel position, the difference between two adjacent verification code characters on the verification code image used by some websites is 0 pixel position. For the training of machine learning model, different character spacing not only affects the difficulty of model training and accurate recognition in the later stage. Rate, and the size of the cut code image will be the same. Therefore, according to the size of the gap between different verification code characters as the classification standard, each verification code picture as a learning sample is classified, which is more beneficial to the completion of the training of the machine learning model and improve the recognition accuracy when the machine learning model is used.

It should be noted that the above three classification manners may be used alone or in combination, which is not limited in this embodiment.

In this embodiment, first, an access request is initiated to the target website that crawls the data; and after receiving the feedback information of the target website requesting the verification code, acquiring the target verification code corresponding to the feedback information on the target website. a picture; then, the target verification code picture is put into a pre-trained machine learning model for identification, and a verification code answer output by the machine learning model is obtained; and then, the target website requirement is executed according to the output verification code answer Entering a verification operation of the verification code; after the verification by the target website, the data is crawled from the target website. In this embodiment, when crawling the target website data and encountering the target website requesting input of the verification code, the target verification code picture may be identified by the machine learning model, the verification code answer is obtained, and the target website is automatically completed according to the verification code answer. The verification, the breakthrough of the website to block the data, so that the crawler system can successfully crawl the data on the website.

It should be understood that the size of the sequence of the steps in the above embodiments does not mean that the order of execution is performed. The order of execution of each process should be determined by its function and internal logic, and should not be construed as limiting the implementation process of the embodiments of the present application.

The above mainly describes a method of crawling website data, and a device for crawling website data will be described in detail below.

FIG. 4 is a structural diagram showing an embodiment of an apparatus for crawling website data in an embodiment of the present application.

In this embodiment, an apparatus for crawling website data includes:

The request initiating module 401 is configured to initiate an access request to the target website that crawls the data;

The target image obtaining module 402 is configured to: after receiving the feedback information that the target website requests to input the verification code, acquire the target verification code image corresponding to the feedback information on the target website;

The verification code identification module 403 is configured to input the target verification code picture into a pre-trained machine learning model to obtain a verification code answer output by the machine learning model;

a verification operation module 404, configured to perform, according to the output verification code answer, the verification operation of the target website requesting input verification code;

The crawl data module 405 is configured to crawl data from the target website after verification by the target website.

Further, the machine learning model can be pre-trained by the following modules:

a picture acquisition module, configured to obtain multiple verification code pictures;

a picture block cutting module, configured to cut the verification code picture into each picture block including an independent verification code for each verification code picture;

a picture block binarization module, configured to perform binarization processing on each of the picture blocks;

An answer tag module, configured to mark a corresponding verification code answer for each picture block after binarization;

a training module, configured to input each binarized picture block as an input to a machine learning model, and obtain a training answer output by the machine learning model;

a parameter adjustment module, configured to target each training answer, and adjust model parameters of the machine learning model to minimize an error between the obtained training answers and the verification code answers of the respective markers;

The training completion module is configured to determine that the machine learning model training is completed if an error rate between the training answers of the respective outputs and the verification code answers of the respective markers is less than a preset threshold.

Further, the verification code identification module may include:

a cutting unit, configured to cut the target verification code picture into target picture blocks each including an independent verification code;

a binarization unit, configured to perform binarization processing on each of the target picture blocks;

The input model unit is configured to input each binarized target picture block as input to the machine learning model, and obtain a verification code answer output by the machine learning model.

Further, the device for crawling website data may further include:

a model selection module, configured to select, from a pre-established model set, a machine learning model corresponding to the target verification code picture, where different machine learning models in the model set use verification code pictures under different classifications as learning samples Pre-trained.

Further, the classification of the learning samples used in the pre-training of the respective machine learning models is predetermined by the following modules:

The first categorization module is configured to classify each verification code image as a learning sample according to a website of a respective source, wherein one website corresponds to one category;

or

a character extraction module, configured to extract characters of each verification code in the verification code picture of the learning sample;

a second categorization module, configured to classify each of the verification code pictures according to a type of the extracted characters, where one type of the belonging type corresponds to one type of classification;

or

a spacing acquisition module, configured to obtain a spacing between each verification code character in each verification code picture as a learning sample;

The third categorization module is configured to categorize the verification code images according to respective preset interval intervals to which the spacing corresponding to each verification code picture belongs, wherein one spacing interval corresponds to one classification.

Further, the device for crawling website data may further include:

And a picture refreshing module, configured to: if the target website feedback verification fails after the verifying operation, refresh the target verification code picture provided by the target website, and return to trigger the target picture acquiring module.

FIG. 5 is a schematic diagram of a server according to an embodiment of the present application. As shown in FIG. 5, the server 5 of this embodiment includes a processor 50, a memory 51, and computer readable instructions 52 stored in the memory 51 and operable on the processor 50, for example, performing the above crawling The program of the method of website data. The steps in the method embodiment of implementing the above-described various crawling website data when the processor 50 executes the computer readable instructions 52, such as steps 101 to 105 shown in FIG. Alternatively, when the processor 50 executes the computer readable instructions 52, the functions of the modules/units in the various apparatus embodiments described above are implemented, such as the functions of the modules 401 to 405 shown in FIG.

Illustratively, the computer readable instructions 52 may be partitioned into one or more modules/units that are stored in the memory 51 and executed by the processor 50, To complete this application. The one or more modules/units may be an instruction segment of a series of computer readable instructions capable of performing a particular function, the instruction segments being used to describe the execution of the computer readable instructions 52 in the server 5.

The server 5 can be a computing device such as a local server or a cloud server. The server may include, but is not limited to, a processor 50, a memory 51. It will be understood by those skilled in the art that FIG. 5 is merely an example of the server 5 and does not constitute a limitation of the server 5, and may include more or less components than those illustrated, or combine some components, or different components, such as The server may also include an input and output device, a network access device, a bus, and the like.

The functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit. The above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application, in essence or the contribution to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium. A number of instructions are included to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present application. The foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), and a random access memory (RAM, Random Access). A variety of media that can store program code, such as a memory, a disk, or an optical disk.

The above embodiments are only used to explain the technical solutions of the present application, and are not limited thereto; although the present application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that they can still The technical solutions described in the embodiments are modified, or the equivalents of the technical features are replaced by the equivalents. The modifications and substitutions of the embodiments do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

A method for crawling website data, which is characterized by comprising:

Initiating an access request to the target website that crawled the data;

After receiving the feedback information that the target website requests to input the verification code, acquiring the target verification code picture corresponding to the feedback information on the target website;

And the target verification code picture is input into the pre-trained machine learning model for identification, and the verification code answer output by the machine learning model is obtained;

Performing, according to the output verification code answer, the verification operation of the target website requesting input verification code;

After the verification by the target website, the data is crawled from the target website.
The method of crawling website data according to claim 1, wherein the machine learning model is pre-trained by the following steps:

Obtain multiple captcha images;

For each verification code picture, the verification code picture is cut into each picture block including an independent verification code;

Performing binarization processing on each of the picture blocks;

Marking the corresponding verification code answer for each picture block after binarization;

Inputting each binarized picture block as input to a machine learning model, and obtaining a training answer output by the machine learning model;

Adjusting the model parameters of the machine learning model with the respective training answers as targets, to minimize the error between the obtained training answers and the verification code answers of the respective markers;

If the error rate between the training answers of the respective outputs and the verification code answers of the respective markers is less than a preset threshold, it is determined that the machine learning model training is completed.
The method for crawling website data according to claim 2, wherein the image of the target verification code is put into a pre-trained machine learning model for recognition, and the verification code answer output by the machine learning model is obtained. :

Destroying the target verification code picture into target picture blocks each containing an independent verification code;

Performing binarization processing on each of the target picture blocks;

Each binarized target picture block is input as input to the machine learning model, and a verification code answer output by the machine learning model is obtained.
The method for crawling website data according to claim 1, wherein the target verification code picture is input into a pre-trained machine learning model for identification, and before the verification code answer output by the machine learning model is obtained, Also includes:

A machine learning model corresponding to the target verification code picture is selected from a set of pre-established models, and different machine learning models in the model set are pre-trained by using verification code pictures under different classifications as learning samples.
The method for crawling website data according to claim 4, wherein the classification of the learning samples used in the pre-training of the respective machine learning models is predetermined by the following steps:

The verification code pictures as the learning samples are classified according to the websites of the respective sources, wherein one website corresponds to one category;

or

Extracting each of the characters of the verification code in the verification code picture as the learning sample;

Each of the verification code pictures is classified according to the type of the extracted characters, wherein one type of the belonging type corresponds to one type of classification;

or

Obtaining a spacing between each verification code character in each verification code picture as a learning sample;

Each of the verification code pictures is classified according to each preset interval interval to which the spacing corresponding to each verification code picture belongs, wherein one spacing interval corresponds to one classification.
The method of crawling website data according to any one of claims 1 to 5, further comprising: after performing the verification operation of the target website requesting the input verification code according to the output verification code answer, further comprising:

If the target website verification verification fails after the verification operation, refreshing the target verification code picture provided by the target website, and returning to perform the step of acquiring the target verification code picture corresponding to the feedback information on the target website .
A computer readable storage medium storing computer readable instructions, wherein the computer readable instructions, when executed by a processor, implement the following steps:

Initiating an access request to the target website that crawled the data;

After receiving the feedback information that the target website requests to input the verification code, acquiring the target verification code picture corresponding to the feedback information on the target website;

And the target verification code picture is input into the pre-trained machine learning model for identification, and the verification code answer output by the machine learning model is obtained;

Performing, according to the output verification code answer, the verification operation of the target website requesting input verification code;

After the verification by the target website, the data is crawled from the target website.
The computer readable storage medium of claim 7, wherein the machine learning model is pre-trained by the following steps:

Obtain multiple captcha images;

For each verification code picture, the verification code picture is cut into each picture block including an independent verification code;

Performing binarization processing on each of the picture blocks;

Marking the corresponding verification code answer for each picture block after binarization;

Inputting each binarized picture block as input to a machine learning model, and obtaining a training answer output by the machine learning model;

Adjusting the model parameters of the machine learning model with the respective training answers as targets, to minimize the error between the obtained training answers and the verification code answers of the respective markers;

If the error rate between the training answers of the respective outputs and the verification code answers of the respective markers is less than a preset threshold, it is determined that the machine learning model training is completed.
The computer readable storage medium according to claim 8, wherein the determining the target verification code picture into a pre-trained machine learning model, and obtaining the verification code output output by the machine learning model comprises:

Destroying the target verification code picture into target picture blocks each containing an independent verification code;

Performing binarization processing on each of the target picture blocks;

Each binarized target picture block is input as input to the machine learning model, and a verification code answer output by the machine learning model is obtained.
The computer readable storage medium according to claim 7, wherein before the target verification code picture is put into a pre-trained machine learning model for identification, and the verification code answer output by the machine learning model is obtained, include:

A machine learning model corresponding to the target verification code picture is selected from a set of pre-established models, and different machine learning models in the model set are pre-trained by using verification code pictures under different classifications as learning samples.
The computer readable storage medium according to claim 10, wherein the classification to which the learning samples employed by the respective machine learning models are pre-trained is predetermined by the following steps:

The verification code pictures as the learning samples are classified according to the websites of the respective sources, wherein one website corresponds to one category;

or

Extracting each of the characters of the verification code in the verification code picture as the learning sample;

Each of the verification code pictures is classified according to the type of the extracted characters, wherein one type of the belonging type corresponds to one type of classification;

or

Obtaining a spacing between each verification code character in each verification code picture as a learning sample;

Each of the verification code pictures is classified according to each preset interval interval to which the spacing corresponding to each verification code picture belongs, wherein one spacing interval corresponds to one classification.
The computer readable storage medium according to any one of claims 7 to 11, further comprising: after performing the verification operation of the target website requesting the input verification code according to the output verification code answer, further comprising:

If the target website verification verification fails after the verification operation, refreshing the target verification code picture provided by the target website, and returning to perform the step of acquiring the target verification code picture corresponding to the feedback information on the target website .
A server comprising a memory, a processor, and computer readable instructions stored in the memory and operative on the processor, wherein the processor implements the following steps when the computer readable instructions are executed :

Initiating an access request to the target website that crawled the data;

After receiving the feedback information that the target website requests to input the verification code, acquiring the target verification code picture corresponding to the feedback information on the target website;

And the target verification code picture is input into the pre-trained machine learning model for identification, and the verification code answer output by the machine learning model is obtained;

Performing, according to the output verification code answer, the verification operation of the target website requesting input verification code;

After the verification by the target website, the data is crawled from the target website.
The server according to claim 13, wherein said machine learning model is pre-trained by the following steps:

Obtain multiple captcha images;

For each verification code picture, the verification code picture is cut into each picture block including an independent verification code;

Performing binarization processing on each of the picture blocks;

Marking the corresponding verification code answer for each picture block after binarization;

Inputting each binarized picture block as input to a machine learning model, and obtaining a training answer output by the machine learning model;

Adjusting the model parameters of the machine learning model with the respective training answers as targets, to minimize the error between the obtained training answers and the verification code answers of the respective markers;

If the error rate between the training answers of the respective outputs and the verification code answers of the respective markers is less than a preset threshold, it is determined that the machine learning model training is completed.
The server according to any one of claims 13 to 14, further comprising: after performing the verification operation of the target website requesting the input verification code according to the output verification code answer, further comprising:

If the target website verification verification fails after the verification operation, refreshing the target verification code picture provided by the target website, and returning to perform the step of acquiring the target verification code picture corresponding to the feedback information on the target website .
An apparatus for crawling website data, comprising:

a request initiation module, configured to initiate an access request to a target website that crawls data;

a target image obtaining module, configured to acquire a target verification code image corresponding to the feedback information on the target website after receiving the feedback information of the target website requesting the verification code;

a verification code identification module, configured to input the target verification code picture into a pre-trained machine learning model to obtain an verification code answer output by the machine learning model;

a verification operation module, configured to perform, according to the output verification code answer, the verification operation of the target website requesting input verification code;

Crawling a data module for crawling data from the target website after verification by the target website.
The device for crawling website data according to claim 16, further comprising:

a picture acquisition module, configured to obtain multiple verification code pictures;

a picture block cutting module, configured to cut the verification code picture into each picture block including an independent verification code for each verification code picture;

a picture block binarization module, configured to perform binarization processing on each of the picture blocks;

An answer tag module, configured to mark a corresponding verification code answer for each picture block after binarization;

a training module, configured to input each binarized picture block as an input to a machine learning model, and obtain a training answer output by the machine learning model;

a parameter adjustment module, configured to target each training answer, and adjust model parameters of the machine learning model to minimize an error between the obtained training answers and the verification code answers of the respective markers;

The training completion module is configured to determine that the machine learning model training is completed if an error rate between the training answers of the respective outputs and the verification code answers of the respective markers is less than a preset threshold.
The device for crawling website data according to claim 16, wherein the verification code identification module comprises:

a cutting unit, configured to cut the target verification code picture into target picture blocks each including an independent verification code;

a binarization unit, configured to perform binarization processing on each of the target picture blocks;

The input model unit is configured to input each binarized target picture block as input to the machine learning model, and obtain a verification code answer output by the machine learning model.
The device for crawling website data according to claim 16, further comprising:

a model selection module, configured to select, from a pre-established model set, a machine learning model corresponding to the target verification code picture, where different machine learning models in the model set use verification code pictures under different classifications as learning samples Pre-trained.
The device for crawling website data according to claim 19, further comprising:

The first categorization module is configured to classify each verification code image as a learning sample according to a website of a respective source, wherein one website corresponds to one category;

or

a character extraction module, configured to extract characters of each verification code in the verification code picture of the learning sample;

a second categorization module, configured to classify each of the verification code pictures according to a type of the extracted characters, where one type of the belonging type corresponds to one type of classification;

or

a spacing acquisition module, configured to obtain a spacing between each verification code character in each verification code picture as a learning sample;

The third categorization module is configured to categorize the verification code images according to respective preset interval intervals to which the spacing corresponding to each verification code picture belongs, wherein one spacing interval corresponds to one classification.