WO2019136960A1 - Method and device for crawling website data, storage medium and server - Google Patents

Method and device for crawling website data, storage medium and server Download PDF

Info

Publication number
WO2019136960A1
WO2019136960A1 PCT/CN2018/097499 CN2018097499W WO2019136960A1 WO 2019136960 A1 WO2019136960 A1 WO 2019136960A1 CN 2018097499 W CN2018097499 W CN 2018097499W WO 2019136960 A1 WO2019136960 A1 WO 2019136960A1
Authority
WO
WIPO (PCT)
Prior art keywords
verification code
target
picture
machine learning
learning model
Prior art date
Application number
PCT/CN2018/097499
Other languages
French (fr)
Chinese (zh)
Inventor
李晨光
王盼
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2019136960A1 publication Critical patent/WO2019136960A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines

Definitions

  • the present application relates to the field of data processing technologies, and in particular, to a method for crawling website data, a computer readable storage medium, a server, and a device.
  • the embodiment of the present application provides a method for crawling website data, a computer readable storage medium, a server and a device, which can automatically complete verification of a target website, break through obstacles of the website to crawl data, and enable the crawler system to smoothly crawl the website.
  • a method of crawling website data including:
  • target verification code picture is input into the pre-trained machine learning model for identification, and the verification code answer output by the machine learning model is obtained;
  • the data is crawled from the target website.
  • a computer readable storage medium is stored, the computer readable storage medium storing computer readable instructions that, when executed by a processor, implement the steps of the method of crawling website data.
  • a server comprising a memory, a processor, and computer readable instructions stored in the memory and executable on the processor, the processor implementing the computer readable instructions The steps of the above method of crawling website data.
  • an apparatus for crawling website data which can include means for implementing the steps of the method of crawling website data described above.
  • the target verification code picture can be identified by the machine learning model, the verification code answer is obtained, and the target website is automatically completed according to the verification code answer.
  • the verification the breakthrough of the website to block the data, so that the crawler system can successfully crawl the data on the website.
  • FIG. 1 is a flow chart of an embodiment of a method for crawling website data in an embodiment of the present application
  • FIG. 2 is a schematic flowchart of pre-training a machine learning model in an application scenario according to a method for crawling website data in an embodiment of the present application;
  • FIG. 3 is a schematic flowchart of a step 103 of a method for crawling website data in an application scenario according to an embodiment of the present application;
  • FIG. 4 is a structural diagram of an embodiment of an apparatus for crawling website data in an embodiment of the present application
  • FIG. 5 is a schematic diagram of a server according to an embodiment of the present application.
  • the embodiment of the present application provides a method for crawling website data, a computer readable storage medium, a server, and a device, which are used to solve the problem that many websites take a request for inputting a verification code to block the crawler system, so that the crawler system cannot climb data. problem.
  • an embodiment of a method for crawling website data in an embodiment of the present application includes:
  • the executor of the embodiment may be a terminal device or a server.
  • the execution subject in this embodiment is a server.
  • the server receives the feedback information requested by the target website to input the verification code, and then the server obtains the target verification code picture corresponding to the feedback information on the target website.
  • the server may input the target verification code picture into a pre-trained machine learning model for identification.
  • the machine learning model may specifically be a deep learning model or an SVM vector machine learning model. It can be seen that the machine learning model is pre-trained by a large number of learning samples. Therefore, after inputting the target verification code picture, the output verification code answer can be obtained, and the verification code answer output here is the verification code requested by the target website. .
  • the machine learning model can be pre-trained by the following steps:
  • the verification code picture is cut into each picture block that includes an independent verification code
  • step 201 before training the machine learning model, it is necessary to collect a large number of learning samples, that is, a verification code picture.
  • These captcha images can be obtained from the website, or can be collected manually, which is not limited here.
  • the verification code picture since the verification code picture generally includes a plurality of characters, for each verification code picture, in order to improve the learning efficiency and the recognition accuracy of the machine learning model, the verification code picture may be cut and processed, and The picture of the verification code character is cut into pieces of pictures each containing an independent verification code.
  • the cutting mode may be set for the verification code picture with the similarity of the verification code character spacing, and the cutting mode is reasonably set according to the size of the verification code picture and the position of each verification code character on the verification code picture. For example, if five verification code characters are equally spaced on a 100*20 picture, the verification code picture can be cut into five 20*20 picture blocks at equal intervals during cutting, and the cutting method can be applied to all verification codes. A captcha image with equally spaced characters.
  • the picture blocks need to be binarized to obtain the binarized picture blocks.
  • each of the binarized picture blocks is marked with a corresponding verification code answer, and the binarized picture blocks and the corresponding verification code answers are used as learning samples of the machine learning model.
  • the binarized picture blocks and the corresponding verification code answers are respectively trained as input and output for the machine learning model.
  • the machine learning model in this embodiment may be a deep learning model or an SVM vector machine learning model.
  • the present embodiment is described by taking an SVM vector machine learning model as an example. It can be seen that the SVM model is a supervised classification learning model. By adjusting the parameters of the model, the prediction accuracy of the SVM model can be improved, for example, the kernel functions -rbf, poly, sigmoid, linear, and the like.
  • each training answer is targeted, and the model parameters of the machine learning model are adjusted to minimize The error between each training answer obtained and the verification code answer of each mark until the error rate of the SVM model to the learning sample is less than a preset threshold, such as 10%, or the SVM model identifies the learning sample more accurately than the pre-predetermined rate If the threshold is set, for example, 90%, the SVM model can be considered to be completed.
  • a preset threshold such as 10%
  • the kernel function selected in this embodiment is RBF, and the RBF kernel function has two parameters: a penalty factor c and a kernel parameter y. Therefore, it is hoped that the optimal parameter set (c, y) can be found to give the SVM model the best recognition performance.
  • the problem of parameter adjustment can be attributed to selecting a optimal parameter group (C, y) within a small "good area". It can be understood that different S and C models are obtained by different C and y. The purpose is to find the best combination of parameters to make the performance of the SVM model the best, that is, the recognition error rate is the lowest.
  • a plurality of (C, y) values may be selected, and then the same learning sample is used for training, and finally the (C, y) value corresponding to the best performing SVM model is selected.
  • the (C, y) value at this time is used as the final parameter value for the subsequent training.
  • the foregoing step 103 may specifically include:
  • the principle is similar to the above step 202. Since the verification code picture generally contains a plurality of characters, and the machine learning model is for the picture block, the target is needed before the machine learning model is input for recognition.
  • the captcha image is cut, and a picture containing a plurality of captcha characters is cut into target block blocks each containing an independent captcha.
  • step 302 the principle is similar to the above step 203, and details are not described herein again.
  • step 303 above after obtaining the binarized target picture block, the server inputs each binarized target picture block as input to the machine learning model, and obtains a verification code answer output by the machine learning model.
  • the server may input the verification code answer to the location of the input verification code specified by the target website, and then trigger the “OK” button on the target website to perform verification code verification.
  • the target website background verification server enters the verification code answer, if the verification code answer is correct, the target website will feedback the verified information to the server; otherwise, the target website feeds back to the server the information that the verification fails.
  • the server can crawl data from the target website through the crawler system.
  • the recognition rate of the machine learning model is generally difficult to reach 100%, in the actual use process, there is always an incorrect answer of the output verification code, which leads to the failure of the target website verification. Therefore, after performing the verification operation of the target website requesting the verification code according to the output verification code answer, if the target website feedback verification fails after the verification operation, the target verification provided by the target website may be refreshed. The code picture is returned to step 102 to re-acquire the refreshed target verification code picture on the target website and re-execute steps 103-105 above, and try to pass the verification again. Further, if the number of times to refresh the verification code picture exceeds a preset number threshold, for example, more than 5 times, the current user of the server may be notified, and the data crawling for the target website is failed this time.
  • a preset number threshold for example, more than 5 times
  • a machine learning model corresponding to the target verification code picture may be selected from a pre-established model set, where different machine learning models adopt verification codes under different classifications.
  • the picture is pre-trained as a learning sample. It can be understood that the format of the verification code characters on the verification code pictures used by different websites are often very different, for example, some verification code characters are used in the body, and some verification code characters are in the Song. If the learning sample of the machine learning model contains various verification code characters in different formats, the difficulty of training the machine learning model will be greatly increased, and the recognition accuracy of the machine learning model after the training is completed will be reduced.
  • the verification code pictures as learning samples can be classified into different categories in advance, and then a machine learning model is separately trained for each category, and the trained machine learning models corresponding to the respective categories are collected.
  • Model collection When data crawling on a website requires a verification code, first determine which category the verification code image of the website belongs to, and then select the corresponding machine learning model from the model collection, and the verification code image of the website. (ie, the target verification code picture) is input into the selected machine learning model for identification, and the output verification code answer is obtained. In this way, not only the training and recognition accuracy of the machine learning model is facilitated, but also the application range of the server crawl data is improved.
  • classification of the learning samples used by the respective machine learning model pre-training may be predetermined by any one of the following three methods:
  • each verification code image as a learning sample is classified according to a website of a respective source, wherein one website corresponds to one category. It can be understood that when training the machine learning model, the corresponding machine learning model can be separately trained for each website. In the actual use process, because the target website that needs to crawl data is generally limited, for example, there are only a few websites, the first method can also meet the needs of practical applications without causing excessive model training burden.
  • each character of the verification code in the verification code picture as the learning sample is extracted; then, each of the verification code pictures is classified according to the type of the extracted character, wherein one of the categories belongs to The type corresponds to a category.
  • the type of the character mentioned here may specifically refer to a different form of the character's writing or expression, such as Song, ⁇ , cursive, Roman, etc.
  • the font of the symbol is also included. For example, some websites use Song characters as verification codes, and some websites use scorpion characters as verification codes.
  • the types of different characters are combined to improve training. The difficulty and accuracy of model identification. Therefore, classifying the fonts of characters according to different verification codes as classification criteria can be more beneficial to the completion of the training of the machine learning model and improve the recognition accuracy of the machine learning model when used.
  • each verification code picture as a learning sample is obtained first; then, each verification code picture is performed according to each preset spacing interval to which the corresponding spacing of the verification code pictures belongs.
  • Classification where a spacing interval corresponds to a classification. It can be understood that, for example, some websites use a verification code image on which two adjacent verification code characters differ by 3 pixel positions, and some websites use a verification code picture on which two adjacent verification code characters differ by five. Pixel position, the difference between two adjacent verification code characters on the verification code image used by some websites is 0 pixel position.
  • different character spacing not only affects the difficulty of model training and accurate recognition in the later stage. Rate, and the size of the cut code image will be the same. Therefore, according to the size of the gap between different verification code characters as the classification standard, each verification code picture as a learning sample is classified, which is more beneficial to the completion of the training of the machine learning model and improve the recognition accuracy when the machine learning model is used.
  • an access request is initiated to the target website that crawls the data; and after receiving the feedback information of the target website requesting the verification code, acquiring the target verification code corresponding to the feedback information on the target website. a picture; then, the target verification code picture is put into a pre-trained machine learning model for identification, and a verification code answer output by the machine learning model is obtained; and then, the target website requirement is executed according to the output verification code answer Entering a verification operation of the verification code; after the verification by the target website, the data is crawled from the target website.
  • the target verification code picture when crawling the target website data and encountering the target website requesting input of the verification code, the target verification code picture may be identified by the machine learning model, the verification code answer is obtained, and the target website is automatically completed according to the verification code answer.
  • the verification the breakthrough of the website to block the data, so that the crawler system can successfully crawl the data on the website.
  • the above mainly describes a method of crawling website data, and a device for crawling website data will be described in detail below.
  • FIG. 4 is a structural diagram showing an embodiment of an apparatus for crawling website data in an embodiment of the present application.
  • an apparatus for crawling website data includes:
  • the request initiating module 401 is configured to initiate an access request to the target website that crawls the data;
  • the target image obtaining module 402 is configured to: after receiving the feedback information that the target website requests to input the verification code, acquire the target verification code image corresponding to the feedback information on the target website;
  • the verification code identification module 403 is configured to input the target verification code picture into a pre-trained machine learning model to obtain a verification code answer output by the machine learning model;
  • a verification operation module 404 configured to perform, according to the output verification code answer, the verification operation of the target website requesting input verification code
  • the crawl data module 405 is configured to crawl data from the target website after verification by the target website.
  • machine learning model can be pre-trained by the following modules:
  • a picture acquisition module configured to obtain multiple verification code pictures
  • a picture block cutting module configured to cut the verification code picture into each picture block including an independent verification code for each verification code picture
  • a picture block binarization module configured to perform binarization processing on each of the picture blocks
  • An answer tag module configured to mark a corresponding verification code answer for each picture block after binarization
  • a training module configured to input each binarized picture block as an input to a machine learning model, and obtain a training answer output by the machine learning model;
  • a parameter adjustment module configured to target each training answer, and adjust model parameters of the machine learning model to minimize an error between the obtained training answers and the verification code answers of the respective markers;
  • the training completion module is configured to determine that the machine learning model training is completed if an error rate between the training answers of the respective outputs and the verification code answers of the respective markers is less than a preset threshold.
  • verification code identification module may include:
  • a cutting unit configured to cut the target verification code picture into target picture blocks each including an independent verification code
  • a binarization unit configured to perform binarization processing on each of the target picture blocks
  • the input model unit is configured to input each binarized target picture block as input to the machine learning model, and obtain a verification code answer output by the machine learning model.
  • the device for crawling website data may further include:
  • a model selection module configured to select, from a pre-established model set, a machine learning model corresponding to the target verification code picture, where different machine learning models in the model set use verification code pictures under different classifications as learning samples Pre-trained.
  • classification of the learning samples used in the pre-training of the respective machine learning models is predetermined by the following modules:
  • the first categorization module is configured to classify each verification code image as a learning sample according to a website of a respective source, wherein one website corresponds to one category;
  • a character extraction module configured to extract characters of each verification code in the verification code picture of the learning sample
  • a second categorization module configured to classify each of the verification code pictures according to a type of the extracted characters, where one type of the belonging type corresponds to one type of classification;
  • a spacing acquisition module configured to obtain a spacing between each verification code character in each verification code picture as a learning sample
  • the third categorization module is configured to categorize the verification code images according to respective preset interval intervals to which the spacing corresponding to each verification code picture belongs, wherein one spacing interval corresponds to one classification.
  • the device for crawling website data may further include:
  • a picture refreshing module configured to: if the target website feedback verification fails after the verifying operation, refresh the target verification code picture provided by the target website, and return to trigger the target picture acquiring module.
  • FIG. 5 is a schematic diagram of a server according to an embodiment of the present application.
  • the server 5 of this embodiment includes a processor 50, a memory 51, and computer readable instructions 52 stored in the memory 51 and operable on the processor 50, for example, performing the above crawling The program of the method of website data.
  • the steps in the method embodiment of implementing the above-described various crawling website data when the processor 50 executes the computer readable instructions 52 such as steps 101 to 105 shown in FIG.
  • the processor 50 executes the computer readable instructions 52
  • the functions of the modules/units in the various apparatus embodiments described above are implemented, such as the functions of the modules 401 to 405 shown in FIG.
  • the computer readable instructions 52 may be partitioned into one or more modules/units that are stored in the memory 51 and executed by the processor 50, To complete this application.
  • the one or more modules/units may be an instruction segment of a series of computer readable instructions capable of performing a particular function, the instruction segments being used to describe the execution of the computer readable instructions 52 in the server 5.
  • the server 5 can be a computing device such as a local server or a cloud server.
  • the server may include, but is not limited to, a processor 50, a memory 51. It will be understood by those skilled in the art that FIG. 5 is merely an example of the server 5 and does not constitute a limitation of the server 5, and may include more or less components than those illustrated, or combine some components, or different components, such as
  • the server may also include an input and output device, a network access device, a bus, and the like.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
  • the integrated unit if implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a computer readable storage medium.
  • a computer readable storage medium A number of instructions are included to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present application.
  • the foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), and a random access memory (RAM, Random Access).
  • ROM read-only memory
  • RAM Random Access

Abstract

Disclosed are a method and device for crawling website data, a computer-readable storage medium and a server, solving the problem that many websites require input of a verification code to block a crawler system, resulting in the crawler system not being able to crawl data. The method provided by the present application comprises: initiating an access request to a target website of which data is to be crawled; receiving feedback information that the target website requires input of a verification code, and then acquiring a target verification code picture, on the target website, corresponding to feedback information; putting the target verification code picture into a pre-trained machine learning model for recognition, to obtain a verification code answer output by the machine learning model; executing, according to the output verification code answer, a verification operation of the target website requiring input of a verification code; and when verification of the target website is passed, crawling data from the target website.

Description

一种爬取网站数据的方法、存储介质、服务器及装置Method, storage medium, server and device for crawling website data
本申请要求于2018年1月12日提交中国专利局、申请号为201810029529.4、发明名称为“一种爬取网站数据的方法、存储介质和服务器”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese Patent Application filed on January 12, 2018, the Chinese Patent Office, the application number is 201810029529.4, and the invention is entitled "A method for crawling website data, a storage medium and a server". The citations are incorporated herein by reference.
技术领域Technical field
本申请涉及数据处理技术领域,尤其涉及一种爬取网站数据的方法、计算机可读存储介质、服务器及装置。The present application relates to the field of data processing technologies, and in particular, to a method for crawling website data, a computer readable storage medium, a server, and a device.
背景技术Background technique
在互联网环境中,数据是非常重要的一种资产。目前,爬虫系统是有效获取数据的重要途径之一,但是,很多网站都采取要求输入验证码的方式来屏蔽爬虫系统,使得系统无法访问这些网站并完成数据爬取。In the Internet environment, data is a very important asset. At present, the crawler system is one of the important ways to effectively obtain data. However, many websites use the method of inputting a verification code to block the crawler system, so that the system cannot access these websites and complete data crawling.
技术问题technical problem
本申请实施例提供了一种爬取网站数据的方法、计算机可读存储介质、服务器及装置,能够自动完成目标网站的验证,突破网站对爬取数据的阻碍,使得爬虫系统可以顺利爬取网站上的数据。The embodiment of the present application provides a method for crawling website data, a computer readable storage medium, a server and a device, which can automatically complete verification of a target website, break through obstacles of the website to crawl data, and enable the crawler system to smoothly crawl the website. The data on it.
技术解决方案Technical solution
第一方面,提供了一种爬取网站数据的方法,包括:In a first aspect, a method of crawling website data is provided, including:
向爬取数据的目标网站发起访问请求;Initiating an access request to the target website that crawled the data;
在接收到所述目标网站要求输入验证码的反馈信息后,获取所述目标网站上所述反馈信息所对应的目标验证码图片;After receiving the feedback information that the target website requests to input the verification code, acquiring the target verification code picture corresponding to the feedback information on the target website;
将所述目标验证码图片投入预训练好的机器学习模型进行识别,得到所述机器学习模型输出的验证码答案;And the target verification code picture is input into the pre-trained machine learning model for identification, and the verification code answer output by the machine learning model is obtained;
根据所述输出的验证码答案执行所述目标网站要求输入验证码的验证操作;Performing, according to the output verification code answer, the verification operation of the target website requesting input verification code;
在通过所述目标网站的验证后,从所述目标网站上爬取数据。After the verification by the target website, the data is crawled from the target website.
第二方面,提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可读指令,所述计算机可读指令被处理器执行时实现上述爬取网站数据的方法的步骤。In a second aspect, a computer readable storage medium is stored, the computer readable storage medium storing computer readable instructions that, when executed by a processor, implement the steps of the method of crawling website data.
第三方面,提供了一种服务器,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现上述爬取网站数据的方法的步骤。In a third aspect, a server is provided, comprising a memory, a processor, and computer readable instructions stored in the memory and executable on the processor, the processor implementing the computer readable instructions The steps of the above method of crawling website data.
第四方面,提供了一种爬取网站数据的装置,可以包括用于实现上述爬取网站数据的方法的步骤的模块。In a fourth aspect, an apparatus for crawling website data is provided, which can include means for implementing the steps of the method of crawling website data described above.
有益效果Beneficial effect
本申请实施例中,在爬取目标网站数据,遇到目标网站要求输入验证码时,可以通过机器学习模型对目标验证码图片进行识别,得到验证码答案,并根据验证码答案自动完成目标网站的验证,突破网站对爬取数据的阻碍,使得爬虫系统可以顺利爬取网站上的数据。In the embodiment of the present application, when the target website data is crawled and the target website is required to input the verification code, the target verification code picture can be identified by the machine learning model, the verification code answer is obtained, and the target website is automatically completed according to the verification code answer. The verification, the breakthrough of the website to block the data, so that the crawler system can successfully crawl the data on the website.
附图说明DRAWINGS
图1为本申请实施例中一种爬取网站数据的方法一个实施例流程图;1 is a flow chart of an embodiment of a method for crawling website data in an embodiment of the present application;
图2为本申请实施例中一种爬取网站数据的方法在一个应用场景下预先训练机器学习模型的流程示意图;2 is a schematic flowchart of pre-training a machine learning model in an application scenario according to a method for crawling website data in an embodiment of the present application;
图3为本申请实施例中一种爬取网站数据的方法步骤103在一个应用场景下的流程示意图;FIG. 3 is a schematic flowchart of a step 103 of a method for crawling website data in an application scenario according to an embodiment of the present application;
图4为本申请实施例中一种爬取网站数据的装置一个实施例结构图;4 is a structural diagram of an embodiment of an apparatus for crawling website data in an embodiment of the present application;
图5为本申请一实施例提供的服务器的示意图。FIG. 5 is a schematic diagram of a server according to an embodiment of the present application.
本发明的实施方式Embodiments of the invention
本申请实施例提供了一种爬取网站数据的方法、计算机可读存储介质、服务器及装置,用于解决很多网站采取要求输入验证码的方式来屏蔽爬虫系统,导致爬虫系统无法爬取数据的问题。The embodiment of the present application provides a method for crawling website data, a computer readable storage medium, a server, and a device, which are used to solve the problem that many websites take a request for inputting a verification code to block the crawler system, so that the crawler system cannot climb data. problem.
请参阅图1,本申请实施例中一种爬取网站数据的方法一个实施例包括:Referring to FIG. 1 , an embodiment of a method for crawling website data in an embodiment of the present application includes:
101、向爬取数据的目标网站发起访问请求;101. Initiating an access request to a target website that crawls data;
本实施例中,在需要爬取数据时,需要先向目标网站发起访问请求。本实施例的执行主体可以是终端设备或者服务器,优选地,本实施例中的执行主体为一服务器。In this embodiment, when it is necessary to crawl data, it is necessary to first initiate an access request to the target website. The executor of the embodiment may be a terminal device or a server. Preferably, the execution subject in this embodiment is a server.
102、在接收到所述目标网站要求输入验证码的反馈信息后,获取所述目标网站上所述反馈信息所对应的目标验证码图片;After receiving the feedback information that the target website requests to input the verification code, acquiring the target verification code picture corresponding to the feedback information on the target website;
可以理解的是,在向目标网站发起访问请求之后,如果该目标网站对爬虫系统有防范,往往要求输入验证码,同时,该目标网站上会生成一目标验证码图片。因此,服务器会接收到目标网站要求输入验证码的反馈信息,然后服务器获取该目标网站上所述反馈信息所对应的目标验证码图片。It can be understood that, after initiating an access request to the target website, if the target website has a precaution against the crawling system, the verification code is often required, and a target verification code picture is generated on the target website. Therefore, the server receives the feedback information requested by the target website to input the verification code, and then the server obtains the target verification code picture corresponding to the feedback information on the target website.
103、将所述目标验证码图片投入预训练好的机器学习模型进行识别,得到所述机器学习模型输出的验证码答案;103. Enter the target verification code picture into a pre-trained machine learning model for identification, and obtain a verification code answer output by the machine learning model;
服务器在获取到目标验证码图片之后,可以将该目标验证码图片投入预训练好的机器学习模型进行识别。其中,该机器学习模型具体可以是深度学习模型或者SVM向量机学习模型。可知,该机器学习模型是经过大量学习样本预先训练完成的,因此,在输入目标验证码图片之后,可以得到其输出的验证码答案,这里输出的验证码答案即为目标网站要求输入的验证码。After obtaining the target verification code picture, the server may input the target verification code picture into a pre-trained machine learning model for identification. The machine learning model may specifically be a deep learning model or an SVM vector machine learning model. It can be seen that the machine learning model is pre-trained by a large number of learning samples. Therefore, after inputting the target verification code picture, the output verification code answer can be obtained, and the verification code answer output here is the verification code requested by the target website. .
进一步地,如图2所示,所述机器学习模型可以通过以下步骤预先训练得到:Further, as shown in FIG. 2, the machine learning model can be pre-trained by the following steps:
201、获取多个验证码图片;201. Acquire multiple verification code pictures;
202、针对每个验证码图片,将所述验证码图片切割成各个包含独立验证码的图片块;202. For each verification code picture, the verification code picture is cut into each picture block that includes an independent verification code;
203、对各个所述图片块进行二值化处理;203. Perform binarization processing on each of the picture blocks.
204、为二值化后的每个图片块标记对应的验证码答案;204. Mark a corresponding verification code answer for each picture block after binarization;
205、将各个二值化后的图片块作为输入投入至机器学习模型,得到所述机器学习模型输出的训练答案;205: Input each binarized picture block as an input to a machine learning model, and obtain a training answer output by the machine learning model;
206、将各个训练答案作为目标,调整所述机器学习模型的模型参数,以最小化得到的各个训练答案与各个标记的验证码答案之间的误差;206. Targeting each training answer, adjusting model parameters of the machine learning model to minimize an error between the obtained training answers and the verification code answers of the respective markers;
207、若各个输出的训练答案与各个标记的验证码答案之间的误差率小于预设的阈值,则确定所述机器学习模型训练完成。207. If the error rate between the training answers of the respective outputs and the verification code answers of the respective markers is less than a preset threshold, determine that the machine learning model training is completed.
对于上述步骤201,在训练机器学习模型之前,需要先收集大量的学习样本,即验证码图片。这些验证码图片可以从网站上爬取得到,也可以人工收集整理得到,此处不作限定。For the above step 201, before training the machine learning model, it is necessary to collect a large number of learning samples, that is, a verification code picture. These captcha images can be obtained from the website, or can be collected manually, which is not limited here.
对于上述步骤202,由于验证码图片中一般包含多个字符,因此,针对每个验证码图片,为了提高机器学习模型的学习效率和识别准确性,可以对验证码图片进行切割处理,将包含多个验证码字符的图片切割成各个包含独立验证码的图片块。其中,关于切割验证码图片,可以针对验证码字符间距相近的验证码图片设置切割方式,该切割方式根据验证码图片的尺寸和验证码图片上各个验证码字符的位置合理设定。例如,在100*20的图片上等间距分布5个验证码字符,则在切割时可以等间距将该验证码图片切割为5个20*20的图片块,该切割方式可以适用于所有验证码字符等间距分布的验证码图片。For the above step 202, since the verification code picture generally includes a plurality of characters, for each verification code picture, in order to improve the learning efficiency and the recognition accuracy of the machine learning model, the verification code picture may be cut and processed, and The picture of the verification code character is cut into pieces of pictures each containing an independent verification code. Wherein, regarding the cutting verification code picture, the cutting mode may be set for the verification code picture with the similarity of the verification code character spacing, and the cutting mode is reasonably set according to the size of the verification code picture and the position of each verification code character on the verification code picture. For example, if five verification code characters are equally spaced on a 100*20 picture, the verification code picture can be cut into five 20*20 picture blocks at equal intervals during cutting, and the cutting method can be applied to all verification codes. A captcha image with equally spaced characters.
需要说明的是,上述字符是指字母、数字、字和/或符号。It should be noted that the above characters refer to letters, numbers, words and/or symbols.
对于上述步骤203,切割得到图片块后,还需要对这些图片块进行二值化处理,得到各个二值化后的图片块。After the picture block is obtained by the step 203, the picture blocks need to be binarized to obtain the binarized picture blocks.
对于上述步骤204,对每个二值化后的图片块标记对应的验证码答案,则这些二值化的图片块以及各自对应的验证码答案,作为该机器学习模型的学习样本。For the above step 204, each of the binarized picture blocks is marked with a corresponding verification code answer, and the binarized picture blocks and the corresponding verification code answers are used as learning samples of the machine learning model.
对于上述步骤205~207,将这些二值化的图片块以及各自对应的验证码答案分别作为输入和输出对该机器学习模型进行训练。本实施例中的机器学习模型具体可以是深度学习模型或者SVM向量机学习模型,为便于理解,本实施例以SVM向量机学习模型为例进行说明。可知,SVM模型是有监督分类的学习模型,通过调整模型的参数可以改善SVM模型的预测正确率,比如说核函数-rbf、poly、sigmoid、linear等。在将各个二值化后的图片块作为输入投入至机器学习模型,得到所述机器学习模型输出的训练答案之后,以各个训练答案作为目标,通过调整所述机器学习模型的模型参数,以最小化得到的各个训练答案与各个标记的验证码答案之间的误差,直到SVM模型对学习样本的误差率小于预设的阈值,比如10%,或者,SVM模型对学习样本的识别正确率超过预设的阈值,比如90%,则可以认为该SVM模型训练完成。For the above steps 205-207, the binarized picture blocks and the corresponding verification code answers are respectively trained as input and output for the machine learning model. The machine learning model in this embodiment may be a deep learning model or an SVM vector machine learning model. For ease of understanding, the present embodiment is described by taking an SVM vector machine learning model as an example. It can be seen that the SVM model is a supervised classification learning model. By adjusting the parameters of the model, the prediction accuracy of the SVM model can be improved, for example, the kernel functions -rbf, poly, sigmoid, linear, and the like. After each binarized picture block is input as an input to a machine learning model, and the training answer output by the machine learning model is obtained, each training answer is targeted, and the model parameters of the machine learning model are adjusted to minimize The error between each training answer obtained and the verification code answer of each mark until the error rate of the SVM model to the learning sample is less than a preset threshold, such as 10%, or the SVM model identifies the learning sample more accurately than the pre-predetermined rate If the threshold is set, for example, 90%, the SVM model can be considered to be completed.
关于SVM模型的参数调整,本实施例选取的核函数是RBF,RBF核函数具有两个参数:惩罚因子c和核参数y。因此,希望能找到最优化参数组(c,y)使SVM模型具有最好的识别表现。优选地,在调整SVM模型的参数时,可以将参数调整的问题归结在一个小的“好区”内选取最优参数组(C,y)。可以理解的是,选取不同的C和y就会得到不同的SVM模型,其目的是为了寻找最佳的参数组合使该SVM 模型的性能最好,即识别误差率最低。本实施例中,在一个应用场景下,具体可以选取几组(C,y)值,然后采用相同的学习样本进行训练,最终从中选取表现最好的SVM模型所对应的(C,y)值,将此时的(C,y)值作为最终的参数值用于之后的训练中。Regarding the parameter adjustment of the SVM model, the kernel function selected in this embodiment is RBF, and the RBF kernel function has two parameters: a penalty factor c and a kernel parameter y. Therefore, it is hoped that the optimal parameter set (c, y) can be found to give the SVM model the best recognition performance. Preferably, when adjusting the parameters of the SVM model, the problem of parameter adjustment can be attributed to selecting a optimal parameter group (C, y) within a small "good area". It can be understood that different S and C models are obtained by different C and y. The purpose is to find the best combination of parameters to make the performance of the SVM model the best, that is, the recognition error rate is the lowest. In this embodiment, in an application scenario, a plurality of (C, y) values may be selected, and then the same learning sample is used for training, and finally the (C, y) value corresponding to the best performing SVM model is selected. The (C, y) value at this time is used as the final parameter value for the subsequent training.
进一步地,在图2预训练机器学习模型的步骤基础上,如图3所示,上述步骤103具体可以包括:Further, based on the step of pre-training the machine learning model in FIG. 2, as shown in FIG. 3, the foregoing step 103 may specifically include:
301、将所述目标验证码图片切割成各个包含独立验证码的目标图片块;301. Cut the target verification code picture into target picture blocks each including an independent verification code.
302、对各个所述目标图片块进行二值化处理;302. Perform binarization processing on each of the target picture blocks.
303、将各个二值化后的目标图片块作为输入投入至所述机器学习模型,得到所述机器学习模型输出的验证码答案。303. Input the binarized target picture block as input to the machine learning model, and obtain a verification code answer output by the machine learning model.
对于上述步骤301,原理与上述步骤202类似,由于验证码图片中一般包含多个字符,加上该机器学习模型是针对图片块进行学习的,因此在投入机器学习模型进行识别之前,需要对目标验证码图片进行切割处理,将包含多个验证码字符的图片切割成各个包含独立验证码的目标图片块。For the above step 301, the principle is similar to the above step 202. Since the verification code picture generally contains a plurality of characters, and the machine learning model is for the picture block, the target is needed before the machine learning model is input for recognition. The captcha image is cut, and a picture containing a plurality of captcha characters is cut into target block blocks each containing an independent captcha.
对于上述步骤302,原理与上述步骤203类似,此处不再赘述。For the above step 302, the principle is similar to the above step 203, and details are not described herein again.
对于上述步骤303,在得到二值化后的目标图片块后,服务器将各个二值化后的目标图片块作为输入投入至所述机器学习模型,得到所述机器学习模型输出的验证码答案。In step 303 above, after obtaining the binarized target picture block, the server inputs each binarized target picture block as input to the machine learning model, and obtains a verification code answer output by the machine learning model.
104、根据所述输出的验证码答案执行所述目标网站要求输入验证码的验证操作;104. Perform, according to the output verification code answer, the verification operation of the target website requesting input verification code;
服务器在得到验证码答案后,可以将该验证码答案输入至该目标网站指定的输入验证码的位置上,然后触发该目标网站上的“确定”按钮进行验证码验证。目标网站后台验证服务器输入的验证码答案后,如果该验证码答案正确,则目标网站会向服务器反馈验证通过的信息;反之,则目标网站向服务器反馈验证不通过的信息。After obtaining the verification code answer, the server may input the verification code answer to the location of the input verification code specified by the target website, and then trigger the “OK” button on the target website to perform verification code verification. After the target website background verification server enters the verification code answer, if the verification code answer is correct, the target website will feedback the verified information to the server; otherwise, the target website feeds back to the server the information that the verification fails.
105、在通过所述目标网站的验证后,从所述目标网站上爬取数据。105. After verifying through the target website, crawl data from the target website.
在通过所述目标网站的验证后,服务器就可以通过爬虫系统从所述目标网站上爬取数据了。After verification by the target website, the server can crawl data from the target website through the crawler system.
进一步地,由于机器学习模型的识别正确率一般难以达到100%,因此,在实际使用过程中,总会出现输出的验证码答案不正确,导致目标网站验证不通过的情况。因此,在根据所述输出的验证码答案执行所述目标网站要求输入验证码的验证操作之后,若在验证操作后所述目标网站反馈验证不通过,则可以刷新所述目标网站提供的目标验证码图片,返回执行步骤102,重新获取目标网站上刷新后的目标验证码图片并重新执行上述步骤103~105,再次尝试通过验证。更进一步地,若刷新验证码图片的次数超过预设的数量阈值,比如超过5次,则可以通知服务器的当前用户,本次针对目标网站的数据爬取失败。Further, since the recognition rate of the machine learning model is generally difficult to reach 100%, in the actual use process, there is always an incorrect answer of the output verification code, which leads to the failure of the target website verification. Therefore, after performing the verification operation of the target website requesting the verification code according to the output verification code answer, if the target website feedback verification fails after the verification operation, the target verification provided by the target website may be refreshed. The code picture is returned to step 102 to re-acquire the refreshed target verification code picture on the target website and re-execute steps 103-105 above, and try to pass the verification again. Further, if the number of times to refresh the verification code picture exceeds a preset number threshold, for example, more than 5 times, the current user of the server may be notified, and the data crawling for the target website is failed this time.
进一步地,在步骤103之前,还可以从预先建立的模型集合中选取一个与所述目标验证码图片对应的机器学习模型,所述模型集合中不同机器学习模型采用互不相同分类下的验证码图片作为学习样本预先训练得到。可以理解的是,由于不同网站采用的验证码图片上的验证码字符的格式往往相距迥异,比如有些验证码字符采用楷体,有些验证码字符采用宋体。而如果机器学习模型的学习样本中包含各种不同格式的验证码字符,将大大增大机器学习模型训练完成的难度,也会降低训练完成后机器学习模型的识别正确率。因此,本实施例还可以预先对作为学习样本的验证码图片进行分类,分成不同的类别,然后针对每个类别分别训练一个机器学习模型,把各个类别对应的训练好的机器学习模型收集起来得到模型集合。在对某个网站进行数据爬取,需要输入验证码时,先判断该网站的验证码图片属于哪一种分类,然后从模型集合中选取对应的机器学习模型,并将该网站的验证码图片(即目标验证码图片)投入选取出的机器学习模型中进行识别,得到输出的验证码答案。这样,不仅有利于机器学习模型的训练和识别准确率,而且提高了服务器爬取数据的适用范围。Further, before step 103, a machine learning model corresponding to the target verification code picture may be selected from a pre-established model set, where different machine learning models adopt verification codes under different classifications. The picture is pre-trained as a learning sample. It can be understood that the format of the verification code characters on the verification code pictures used by different websites are often very different, for example, some verification code characters are used in the body, and some verification code characters are in the Song. If the learning sample of the machine learning model contains various verification code characters in different formats, the difficulty of training the machine learning model will be greatly increased, and the recognition accuracy of the machine learning model after the training is completed will be reduced. Therefore, in this embodiment, the verification code pictures as learning samples can be classified into different categories in advance, and then a machine learning model is separately trained for each category, and the trained machine learning models corresponding to the respective categories are collected. Model collection. When data crawling on a website requires a verification code, first determine which category the verification code image of the website belongs to, and then select the corresponding machine learning model from the model collection, and the verification code image of the website. (ie, the target verification code picture) is input into the selected machine learning model for identification, and the output verification code answer is obtained. In this way, not only the training and recognition accuracy of the machine learning model is facilitated, but also the application range of the server crawl data is improved.
更进一步地,所述各个机器学习模型预先训练所采用的学习样本所属的分类可以通过以下三种方式中的任意一种预先确定:Further, the classification of the learning samples used by the respective machine learning model pre-training may be predetermined by any one of the following three methods:
第一种方式,对各个作为学习样本的验证码图片按照各自来源的网站进行归类,其中,一个网站对应一种分类。可以理解的是,在训练机器学习模型时,可以对每个网站分别训练对应的机器学习模型。在实际使用过程中,由于需要爬取数据的目标网站一般数量有限,比如只有几个网站,因此采用第一方式也可以满足实际应用的需要,不会造成过大的模型训练负担。In the first method, each verification code image as a learning sample is classified according to a website of a respective source, wherein one website corresponds to one category. It can be understood that when training the machine learning model, the corresponding machine learning model can be separately trained for each website. In the actual use process, because the target website that needs to crawl data is generally limited, for example, there are only a few websites, the first method can also meet the needs of practical applications without causing excessive model training burden.
第二种方式,首先,提取各个作为学习样本的验证码图片中的验证码的字符;然后,根据提取的字符的所属类型对各个所述验证码图片进行归类,其中,一种所述所属类型对应一种分类。这里所说的字符的所属类型具体可以是指字符的写法或表现形式上的不同形体,比如宋体、楷体、草书、罗马体等,需要注意的是,也包括了符号的字体。比如,有些网站采用宋体的字符作为验证码,有些网站采用楷体的字符作为验证码,对于机器学习模型的训练来说,不同的字符的所属类型(这里具体指字体)糅合在一起训练会提高训练的难度以及模型识别的准确率。因此,按照不同验证码的字符的字体作为分类标准进行分类,可以更有利于机器学习模型的训练完成,提高机器学习模型使用时的识别准确率。In the second manner, first, each character of the verification code in the verification code picture as the learning sample is extracted; then, each of the verification code pictures is classified according to the type of the extracted character, wherein one of the categories belongs to The type corresponds to a category. The type of the character mentioned here may specifically refer to a different form of the character's writing or expression, such as Song, 楷, cursive, Roman, etc. It should be noted that the font of the symbol is also included. For example, some websites use Song characters as verification codes, and some websites use scorpion characters as verification codes. For the training of machine learning models, the types of different characters (here specifically referred to as fonts) are combined to improve training. The difficulty and accuracy of model identification. Therefore, classifying the fonts of characters according to different verification codes as classification criteria can be more beneficial to the completion of the training of the machine learning model and improve the recognition accuracy of the machine learning model when used.
第三种方式,先获取各个作为学习样本的验证码图片中各个验证码字符之间的间距;然后,根据各个验证码图片对应的间距所属的各个预设间距区间对所述各个验证码图片进行归类,其中,一个间距区间对应一种分类。可以理解的是,比如,有些网站采用的验证码图片上相邻两个验证码字符之间相差3个像素位置,有些网站采用的验证码图片上相邻两个验证码字符之间相差5个像素位置,有些网站采用的验证码图片上相邻两个验证码字符之间相差0个像素位置,对于机器学习模型的训练来说,不同的字符间距不仅影响了模型训练的难度和后期识别准确率,而且在对验证码图片进行切割处理的尺寸也会有所相同。因此,按照不同验证码字符之间的间距大小作为分类标准对各个作为学习样本的验证码图片进行归类,更有利于机器学习模型的训练完成,提高机器学习模型使用时的识别准确率。In a third manner, the spacing between each verification code character in each verification code picture as a learning sample is obtained first; then, each verification code picture is performed according to each preset spacing interval to which the corresponding spacing of the verification code pictures belongs. Classification, where a spacing interval corresponds to a classification. It can be understood that, for example, some websites use a verification code image on which two adjacent verification code characters differ by 3 pixel positions, and some websites use a verification code picture on which two adjacent verification code characters differ by five. Pixel position, the difference between two adjacent verification code characters on the verification code image used by some websites is 0 pixel position. For the training of machine learning model, different character spacing not only affects the difficulty of model training and accurate recognition in the later stage. Rate, and the size of the cut code image will be the same. Therefore, according to the size of the gap between different verification code characters as the classification standard, each verification code picture as a learning sample is classified, which is more beneficial to the completion of the training of the machine learning model and improve the recognition accuracy when the machine learning model is used.
需要说明的是,上述三种分类方式可以单独使用,也可以结合起来使用,对此本实施例不作限定。It should be noted that the above three classification manners may be used alone or in combination, which is not limited in this embodiment.
本实施例中,首先,向爬取数据的目标网站发起访问请求;在接收到所述目标网站要求输入验证码的反馈信息后,获取所述目标网站上所述反馈信息所对应的目标验证码图片;然后,将所述目标验证码图片投入预训练好的机器学习模型进行识别,得到所述机器学习模型输出的验证码答案;接着,根据所述输出的验证码答案执行所述目标网站要求输入验证码的验证操作;在通过所述目标网站的验证后,从所述目标网站上爬取数据。在本实施例中,在爬取目标网站数据,遇到目标网站要求输入验证码时,可以通过机器学习模型对目标验证码图片进行识别,得到验证码答案,并根据验证码答案自动完成目标网站的验证,突破网站对爬取数据的阻碍,使得爬虫系统可以顺利爬取网站上的数据。In this embodiment, first, an access request is initiated to the target website that crawls the data; and after receiving the feedback information of the target website requesting the verification code, acquiring the target verification code corresponding to the feedback information on the target website. a picture; then, the target verification code picture is put into a pre-trained machine learning model for identification, and a verification code answer output by the machine learning model is obtained; and then, the target website requirement is executed according to the output verification code answer Entering a verification operation of the verification code; after the verification by the target website, the data is crawled from the target website. In this embodiment, when crawling the target website data and encountering the target website requesting input of the verification code, the target verification code picture may be identified by the machine learning model, the verification code answer is obtained, and the target website is automatically completed according to the verification code answer. The verification, the breakthrough of the website to block the data, so that the crawler system can successfully crawl the data on the website.
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that the size of the sequence of the steps in the above embodiments does not mean that the order of execution is performed. The order of execution of each process should be determined by its function and internal logic, and should not be construed as limiting the implementation process of the embodiments of the present application.
上面主要描述了一种爬取网站数据的方法,下面将对一种爬取网站数据的装置进行详细描述。The above mainly describes a method of crawling website data, and a device for crawling website data will be described in detail below.
图4示出了本申请实施例中一种爬取网站数据的装置一个实施例结构图。FIG. 4 is a structural diagram showing an embodiment of an apparatus for crawling website data in an embodiment of the present application.
本实施例中,一种爬取网站数据的装置,包括:In this embodiment, an apparatus for crawling website data includes:
请求发起模块401,用于向爬取数据的目标网站发起访问请求;The request initiating module 401 is configured to initiate an access request to the target website that crawls the data;
目标图片获取模块402,用于在接收到所述目标网站要求输入验证码的反馈信息后,获取所述目标网站上所述反馈信息所对应的目标验证码图片;The target image obtaining module 402 is configured to: after receiving the feedback information that the target website requests to input the verification code, acquire the target verification code image corresponding to the feedback information on the target website;
验证码识别模块403,用于将所述目标验证码图片投入预训练好的机器学习模型进行识别,得到所述机器学习模型输出的验证码答案;The verification code identification module 403 is configured to input the target verification code picture into a pre-trained machine learning model to obtain a verification code answer output by the machine learning model;
验证操作模块404,用于根据所述输出的验证码答案执行所述目标网站要求输入验证码的验证操作;a verification operation module 404, configured to perform, according to the output verification code answer, the verification operation of the target website requesting input verification code;
爬取数据模块405,用于在通过所述目标网站的验证后,从所述目标网站上爬取数据。The crawl data module 405 is configured to crawl data from the target website after verification by the target website.
进一步地,所述机器学习模型可以通过以下模块预先训练得到:Further, the machine learning model can be pre-trained by the following modules:
图片获取模块,用于获取多个验证码图片;a picture acquisition module, configured to obtain multiple verification code pictures;
图片块切割模块,用于针对每个验证码图片,将所述验证码图片切割成各个包含独立验证码的图片块;a picture block cutting module, configured to cut the verification code picture into each picture block including an independent verification code for each verification code picture;
图片块二值化模块,用于对各个所述图片块进行二值化处理;a picture block binarization module, configured to perform binarization processing on each of the picture blocks;
答案标记模块,用于为二值化后的每个图片块标记对应的验证码答案;An answer tag module, configured to mark a corresponding verification code answer for each picture block after binarization;
训练模块,用于将各个二值化后的图片块作为输入投入至机器学习模型,得到所述机器学习模型输出的训练答案;a training module, configured to input each binarized picture block as an input to a machine learning model, and obtain a training answer output by the machine learning model;
参数调整模块,用于将各个训练答案作为目标,调整所述机器学习模型的模型参数,以最小化得到的各个训练答案与各个标记的验证码答案之间的误差;a parameter adjustment module, configured to target each training answer, and adjust model parameters of the machine learning model to minimize an error between the obtained training answers and the verification code answers of the respective markers;
训练完成模块,用于若各个输出的训练答案与各个标记的验证码答案之间的误差率小于预设的阈值,则确定所述机器学习模型训练完成。The training completion module is configured to determine that the machine learning model training is completed if an error rate between the training answers of the respective outputs and the verification code answers of the respective markers is less than a preset threshold.
进一步地,所述验证码识别模块可以包括:Further, the verification code identification module may include:
切割单元,用于将所述目标验证码图片切割成各个包含独立验证码的目标图片块;a cutting unit, configured to cut the target verification code picture into target picture blocks each including an independent verification code;
二值化单元,用于对各个所述目标图片块进行二值化处理;a binarization unit, configured to perform binarization processing on each of the target picture blocks;
输入模型单元,用于将各个二值化后的目标图片块作为输入投入至所述机器学习模型,得到所述机器学习模型输出的验证码答案。The input model unit is configured to input each binarized target picture block as input to the machine learning model, and obtain a verification code answer output by the machine learning model.
进一步地,所述爬取网站数据的装置还可以包括:Further, the device for crawling website data may further include:
模型选取模块,用于从预先建立的模型集合中选取一个与所述目标验证码图片对应的机器学习模型,所述模型集合中不同机器学习模型采用互不相同分类下的验证码图片作为学习样本预先训练得到。a model selection module, configured to select, from a pre-established model set, a machine learning model corresponding to the target verification code picture, where different machine learning models in the model set use verification code pictures under different classifications as learning samples Pre-trained.
进一步地,所述各个机器学习模型预先训练所采用的学习样本所属的分类通过以下模块预先确定:Further, the classification of the learning samples used in the pre-training of the respective machine learning models is predetermined by the following modules:
第一归类模块,用于对各个作为学习样本的验证码图片按照各自来源的网站进行归类,其中,一个网站对应一种分类;The first categorization module is configured to classify each verification code image as a learning sample according to a website of a respective source, wherein one website corresponds to one category;
or
字符提取模块,用于提取各个作为学习样本的验证码图片中的验证码的字符;a character extraction module, configured to extract characters of each verification code in the verification code picture of the learning sample;
第二归类模块,用于根据提取的字符的所属类型对各个所述验证码图片进行归类,其中,一种所述所属类型对应一种分类;a second categorization module, configured to classify each of the verification code pictures according to a type of the extracted characters, where one type of the belonging type corresponds to one type of classification;
or
间距获取模块,用于获取各个作为学习样本的验证码图片中各个验证码字符之间的间距;a spacing acquisition module, configured to obtain a spacing between each verification code character in each verification code picture as a learning sample;
第三归类模块,用于根据各个验证码图片对应的间距所属的各个预设间距区间对所述各个验证码图片进行归类,其中,一个间距区间对应一种分类。The third categorization module is configured to categorize the verification code images according to respective preset interval intervals to which the spacing corresponding to each verification code picture belongs, wherein one spacing interval corresponds to one classification.
进一步地,所述爬取网站数据的装置还可以包括:Further, the device for crawling website data may further include:
图片刷新模块,用于若在验证操作后所述目标网站反馈验证不通过,则刷新所述目标网站提供的目标验证码图片,返回触发所述目标图片获取模块。And a picture refreshing module, configured to: if the target website feedback verification fails after the verifying operation, refresh the target verification code picture provided by the target website, and return to trigger the target picture acquiring module.
图5是本申请一实施例提供的服务器的示意图。如图5所示,该实施例的服务器5包括:处理器50、存储器51以及存储在所述存储器51中并可在所述处理器50上运行的计算机可读指令52,例如执行上述爬取网站数据的方法的程序。所述处理器50执行所述计算机可读指令52时实现上述各个爬取网站数据的方法实施例中的步骤,例如图1所示的步骤101至105。或者,所述处理器50执行所述计算机可读指令52时实现上述各装置实施例中各模块/单元的功能,例如图4所示模块401至405的功能。FIG. 5 is a schematic diagram of a server according to an embodiment of the present application. As shown in FIG. 5, the server 5 of this embodiment includes a processor 50, a memory 51, and computer readable instructions 52 stored in the memory 51 and operable on the processor 50, for example, performing the above crawling The program of the method of website data. The steps in the method embodiment of implementing the above-described various crawling website data when the processor 50 executes the computer readable instructions 52, such as steps 101 to 105 shown in FIG. Alternatively, when the processor 50 executes the computer readable instructions 52, the functions of the modules/units in the various apparatus embodiments described above are implemented, such as the functions of the modules 401 to 405 shown in FIG.
示例性的,所述计算机可读指令52可以被分割成一个或多个模块/单元,所述一个或者多个模块/单元被存储在所述存储器51中,并由所述处理器50执行,以完成本申请。所述一个或多个模块/单元可以是能够完成特定功能的一系列计算机可读指令的指令段,该指令段用于描述所述计算机可读指令52在所述服务器5中的执行过程。Illustratively, the computer readable instructions 52 may be partitioned into one or more modules/units that are stored in the memory 51 and executed by the processor 50, To complete this application. The one or more modules/units may be an instruction segment of a series of computer readable instructions capable of performing a particular function, the instruction segments being used to describe the execution of the computer readable instructions 52 in the server 5.
所述服务器5可以是本地服务器、云端服务器等计算设备。所述服务器可包括,但不仅限于,处理器50、存储器51。本领域技术人员可以理解,图5仅仅是服务器5的示例,并不构成对服务器5的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如所述服务器还可以包括输入输出设备、网络接入设备、总线等。The server 5 can be a computing device such as a local server or a cloud server. The server may include, but is not limited to, a processor 50, a memory 51. It will be understood by those skilled in the art that FIG. 5 is merely an example of the server 5 and does not constitute a limitation of the server 5, and may include more or less components than those illustrated, or combine some components, or different components, such as The server may also include an input and output device, a network access device, a bus, and the like.
在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。The functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit. The above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。The integrated unit, if implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application, in essence or the contribution to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium. A number of instructions are included to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present application. The foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), and a random access memory (RAM, Random Access). A variety of media that can store program code, such as a memory, a disk, or an optical disk.
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。The above embodiments are only used to explain the technical solutions of the present application, and are not limited thereto; although the present application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that they can still The technical solutions described in the embodiments are modified, or the equivalents of the technical features are replaced by the equivalents. The modifications and substitutions of the embodiments do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims (20)

  1. 一种爬取网站数据的方法,其特征在于,包括:A method for crawling website data, which is characterized by comprising:
    向爬取数据的目标网站发起访问请求;Initiating an access request to the target website that crawled the data;
    在接收到所述目标网站要求输入验证码的反馈信息后,获取所述目标网站上所述反馈信息所对应的目标验证码图片;After receiving the feedback information that the target website requests to input the verification code, acquiring the target verification code picture corresponding to the feedback information on the target website;
    将所述目标验证码图片投入预训练好的机器学习模型进行识别,得到所述机器学习模型输出的验证码答案;And the target verification code picture is input into the pre-trained machine learning model for identification, and the verification code answer output by the machine learning model is obtained;
    根据所述输出的验证码答案执行所述目标网站要求输入验证码的验证操作;Performing, according to the output verification code answer, the verification operation of the target website requesting input verification code;
    在通过所述目标网站的验证后,从所述目标网站上爬取数据。After the verification by the target website, the data is crawled from the target website.
  2. 根据权利要求1所述的爬取网站数据的方法,其特征在于,所述机器学习模型通过以下步骤预先训练得到:The method of crawling website data according to claim 1, wherein the machine learning model is pre-trained by the following steps:
    获取多个验证码图片;Obtain multiple captcha images;
    针对每个验证码图片,将所述验证码图片切割成各个包含独立验证码的图片块;For each verification code picture, the verification code picture is cut into each picture block including an independent verification code;
    对各个所述图片块进行二值化处理;Performing binarization processing on each of the picture blocks;
    为二值化后的每个图片块标记对应的验证码答案;Marking the corresponding verification code answer for each picture block after binarization;
    将各个二值化后的图片块作为输入投入至机器学习模型,得到所述机器学习模型输出的训练答案;Inputting each binarized picture block as input to a machine learning model, and obtaining a training answer output by the machine learning model;
    将各个训练答案作为目标,调整所述机器学习模型的模型参数,以最小化得到的各个训练答案与各个标记的验证码答案之间的误差;Adjusting the model parameters of the machine learning model with the respective training answers as targets, to minimize the error between the obtained training answers and the verification code answers of the respective markers;
    若各个输出的训练答案与各个标记的验证码答案之间的误差率小于预设的阈值,则确定所述机器学习模型训练完成。If the error rate between the training answers of the respective outputs and the verification code answers of the respective markers is less than a preset threshold, it is determined that the machine learning model training is completed.
  3. 根据权利要求2所述的爬取网站数据的方法,其特征在于,所述将所述目标验证码图片投入预训练好的机器学习模型进行识别,得到所述机器学习模型输出的验证码答案包括:The method for crawling website data according to claim 2, wherein the image of the target verification code is put into a pre-trained machine learning model for recognition, and the verification code answer output by the machine learning model is obtained. :
    将所述目标验证码图片切割成各个包含独立验证码的目标图片块;Destroying the target verification code picture into target picture blocks each containing an independent verification code;
    对各个所述目标图片块进行二值化处理;Performing binarization processing on each of the target picture blocks;
    将各个二值化后的目标图片块作为输入投入至所述机器学习模型,得到所述机器学习模型输出的验证码答案。Each binarized target picture block is input as input to the machine learning model, and a verification code answer output by the machine learning model is obtained.
  4. 根据权利要求1所述的爬取网站数据的方法,其特征在于,在将所述目标验证码图片投入预训练好的机器学习模型进行识别,得到所述机器学习模型输出的验证码答案之前,还包括:The method for crawling website data according to claim 1, wherein the target verification code picture is input into a pre-trained machine learning model for identification, and before the verification code answer output by the machine learning model is obtained, Also includes:
    从预先建立的模型集合中选取一个与所述目标验证码图片对应的机器学习模型,所述模型集合中不同机器学习模型采用互不相同分类下的验证码图片作为学习样本预先训练得到。A machine learning model corresponding to the target verification code picture is selected from a set of pre-established models, and different machine learning models in the model set are pre-trained by using verification code pictures under different classifications as learning samples.
  5. 根据权利要求4所述的爬取网站数据的方法,其特征在于,所述各个机器学习模型预先训练所采用的学习样本所属的分类通过以下步骤预先确定:The method for crawling website data according to claim 4, wherein the classification of the learning samples used in the pre-training of the respective machine learning models is predetermined by the following steps:
    对各个作为学习样本的验证码图片按照各自来源的网站进行归类,其中,一个网站对应一种分类;The verification code pictures as the learning samples are classified according to the websites of the respective sources, wherein one website corresponds to one category;
    or
    提取各个作为学习样本的验证码图片中的验证码的字符;Extracting each of the characters of the verification code in the verification code picture as the learning sample;
    根据提取的字符的所属类型对各个所述验证码图片进行归类,其中,一种所述所属类型对应一种分类;Each of the verification code pictures is classified according to the type of the extracted characters, wherein one type of the belonging type corresponds to one type of classification;
    or
    获取各个作为学习样本的验证码图片中各个验证码字符之间的间距;Obtaining a spacing between each verification code character in each verification code picture as a learning sample;
    根据各个验证码图片对应的间距所属的各个预设间距区间对所述各个验证码图片进行归类,其中,一个间距区间对应一种分类。Each of the verification code pictures is classified according to each preset interval interval to which the spacing corresponding to each verification code picture belongs, wherein one spacing interval corresponds to one classification.
  6. 根据权利要求1至5中任一项所述的爬取网站数据的方法,其特征在于,在根据所述输出的验证码答案执行所述目标网站要求输入验证码的验证操作之后,还包括:The method of crawling website data according to any one of claims 1 to 5, further comprising: after performing the verification operation of the target website requesting the input verification code according to the output verification code answer, further comprising:
    若在验证操作后所述目标网站反馈验证不通过,则刷新所述目标网站提供的目标验证码图片,返回执行所述获取所述目标网站上所述反馈信息所对应的目标验证码图片的步骤。If the target website verification verification fails after the verification operation, refreshing the target verification code picture provided by the target website, and returning to perform the step of acquiring the target verification code picture corresponding to the feedback information on the target website .
  7. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可读指令,其特征在于,所述计算机可读指令被处理器执行时实现如下步骤:A computer readable storage medium storing computer readable instructions, wherein the computer readable instructions, when executed by a processor, implement the following steps:
    向爬取数据的目标网站发起访问请求;Initiating an access request to the target website that crawled the data;
    在接收到所述目标网站要求输入验证码的反馈信息后,获取所述目标网站上所述反馈信息所对应的目标验证码图片;After receiving the feedback information that the target website requests to input the verification code, acquiring the target verification code picture corresponding to the feedback information on the target website;
    将所述目标验证码图片投入预训练好的机器学习模型进行识别,得到所述机器学习模型输出的验证码答案;And the target verification code picture is input into the pre-trained machine learning model for identification, and the verification code answer output by the machine learning model is obtained;
    根据所述输出的验证码答案执行所述目标网站要求输入验证码的验证操作;Performing, according to the output verification code answer, the verification operation of the target website requesting input verification code;
    在通过所述目标网站的验证后,从所述目标网站上爬取数据。After the verification by the target website, the data is crawled from the target website.
  8. 根据权利要求7所述的计算机可读存储介质,其特征在于,所述机器学习模型通过以下步骤预先训练得到:The computer readable storage medium of claim 7, wherein the machine learning model is pre-trained by the following steps:
    获取多个验证码图片;Obtain multiple captcha images;
    针对每个验证码图片,将所述验证码图片切割成各个包含独立验证码的图片块;For each verification code picture, the verification code picture is cut into each picture block including an independent verification code;
    对各个所述图片块进行二值化处理;Performing binarization processing on each of the picture blocks;
    为二值化后的每个图片块标记对应的验证码答案;Marking the corresponding verification code answer for each picture block after binarization;
    将各个二值化后的图片块作为输入投入至机器学习模型,得到所述机器学习模型输出的训练答案;Inputting each binarized picture block as input to a machine learning model, and obtaining a training answer output by the machine learning model;
    将各个训练答案作为目标,调整所述机器学习模型的模型参数,以最小化得到的各个训练答案与各个标记的验证码答案之间的误差;Adjusting the model parameters of the machine learning model with the respective training answers as targets, to minimize the error between the obtained training answers and the verification code answers of the respective markers;
    若各个输出的训练答案与各个标记的验证码答案之间的误差率小于预设的阈值,则确定所述机器学习模型训练完成。If the error rate between the training answers of the respective outputs and the verification code answers of the respective markers is less than a preset threshold, it is determined that the machine learning model training is completed.
  9. 根据权利要求8所述的计算机可读存储介质,其特征在于,所述将所述目标验证码图片投入预训练好的机器学习模型进行识别,得到所述机器学习模型输出的验证码答案包括:The computer readable storage medium according to claim 8, wherein the determining the target verification code picture into a pre-trained machine learning model, and obtaining the verification code output output by the machine learning model comprises:
    将所述目标验证码图片切割成各个包含独立验证码的目标图片块;Destroying the target verification code picture into target picture blocks each containing an independent verification code;
    对各个所述目标图片块进行二值化处理;Performing binarization processing on each of the target picture blocks;
    将各个二值化后的目标图片块作为输入投入至所述机器学习模型,得到所述机器学习模型输出的验证码答案。Each binarized target picture block is input as input to the machine learning model, and a verification code answer output by the machine learning model is obtained.
  10. 根据权利要求7所述的计算机可读存储介质,其特征在于,在将所述目标验证码图片投入预训练好的机器学习模型进行识别,得到所述机器学习模型输出的验证码答案之前,还包括:The computer readable storage medium according to claim 7, wherein before the target verification code picture is put into a pre-trained machine learning model for identification, and the verification code answer output by the machine learning model is obtained, include:
    从预先建立的模型集合中选取一个与所述目标验证码图片对应的机器学习模型,所述模型集合中不同机器学习模型采用互不相同分类下的验证码图片作为学习样本预先训练得到。A machine learning model corresponding to the target verification code picture is selected from a set of pre-established models, and different machine learning models in the model set are pre-trained by using verification code pictures under different classifications as learning samples.
  11. 根据权利要求10所述的计算机可读存储介质,其特征在于,所述各个机器学习模型预先训练所采用的学习样本所属的分类通过以下步骤预先确定:The computer readable storage medium according to claim 10, wherein the classification to which the learning samples employed by the respective machine learning models are pre-trained is predetermined by the following steps:
    对各个作为学习样本的验证码图片按照各自来源的网站进行归类,其中,一个网站对应一种分类;The verification code pictures as the learning samples are classified according to the websites of the respective sources, wherein one website corresponds to one category;
    or
    提取各个作为学习样本的验证码图片中的验证码的字符;Extracting each of the characters of the verification code in the verification code picture as the learning sample;
    根据提取的字符的所属类型对各个所述验证码图片进行归类,其中,一种所述所属类型对应一种分类;Each of the verification code pictures is classified according to the type of the extracted characters, wherein one type of the belonging type corresponds to one type of classification;
    or
    获取各个作为学习样本的验证码图片中各个验证码字符之间的间距;Obtaining a spacing between each verification code character in each verification code picture as a learning sample;
    根据各个验证码图片对应的间距所属的各个预设间距区间对所述各个验证码图片进行归类,其中,一个间距区间对应一种分类。Each of the verification code pictures is classified according to each preset interval interval to which the spacing corresponding to each verification code picture belongs, wherein one spacing interval corresponds to one classification.
  12. 根据权利要求7至11中任一项所述的计算机可读存储介质,其特征在于,在根据所述输出的验证码答案执行所述目标网站要求输入验证码的验证操作之后,还包括:The computer readable storage medium according to any one of claims 7 to 11, further comprising: after performing the verification operation of the target website requesting the input verification code according to the output verification code answer, further comprising:
    若在验证操作后所述目标网站反馈验证不通过,则刷新所述目标网站提供的目标验证码图片,返回执行所述获取所述目标网站上所述反馈信息所对应的目标验证码图片的步骤。If the target website verification verification fails after the verification operation, refreshing the target verification code picture provided by the target website, and returning to perform the step of acquiring the target verification code picture corresponding to the feedback information on the target website .
  13. 一种服务器,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现如下步骤:A server comprising a memory, a processor, and computer readable instructions stored in the memory and operative on the processor, wherein the processor implements the following steps when the computer readable instructions are executed :
    向爬取数据的目标网站发起访问请求;Initiating an access request to the target website that crawled the data;
    在接收到所述目标网站要求输入验证码的反馈信息后,获取所述目标网站上所述反馈信息所对应的目标验证码图片;After receiving the feedback information that the target website requests to input the verification code, acquiring the target verification code picture corresponding to the feedback information on the target website;
    将所述目标验证码图片投入预训练好的机器学习模型进行识别,得到所述机器学习模型输出的验证码答案;And the target verification code picture is input into the pre-trained machine learning model for identification, and the verification code answer output by the machine learning model is obtained;
    根据所述输出的验证码答案执行所述目标网站要求输入验证码的验证操作;Performing, according to the output verification code answer, the verification operation of the target website requesting input verification code;
    在通过所述目标网站的验证后,从所述目标网站上爬取数据。After the verification by the target website, the data is crawled from the target website.
  14. 根据权利要求13所述的服务器,其特征在于,所述机器学习模型通过以下步骤预先训练得到:The server according to claim 13, wherein said machine learning model is pre-trained by the following steps:
    获取多个验证码图片;Obtain multiple captcha images;
    针对每个验证码图片,将所述验证码图片切割成各个包含独立验证码的图片块;For each verification code picture, the verification code picture is cut into each picture block including an independent verification code;
    对各个所述图片块进行二值化处理;Performing binarization processing on each of the picture blocks;
    为二值化后的每个图片块标记对应的验证码答案;Marking the corresponding verification code answer for each picture block after binarization;
    将各个二值化后的图片块作为输入投入至机器学习模型,得到所述机器学习模型输出的训练答案;Inputting each binarized picture block as input to a machine learning model, and obtaining a training answer output by the machine learning model;
    将各个训练答案作为目标,调整所述机器学习模型的模型参数,以最小化得到的各个训练答案与各个标记的验证码答案之间的误差;Adjusting the model parameters of the machine learning model with the respective training answers as targets, to minimize the error between the obtained training answers and the verification code answers of the respective markers;
    若各个输出的训练答案与各个标记的验证码答案之间的误差率小于预设的阈值,则确定所述机器学习模型训练完成。If the error rate between the training answers of the respective outputs and the verification code answers of the respective markers is less than a preset threshold, it is determined that the machine learning model training is completed.
  15. 根据权利要求13至14中任一项所述的服务器,其特征在于,在根据所述输出的验证码答案执行所述目标网站要求输入验证码的验证操作之后,还包括:The server according to any one of claims 13 to 14, further comprising: after performing the verification operation of the target website requesting the input verification code according to the output verification code answer, further comprising:
    若在验证操作后所述目标网站反馈验证不通过,则刷新所述目标网站提供的目标验证码图片,返回执行所述获取所述目标网站上所述反馈信息所对应的目标验证码图片的步骤。If the target website verification verification fails after the verification operation, refreshing the target verification code picture provided by the target website, and returning to perform the step of acquiring the target verification code picture corresponding to the feedback information on the target website .
  16. 一种爬取网站数据的装置,其特征在于,包括:An apparatus for crawling website data, comprising:
    请求发起模块,用于向爬取数据的目标网站发起访问请求;a request initiation module, configured to initiate an access request to a target website that crawls data;
    目标图片获取模块,用于在接收到所述目标网站要求输入验证码的反馈信息后,获取所述目标网站上所述反馈信息所对应的目标验证码图片;a target image obtaining module, configured to acquire a target verification code image corresponding to the feedback information on the target website after receiving the feedback information of the target website requesting the verification code;
    验证码识别模块,用于将所述目标验证码图片投入预训练好的机器学习模型进行识别,得到所述机器学习模型输出的验证码答案;a verification code identification module, configured to input the target verification code picture into a pre-trained machine learning model to obtain an verification code answer output by the machine learning model;
    验证操作模块,用于根据所述输出的验证码答案执行所述目标网站要求输入验证码的验证操作;a verification operation module, configured to perform, according to the output verification code answer, the verification operation of the target website requesting input verification code;
    爬取数据模块,用于在通过所述目标网站的验证后,从所述目标网站上爬取数据。Crawling a data module for crawling data from the target website after verification by the target website.
  17. 根据权利要求16所述的爬取网站数据的装置,其特征在于,还包括:The device for crawling website data according to claim 16, further comprising:
    图片获取模块,用于获取多个验证码图片;a picture acquisition module, configured to obtain multiple verification code pictures;
    图片块切割模块,用于针对每个验证码图片,将所述验证码图片切割成各个包含独立验证码的图片块;a picture block cutting module, configured to cut the verification code picture into each picture block including an independent verification code for each verification code picture;
    图片块二值化模块,用于对各个所述图片块进行二值化处理;a picture block binarization module, configured to perform binarization processing on each of the picture blocks;
    答案标记模块,用于为二值化后的每个图片块标记对应的验证码答案;An answer tag module, configured to mark a corresponding verification code answer for each picture block after binarization;
    训练模块,用于将各个二值化后的图片块作为输入投入至机器学习模型,得到所述机器学习模型输出的训练答案;a training module, configured to input each binarized picture block as an input to a machine learning model, and obtain a training answer output by the machine learning model;
    参数调整模块,用于将各个训练答案作为目标,调整所述机器学习模型的模型参数,以最小化得到的各个训练答案与各个标记的验证码答案之间的误差;a parameter adjustment module, configured to target each training answer, and adjust model parameters of the machine learning model to minimize an error between the obtained training answers and the verification code answers of the respective markers;
    训练完成模块,用于若各个输出的训练答案与各个标记的验证码答案之间的误差率小于预设的阈值,则确定所述机器学习模型训练完成。The training completion module is configured to determine that the machine learning model training is completed if an error rate between the training answers of the respective outputs and the verification code answers of the respective markers is less than a preset threshold.
  18. 根据权利要求16所述的爬取网站数据的装置,其特征在于,所述验证码识别模块包括:The device for crawling website data according to claim 16, wherein the verification code identification module comprises:
    切割单元,用于将所述目标验证码图片切割成各个包含独立验证码的目标图片块;a cutting unit, configured to cut the target verification code picture into target picture blocks each including an independent verification code;
    二值化单元,用于对各个所述目标图片块进行二值化处理;a binarization unit, configured to perform binarization processing on each of the target picture blocks;
    输入模型单元,用于将各个二值化后的目标图片块作为输入投入至所述机器学习模型,得到所述机器学习模型输出的验证码答案。The input model unit is configured to input each binarized target picture block as input to the machine learning model, and obtain a verification code answer output by the machine learning model.
  19. 根据权利要求16所述的爬取网站数据的装置,其特征在于,还包括:The device for crawling website data according to claim 16, further comprising:
    模型选取模块,用于从预先建立的模型集合中选取一个与所述目标验证码图片对应的机器学习模型,所述模型集合中不同机器学习模型采用互不相同分类下的验证码图片作为学习样本预先训练得到。a model selection module, configured to select, from a pre-established model set, a machine learning model corresponding to the target verification code picture, where different machine learning models in the model set use verification code pictures under different classifications as learning samples Pre-trained.
  20. 根据权利要求19所述的爬取网站数据的装置,其特征在于,还包括:The device for crawling website data according to claim 19, further comprising:
    第一归类模块,用于对各个作为学习样本的验证码图片按照各自来源的网站进行归类,其中,一个网站对应一种分类;The first categorization module is configured to classify each verification code image as a learning sample according to a website of a respective source, wherein one website corresponds to one category;
    or
    字符提取模块,用于提取各个作为学习样本的验证码图片中的验证码的字符;a character extraction module, configured to extract characters of each verification code in the verification code picture of the learning sample;
    第二归类模块,用于根据提取的字符的所属类型对各个所述验证码图片进行归类,其中,一种所述所属类型对应一种分类;a second categorization module, configured to classify each of the verification code pictures according to a type of the extracted characters, where one type of the belonging type corresponds to one type of classification;
    or
    间距获取模块,用于获取各个作为学习样本的验证码图片中各个验证码字符之间的间距;a spacing acquisition module, configured to obtain a spacing between each verification code character in each verification code picture as a learning sample;
    第三归类模块,用于根据各个验证码图片对应的间距所属的各个预设间距区间对所述各个验证码图片进行归类,其中,一个间距区间对应一种分类。The third categorization module is configured to categorize the verification code images according to respective preset interval intervals to which the spacing corresponding to each verification code picture belongs, wherein one spacing interval corresponds to one classification.
PCT/CN2018/097499 2018-01-12 2018-07-27 Method and device for crawling website data, storage medium and server WO2019136960A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810029529.4 2018-01-12
CN201810029529.4A CN108345641B (en) 2018-01-12 2018-01-12 Method for crawling website data, storage medium and server

Publications (1)

Publication Number Publication Date
WO2019136960A1 true WO2019136960A1 (en) 2019-07-18

Family

ID=62961117

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/097499 WO2019136960A1 (en) 2018-01-12 2018-07-27 Method and device for crawling website data, storage medium and server

Country Status (2)

Country Link
CN (1) CN108345641B (en)
WO (1) WO2019136960A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111382750A (en) * 2020-03-05 2020-07-07 北京网众共创科技有限公司 Method and device for identifying graphic verification code
CN111667021A (en) * 2020-06-30 2020-09-15 上海仪电(集团)有限公司中央研究院 Front-end performance problem detection method based on artificial intelligence
CN111966432A (en) * 2020-06-30 2020-11-20 北京百度网讯科技有限公司 Verification code processing method and device, electronic equipment and storage medium
CN112214750A (en) * 2020-10-16 2021-01-12 上海携旅信息技术有限公司 Character verification code recognition method, system, electronic device and storage medium

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109815380A (en) * 2018-12-20 2019-05-28 山东中创软件工程股份有限公司 A kind of information crawler method, apparatus, equipment and computer readable storage medium
CN109740336B (en) * 2018-12-28 2020-08-18 北京云测信息技术有限公司 Method and device for identifying verification information in picture and electronic equipment
CN111782068A (en) * 2019-04-04 2020-10-16 阿里巴巴集团控股有限公司 Method, device and system for generating mouse track and data processing method
CN110348438A (en) * 2019-06-29 2019-10-18 上海淇馥信息技术有限公司 A kind of picture character identifying method, device and electronic equipment based on artificial nerve network model
CN110489629A (en) * 2019-08-28 2019-11-22 云汉芯城(上海)互联网科技股份有限公司 Data crawling method, data crawl device, data crawl equipment and storage medium
CN112380409A (en) * 2020-10-26 2021-02-19 武汉天宝莱信息技术有限公司 Verification code identification method based on automatic crawler

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103514171A (en) * 2012-06-20 2014-01-15 同程网络科技股份有限公司 Method for implementing self-defined crawler based on optical character recognition and vertical search
CN105893583A (en) * 2016-04-01 2016-08-24 北京鼎泰智源科技有限公司 Data acquisition method and system based on artificial intelligence
CN106446123A (en) * 2016-09-19 2017-02-22 成都知道创宇信息技术有限公司 Webpage verification code element identification method

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9747494B2 (en) * 2015-11-16 2017-08-29 MorphoTrak, LLC Facial matching system
CN107085730A (en) * 2017-03-24 2017-08-22 深圳爱拼信息科技有限公司 A kind of deep learning method and device of character identifying code identification

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103514171A (en) * 2012-06-20 2014-01-15 同程网络科技股份有限公司 Method for implementing self-defined crawler based on optical character recognition and vertical search
CN105893583A (en) * 2016-04-01 2016-08-24 北京鼎泰智源科技有限公司 Data acquisition method and system based on artificial intelligence
CN106446123A (en) * 2016-09-19 2017-02-22 成都知道创宇信息技术有限公司 Webpage verification code element identification method

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111382750A (en) * 2020-03-05 2020-07-07 北京网众共创科技有限公司 Method and device for identifying graphic verification code
CN111667021A (en) * 2020-06-30 2020-09-15 上海仪电(集团)有限公司中央研究院 Front-end performance problem detection method based on artificial intelligence
CN111966432A (en) * 2020-06-30 2020-11-20 北京百度网讯科技有限公司 Verification code processing method and device, electronic equipment and storage medium
CN111667021B (en) * 2020-06-30 2023-07-21 上海仪电(集团)有限公司中央研究院 Front-end performance problem detection method based on artificial intelligence
CN111966432B (en) * 2020-06-30 2023-07-28 北京百度网讯科技有限公司 Verification code processing method and device, electronic equipment and storage medium
CN112214750A (en) * 2020-10-16 2021-01-12 上海携旅信息技术有限公司 Character verification code recognition method, system, electronic device and storage medium
CN112214750B (en) * 2020-10-16 2023-04-25 上海携旅信息技术有限公司 Character verification code recognition method, system, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN108345641B (en) 2021-02-05
CN108345641A (en) 2018-07-31

Similar Documents

Publication Publication Date Title
WO2019136960A1 (en) Method and device for crawling website data, storage medium and server
US11348249B2 (en) Training method for image semantic segmentation model and server
WO2018166114A1 (en) Picture identification method and system, electronic device, and medium
US20200195667A1 (en) Url attack detection method and apparatus, and electronic device
US9923912B2 (en) Learning detector of malicious network traffic from weak labels
US9852363B1 (en) Generating labeled images
US9760700B2 (en) Image based CAPTCHA challenges
JP6345276B2 (en) Face authentication method and system
US20210382937A1 (en) Image processing method and apparatus, and storage medium
EP2806374B1 (en) Method and system for automatic selection of one or more image processing algorithm
US20180189950A1 (en) Generating structured output predictions using neural networks
CN110188654B (en) Video behavior identification method based on mobile uncut network
WO2019179295A1 (en) Facial recognition method and device
CN111783505A (en) Method and device for identifying forged faces and computer-readable storage medium
US11163989B2 (en) Action localization in images and videos using relational features
US11657222B1 (en) Confidence calibration using pseudo-accuracy
Liu et al. OptiFlex: video-based animal pose estimation using deep learning enhanced by optical flow
US20200320440A1 (en) System and Method for Use in Training Machine Learning Utilities
KR101545809B1 (en) Method and apparatus for detection license plate
Lv et al. Chinese character CAPTCHA recognition based on convolution neural network
CN114519401A (en) Image classification method and device, electronic equipment and storage medium
WO2017124336A1 (en) Method and system for adapting deep model for object representation from source domain to target domain
US20160070972A1 (en) System and method for determining a pet breed from an image
CN112241470A (en) Video classification method and system
Sert et al. Recognizing facial expressions of emotion using action unit specific decision thresholds

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18900507

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 17/11/2020)

122 Ep: pct application non-entry in european phase

Ref document number: 18900507

Country of ref document: EP

Kind code of ref document: A1