CN104408194B - The acquisition methods and device of web crawlers request - Google Patents

The acquisition methods and device of web crawlers request Download PDF

Info

Publication number
CN104408194B
CN104408194B CN201410779511.8A CN201410779511A CN104408194B CN 104408194 B CN104408194 B CN 104408194B CN 201410779511 A CN201410779511 A CN 201410779511A CN 104408194 B CN104408194 B CN 104408194B
Authority
CN
China
Prior art keywords
value
pixel
picture
matrix
noise reduction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410779511.8A
Other languages
Chinese (zh)
Other versions
CN104408194A (en
Inventor
李庆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201410779511.8A priority Critical patent/CN104408194B/en
Publication of CN104408194A publication Critical patent/CN104408194A/en
Application granted granted Critical
Publication of CN104408194B publication Critical patent/CN104408194B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Image Processing (AREA)

Abstract

The invention provides the acquisition methods and device of a kind of web crawlers request.Wherein, this method includes:By the identifying code picture for obtaining the resource information for treating reptile;Identifying code picture is split and noise reduction process, obtain multiple first pictures;Binary conversion treatment is carried out to each first picture and obtains the first matrix of each first picture;The second matrix corresponding with each first matrix in reading database;The character indicated by the second matrix is obtained, is verified a yard information;Asked using based on the verification code information and the user profile obtained in advance generation tectonic network reptile, wherein, web crawlers asks to be used to obtain the resource information.Crawl that efficiency is low so as to solve the problems, such as to exist in the prior art network caused by manual intervention to a certain extent, improve the effect for the efficiency that network crawls.

Description

The acquisition methods and device of web crawlers request
Technical field
The present invention relates to internet arena, in particular to the acquisition methods and device of a kind of request of web crawlers.
Background technology
With the fast development of internet, Internet resources become increasingly to enrich, and these information of artificial search are remote Remote is unable to meet demand, how to efficiently extract and turns into a huge challenge using these information.In order to solve information Effective intelligent extraction, web crawlers arises at the historic moment.Web crawlers is before specific resources are obtained, and first, construction accordingly please Ask header, and cookie to set to send to specific website in a manner of Post or Get and ask, and then obtain resource response letter Breath.But the resource information on network in the presence of some websites is to need to obtain after logging in, or even the requirement of some websites exists Input validation code information when logging in for the first time, therefore, such as how intelligentized mode analog subscriber logs in, be web crawlers urgently Solve the problems, such as.
For the website logged in identifying code, the method that legacy network reptile solves the problem is the ground according to identifying code Location, the address is then downloaded, save as picture, then before data are sent, identifying code is manually added in code, according to it come structure Request data is built, and then analog subscriber logs in, or login and verification code information manually are inputted in website, pass through firebug etc. Instrument obtains the cookie returned, and the cookie information is taken when sending request next time.
For network crawls efficiency caused by it manual intervention be present in handling identifying code login process in the prior art The problem of low, effective solution is not yet proposed at present.
The content of the invention
It is a primary object of the present invention to provide a kind of acquisition methods and device of web crawlers request, with to a certain degree The network caused by it manual intervention in handling identifying code login process be present crawls that efficiency is low to ask in the prior art for upper solution Topic.
To achieve these goals, one side according to embodiments of the present invention, there is provided a kind of web crawlers request Acquisition methods, including:Obtain the identifying code picture for the resource information for treating reptile;The identifying code picture is split and noise reduction Processing, obtains multiple first pictures;Binary conversion treatment is carried out to each first picture and obtains each first picture First matrix;The second matrix corresponding with each first matrix in reading database;Obtain indicated by second matrix Character, be verified a yard information;Using based on the verification code information and the user profile obtained in advance generation tectonic network Reptile is asked, wherein, the web crawlers asks to be used to obtain the resource information.
Further, the identifying code picture is split and noise reduction process, obtaining multiple first pictures includes:By institute Identifying code picture is stated to split to obtain multiple second pictures according to predetermined width;To each first pixel in the second picture Brightness noise reduction process is carried out, obtains first picture after noise reduction, wherein, to each first pixel in the second picture Point carries out brightness noise reduction process, and obtaining first picture after noise reduction includes:Judge first pixel brightness whether More than the first predetermined threshold value;If the brightness of first pixel is more than first predetermined threshold value, first picture is set The gray value of vegetarian refreshments is the first value;If the brightness of first pixel is not more than first predetermined threshold value, set described in The gray value of first pixel is second value.
Further, brightness noise reduction process is carried out to each first pixel in the second picture, after obtaining noise reduction First picture before, the acquisition methods include:Obtain the height and width of the second picture;Use described second The height and width of picture judge to whether there is next first pixel on the second picture;If on the second picture Next first pixel be present, then read the brightness of next first pixel.
Further, if the brightness of first pixel is more than first predetermined threshold value, first picture is set The gray value of vegetarian refreshments is that the first value includes:If the brightness of first pixel is more than first predetermined threshold value, institute is obtained The first chromatic value of the first pixel and the second chromatic value of the second pixel are stated, wherein, first pixel and described the The distance of two pixels is less than the second predetermined threshold value;Calculate the difference of each second chromatic value and first chromatic value; Quantity of the statistics more than the difference of the 3rd predetermined threshold value;If the quantity of the difference for being more than the 3rd predetermined threshold value is not More than the 4th predetermined threshold value, then the gray value for setting first pixel is first value;If described, to be more than the 3rd default The quantity of the difference of threshold value is more than the 4th predetermined threshold value, then the gray value for setting first pixel is described second Value.
Further, binary conversion treatment is carried out to each first picture and obtains the first square of each first picture Battle array includes:If the gray value of first pixel of first picture is the described first value, first pixel is set Two dimension value for the 3rd value, if the gray value of first pixel of first picture is the second value, described in setting The two dimension value of first pixel is the 4th value, obtains first matrix of two dimension.
Further, the second matrix corresponding with each first matrix includes in reading database:Calculate described One matrix and value, from the database read and second matrix first matrix and that value is equal;Obtain institute The character indicated by the second matrix is stated, being verified yard information includes:Read according to mapping relations indicated by second matrix The character, form the verification code information according to the order of the character.
To achieve these goals, another aspect according to embodiments of the present invention, there is provided a kind of web crawlers request Acquisition device, including:First acquisition module, for obtaining the identifying code picture for the resource information for treating reptile;Split noise reduction module, For being split to the identifying code picture and noise reduction process, multiple first pictures are obtained;Matrix module, for each institute State the first picture progress binary conversion treatment and obtain the first matrix of each first picture;Read module, for reading data The second matrix corresponding with each first matrix in storehouse;Second acquisition module, for obtaining indicated by second matrix Character, be verified a yard information;Generation module, for using based on the verification code information and the user profile obtained in advance The request of tectonic network reptile is generated, wherein, the web crawlers asks to be used to obtain the resource information.
Further, the segmentation noise reduction module includes:Split submodule, for by the identifying code picture according to default Width is split to obtain multiple second pictures;Noise reduction submodule, for being carried out to each first pixel in the second picture Brightness noise reduction process, first picture after noise reduction is obtained, wherein, the noise reduction submodule includes:Judging unit, for sentencing Whether the brightness of disconnected first pixel is more than the first predetermined threshold value;First setting unit, in the judgement of the judging unit As a result it is that the gray value for setting first pixel is the first value in the case of being;Second setting unit, sentence described In the case that the judged result of disconnected unit is no, the gray value for setting first pixel is second value.
Further, the segmentation noise reduction module that obtains also includes:Acquisition submodule, the segmentation submodule is connected to, For obtaining the height and width of the second picture;First judging submodule, the acquisition submodule is connected to, for using The height and width of the second picture judge to whether there is next first pixel on the second picture;First reads Submodule, first judging submodule is connected to, in the case where the judged result of first judging submodule is to be, used In the brightness for reading next first pixel.
Further, first setting unit includes:Colourity subelement, the judging unit is connected to, sentenced described In the case that the judged result of disconnected unit is is, for obtaining the first chromatic value and the second pixel of first pixel Second chromatic value, wherein, the distance of first pixel and second pixel is less than the second predetermined threshold value;It is single to calculate son Member, the colourity subelement is connected to, for calculating the difference of each second chromatic value and first chromatic value;Statistics Subelement, the computation subunit is connected to, for counting the quantity of the difference more than the 3rd predetermined threshold value;First is set Subelement, the statistics subelement is connected to, if the quantity of the difference for being more than the 3rd predetermined threshold value is not more than the 4th Predetermined threshold value, the then gray value for setting first pixel are first value;Second sets subelement, is connected to the system Subelement is counted, if the quantity of the difference for being more than the 3rd predetermined threshold value is more than the 4th predetermined threshold value, sets described the The gray value of one pixel is the second value.
Further, the matrix module includes:Second judging submodule, for judging described the of first picture Whether the gray value of one pixel is first value;3rd value submodule, for the judgement in second judging submodule As a result it is that the two dimension value that first pixel is set in the case of being is worth for the 3rd, obtains first matrix of two dimension;The Four value submodules, in the case of being no in the judged result of second judging submodule, set first pixel Two dimension value for the 4th value, obtain two dimension first matrix.
Further, the read module includes:Calculating sub module, for calculate first matrix and value;Second Reading submodule, for being read and second matrix first matrix and that value is equal from the database;
Second acquisition module includes:3rd reading submodule, for reading second matrix according to mapping relations The indicated character;Submodule is formed, for forming the verification code information according to the order of the character.
According to inventive embodiments, by the identifying code picture for obtaining the resource information for treating reptile;To the identifying code picture Split and noise reduction process, obtain multiple first pictures;Each first picture progress binary conversion treatment is obtained each First matrix of first picture;The second matrix corresponding with each first matrix in reading database;Described in acquisition Character indicated by second matrix, it is verified a yard information;Believe using based on the verification code information and the user obtained in advance Breath generation tectonic network reptile request, wherein, the web crawlers asks the mode for obtaining the resource information, certain Solving the problems, such as to exist in the prior art network caused by manual intervention in degree, to crawl efficiency low, improves web crawlers Efficiency, so as to improve user experience, reached intelligentized effect.
Brief description of the drawings
The accompanying drawing for forming the part of the application is used for providing a further understanding of the present invention, schematic reality of the invention Apply example and its illustrate to be used to explain the present invention, do not form inappropriate limitation of the present invention.In the accompanying drawings:
Fig. 1 is the flow chart of the acquisition methods of web crawlers request according to embodiments of the present invention;
Fig. 2 is the structure chart of the acquisition device of web crawlers request according to embodiments of the present invention;
Fig. 3 is according to identifying code original image in alternative embodiment of the present invention;
Fig. 4 is the flow chart according to the noise reduction process of alternative embodiment of the present invention;
Fig. 5 is after being split according to identifying code in alternative embodiment of the present invention and the identifying code picture of noise reduction;
Fig. 6 is according to the identifying code picture after noise reduction in alternative embodiment of the present invention;
Fig. 7 is the handling process according to binaryzation in alternative embodiment of the present invention;
Fig. 8 is the two values matrix according to corresponding to identifying code picture in alternative embodiment of the present invention.
Embodiment
It should be noted that in the case where not conflicting, the feature in embodiment and embodiment in the application can phase Mutually combination.Describe the present invention in detail below with reference to the accompanying drawings and in conjunction with the embodiments.
In order that those skilled in the art more fully understand the present invention program, below in conjunction with the embodiment of the present invention Accompanying drawing, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is only The embodiment of a part of the invention, rather than whole embodiments.Based on the embodiment in the present invention, ordinary skill people The every other embodiment that member is obtained under the premise of creative work is not made, it should all belong to the model that the present invention protects Enclose.
It should be noted that term " first " in description and claims of this specification and above-mentioned accompanying drawing, " Two " etc. be for distinguishing similar object, without for describing specific order or precedence.It should be appreciated that so use Data can exchange in the appropriate case, so as to embodiments of the invention described herein.In addition, term " comprising " and " tool Have " and their any deformation, it is intended that cover it is non-exclusive include, for example, containing series of steps or unit Process, method, system, product or equipment are not necessarily limited to those steps clearly listed or unit, but may include without clear It is listing to Chu or for the intrinsic other steps of these processes, method, product or equipment or unit.
The embodiments of the invention provide a kind of acquisition methods of web crawlers request.Fig. 1 is according to embodiments of the present invention The flow chart of the acquisition methods of web crawlers request, as shown in figure 1, the flow is as follows including step:
Step S102, obtain the identifying code picture for the resource information for treating reptile;
Step S104, is split and noise reduction process to identifying code picture, obtains multiple first pictures;
Step S106, binary conversion treatment is carried out to each first picture and obtains the first matrix of each first picture;
Step S108, the second matrix corresponding with each first matrix in reading database;
Step S110, the character indicated by the second matrix is obtained, is verified a yard information;
Step S112, please using tectonic network reptile is generated based on the verification code information and the user profile obtained in advance Ask, wherein, web crawlers asks to be used to obtain the resource information.
By above-mentioned each step, acquisition is taken to treat the identifying code picture of the resource information of reptile;Identifying code picture is entered Row segmentation and noise reduction process, obtain multiple first pictures;Binary conversion treatment is carried out to each first picture and obtains each first figure First matrix of piece;The second matrix corresponding with each first matrix in reading database;Obtain the word indicated by the second matrix Symbol, is verified a yard information;Asked using based on the verification code information and the user profile obtained in advance generation tectonic network reptile Ask, wherein, web crawlers asks the mode for obtaining the resource information, by being split to identifying code picture and noise reduction Deng processing and a yard information is verified, so as to solve to a certain extent in the prior art in identifying code login process is handled Network crawls the problem of efficiency is low caused by manual intervention being present, improves the efficiency of web crawlers, so as to improve user Experience Degree, intelligentized effect is reached.
The identifying code picture is split in above-mentioned steps S104 and noise reduction process, obtain multiple first pictures, can It to there is a variety of implementations, such as can first be split, then carry out noise reduction, can also first carry out noise reduction, then split. In an optional embodiment, realize in the following way:Identifying code picture is split according to predetermined width to obtain multiple second Picture;Brightness noise reduction process is carried out to each first pixel in second picture, obtains the first picture after noise reduction.
In above-mentioned optional embodiment, identifying code picture is split according to predetermined width to obtain the mistake of multiple second pictures Cheng Zhong, according to composition characteristic digital in identifying code picture, picture can be split according to width by softwares such as PS. Because the width of each identifying code picture is generally not consistent, predetermined width is also to adjust, to adapt to different identifying codes The width of picture.
In addition, in above-mentioned optional embodiment, each first pixel in second picture is carried out at brightness noise reduction Reason, obtains the first picture after noise reduction, there is also a variety of implementations, such as mean filter, adaptive wiener can be used to filter Ripple, wavelet filtering etc., in an optional embodiment, following manner is taken to realize:Judge the first pixel brightness whether More than the first predetermined threshold value;If the brightness of the first pixel is more than the first predetermined threshold value, the gray value of the first pixel is set For the first value;If the brightness of the first pixel is not more than the first predetermined threshold value, the gray value for setting the first pixel is second Value.First predetermined threshold value represents brightness value, can be with adjusting size, such as it is 70 that can take the first predetermined threshold value, and brightness is big 255 are labeled as in the gray scale of 70 pixel, the gray scale of pixel of the brightness less than 70 is labeled as 0, so as to pass through each picture Vegetarian refreshments obtains the identifying code picture after noise reduction with the comparative result of the first predetermined threshold value.
Brightness noise reduction process is carried out to each first pixel in second picture, obtains first picture after noise reduction Before, in an optional embodiment, the acquisition methods of web crawlers request can also include:Obtain the height of second picture Degree and width;Judge to whether there is next first pixel on the second picture using the height and width of second picture Point;If next first pixel on second picture be present, the brightness of next first pixel is read.So as to basis The height and width of each second picture, whole pixels of each second picture are all traveled through, handled, after obtaining noise reduction Picture.
Identifying code picture generally also includes some discrete points in addition to containing identifying code numeral, optional real at one Apply in example, if the brightness of the first pixel is more than the first predetermined threshold value, set the gray value of the first pixel to be wrapped for the first value Include:If the brightness of the first pixel is more than the first predetermined threshold value, the first chromatic value and the second pixel of the first pixel are obtained Second chromatic value of point, wherein, the distance of the first pixel and the second pixel is less than the second predetermined threshold value;Calculate each second The difference of chromatic value and the first chromatic value;Quantity of the statistics more than the difference of the 3rd predetermined threshold value;If it is more than the 3rd predetermined threshold value The quantity of difference be not more than the 4th predetermined threshold value, then the gray value for setting the first pixel is first value;If more than the The quantity of the difference of three predetermined threshold values is more than the 4th predetermined threshold value, then the gray value for setting the first pixel is second value.
For example, for some pixels discrete with digital color identical, compare the point and 8 adjacent pixels Aberration, aberration are more than the value of setting, then are more than the counting+1 of the adjacent pixel of the value of setting to aberration, when more than 6 pictures When vegetarian refreshments aberration is bigger, illustrate that the aberration of the point and the point of surrounding is bigger, the point is discrete point.By traveling through each second Each pixel in picture, can obtain the picture after a noise reduction.
After obtaining the picture after noise reduction, in an optional embodiment, in step S106, to each first picture The first matrix that progress binary conversion treatment obtains each first picture can include:If the first pixel of the first picture Gray value is the first value, and the two dimension value for setting the first pixel is the 3rd value, if the gray value of the first pixel of the first picture For second value, the two dimension value for setting the first pixel is the 4th value, obtains the first matrix of two dimension.
For example, according to the picture generated after noise reduction, each pixel is traveled through, the gray value of each pixel is sentenced It is disconnected, and be configured.For example, the value of the small pixel of gray value can be set to 1, the value of the big pixel of gray value is set to 0, 01 matrix is generated, 01 matrix corresponding to yard picture, i.e. the first matrix are verified so as to basis.
According to the first obtained matrix, the second matrix corresponding with each first matrix has a variety of realizations in reading database Mode, such as the mode compared one by one according to the value of each position of the first matrix, in an optional embodiment, bag Include:Calculate the first matrix and value, read from database and second matrix the first matrix and that value is equal;Obtain the Character indicated by two matrixes, being verified yard information can include:The word indicated by the second matrix is read according to mapping relations Symbol, verification code information is formed according to the order of character.
By the first matrix and value by way of each matrix of prestoring in database and compared with value, obtain To the second matrix of matching, further according to character corresponding to the second matrix in database, a yard information is verified.Because each Corresponding first matrix of two pictures, if so we obtain each and every one the second matrix and its corresponding character, in sequence will Character combination, verification code information is just obtained.
A kind of device is additionally provided in embodiment, the device is corresponding with the method in above-described embodiment, has carried out Cross will not be repeated here for explanation.Module or unit in the device can be stored in memory and can be transported by processor Capable code, the memory and processor can be located in server, but be not limited to this, and the device can also be in other ways Realize, no longer illustrate one by one herein.
Fig. 2 is the structure chart of the acquisition device of web crawlers request according to embodiments of the present invention.As shown in Fig. 2 the dress Put including:
First acquisition module 202, for obtaining the identifying code picture for the resource information for treating reptile;
Split noise reduction module 204, for being split to identifying code picture and noise reduction process, obtain multiple first pictures;
Matrix module 206, the first square of each first picture is obtained for carrying out binary conversion treatment to each first picture Battle array;
Read module 208, for the second matrix corresponding with each first matrix in reading database;
Second acquisition module 210, for obtaining the character indicated by the second matrix, it is verified a yard information;
Generation module 212, for using based on the verification code information and the user profile obtained in advance generation tectonic network Reptile is asked, wherein, web crawlers asks to be used to obtain resource information.
By above-mentioned modules, the identifying code picture for the resource information for treating reptile is obtained using the first acquisition module 202; Segmentation noise reduction module 204 is split by identifying code picture and noise reduction process, obtains multiple first pictures;Matrix module 206 is right Each first picture carries out binary conversion treatment and obtains the first matrix of each first picture;In the reading database of read module 208 The second matrix corresponding with each first matrix;Second acquisition module 210 obtains the character indicated by the second matrix, is verified Code information;Generation module 212 is asked using based on the verification code information and the user profile obtained in advance generation tectonic network reptile Ask, wherein, web crawlers is asked to be used to obtain the resource information, and identifying code picture is carried out by splitting noise reduction module 204 Segmentation and noise reduction etc. handle and finally give verification code information, are tested in the prior art in processing so as to solve to a certain extent Network crawls the problem of efficiency is low caused by manual intervention being present in card code login process, improves the efficiency of web crawlers, So as to improve user experience, intelligentized effect is reached.
Segmentation noise reduction module 204 can have a variety of implementations, in an optional embodiment, can include:Segmentation Submodule, for splitting identifying code picture according to predetermined width to obtain multiple second pictures;Noise reduction submodule, for second Each first pixel in picture carries out brightness noise reduction process, obtains first picture after noise reduction.
In above-mentioned optional embodiment, identifying code picture is split to obtain multiple the by segmentation submodule according to predetermined width , can be by softwares such as PS, according to composition characteristic digital in identifying code picture, by picture according to width during two pictures Split.Because the width of each identifying code picture is generally not consistent, predetermined width is also to adjust, to adapt to not With the width of identifying code picture.
In addition, in above-mentioned optional embodiment, noise reduction submodule is carried out to each first pixel in second picture Brightness noise reduction process, obtain the first picture after noise reduction, there is also a variety of implementations, such as can use mean filter, Adaptive wiener filter, wavelet filter etc., in an optional embodiment, following manner is taken to realize:Judging unit, For judging whether the brightness of the first pixel is more than the first predetermined threshold value;First setting unit, in the judgement knot of judging unit In the case that fruit is is, the gray value for setting first pixel is the first value;Second setting unit, in judging unit Judged result to be second value for setting the gray value of the first pixel in the case of no.First predetermined threshold value represents Brightness value, can be with adjusting size, such as it is 70 that can take the first predetermined threshold value, and the gray scale of pixel of the brightness more than 70 is marked For 255, the gray scale of pixel of the brightness less than 70 is labeled as 0, so as to the ratio by each pixel with the first predetermined threshold value Relatively result, obtain the identifying code picture after noise reduction.
In an optional embodiment, obtaining segmentation noise reduction module also includes:Acquisition submodule, it is connected to segmentation submodule Block, for obtaining the height and width of the second picture;First judging submodule, is connected to acquisition submodule, for using The height and width of second picture judge to whether there is next first pixel on the second picture;First reads submodule Block, the first judging submodule is connected to, it is next for reading in the case where the judged result of the first judging submodule is to be The brightness of first pixel.So as to the height and width according to each second picture, by whole pictures of each second picture Vegetarian refreshments is all traveled through, handled, and obtains the picture after noise reduction.
Identifying code picture generally also includes some discrete points in addition to containing identifying code numeral, optional real at one Apply in example, the first setting unit includes:Colourity subelement, is connected to judging unit, is yes in the judged result of judging unit In the case of, for obtaining the first chromatic value of the first pixel and the second chromatic value of the second pixel, wherein, the first pixel It is less than the second predetermined threshold value with the distance of second pixel;Computation subunit, colourity subelement is connected to, it is each for calculating The difference of individual second chromatic value and the first chromatic value;Subelement is counted, is connected to computation subunit, for counting pre- more than the 3rd If the quantity of the difference of threshold value;First sets subelement, is connected to statistics subelement, if the difference more than the 3rd predetermined threshold value Quantity is not more than the 4th predetermined threshold value, then the gray value for setting the first pixel is first value;Second sets subelement, even Statistics subelement is connected to, if the quantity more than the difference of the 3rd predetermined threshold value is more than the 4th predetermined threshold value, the first pixel is set The gray value of point is second value.
For example, for some pixels discrete with digital color identical, colourity subelement obtain the point and with the point The chromatic value of 8 adjacent pixels, computation subunit calculate the aberration of the point and 8 adjacent pixels, count subelement Compared result is counted, if aberration is more than the value of setting, the adjacent pixel of the value of setting is more than to aberration + 1 is counted, when more than 6 pixel aberration are bigger, illustrates that the aberration of the point and the point of surrounding is bigger, the point is discrete The gray value of the point is arranged to second value, such as 255 by point, the second setting unit.By traveling through each second in this way Each pixel in picture, can obtain the picture after a noise reduction.
Matrix module 206 is handled the picture after noise reduction, and in an optional embodiment, matrix module 206 can With including:Second judging submodule, for judging whether the gray value of the first pixel of the first picture is the first value;3rd value Submodule, in the case of being, setting the two dimension value of the first pixel in the judged result of the second judging submodule as the Three values, obtain the first matrix of two dimension;4th value submodule, for the situation for being no in the judged result of the second judging submodule Under, the two dimension value for setting the first pixel is the 4th value, obtains the first matrix of two dimension.
For example, according to the picture generated after noise reduction, matrix module 206 travels through each pixel, the second judging submodule pair The gray value of each pixel is judged that the 3rd value submodule or the 4th value submodule are configured.For example, the 3rd value The value of the small pixel of gray value can be set to 1 by module, and the 4th value submodule can set the value of the big pixel of gray value For 0, so as to according to being verified 01 matrix corresponding to yard picture, i.e. the first matrix.
According to the first obtained matrix, using the acquisition module 210 of read module 208 and second, identifying code letter can be obtained Breath.In an optional embodiment, read module 208 includes:Calculating sub module, for calculate the first matrix and value;The Two reading submodules, for being read and the second matrix the first matrix and that value is equal from database;Second acquisition module 210 Including:3rd reading submodule, for reading the character indicated by the second matrix according to mapping relations;Submodule is formed, is used for Verification code information is formed according to the order of character.
By the calculating sub module in read module 208 by the first matrix and in value and database each square to prestore Battle array and the mode that is compared of value, the second matrix that the second reading submodule is matched, the 3rd reading submodule further according to Character corresponding to second matrix in database, a yard information is verified by forming submodule.Because each second picture is corresponding One the first matrix, if so we obtain each and every one the second matrix and its corresponding character, in sequence by character combination, just It is verified a yard information.
Below in conjunction with specific implementation environment, the acquisition methods asked the web crawlers of the present invention are described further. In the specific implementation environment, the acquisition methods of web crawlers request are similar with Fig. 1.
This method includes:
(1) picture printed words are gathered:By softwares such as PS, according to identifying code numeral composition characteristic, picture is split by width. The threshold value of image is adjusted, after eliminating interference pixel, the input using the threshold value as noise reduction process;
Fig. 3 is according to identifying code original image in alternative embodiment of the present invention, from the figure 3, it may be seen that the identifying code picture is not only With verification code information, interference information also be present, such as background colour, discrete point.
(2) noise reduction process:A kind of noise reduction process method is additionally provided as shown in Figure 4, and the flow of the noise reduction process is included such as Lower step:
Step S402, download pictures.
Step S404, luminance threshold is set.
Step S406, obtain picture traverse w and height h.
Step S408, take next pixel.
Step S410, judge whether pixel reads and finish.
Wherein, in the case where pixel reads and finished, step S412 is performed;Do not read situation about finishing in pixel Under, return and perform step S408;After all pixels point, which is read, to be finished, step S418 is performed.
Step S412, judges whether the brightness of (i, j) pixel is more than threshold value.
Wherein, when the brightness of (i, j) pixel is more than threshold value, step S414 is performed;It is not more than in the brightness of (i, j) pixel During threshold value, step S416 is performed.
Step S414, puts gray value=225.
Step S416, puts gray value=0.
Step S418, processing terminate.
Specifically, it is to picture noise reduction first.Given threshold, the height and width of picture after segmentation are obtained, according to height The each pixel of picture is traveled through with width, the brightness of each pixel is obtained, more than the threshold value of setting, labeled as 255 (whites); Threshold value less than setting is then labeled as 0 (black).For some pixels discrete with digital color identical, compare the point with The aberration of 8 adjacent pixels, aberration are more than the value of setting, then and+1, when more than 6 pixel aberration are bigger, explanation It is discrete point.It so can be obtained by the picture after a noise reduction.
As shown in figure 5, simultaneously the picture of noise reduction increases, added compared to original image, the identification degree of picture after segmentation The degree of accuracy of obtained identifying code.
In (2), noise reduction can also be first carried out, then split.Fig. 6 is according to after noise reduction in alternative embodiment of the present invention Identifying code picture, as shown in fig. 6, picture after noise reduction increases compared to original image, the identification degree of its identifying code, increase The degree of accuracy of the identifying code obtained, and after the segmentation shown in Fig. 5 and noise reduction identifying code picture compared to the figure after only noise reduction Piece, the identification degree of its identifying code further increase.
(3) binaryzation:Fig. 7 is according to the handling process of binaryzation in alternative embodiment of the present invention, as shown in fig. 7, at this Reason flow may include steps of:
Step S702, picture after noise reduction segmentation.
Step S704, construct bivariate table.
Step S706, judges whether the gray value of pixel is equal to 0.
Wherein, in the case where the gray value of pixel is not equal to 0, step S708 is performed;In gray value of pixel etc. In the case of 0, step S710 is performed.
Step S708, correspondence position 1 in bivariate table.
Step S710, judges whether adjacent 8 node pixels have 6 aberration differences larger.
Wherein, in the case where adjacent 8 node pixels there are 6 aberration differences larger, step S714 is performed;Adjacent 8 In the case that individual node pixel differs larger without 6 aberration, step S712 is performed.
Step S712, bivariate table correspondence position put 1.
Step S714, bivariate table relevant position are set to 0.
Specifically, according to the picture generated after noise reduction, each pixel is traveled through.The white obtained in step S304 is set to 1, black is set to 0, generates 01 matrix.
Fig. 8 is the two values matrix according to corresponding to identifying code picture in alternative embodiment of the present invention, i.e. the first matrix.The step S306 can merge with step S304 noise reductions.
(4) type matrix matches:Random verification code is passed through same noise reduction process and be calculated and value with the square of type matrix 01 Battle array compares identification identifying code with value;
(5) solicited message is constructed:After the identifying code of generation, password and identifying code construction solicited message;
(6) reptile sends solicited message.
In this alternative embodiment, by carrying out noise reduction filtering processing to identifying code original image, obtain and identifying code figure First matrix corresponding to piece, manual intervention is eliminated to a certain extent, realizes intelligent acquisition verification code information, simultaneously Associating according to the matrix magazine that pre-establishes and the first matrix is adopted, the mode of yard information is verified, reduces processor and store Space, improve recognition performance.
Obviously, those skilled in the art should be understood that above-mentioned each module of the invention or each step can be with general Computing device realize that they can be concentrated on single computing device, or be distributed in multiple computing devices and formed Network on, alternatively, they can be realized with the program code that computing device can perform, it is thus possible to they are stored Performed in the storage device by computing device, either they are fabricated to respectively each integrated circuit modules or by they In multiple modules or step be fabricated to single integrated circuit module to realize.So, the present invention is not restricted to any specific Hardware and software combines
It should be noted that for foregoing each method embodiment, in order to be briefly described, therefore it is all expressed as a series of Combination of actions, but those skilled in the art should know, the present invention is not limited by described sequence of movement because According to the present invention, some steps can use other orders or carry out simultaneously.Secondly, those skilled in the art should also know Know, embodiment described in this description belongs to preferred embodiment, and involved action and module are not necessarily of the invention It is necessary.
In the above-described embodiments, the description to each embodiment all emphasizes particularly on different fields, and does not have the portion being described in detail in some embodiment Point, it may refer to the associated description of other embodiment.
In several embodiments provided herein, it should be understood that disclosed device, can be by another way Realize.For example, device embodiment described above is only schematical, such as the division of the unit, it is only one kind Division of logic function, can there is an other dividing mode when actually realizing, such as multiple units or component can combine or can To be integrated into another system, or some features can be ignored, or not perform.Another, shown or discussed is mutual Coupling direct-coupling or communication connection can be by some interfaces, the INDIRECT COUPLING or communication connection of device or unit, Can be electrical or other forms.
The unit illustrated as separating component can be or may not be physically separate, show as unit The part shown can be or may not be physical location, you can with positioned at a place, or can also be distributed to multiple On NE.Some or all of unit therein can be selected to realize the mesh of this embodiment scheme according to the actual needs 's.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, can also That unit is individually physically present, can also two or more units it is integrated in a unit.Above-mentioned integrated list Member can both be realized in the form of hardware, can also be realized in the form of SFU software functional unit.
If the integrated unit is realized in the form of SFU software functional unit and is used as independent production marketing or use When, it can be stored in a computer read/write memory medium.Based on such understanding, technical scheme is substantially The part to be contributed in other words to prior art or all or part of the technical scheme can be in the form of software products Embody, the computer software product is stored in a storage medium, including some instructions are causing a computer Equipment (can be personal computer, mobile terminal, server or network equipment etc.) performs side described in each embodiment of the present invention The all or part of step of method.And foregoing storage medium includes:USB flash disk, read-only storage (ROM, Read-Only Memory), Random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disc or CD etc. are various to be stored The medium of program code.
The preferred embodiments of the present invention are the foregoing is only, are not intended to limit the invention, for the skill of this area For art personnel, the present invention can have various modifications and variations.Within the spirit and principles of the invention, that is made any repaiies Change, equivalent substitution, improvement etc., should be included in the scope of the protection.

Claims (10)

  1. A kind of 1. acquisition methods of web crawlers request, it is characterised in that including:
    Obtain the identifying code picture for the resource information for treating reptile;
    The identifying code picture is split and noise reduction process, obtain multiple first pictures;
    Binary conversion treatment is carried out to each first picture and obtains the first matrix of each first picture;
    The second matrix corresponding with each first matrix in reading database;
    The character indicated by second matrix is obtained, is verified a yard information;
    Based on the verification code information and the user profile obtained in advance generation web crawlers request, wherein, the web crawlers Ask to be used to obtain the resource information;
    The second matrix corresponding with each first matrix includes in reading database:Calculate first matrix and value, Read and second matrix first matrix and that value is equal from the database;
    The character indicated by second matrix is obtained, being verified yard information includes:Described second is read according to mapping relations The character indicated by matrix, the verification code information is formed according to the order of the character.
  2. 2. acquisition methods according to claim 1, it is characterised in that the identifying code picture is split and noise reduction at Reason, obtaining multiple first pictures includes:
    The identifying code picture is split according to predetermined width to obtain multiple second pictures;
    Brightness noise reduction process is carried out to each first pixel in the second picture, obtains first figure after noise reduction Piece,
    Wherein, brightness noise reduction process is carried out to each first pixel in the second picture, obtains described the after noise reduction One picture includes:
    Judge whether the brightness of first pixel is more than the first predetermined threshold value;
    If the brightness of first pixel is more than first predetermined threshold value, the gray value for setting first pixel is First value;
    If the brightness of first pixel is not more than first predetermined threshold value, the gray value of first pixel is set For second value.
  3. 3. acquisition methods according to claim 2, it is characterised in that to each first pixel in the second picture Brightness noise reduction process is carried out, before obtaining first picture after noise reduction, the acquisition methods include:
    Obtain the height and width of the second picture;
    Judge to whether there is next first pixel on the second picture using the height and width of the second picture;
    If next first pixel on the second picture be present, the bright of next first pixel is read Degree.
  4. 4. acquisition methods according to claim 2, it is characterised in that if the brightness of first pixel is more than described the One predetermined threshold value, then set the gray value of first pixel includes for the first value:
    If the brightness of first pixel is more than first predetermined threshold value, the first colourity of first pixel is obtained Second chromatic value of value and the second pixel, wherein, the distance of first pixel and second pixel is less than second Predetermined threshold value;
    Calculate the difference of each second chromatic value and first chromatic value;
    Quantity of the statistics more than the difference of the 3rd predetermined threshold value;
    If the quantity of the difference for being more than the 3rd predetermined threshold value is not more than the 4th predetermined threshold value, first picture is set The gray value of vegetarian refreshments is the described first value;
    If the quantity of the difference for being more than the 3rd predetermined threshold value is more than the 4th predetermined threshold value, first pixel is set The gray value of point is the second value.
  5. 5. acquisition methods as claimed in any of claims 2 to 4, it is characterised in that to each first picture The first matrix that progress binary conversion treatment obtains each first picture includes:
    If the gray value of first pixel of first picture is the described first value, the two of first pixel is set Dimension value is the 3rd value, if the gray value of first pixel of first picture is the second value, sets described first The two dimension value of pixel is the 4th value, obtains first matrix of two dimension.
  6. A kind of 6. acquisition device of web crawlers request, it is characterised in that including:
    First acquisition module, for obtaining the identifying code picture for the resource information for treating reptile;
    Split noise reduction module, for being split to the identifying code picture and noise reduction process, obtain multiple first pictures;
    Matrix module, the first square of each first picture is obtained for carrying out binary conversion treatment to each first picture Battle array;
    Read module, for the second matrix corresponding with each first matrix in reading database;
    Second acquisition module, for obtaining the character indicated by second matrix, it is verified a yard information;
    Generation module, for being asked based on the verification code information and the user profile obtained in advance generation web crawlers, wherein, The web crawlers asks to be used to obtain the resource information;
    The read module includes:Calculating sub module, for calculate first matrix and value;Second reading submodule, use In reading and second matrix first matrix and that value is equal from the database;
    Second acquisition module includes:3rd reading submodule, it is signified for reading second matrix according to mapping relations The character shown;Submodule is formed, for forming the verification code information according to the order of the character.
  7. 7. acquisition device according to claim 6, it is characterised in that the segmentation noise reduction module includes:
    Split submodule, for splitting the identifying code picture according to predetermined width to obtain multiple second pictures;
    Noise reduction submodule, for carrying out brightness noise reduction process to each first pixel in the second picture, obtain noise reduction First picture afterwards,
    Wherein, the noise reduction submodule includes:
    Judging unit, for judging whether the brightness of first pixel is more than the first predetermined threshold value;
    First setting unit, in the case where the judged result of the judging unit is to be, for setting first pixel Gray value for first value;
    Second setting unit, in the case where the judged result of the judging unit is no, for setting first pixel Gray value be second value.
  8. 8. acquisition device according to claim 7, it is characterised in that the segmentation noise reduction module that obtains also includes:
    Acquisition submodule, the segmentation submodule is connected to, for obtaining the height and width of the second picture;
    First judging submodule, the acquisition submodule is connected to, judged for the height using the second picture and width It whether there is next first pixel on the second picture;
    First reading submodule, first judging submodule is connected to, is in the judged result of first judging submodule In the case of being, for reading the brightness of next first pixel.
  9. 9. acquisition device according to claim 7, it is characterised in that first setting unit includes:
    Colourity subelement, the judging unit is connected to, in the case where the judged result of the judging unit is to be, for obtaining The first chromatic value of first pixel and the second chromatic value of the second pixel are taken, wherein, first pixel and institute The distance for stating the second pixel is less than the second predetermined threshold value;
    Computation subunit, the colourity subelement is connected to, for calculating each second chromatic value and first colourity The difference of value;
    Subelement is counted, is connected to the computation subunit, for counting the quantity of the difference more than the 3rd predetermined threshold value;
    First sets subelement, the statistics subelement is connected to, if the number of the difference for being more than the 3rd predetermined threshold value Amount is not more than the 4th predetermined threshold value, then the gray value for setting first pixel is first value;
    Second sets subelement, the statistics subelement is connected to, if the number of the difference for being more than the 3rd predetermined threshold value Amount is more than the 4th predetermined threshold value, then the gray value for setting first pixel is the second value.
  10. 10. the acquisition device according to any one in claim 7 to 9, it is characterised in that the matrix module includes:
    Second judging submodule, for judging whether the gray value of first pixel of first picture is described first Value;
    3rd value submodule, in the case of being in the judged result of second judging submodule, sets described first The two dimension value of pixel is the 3rd value, obtains first matrix of two dimension;
    4th value submodule, in the case of being no in the judged result of second judging submodule, sets described first The two dimension value of pixel is the 4th value, obtains first matrix of two dimension.
CN201410779511.8A 2014-12-15 2014-12-15 The acquisition methods and device of web crawlers request Active CN104408194B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410779511.8A CN104408194B (en) 2014-12-15 2014-12-15 The acquisition methods and device of web crawlers request

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410779511.8A CN104408194B (en) 2014-12-15 2014-12-15 The acquisition methods and device of web crawlers request

Publications (2)

Publication Number Publication Date
CN104408194A CN104408194A (en) 2015-03-11
CN104408194B true CN104408194B (en) 2017-11-21

Family

ID=52645825

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410779511.8A Active CN104408194B (en) 2014-12-15 2014-12-15 The acquisition methods and device of web crawlers request

Country Status (1)

Country Link
CN (1) CN104408194B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109033195A (en) * 2018-06-28 2018-12-18 上海盛付通电子支付服务有限公司 The acquisition methods of webpage information obtain equipment and computer-readable medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102663672A (en) * 2012-05-03 2012-09-12 杭州朗和科技有限公司 Picture verification code generation method and device
CN102930277A (en) * 2012-09-19 2013-02-13 上海珍岛信息技术有限公司 Character picture verification code identifying method based on identification feedback

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8737745B2 (en) * 2012-03-27 2014-05-27 The Nielsen Company (Us), Llc Scene-based people metering for audience measurement
US9147104B2 (en) * 2012-11-05 2015-09-29 The United States Of America As Represented By The Secretary Of The Air Force Systems and methods for processing low contrast images

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102663672A (en) * 2012-05-03 2012-09-12 杭州朗和科技有限公司 Picture verification code generation method and device
CN102930277A (en) * 2012-09-19 2013-02-13 上海珍岛信息技术有限公司 Character picture verification code identifying method based on identification feedback

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于自组织映射的验证码识别研究;刘莉;《中国优秀硕士学位论文全文数据库 信息科技辑》;20140615(第6期);文章第13-38页 *

Also Published As

Publication number Publication date
CN104408194A (en) 2015-03-11

Similar Documents

Publication Publication Date Title
CN107423613B (en) Method and device for determining device fingerprint according to similarity and server
CN110008680B (en) Verification code generation system and method based on countermeasure sample
CN103814545B (en) Authenticating identity of mobile phone user method
CN106445939A (en) Image retrieval, image information acquisition and image identification methods and apparatuses, and image identification system
CN104618350B (en) A kind of generation method of picture validation code
CN107992887A (en) Classifier generation method, sorting technique, device, electronic equipment and storage medium
CN109409377B (en) Method and device for detecting characters in image
CN106815226A (en) Text matching technique and device
CN104504086B (en) The clustering method and device of Webpage
CN109919160A (en) Method for recognizing verification code, device, terminal and storage medium
CN111260220B (en) Group control equipment identification method and device, electronic equipment and storage medium
CN107204956A (en) website identification method and device
CN103971134A (en) Image classifying, retrieving and correcting method and corresponding device
CN105653833A (en) Method and device for recommending game community
CN106844389A (en) The treating method and apparatus of network resources address URL
CN104408194B (en) The acquisition methods and device of web crawlers request
CN106649829A (en) Method and device for processing business based on palmprint data
CN104899232B (en) The method and apparatus of Cooperative Clustering
CN110210344A (en) Video actions recognition methods and device, electronic equipment, storage medium
CN110135512A (en) Recognition methods, equipment, storage medium and the device of picture
CN106528758A (en) Picture selecting method and device
CN104484451B (en) The extracting method and device of Webpage information
CN107085727A (en) The determination method and its device of a kind of image boundary function
CN111126503B (en) Training sample generation method and device
CN107944429A (en) A kind of face recognition method, device and its mobile terminal used

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: Acquisition method and device of web crawler request

Effective date of registration: 20190531

Granted publication date: 20171121

Pledgee: Shenzhen Black Horse World Investment Consulting Co., Ltd.

Pledgor: Beijing Guoshuang Technology Co.,Ltd.

Registration number: 2019990000503

CP02 Change in the address of a patent holder

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Patentee after: BEIJING GRIDSUM TECHNOLOGY Co.,Ltd.

Address before: 100086 Beijing city Haidian District Shuangyushu Area No. 76 Zhichun Road cuigongfandian 8 layer A

Patentee before: BEIJING GRIDSUM TECHNOLOGY Co.,Ltd.