CN102737122A - Method for extracting verification code image from webpage - Google Patents
Method for extracting verification code image from webpage Download PDFInfo
- Publication number
- CN102737122A CN102737122A CN2012101924281A CN201210192428A CN102737122A CN 102737122 A CN102737122 A CN 102737122A CN 2012101924281 A CN2012101924281 A CN 2012101924281A CN 201210192428 A CN201210192428 A CN 201210192428A CN 102737122 A CN102737122 A CN 102737122A
- Authority
- CN
- China
- Prior art keywords
- identifying code
- picture
- img
- code picture
- local
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Abstract
The invention provides a method for extracting a verification code image from a webpage. As a fixed website link is not available for the verification code image on the webpage, the verification code image is generated randomly, and the content of the verification code image can be changed if the verification code image is refreshed or saved, the verification code image extraction is a key problem for a plurality of software applications requiring verification code images. According to the method, the verification code image is extracted from the webpage by virtue of a cursor position, a verification code input box position, an image position, an image size, image vision and content features, image keywords, an image length-to-width ratio and other information.
Description
Technical field
The present invention relates to identifying code picture recognition field, relate in particular to a kind of method of extracting identifying code picture in the webpage.
Background technology
Identifying code is that a kind of user of differentiation is computing machine and people's a public full-automatic program.Can prevent: malice decryption, brush ticket, forum pour water, and prevent that effectively certain hacker from constantly landing trial to some particular registered user with specific program Brute Force mode, and being actually with identifying code is the current modes in now a lot of websites.
Identifying code is exactly numeral or the symbol that produces at random a string, generates a width of cloth picture, adds some interference in the picture, for example draws several straight lines at random, draws some points and (prevents
OCR), by user's naked eyes identification verification code information wherein, the input list is submitted the website checking to, could use a certain function after verifying successfully.The application of now a lot of softwares will be extracted the identifying code picture in the webpage; Because the identifying code picture does not have a fixing website links in webpage; And picture generates at random; It is refreshed or preserve operation can change image content, therefore extract the identifying code picture and be a crucial difficult problem of the software application (blind person's picture validation code service software) that much needs the identifying code picture.
Summary of the invention
The invention provides a kind of method of extracting identifying code picture in the webpage.Utilize information such as identifying code input frame position, picture position, picture size, picture vision and content characteristic, picture key word in the webpage; Extract identifying code picture in the webpage, this law invention can provide convenient for the software application that much needs to extract webpage verification using data-hiding technology sign indicating number picture.
The invention provides a kind of method of extracting identifying code picture in the webpage, may further comprise the steps:
1) obtains all IMG nodal informations of the browser current active page;
2) according to the identifying code picture scoring strategy of formulating in advance, the pictorial information that the IMG node is comprised is marked, the highest being of scoring comprises the IMG node of identifying code picture;
3) if step 2) can't obtain all IMG nodes, be in its local picture on every side of focus intercepting is included in the identifying code picture then with the identifying code input frame; The classification and Detection model that utilizes training in advance to obtain obtains the particular location of identifying code picture;
4) the identifying code picture is preserved separately.
2. described all the IMG nodal informations of the browser current active page that obtain, concrete steps are:
1) confirms the browser current active page;
2) top-down, obtain all IMG nodal informations of active page, the IMG nodal information has comprised the picture position, picture size, picture length and width, information such as picture key word.
3. according to the identifying code picture scoring strategy of formulating in advance, the pictorial information that the IMG node is comprised is marked, the highest being of scoring comprises the IMG node of identifying code picture, and concrete steps are:
Obtain the information of all IMG nodes of the browser current active page, utilize the identifying code scoring strategy of formulating in advance that the information of all IMG nodes is marked, the highest IMG node of marking promptly is the IMG node at identifying code picture place.
4. described is in its local picture on every side of focus intercepting is included in the identifying code picture with the identifying code input frame; The classification and Detection model that utilizes training in advance to obtain obtains the particular location of identifying code picture, and concrete steps are:
1), then might obtain IMG node less than identifying code picture place if can not obtain all IMG nodes of active page.At this moment, can be in its local picture on every side of focus intercepting is included in the identifying code picture with the identifying code input frame.
2) local picture is handled,, utilized the identifying code sorter model, it is identified from local picture, and be processed into independent identifying code picture according to color, the texture gradient characteristic of identifying code picture.
5. described the identifying code picture is preserved separately, concrete steps are:
Because the singularity of identifying code picture, operate on it and might change picture, therefore to take the mode of special preservation picture.If can get access to all IMG nodes, then utilize identifying code scoring strategy, select the IMG node at identifying code picture place, can carry out accurate screenshotss according to the positional information of picture in the IMG node, obtain the identifying code picture; Otherwise, be in its local picture on every side of focus intercepting is included in the identifying code picture, to utilize the identifying code disaggregated model that intercepting is carried out in the rectangular area at identifying code picture place with the identifying code input frame, obtain the identifying code picture.
6. local picture is handled, according to characteristics such as the color of identifying code picture, textures, utilized the identifying code sorter, it is identified from local picture, and be processed into independent identifying code picture, concrete steps are:
1) sets up the sample space of identifying code picture, extract sample local color, texture, gradient characteristic, set up identifying code picture classification device model through machine learning;
2), utilize the moving window model to obtain alternative rectangular area for local picture;
3) to step 2) generate each rectangular area; Identifying code sorter with step 1) generates judges to be the identifying code picture; If this rectangular area meets the characteristic of identifying code picture, then with its intercepting and the independent generation identifying code picture of preserving from local picture.
Description of drawings
Fig. 1 is a kind of process flow diagram that extracts the method for identifying code picture in the webpage of the present invention;
Embodiment
To combine the accompanying drawing among the present invention below, technical scheme of the present invention will be carried out clear, intactly description.Based on the embodiment among the present invention, the every other embodiment that those of ordinary skills are obtained under the prerequisite of not doing creative work belongs to the scope that the present invention protects.
For making the object of the invention, technical scheme and advantage clearer, next will carry out detailed explanation to the embodiment of the invention with reference to accompanying drawing.
With reference to figure 1, the present invention helps the blind person to discern the method for identifying code picture, may further comprise the steps:
Step S101 confirms the browser current active page.
Step S102 confirms the position of cursor in active page.
Step S103 according to cursor position, obtains the identifying code input frame node of active page.
Step S104, the top obtains all IMG nodes of the current active page to down certainly.
Step S105 judges whether to obtain all IMG nodes of active page, if turn to step S107, otherwise turns to step S106.
Step S106 according to the identifying code scoring strategy of formulating in advance, marks to the picture that the IMG node is comprised, the highest identifying code picture that is of marking.The picture that each IMG label is comprised all has several attributes: IMG label key word, picture size, picture Aspect Ratio, picture are in the position of the page and distance, the image content characteristic of identifying code input frame.According to these information, every pictures all has a scoring.Such as, it is 0 that initial score can be set, the full marks of each attribute are 10 minutes.It is high more to mark, and might be the identifying code picture more.Identifying code picture to the website is learnt, and finds that the label of identifying code picture has key word such as " identifying code ", " code "; Picture size also within the specific limits, such as 200X200 with interior (can as required expanded scope) again; Picture is grown up in wide.Can be each attribute established standards, criterion distance is near more, and it is high more to mark.
The local picture that comprises the identifying code picture around the step S107, intercepting identifying code input frame.
Step S108 utilizes machine learning, sets up the identifying code disaggregated model, according to the identifying code characteristic, from local picture, extracts the identifying code picture.
Step S109 preserves the identifying code picture separately.Because the singularity of identifying code picture, operate on it and might change picture, therefore to take the mode of special preservation picture.If can get access to all IMG nodes, then utilize identifying code scoring strategy, select the IMG node at identifying code picture place, can carry out accurate screenshotss according to the positional information of picture in the IMG node, obtain the identifying code picture; Otherwise, be in its local picture on every side of focus intercepting is included in the identifying code picture, to utilize the identifying code disaggregated model that intercepting is carried out in the rectangular area at identifying code picture place with the identifying code input frame, obtain the identifying code picture.
In the technical scheme of the present invention; Utilize information such as cursor position, identifying code input frame position, picture position, picture size, picture vision and content characteristic, picture key word, picture Aspect Ratio that the identifying code picture in the webpage is extracted, for the application software that much needs to extract webpage verification using data-hiding technology sign indicating number picture provides convenient.
At last, should be pointed out that above embodiment only is the more representational example of the present invention.Obviously, technical scheme of the present invention is not limited to the foregoing description.Those of ordinary skill in the art can make various modifications or variation for the foregoing description not breaking away under the invention state of mind of the present invention, thereby protection scope of the present invention do not limit by the foregoing description, and should confirm according to claims.
Claims (6)
1. a method of extracting identifying code picture in the webpage is characterized in that, may further comprise the steps:
1) obtains all IMG nodal informations of the browser current active page;
2) according to the identifying code picture scoring strategy of formulating in advance, the pictorial information that the IMG node is comprised is marked, the highest being of scoring comprises the IMG node of identifying code picture;
3) if step 2) can't obtain all IMG nodes, be in its local picture on every side of focus intercepting is included in the identifying code picture then with the identifying code input frame; The classification and Detection model that utilizes training in advance to obtain obtains the particular location of identifying code picture;
4) the identifying code picture is preserved separately.
2. all the IMG nodal informations of the browser current active page that obtain according to claim 1 is characterized in that concrete steps are:
1) confirms the browser current active page;
2) top-down, obtain all IMG nodal informations of active page, the IMG nodal information has comprised the picture position, picture size, picture length and width, information such as picture key word.
3. the identifying code picture scoring strategy that basis according to claim 1 is formulated is in advance marked to the pictorial information that the IMG node is comprised, and the highest being of scoring comprises the IMG node of identifying code picture, it is characterized in that concrete steps are:
Obtain the information of all IMG nodes of the browser current active page, utilize the identifying code scoring strategy of formulating in advance that the information of all IMG nodes is marked, the highest IMG node of marking promptly is the IMG node at identifying code picture place.
4. according to claim 1 is in its local picture on every side of focus intercepting is included in the identifying code picture with the identifying code input frame; The classification and Detection model that utilizes training in advance to obtain obtains the particular location of identifying code picture, it is characterized in that, concrete steps are:
1), then might obtain IMG node less than identifying code picture place if can not obtain all IMG nodes of active page.At this moment, can be in its local picture on every side of focus intercepting is included in the identifying code picture with the identifying code input frame.
2) local picture is handled,, utilized the identifying code sorter model, it is identified from local picture, and be processed into independent identifying code picture according to color, the texture gradient characteristic of identifying code picture.
5. according to claim 1 the identifying code picture is preserved separately, it is characterized in that concrete steps are:
Because the singularity of identifying code picture, operate on it and might change picture, therefore to take the mode of special preservation picture.If can get access to all IMG nodes, then utilize identifying code scoring strategy, select the IMG node at identifying code picture place, can carry out accurate screenshotss according to the positional information of picture in the IMG node, obtain the identifying code picture; Otherwise, be in its local picture on every side of focus intercepting is included in the identifying code picture, to utilize the identifying code disaggregated model that intercepting is carried out in the rectangular area at identifying code picture place with the identifying code input frame, obtain the identifying code picture.
6. according to claim 4 local picture is handled,, utilized the identifying code sorter according to characteristics such as the color of identifying code picture, textures; It is identified from local picture; And be processed into independent identifying code picture, it is characterized in that concrete steps are:
1) sets up the sample space of identifying code picture, extract sample local color, texture, gradient characteristic, set up identifying code picture classification device model through machine learning;
2), utilize the moving window model to obtain alternative rectangular area for local picture;
3) to step 2) generate each rectangular area; Identifying code sorter with step 1) generates judges to be the identifying code picture; If this rectangular area meets the characteristic of identifying code picture, then with its intercepting and the independent generation identifying code picture of preserving from local picture.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210192428.1A CN102737122B (en) | 2012-06-08 | 2012-06-08 | Method for extracting verification code image from webpage |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210192428.1A CN102737122B (en) | 2012-06-08 | 2012-06-08 | Method for extracting verification code image from webpage |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102737122A true CN102737122A (en) | 2012-10-17 |
CN102737122B CN102737122B (en) | 2014-12-10 |
Family
ID=46992623
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201210192428.1A Active CN102737122B (en) | 2012-06-08 | 2012-06-08 | Method for extracting verification code image from webpage |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102737122B (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103279503A (en) * | 2013-05-09 | 2013-09-04 | 北京小米科技有限责任公司 | Method and system for acquiring two-dimension code information from webpage |
CN104021376A (en) * | 2014-06-05 | 2014-09-03 | 北京乐动卓越科技有限公司 | Verification code identifying method and device |
CN104102891A (en) * | 2013-04-02 | 2014-10-15 | 腾讯科技(深圳)有限公司 | Information interaction method based on two dimension code, and mobile terminal |
CN104144052A (en) * | 2013-05-10 | 2014-11-12 | 孙鑫 | Verification method for corresponding between keyword in characters and picture or video |
CN105160236A (en) * | 2015-08-31 | 2015-12-16 | 小米科技有限责任公司 | Method and device for inputting verification code |
CN105512107A (en) * | 2015-12-10 | 2016-04-20 | 天津海量信息技术有限公司 | Internet regular text page title identification method based on vision |
CN106131000A (en) * | 2016-06-30 | 2016-11-16 | 维沃移动通信有限公司 | Identifying code fill method and mobile terminal thereof |
CN106203057A (en) * | 2016-06-30 | 2016-12-07 | 北京奇艺世纪科技有限公司 | Identifying code Picture Generation Method and device |
CN110113354A (en) * | 2016-05-24 | 2019-08-09 | 北京京东尚科信息技术有限公司 | The verification method and system of identifying code |
CN111966432A (en) * | 2020-06-30 | 2020-11-20 | 北京百度网讯科技有限公司 | Verification code processing method and device, electronic equipment and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101937438A (en) * | 2009-06-30 | 2011-01-05 | 富士通株式会社 | Method and device for extracting webpage content |
WO2011072434A1 (en) * | 2009-12-14 | 2011-06-23 | Hewlett-Packard Development Company,L.P. | System and method for web content extraction |
CN102314513A (en) * | 2011-09-16 | 2012-01-11 | 华中科技大学 | Image text semantic extraction method based on GPU (Graphics Processing Unit) |
-
2012
- 2012-06-08 CN CN201210192428.1A patent/CN102737122B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101937438A (en) * | 2009-06-30 | 2011-01-05 | 富士通株式会社 | Method and device for extracting webpage content |
WO2011072434A1 (en) * | 2009-12-14 | 2011-06-23 | Hewlett-Packard Development Company,L.P. | System and method for web content extraction |
CN102314513A (en) * | 2011-09-16 | 2012-01-11 | 华中科技大学 | Image text semantic extraction method based on GPU (Graphics Processing Unit) |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104102891A (en) * | 2013-04-02 | 2014-10-15 | 腾讯科技(深圳)有限公司 | Information interaction method based on two dimension code, and mobile terminal |
CN103279503B (en) * | 2013-05-09 | 2017-02-08 | 小米科技有限责任公司 | Method and system for acquiring two-dimension code information from webpage |
CN103279503A (en) * | 2013-05-09 | 2013-09-04 | 北京小米科技有限责任公司 | Method and system for acquiring two-dimension code information from webpage |
CN104144052A (en) * | 2013-05-10 | 2014-11-12 | 孙鑫 | Verification method for corresponding between keyword in characters and picture or video |
CN104144052B (en) * | 2013-05-10 | 2018-05-01 | 孙鑫 | A kind of keyword verification method corresponding with picture or video among word |
CN104021376A (en) * | 2014-06-05 | 2014-09-03 | 北京乐动卓越科技有限公司 | Verification code identifying method and device |
CN104021376B (en) * | 2014-06-05 | 2017-11-21 | 北京乐动卓越科技有限公司 | Method for recognizing verification code and device |
CN105160236A (en) * | 2015-08-31 | 2015-12-16 | 小米科技有限责任公司 | Method and device for inputting verification code |
CN105160236B (en) * | 2015-08-31 | 2018-04-06 | 小米科技有限责任公司 | A kind of method and apparatus of input validation code |
CN105512107A (en) * | 2015-12-10 | 2016-04-20 | 天津海量信息技术有限公司 | Internet regular text page title identification method based on vision |
CN110113354A (en) * | 2016-05-24 | 2019-08-09 | 北京京东尚科信息技术有限公司 | The verification method and system of identifying code |
CN110113354B (en) * | 2016-05-24 | 2021-11-02 | 北京京东尚科信息技术有限公司 | Verification method and system of verification code |
CN106203057A (en) * | 2016-06-30 | 2016-12-07 | 北京奇艺世纪科技有限公司 | Identifying code Picture Generation Method and device |
CN106131000A (en) * | 2016-06-30 | 2016-11-16 | 维沃移动通信有限公司 | Identifying code fill method and mobile terminal thereof |
CN106203057B (en) * | 2016-06-30 | 2019-03-12 | 北京奇艺世纪科技有限公司 | Identifying code Picture Generation Method and device |
CN111966432A (en) * | 2020-06-30 | 2020-11-20 | 北京百度网讯科技有限公司 | Verification code processing method and device, electronic equipment and storage medium |
CN111966432B (en) * | 2020-06-30 | 2023-07-28 | 北京百度网讯科技有限公司 | Verification code processing method and device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN102737122B (en) | 2014-12-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102737122B (en) | Method for extracting verification code image from webpage | |
CN109416731A (en) | Document optical character identification | |
CN107808358B (en) | Automatic detection method for image watermark | |
US9946949B2 (en) | Techniques including URL recognition and applications | |
CN111078978B (en) | Network credit website entity identification method and system based on website text content | |
CN108491866B (en) | Pornographic picture identification method, electronic device and readable storage medium | |
CN107454964A (en) | A kind of commodity recognition method and device | |
CN105165069A (en) | Method for accessing Wi-Fi hotspot device, Wi-Fi hotspot device, and user equipment | |
WO2015062275A1 (en) | Method, apparatus and system for information identification | |
CN113051500B (en) | Phishing website identification method and system fusing multi-source data | |
CN111160427B (en) | Method for detecting mass flow data type based on neural network | |
CN103197866A (en) | Information processing device, information processing method and program | |
CN109194689B (en) | Abnormal behavior recognition method, device, server and storage medium | |
JP6795195B2 (en) | Character type estimation system, character type estimation method, and character type estimation program | |
CN103425993A (en) | Method and system for recognizing images | |
CN105354481A (en) | Network verification method and network verification server | |
CN114386013A (en) | Automatic student status authentication method and device, computer equipment and storage medium | |
CN107291774A (en) | Error sample recognition methods and device | |
CN105825228A (en) | Image identification method and apparatus | |
CN104616163A (en) | Lottery drawing identification code management method and device | |
CN108932434B (en) | Data encryption method and device based on machine learning technology | |
CN107220291A (en) | The method and system of the anti-crawl of web data | |
CN107239787A (en) | A kind of utilization multi-source data have the Image classification method of privacy protection function | |
US9971950B2 (en) | Interactive optical codes | |
CN106599963A (en) | Method and system for forming quick response (QR) code with plaintext |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant |