CN102737122B - Method for extracting verification code image from webpage - Google Patents

Method for extracting verification code image from webpage Download PDF

Info

Publication number
CN102737122B
CN102737122B CN201210192428.1A CN201210192428A CN102737122B CN 102737122 B CN102737122 B CN 102737122B CN 201210192428 A CN201210192428 A CN 201210192428A CN 102737122 B CN102737122 B CN 102737122B
Authority
CN
China
Prior art keywords
identifying code
picture
img
code picture
local
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201210192428.1A
Other languages
Chinese (zh)
Other versions
CN102737122A (en
Inventor
卜佳俊
陈纯
韩冲
王灿
宋明黎
王炜
何占盈
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201210192428.1A priority Critical patent/CN102737122B/en
Publication of CN102737122A publication Critical patent/CN102737122A/en
Application granted granted Critical
Publication of CN102737122B publication Critical patent/CN102737122B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention provides a method for extracting a verification code image from a webpage. As a fixed website link is not available for the verification code image on the webpage, the verification code image is generated randomly, and the content of the verification code image can be changed if the verification code image is refreshed or saved, the verification code image extraction is a key problem for a plurality of software applications requiring verification code images. According to the method, the verification code image is extracted from the webpage by virtue of a cursor position, a verification code input box position, an image position, an image size, image vision and content features, image keywords, an image length-to-width ratio and other information.

Description

A kind of method of extracting identifying code picture in webpage
Technical field
The present invention relates to identifying code picture recognition field, relate in particular to a kind of method of extracting identifying code picture in webpage.
Background technology
Identifying code is that a kind of user of differentiation is computing machine and people's public full-automatic program.Can prevent: malice decryption, brush ticket, forum pour water, and effectively prevent that certain hacker from constantly logging in trial to some particular registered user specific program Brute Force mode, be actually with identifying code be present much current modes in websites.
Identifying code, is exactly by the numeral of a string random generation or symbol, generates a width picture, adds some interference in picture, for example, draw at random several straight lines, draws some points and (prevents oCR), identify verification code information wherein by user's naked eyes, the checking of input submission of sheet website, could be used a certain function after being proved to be successful.The application of now a lot of softwares will be extracted the identifying code picture in webpage, because identifying code picture does not have a fixing website links in webpage, and picture generates at random, it is refreshed or preserves operation can change image content, therefore extract identifying code picture and be a crucial difficult problem of the software application (blind person's picture validation code service software) that much needs identifying code picture.
Summary of the invention
The invention provides a kind of method of extracting identifying code picture in webpage.Utilize the information such as identifying code input frame position, picture position, picture size, picture vision and content characteristic, picture key word in webpage, extract identifying code picture in webpage, this law invention can provide convenient for the software application that much need to extract webpage verification using data-hiding technology code picture.
The invention provides a kind of method of extracting identifying code picture in webpage, comprise the following steps:
1) obtain all IMG nodal informations of the browser current active page;
2) according to the identifying code picture scoring strategy pre-establishing, the pictorial information that IMG node is comprised is marked, the IMG node that the highest being of scoring comprises identifying code picture;
3) if step 2) cannot obtain all IMG nodes, intercept its local picture around taking identifying code input frame as focus identifying code picture is included; Utilize classification and Detection model that training in advance obtains to obtain the particular location of identifying code picture;
4) identifying code picture is preserved separately.
2. described in, obtain all IMG nodal informations of the browser current active page, concrete steps are:
1) determine the browser current active page;
2) top-down, obtain all IMG nodal informations of loose-leaf, IMG nodal information has comprised picture position, picture size, picture length and width, the information such as picture key word.
3. according to the identifying code picture scoring strategy pre-establishing, the pictorial information that IMG node is comprised is marked, the IMG node that the highest being of scoring comprises identifying code picture, and concrete steps are:
Obtain the information of all IMG nodes of the browser current active page, utilize the identifying code scoring strategy pre-establishing to mark to the information of all IMG nodes, the highest IMG node of marking is the IMG node at identifying code picture place.
4. its local picture around that intercepts taking identifying code input frame as focus described in is included identifying code picture; Utilize classification and Detection model that training in advance obtains to obtain the particular location of identifying code picture, concrete steps are:
1) if can not obtain all IMG nodes of loose-leaf, likely obtain the IMG node less than identifying code picture place.At this moment, can intercept taking identifying code input frame as focus its local picture is around included identifying code picture.
2) local picture is processed, according to the color of identifying code picture, texture gradient feature, utilized identifying code sorter model, it is identified from local picture, and be processed into independent identifying code picture.
5. identifying code picture is preserved separately described in, concrete steps are:
Due to the singularity of identifying code picture, operate on it and likely can change picture, therefore to take the mode of special preservation picture.If can get all IMG nodes, utilize identifying code scoring strategy, select the IMG node at identifying code picture place, can, according to the positional information of picture in IMG node, carry out accurate screenshotss, obtain identifying code picture; Otherwise, intercept its local picture around taking identifying code input frame as focus identifying code picture is included, utilize identifying code disaggregated model that the rectangular area at identifying code picture place is intercepted, obtain identifying code picture.
6. pair local picture is processed, and according to the color of identifying code picture, Texture eigenvalue, utilizes identifying code sorter, it is identified from local picture, and be processed into independent identifying code picture, and concrete steps are:
1) set up the sample space of identifying code picture, extract sample local color, texture, Gradient Features, set up identifying code picture classification device model by machine learning;
2), for local picture, utilize sliding window model to obtain alternative rectangular area;
3) to step 2) generate each rectangular area, by step 1) generate identifying code sorter judge be identifying code picture, if this rectangular area meets the feature of identifying code picture, it is intercepted from local picture and preserves separately generation identifying code picture.
Brief description of the drawings
Fig. 1 is a kind of process flow diagram that extracts the method for identifying code picture in webpage of the present invention;
Embodiment
Below in conjunction with the accompanying drawing in the present invention, technical scheme of the present invention is carried out to clear, intactly description.Based on the embodiment in the present invention, the every other embodiment that those of ordinary skill in the art obtain under the prerequisite of not doing creative work, belongs to the scope of protection of the invention.
For making object of the present invention, technical scheme and advantage clearer, next with reference to the accompanying drawings the embodiment of the present invention is described in detail.
With reference to figure 1, the present invention helps blind person to identify the method for identifying code picture, comprises the following steps:
Step S101, determines the browser current active page.
Step S102, determines the position of cursor in loose-leaf.
Step S103, according to cursor position, obtains the identifying code input frame node of loose-leaf.
Step S104, certainly top under, obtain all IMG nodes of the current active page.
Step S105, judges whether to obtain all IMG nodes of loose-leaf, if turn to step S107, otherwise turns to step S106.
Step S106, according to the identifying code scoring strategy pre-establishing, the picture that IMG node is comprised is marked, the highest identifying code picture that is of marking.The picture that each IMG label comprises has several attributes: IMG label key word, picture size, picture Aspect Ratio, picture are in the position of the page and the distance of identifying code input frame, image content feature.According to these information, every pictures has a scoring.Such as, it is 0 that initial score can be set, the full marks of each attribute are 10 points.Marking higher, is more likely identifying code picture.Identifying code picture to website is learnt, and finds that the label of identifying code picture has key word such as " identifying code ", " code "; Picture size also within the specific limits, such as 200X200 is with interior (can as required again expanded scope); Picture is grown up in wide.Can be each attribute established standards, criterion distance be nearer, marks higher.
Step S107, intercepts the local picture that identifying code input frame comprises identifying code picture around.
Step S108, utilizes machine learning, sets up identifying code disaggregated model, according to identifying code feature, extracts identifying code picture from local picture.
Step S109, preserves identifying code picture separately.Due to the singularity of identifying code picture, operate on it and likely can change picture, therefore to take the mode of special preservation picture.If can get all IMG nodes, utilize identifying code scoring strategy, select the IMG node at identifying code picture place, can, according to the positional information of picture in IMG node, carry out accurate screenshotss, obtain identifying code picture; Otherwise, intercept its local picture around taking identifying code input frame as focus identifying code picture is included, utilize identifying code disaggregated model that the rectangular area at identifying code picture place is intercepted, obtain identifying code picture.
In technical scheme of the present invention, utilize the information such as cursor position, identifying code input frame position, picture position, picture size, picture vision and content characteristic, picture key word, picture Aspect Ratio that the identifying code picture in webpage is extracted, for the application software that much need to extract webpage verification using data-hiding technology code picture provides convenient.
Finally, it should be pointed out that above embodiment is only the more representational example of the present invention.Obviously, technical scheme of the present invention is not limited to above-described embodiment.Those of ordinary skill in the art can, not departing under the invention state of mind of the present invention, make various modifications or variation for above-described embodiment, thereby protection scope of the present invention do not limit by above-described embodiment, and should determine according to claims.

Claims (4)

1. a method of extracting identifying code picture in webpage, is characterized in that, comprises the following steps:
1) obtain all IMG nodal informations of the browser current active page;
2) according to the identifying code picture scoring strategy pre-establishing, the pictorial information that IMG node is comprised is marked, the IMG node that the highest being of scoring comprises identifying code picture; Concrete steps are:
Obtain the information of all IMG nodes of the browser current active page, utilize the identifying code scoring strategy pre-establishing to mark to the information of all IMG nodes, the highest IMG node of marking is the IMG node at identifying code picture place;
3) if step 2) cannot obtain all IMG nodes, intercept its local picture around taking identifying code input frame as focus identifying code picture is included; Utilize classification and Detection model that training in advance obtains to obtain the particular location of identifying code picture; Concrete steps are:
3.1) if can not obtain all IMG nodes of loose-leaf, likely obtain the IMG node less than identifying code picture place; At this moment, intercepting its local picture around taking identifying code input frame as focus is included identifying code picture;
3.2) local picture is processed, according to the color of identifying code picture, texture, Gradient Features, utilized identifying code sorter model, it is identified from local picture, and be processed into independent identifying code picture;
4) identifying code picture is preserved separately.
2. the method for identifying code picture in extraction webpage according to claim 1, is characterized in that, the concrete steps of obtaining all IMG nodal informations of the browser current active page are:
1) determine the browser current active page;
2) top-down, obtain all IMG nodal informations of loose-leaf, IMG nodal information has comprised picture position, picture size, picture keyword message.
3. the method for identifying code picture in extraction webpage according to claim 1, is characterized in that, the concrete steps that identifying code picture is preserved are separately:
Due to the singularity of identifying code picture, operate on it and likely can change picture, therefore to take the mode of special preservation picture; If get all IMG nodes, utilize identifying code scoring strategy, select the IMG node at identifying code picture place, according to the positional information of picture in IMG node, carry out accurate screenshotss, obtain identifying code picture; Otherwise, intercept its local picture around taking identifying code input frame as focus identifying code picture is included, utilize identifying code disaggregated model that the rectangular area at identifying code picture place is intercepted, obtain identifying code picture.
4. according to the method for identifying code picture in the extraction webpage described in claim 2 or 3, it is characterized in that: local picture is processed, according to the color of identifying code picture, textural characteristics, utilize identifying code sorter, it is identified from local picture, and be processed into independent identifying code picture, concrete steps are:
1) set up the sample space of identifying code picture, extract sample local color, texture, Gradient Features, set up identifying code picture classification device model by machine learning;
2), for local picture, utilize sliding window model to obtain alternative rectangular area;
3) to step 2) generate each rectangular area, by step 1) generate identifying code sorter judge be identifying code picture, if this rectangular area meets the feature of identifying code picture, it is intercepted from local picture and preserves separately generation identifying code picture.
CN201210192428.1A 2012-06-08 2012-06-08 Method for extracting verification code image from webpage Active CN102737122B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210192428.1A CN102737122B (en) 2012-06-08 2012-06-08 Method for extracting verification code image from webpage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210192428.1A CN102737122B (en) 2012-06-08 2012-06-08 Method for extracting verification code image from webpage

Publications (2)

Publication Number Publication Date
CN102737122A CN102737122A (en) 2012-10-17
CN102737122B true CN102737122B (en) 2014-12-10

Family

ID=46992623

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210192428.1A Active CN102737122B (en) 2012-06-08 2012-06-08 Method for extracting verification code image from webpage

Country Status (1)

Country Link
CN (1) CN102737122B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104102891A (en) * 2013-04-02 2014-10-15 腾讯科技(深圳)有限公司 Information interaction method based on two dimension code, and mobile terminal
CN103279503B (en) * 2013-05-09 2017-02-08 小米科技有限责任公司 Method and system for acquiring two-dimension code information from webpage
CN104144052B (en) * 2013-05-10 2018-05-01 孙鑫 A kind of keyword verification method corresponding with picture or video among word
CN104021376B (en) * 2014-06-05 2017-11-21 北京乐动卓越科技有限公司 Method for recognizing verification code and device
CN105160236B (en) * 2015-08-31 2018-04-06 小米科技有限责任公司 A kind of method and apparatus of input validation code
CN105512107A (en) * 2015-12-10 2016-04-20 天津海量信息技术有限公司 Internet regular text page title identification method based on vision
CN110113354B (en) * 2016-05-24 2021-11-02 北京京东尚科信息技术有限公司 Verification method and system of verification code
CN106203057B (en) * 2016-06-30 2019-03-12 北京奇艺世纪科技有限公司 Identifying code Picture Generation Method and device
CN106131000B (en) * 2016-06-30 2019-12-03 维沃移动通信有限公司 Identifying code fill method and its mobile terminal
CN111966432B (en) * 2020-06-30 2023-07-28 北京百度网讯科技有限公司 Verification code processing method and device, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101937438A (en) * 2009-06-30 2011-01-05 富士通株式会社 Method and device for extracting webpage content
CN102314513A (en) * 2011-09-16 2012-01-11 华中科技大学 Image text semantic extraction method based on GPU (Graphics Processing Unit)

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8819028B2 (en) * 2009-12-14 2014-08-26 Hewlett-Packard Development Company, L.P. System and method for web content extraction

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101937438A (en) * 2009-06-30 2011-01-05 富士通株式会社 Method and device for extracting webpage content
CN102314513A (en) * 2011-09-16 2012-01-11 华中科技大学 Image text semantic extraction method based on GPU (Graphics Processing Unit)

Also Published As

Publication number Publication date
CN102737122A (en) 2012-10-17

Similar Documents

Publication Publication Date Title
CN102737122B (en) Method for extracting verification code image from webpage
US11410407B2 (en) Method and device for generating collection of incorrectly-answered questions
US10013624B2 (en) Text entity recognition
CN107590491B (en) Image processing method and device
CN109416731A (en) Document optical character identification
US9098888B1 (en) Collaborative text detection and recognition
US9652680B2 (en) Techniques including URL recognition and applications
CN109685052A (en) Method for processing text images, device, electronic equipment and computer-readable medium
CN102682091A (en) Cloud-service-based visual search method and cloud-service-based visual search system
CN107454964A (en) A kind of commodity recognition method and device
CN105426455A (en) Method and device for carrying out classified management on clothes on the basis of picture processing
CN105165069A (en) Method for accessing Wi-Fi hotspot device, Wi-Fi hotspot device, and user equipment
CN111160427B (en) Method for detecting mass flow data type based on neural network
WO2015062275A1 (en) Method, apparatus and system for information identification
WO2014104694A1 (en) Authentication server and method using label, and mobile device
WO2018129903A1 (en) Public relations method and system for public opinion, user terminal and computer readable storage medium
CN103136676A (en) Two-dimension code anti-counterfeiting system recognized by mobile phone automatically and usage method thereof
CN103136251A (en) Method and device of webpage identification
WO2015032308A1 (en) Image recognition method and user terminal
CN107291774A (en) Error sample recognition methods and device
CN114092948B (en) Bill identification method, device, equipment and storage medium
CN114386013A (en) Automatic student status authentication method and device, computer equipment and storage medium
KR20200132681A (en) Method and server for registering merchandise information
JP2013242673A (en) Information processing system and program
CN104951444B (en) A kind of searching method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant