Patented claim of the present invention be that March 30, application number in 2012 are 201210091311.4 the applying date, name is called the dividing an application of Chinese invention patent application of " a kind of tab file analytic method and device ".
Summary of the invention
The application's technical matters to be solved is to provide a kind of tab file analytic method and device, so that the low problem of success ratio when effectively solving prior art and resolving html web page.
In order to address the above problem, the application discloses a kind of tab file analytic method, comprising:
Obtain the label object in tab file, generating labels set;
Public attribute according to label object in described tag set, divides into groups to described label object, from the result of described grouping, obtains one or more packet labels;
According to preset tab file, resolve mapping table, mate the attribute of the label object in described one or more packet label;
From the packet label matching, obtain the data that tab file is resolved use.
Preferably, according to the public attribute of label object in described tag set, described label object is divided into groups, comprising:
The label object in tag set with identical father node is placed in same packet label.
Preferably, the described result from described grouping also comprises that packet label divides into groups again after obtaining one or more packet labels, comprising:
One or more packet labels in judgement current group result; If current group label comprises two or more label objects, and described two or more label object do not have identical father node, the label object in current group label with identical father node is placed in another packet label; Current group label repeats above-mentioned steps until can not divide into groups again.
Preferably, described tab file is hypertext markup html file.
Preferably, described label object is <input> input frame;
The preset tab file of described foundation is resolved mapping table, and the attribute that mates the label object in described one or more packet label also comprises:
If only contain 1 Password Input frame in current group label, and the label object that is positioned at the described Password Input frame subordinate brotgher of node is not text input frame, and current group label is logon form.
Preferably, if current group label is logon form, and in current group label there is multistage father node in Password Input frame, and current group label can not divide again, and in the nearest father node of described Password Input frame, comprise at least one text input frame, the label object in the nearest father node of described Password Input frame is placed in new packet label.
Preferably, the preset tab file of described foundation is resolved mapping table, and the attribute that mates the label object in described one or more packet label also comprises:
If only contain 1 Password Input frame in current group label, and the label object that is positioned at the described Password Input frame subordinate brotgher of node is text input frame, and current group label is enrollment form.
Preferably, the preset tab file of described foundation is resolved mapping table, and the attribute that mates the label object in described one or more packet label also comprises:
If in current group label, comprise 2 positions continuous and the Password Input frame of the brotgher of node each other, current group label is enrollment form.
Preferably, the preset tab file of described foundation is resolved mapping table, and the attribute that mates the label object in described one or more packet label also comprises:
If comprise 3 positions in the current group label Password Input frame of the brotgher of node continuously and each other, current group label is that password is revised list.
Preferably, before described one or more packet labels being resolved by pre-defined rule, also comprise:
For addressing the above problem, disclosed herein as well is a kind of webpage fill method, comprising: tab file is resolved; Target data storage; Target data is filled;
Described tab file is resolved and is comprised:
Obtain the label object in tab file, generating labels set;
Public attribute according to label object in described tag set, divides into groups to described label object, from the result of described grouping, obtains one or more packet labels;
According to preset tab file, resolve mapping table, mate the attribute of the label object in described one or more packet label;
The packet label matching is target input item;
Described target data storage comprises:
From described target input item, obtain target data and be stored in configuration information;
Described target data is filled and is comprised:
From configuration information, obtain target data and be filled in the target input item that described tab file is corresponding.
For addressing the above problem, disclosed herein as well is a kind of tab file resolver, comprising:
Acquisition module, for obtaining the label object of tab file;
Set generation module, for the label object generating labels set that described acquisition module is obtained;
Grouping module, for dividing into groups to described label object according to the public attribute of described tag set label object;
Packet label acquisition module, the one or more packet labels that generate for obtaining described grouping module;
Parsing module, for resolving mapping table according to preset tab file, mates the attribute of the label object in one or more packet labels that described packet label acquisition module obtains; From the packet label matching, obtain tab file and resolve the data of use.
Preferably, described grouping module also comprises:
The first grouped element, is placed in same packet label for tag set being had to the label object of identical father node.
Preferably, described grouping module also comprises:
The second grouped element, if comprise two or more label objects for judgement current group label, and described two or more label object does not have identical father node, the label object in current group label with identical father node is placed in another packet label; The second grouped element repeats operation until current group label can not divide into groups again.
Preferably, described tab file is hypertext markup html file.
Preferably, the label object that described acquisition module obtains is <input> input frame.
Preferably, described parsing module also comprises:
Logon form recognition unit, if only contain 1 Password Input frame for judgement current group label, and the label object that is positioned at the described Password Input frame subordinate brotgher of node is not text input frame, current group label is logon form.
Preferably, described parsing module also comprises:
The first enrollment form recognition unit, if only contain 1 Password Input frame for judgement current group label, and the label object that is positioned at the described Password Input frame subordinate brotgher of node is text input frame, current group label is enrollment form.
Preferably, described parsing module also comprises:
The second enrollment form recognition unit, if for judge current group label comprise 2 positions continuous and the Password Input frame of the brotgher of node each other, current group label is enrollment form.
Preferably, described parsing module also comprises:
Password is revised form recognition unit, if for judging that current group label comprises 3 positions continuously and the Password Input frame of the brotgher of node each other, current group label is that password is revised list.
Compared with prior art, the application has the following advantages:
Prior art need to first find <form> label when identification html web page file, then search the <input> input frame under this <form> label, but owing to there being the html web page file that lacks in a large number <form> label, therefore use existing file identification technology cannot obtain <input> input frame because finding <form> label, so cause the parsing failure to numerous html web page files in internet, resolve to power low.For this reason, we propose the <input> label in whole html web page file to search and obtain, the public attribute simultaneously obtained <input> tag set being had by each <input> label divides into groups, allow and do not have the <input> label of obviously contact to form one or more packet labels by its total Attribute Association originally together, because the <input> in same packet label has identical public attribute, be convenient to very much be further analyzed use.In this process, need not consider position and the manifestation mode of <input> label in html web page file completely, for the web page files that there is no <form> label, still can successfully resolve, greatly improve the success ratio to html web page document analysis.According to statistics, the success ratio that web page files is resolved can be brought up to 90% left and right from 40% of prior art.
Embodiment
For the application's above-mentioned purpose, feature and advantage can be become apparent more, below in conjunction with the drawings and specific embodiments, the application is described in further detail.
At present, usage flag language description or storage data have become current most important Data Representation and storage mode, as HTML, HTML5, extensible HyperText Markup Language (eXtensible HyperText Markup Language, XHTML), extend markup language (Extensible Markup Language, XML) etc., topmost feature of this class markup language is exactly all to use a set of markup tags (markup tag) tissue or storage data.Tab file described in following the application just refers to the file with markup tags organising data.
With reference to Fig. 1, show the schematic flow sheet of a kind of tab file analytic method of the application, specific as follows:
Step 101, obtains the label object generating labels set in tab file.
Tab file is exactly the file with markup tags tissue or storage data to be resolved.According to the needs of data storage and performance, a certain specific label may repeatedly occur in a tab file, in when a webpage, need to occur a plurality of links, the label <a></aGreatT.Gre aT.GT for label link in the tab file that this webpage is corresponding will repeatedly occur by certain format:
<a href=
http:// www.google.comthis is a link </a> for name=" google " >
<a href=
http:// www.360.comthis is a link </a> for name=" 360 " >
When if desired the data of label object tissue are resolved, just from this tab file, search the set of label object generating labels.
It should be noted that, a tab file is normally organized corresponding data by various label, such as <lable>, <table>, <form>, <input> etc.So-called label object is exactly the object instance of a label in tab file, as above-mentioned label <a> has occurred 2 times in a webpage, if the label object that so this webpage is obtained to label <a> will obtain corresponding two object instances of label <a>:
<a href=
http:// www.google.comthis is a link </a> for name=" google " >
<a href=
http:// www.360.comthis is a link </a> for name=" 360 " >
In addition, the label object that obtains from tab file described herein also refers to the label object that obtains a certain class or a few class specific label conventionally, which class label it specifically obtains can be determined according to tab file analysis purpose by those skilled in the art when implementing the application, and the application is not limited in this respect.
Preferably, described in to obtain label object in tab file be that tab file is carried out searching the whole label objects that obtain certain class label in full, to avoid omitting, improve the accuracy of resolving.
Step 102, divides into groups to described label object according to the public attribute of label object in described tag set.
Can be according to the type of different tab files, or this document analysis object, select the different public attributes of label object, and according to this public attribute, tab file is implemented to grouping.The public attribute of so-called label object refers to the attribute information that label object is common, such as the residing position of label object, tag types, length etc.
Step 103, obtains one or more packet labels.
According to the group result in above-mentioned steps, obtain one or more packet labels.
Step 104, resolves mapping table according to preset tab file, mates the attribute of the label object in described one or more packet label, obtains the data that tab file is resolved use from the packet label matching.
The required information of matched packet label that tab file parsing mapping table is pre-stored.Those skilled in the art easily understand, concrete parsing mapping table content can be resolved the influence factors such as object, tab file type and be organized setting according to this, tab file as dissimilar in HTML, XML, or the parsing mapping table that tab file of the same type can be corresponding different when different labels are resolved.
When the label object content of certain label is resolved in need to be to a tab file in prior art, can directly find this label, then analyze one by one.The greatest problem of this analytical approach is exactly that impurity in data to be resolved is too many, has a strong impact on the success ratio of identification.Because may there is multiple describing mode in same class label in a tab file, as <input> label, it can be positioned at <form> label, also can be positioned at outside <form> label, or not use <form> label.Based on this situation, if search the <input> label in whole tab file, owing to cannot knowing the relation between a plurality of <input> labels, all <input> labels can only be analyzed as same group of label, will certainly bring so a lot of incoherent contents into causes impurity too many, and iff considering that the <input> in <form> label may cannot obtain effective <input> label because can not find <form> label, cause recognition success rate low.
For the problems referred to above, the application proposes: to the tag set obtaining from tab file, by the public attribute of label object in tag set, the label object in tag set is divided into groups to obtain packet label, again obtained one or more packet labels are resolved to the analysis result obtaining whole tab file afterwards.Because same grouping interior label object all has identical public attribute, each label object has also possessed relevance each other, is convenient to implement further analysis, and unlike each label in prior art in disordered state; Meanwhile, owing to need not considering the different expression form of label object in different tab files, thereby applicability and the success ratio of document analysis have been improved.
With reference to Fig. 2, show the schematic flow sheet of a kind of tab file analytic method of the application, specific as follows:
Step 201, obtains the whole label object generating labels set in tab file.
Step 202, divides into groups to the label object in tag set, specifically comprises: the label object with identical father node is set to same packet label.
In this example, there is identical father node and be the divide into groups common public attribute of institute's foundation of each label object in tag set.Due to this, resolving object is to wish that contraposition is set up related label and analyzes, and therefore uses identical father node as grouping foundation.Concrete, first calculate the node path of each label object, then the label object that has identical father node is arranged in same packet label.
Step 203, is used recursive operation to complete the dividing into groups again of each label object in identical father node, specifically comprises:
One or more packet labels in judgement current group result, if current group label comprises two or more label objects, and described two or more label object does not have identical father node, the label in current group label with identical father node is placed in another packet label; Current group label repeats above-mentioned steps until can not divide into groups again.
As follows by code description said process example:
Step 203, obtains one or more packet labels according to group result.
Step 204, resolves the one or more packet labels in group result.
According to preset tab file, resolve mapping table, mate the attribute of the label object in described one or more packet label; From the packet label matching, obtain tab file and resolve the data of use.
A kind of tab file analytic method above the application being proposed has been done general description, the parsing html web page file of take is below example, the <input> input frame relating to is resolved to identifying be elaborated in html web page file.The object that the present embodiment is resolved html web page is in order to identify the user's logon form in webpage, and the logon form identifying is often used to provide support for the operation of webpage automatic form filling.
Referring to Fig. 3, show the distribution schematic diagram of <input> input frame in a html web page file to be resolved, concrete resolving is described below:
In html web page file as shown in Figure 3, have 8 <input> input frames, it is respectively text0, text1, password1, text2, password2, text3, password3, text4.The text meaning is here that the type of this <input> is text input frame, and the meaning of password is that this <input> type is Password Input frame.<input> code shown in input frame below is the source code of this input frame correspondence in html file.
S1 obtains whole <input> input frames from html web page file.
The result of obtaining is to obtain a <input> set that comprises 8 <input> input frames, as:
{text0、text1、password1、text2、password2、text3、password3、text4}。
S2, to <input> input frame, set is divided into groups.
Group forming criterion: the <input> with identical father node is placed in same <input> group.
First group result:
The packet label group0(text0 that belongs to father node A, text1, password1);
The packet label group2(text2 that belongs to father node C, password2, password3, text3, text4)
First group result can be two classes: can not divide into groups again or can divide into groups again, it meets the following conditions respectively:
Can not divide into groups again: meet following a, two conditions of b:
A. in group, there is no password;
B. in group, only have 1 password(to think that N continuous password also calculates 1)
Can divide into groups again: in group, have discontinuous a plurality of password, such as (text1, password1, text2, password2)
S3, can divide into groups to continue grouping again if group result comprises.
From group result group0 and group2, by above-mentioned condition judgment, whether be to divide into groups again, if can divide into groups again, according to above-mentioned rule of classification, this packet label divided into groups again; Afterwards repeating step S3 until group result no longer can divide.Concrete determination methods is: to the packet label in group result, judgement current group label, if current group label comprises two or more label objects, and described two or more label object does not have identical father node, the label object in current group label with identical father node is placed in another packet label.
S4, deletes unnecessary <text> label in packet label.
If current group label is logon form, and can not divide again, and in packet label, the number of <text> input frame is greater than 1, the <text> label of Delete superfluous, packet label group0(text0 for example, text1, password1).Concrete grammar comprises:
First, search the nearest father node of <password> input frame in packet label group0.
According to node path, find the nearest father node B of <password> input frame, judge and in father node B, whether have <text> input frame, if exist, using the label of the <text> in father node B and <password> as a new labeled packet.
As shown in Figure 3, in packet label group0, <password1> has two father nodes, be respectively father node A, father node B, and in group0, only have a <password1> input frame to divide again.Search the nearest father node B of <password1>, in this father node B, there is <text1> input frame, therefore, using the <text> in father node B and <password> as a new packet label group1 (text1, password1).
S5, resolves group result.
According to resolving the matching relationship prestoring in mapping table, resolve, comprising:
According to the <input> number that in each packet label group, type is password, and determine form types with the associated and position relationship of text:
1) logon form: if only contain 1 Password Input frame in current group label group, and the label object that is positioned at the described Password Input frame subordinate brotgher of node is not text input frame, and current group label is logon form.That is to say, if only have 1 password in current group label group and there is no text below, current group label group logs in list.
2) enrollment form: if only contain 1 Password Input frame in current group label, and the label object that is positioned at the described Password Input frame subordinate brotgher of node is text input frame, and current group label is enrollment form.That is to say, if only have 1 password in current group label group but have text below, current group label group is enrollment form
3) enrollment form: if in current group label, comprise 2 positions continuous and the Password Input frame of the brotgher of node each other, current group label is enrollment form.That is to say, be continuous if there are 2 password(in current group certainly, discontinuously just can divide again), current group label group is enrollment form
4) Modify password list: if comprise 3 positions in the current group label Password Input frame of the brotgher of node continuously and each other, current group label is that password is revised list.That is to say, if there are 3 password in current group label group, current group label group is Modify password list.
S5, determines the user name of logon form.
To logon form group, we think that the text above password is exactly user name.So, can clear and definite logon form text and the corresponding relation of password, and for follow-up automatic form filling or other operations.
If will identify the logon form in web page files in the prior art, be could further search the <input> input frame of this <form> label inside after the <form> label that will first find in webpage, and using the <input> result finding as 1 list, then judge whether this list is logon form.But because existence is positioned at the <input> input frame of <form> outside, cause such web page files of prior art None-identified.For eliminating the problems referred to above, the mode that prior art utilization manually participates in analyzing is identified logon form.And the logon form recognition methods that the application proposes is without considering in webpage whether have <form>, without any manual intervention in the situation that, logon form recognition success rate can be brought up to 90% from original 40%.
By several embodiment, a kind of tab file analytic method of the application is described above, for the ease of those skilled in the art, better understand the application's implementation process, below in conjunction with concrete C++ code sample, a kind of tab file analytic method of the application is described in further detail, its detailed process is:
First, search <input> input frames all in target web and generate <input> tag set.Concrete, by calling the method for searching in list class CFormFinder, realize;
Then, the <input> input frame finding out is divided into groups.Concrete, calculate the path of each <input> object in <input> tag set, will have the <input> object of identical father node to divide into groups; Concrete, by the DivideIntoGroups () method defining in recursive call <input> input frame classes of packets CInputGroup, realize;
Finally, grouping <input> input frame is determined whether to logon form, enrollment form, password modification list; Concrete, by the method for calling in CInputGroup class, realize.
Provide partial reference code sample below:
The bstrType that below used is the value of current <input> object type attribute.According to HTML agreement regulation, the value of the type of <input> comprises following several: button, checkbox, file, hidden, image, password, radio, reset, submit, text.
Identification <input> object type is exactly will identify type value for the situation of text input frame text, Password Input frame password, submit button submit, and concrete recognition methods comprises:
If bstrType value is " password ", <input> is Password Input frame;
If bstrType value is " text ", <input> is text input frame; Further determine whether user name input frame: if comprise " user name ", " account ", " login name ", " login account " or attribute id value in attribute name value, comprise " userid ", think that this input frame is user name input frame;
If bstrType value is " sumbit ", <input> is submit button.
CInputItem class, represents an input object, and it provides certain methods, for judging whether input object is password box, text box, user name frame etc.
CFormFinder class, for searching list, has realized following important method in such:
Method 1 is searched list from Browser.
Method 2 is searched list from Document.
Method 3 is searched list from Collection.The method is searched list in input frame set, calls the DivideIntoGroups method in following CInputGroup class, and input frame is divided into groups and determines whether the method into list.
CInputGroup class, has represented the grouping of input frame, and minute in same group is all the input frame with identical father node.The important method realizing in CInputGroup is as follows:
Does (1) initialization function, identify location to current group label, determines this set type, as discarded group? the group that can split? logon form? enrollment form?
The pseudo-code of this initialization function is expressed as follows:
(2) whether judgement grouping is logon form
IsCanDivide (), is used for the function of grouping, adopts the mode of recursive call.
DivideIntoGroups (), divides into groups and determines whether as list input frame, and the part pseudo-code in the method is as follows:
By specific embodiment, a kind of tab file analytic method of the application is described in detail above.Below with reference to the various embodiments described above content, a kind of webpage fill method of the application is described, it specifically comprises:
Webpage automatic filling is exactly when user need to input account, password login website on webpage, system can be presented at account, the encrypted message of input before this user in the input frame that web page files is corresponding automatically automatically, to reach user-friendly object.
For achieving the above object, a kind of webpage fill method of the application need at least comprise: tab file is resolved; Target data storage; Target data is filled;
Described tab file is resolved and is comprised:
Obtain the label object in tab file, generating labels set;
According to the public attribute of label object in described tag set, described label object is divided into groups, from the result of described grouping, obtain one or more packet labels;
According to preset tab file, resolve mapping table, mate the attribute of the label object in described one or more packet label;
The packet label matching is target input item;
Described target data storage comprises:
From described target input item, obtain target data and be stored in configuration information;
Described target data is filled and is comprised:
From configuration information, obtain target data and be filled in the target input item that described tab file is corresponding.
Further, tab file is html web page file, and label object is <input>.
If the target input item of obtaining after html web page document analysis is user's logon form, this logon form comprises user name input frame, Password Input frame, fills in logon form and records and the user name in logon form, encrypted message be stored in the cookie file of client after server is submitted logging request to so user.
When html web page file is asked to be presented at browser again, from client cookie file, extract user name, the encrypted message corresponding with current html web page file logon form and be shown in position corresponding to logon form in html web page.
Be more than the simple description of a kind of webpage fill method one embodiment of the application, more related contents please refer to part corresponding to the various embodiments described above, repeat no more herein.
It should be noted that, for aforesaid each embodiment of the method, for simple description, therefore it is all expressed as to a series of combination of actions, but those skilled in the art should know, the application is not subject to the restriction of described sequence of movement, because according to the application, some step can adopt other orders or carry out simultaneously.Secondly, those skilled in the art also should know, the embodiment described in instructions all belongs to preferred embodiment, and related action might not be that the application is necessary.
Referring to Fig. 4, show the structural representation of a kind of tab file resolver one embodiment of the application, described device specifically comprises:
Acquisition module 410, for obtaining the label object of tab file;
Set generation module 420, for the label object generating labels set that described acquisition module 410 is obtained;
Grouping module 430, the tag set generating for pair set generation module 420 divides into groups by the public attribute of described tag set label object;
Packet label acquisition module 440, the one or more packet labels that generate for obtaining described grouping module 430;
Tab file is resolved mapping table 460, for the parsing map information of pre-stored tab file;
Parsing module 450, for resolving mapping table 460 according to described tab file, mates the attribute of label object in one or more packet labels that described packet label acquisition module 440 obtains; From the packet label matching, obtain tab file and resolve the data of use.
Preferably, described grouping module 430 also comprises:
The first grouped element 431, is placed in same packet label for tag set being had to the label object of identical father node.
Preferably, described grouping module 430 also comprises:
The second grouped element 432, if comprise two or more label objects for judgement current group label, and described two or more label object does not have identical father node, the label in current group label with identical father node is placed in another packet label; The second grouped element repeats operation until current group label can not divide again.
Preferably, described tab file is hypertext markup html file.
Preferably, the label object that described acquisition module 410 obtains is <input> input frame;
Preferably, described parsing module 450 also comprises:
Logon form recognition unit 451, if only contain 1 Password Input frame for judgement current group label, and the label object that is positioned at the described Password Input frame subordinate brotgher of node is not text input frame, current group label is logon form.
Preferably, described parsing module 450 also comprises:
The first enrollment form recognition unit 452, if only contain 1 Password Input frame for judgement current group label, and the label object that is positioned at the described Password Input frame subordinate brotgher of node is text input frame, current group label is enrollment form.
Preferably, described parsing module 450 also comprises:
The second enrollment form recognition unit 453, if for judge current group label comprise 2 positions continuous and the Password Input frame of the brotgher of node each other, current group label is enrollment form.
Preferably, described parsing module 450 also comprises:
Password is revised form recognition unit 454, if for judging that current group label comprises 3 positions continuously and the Password Input frame of the brotgher of node each other, current group label is that password is revised list.
Each embodiment in this instructions all adopts the mode of going forward one by one to describe, and each embodiment stresses is the difference with other embodiment, between each embodiment identical similar part mutually referring to.For system embodiment, because it is substantially similar to embodiment of the method, so description is fairly simple, relevant part is referring to the part explanation of embodiment of the method.
Finally, also it should be noted that, in this article, relational terms such as the first and second grades is only used for an entity or operation to separate with another entity or operational zone, and not necessarily requires or imply and between these entities or operation, have the relation of any this reality or sequentially.
A kind of tab file analytic method and the device that above the application are provided, and a kind of webpage fill method, be described in detail, applied specific case herein the application's principle and embodiment are set forth, the explanation of above embodiment is just for helping to understand the application's method and core concept thereof; Meanwhile, for one of ordinary skill in the art, the thought according to the application, all will change in specific embodiments and applications, and in sum, this description should not be construed as the restriction to the application.