CN103577578A - Marker file parsing method and device - Google Patents

Marker file parsing method and device Download PDF

Info

Publication number
CN103577578A
CN103577578A CN201310548150.1A CN201310548150A CN103577578A CN 103577578 A CN103577578 A CN 103577578A CN 201310548150 A CN201310548150 A CN 201310548150A CN 103577578 A CN103577578 A CN 103577578A
Authority
CN
China
Prior art keywords
label
packet
tab file
input frame
current group
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310548150.1A
Other languages
Chinese (zh)
Other versions
CN103577578B (en
Inventor
杭程
李超
万勇
任寰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201310548150.1A priority Critical patent/CN103577578B/en
Priority claimed from CN2012100913114A external-priority patent/CN102651019B/en
Publication of CN103577578A publication Critical patent/CN103577578A/en
Application granted granted Critical
Publication of CN103577578B publication Critical patent/CN103577578B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • G06F16/9577Optimising the visualization of content, e.g. distillation of HTML documents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation

Abstract

The invention provides a marker file parsing method and device which are used for solving the problem that the success rate in parsing a marker file is low in the prior art. The method includes the steps that a tag set is generated by acquiring tag objects in the marker file; the tag objects are grouped according to common attributes of the tag objects in the tag set; one or more tag groups are acquired from grouping results; a mapping table is parsed according to the preset marker file, and matching is carried out on attributes of the tag objects in the one or more tag groups; data for file parsing are acquired from matched tag groups. By grouping the tag objects according to the common attributes of the tag objects, a correlation is established between the original unordered tag objects in the marker file, further matching analysis is greatly facilitated, and the success rate in parsing the marker file is effectively improved.

Description

A kind of tab file analytic method and device
Patented claim of the present invention be that March 30, application number in 2012 are 201210091311.4 the applying date, name is called the dividing an application of Chinese invention patent application of " a kind of tab file analytic method and device ".
Technical field
The application relates to Data Analysis technical field, particularly relates to a kind of tab file analytic method and device.
Background technology
Internet technology has deeply affected people's life at present, and such as E-mail address, forum, web game etc. also become a part indispensable in people's routine work and amusement.But above-mentioned internet, applications could be used after mostly needing user to register and login, so user need to remember a large amount of username and passwords.For the safety of account number, user also needs to arrange the password of comparatively complicated numeral, letter, special symbol combination conventionally, has further strengthened the difficulty of memory, when each login, also needs manual input, and everything has caused burden to user's use undoubtedly.Webpage automatic form filling is exactly the technology addressing this problem, the username and password that it can input preservation user in webpage, upper once when user opens same web page, automatically help user to fill in the username and password of having preserved, user needn't remember and fill in a large amount of username and passwords again, uses various network resources and service more easily arbitrarily.
Key in this technology of automatic form filling is to detect in advance and to judge the list that whether exists user to fill in and to submit in the page, it is logon form? or enrollment form? first want can identify these lists, the step that then just can realize follow-up preservation, helps user to fill in.
At present, the technology of existing identification web form generally comprises following steps:
First, obtain the list in webpage.By hypertext code (HTML corresponding to identification webpage, HyperText Markup Language) in, whether there is <form> label, if exist, just using the interior all <input> input frames of this <form> label as a list.Code sample referring to a <form> list below:
Secondly, the type of the identification list that obtains, determines whether this list is the object that needs automatic form filling.Its key is whether want to identify this list be logon form.Concrete, by judging that the type type of input frame <input> in this list is that the input frame number of password box password judges whether this list is logon form.If Password Input frame number is 1 and thinks that this list is logon form in this list.
Finally, the logon form identifying is implemented to the associative operations such as Auto-writing.
Can find out, in existing form recognition process, first will find the <form> label in webpage above, the input frame in this <form> label could be implemented follow-up further analysis after a list.But because <form> label is all used in not all list realization at present a lot of pages, but there is the multiple mode that realizes list, as realized the list that contains input frame in <div> label, its HTML identifying code is as follows:
Figure BDA0000409501800000022
In the face of the list that lacks <form> label of this large amount of existence, the form recognition method based on prior art will be failed, causes recognition success rate very low.According to statistics, above-mentioned prior art form recognition success ratio only has 40% left and right.
In a word, need the urgent technical matters solving of those skilled in the art to be exactly: how can improve the recognition failures causing because lacking <form> label in webpage in existing web form recognition technology, the problem that recognition success rate is low.
Summary of the invention
The application's technical matters to be solved is to provide a kind of tab file analytic method and device, so that the low problem of success ratio when effectively solving prior art and resolving html web page.
In order to address the above problem, the application discloses a kind of tab file analytic method, comprising:
Obtain the label object in tab file, generating labels set;
Public attribute according to label object in described tag set, divides into groups to described label object, from the result of described grouping, obtains one or more packet labels;
According to preset tab file, resolve mapping table, mate the attribute of the label object in described one or more packet label;
From the packet label matching, obtain the data that tab file is resolved use.
Preferably, according to the public attribute of label object in described tag set, described label object is divided into groups, comprising:
The label object in tag set with identical father node is placed in same packet label.
Preferably, the described result from described grouping also comprises that packet label divides into groups again after obtaining one or more packet labels, comprising:
One or more packet labels in judgement current group result; If current group label comprises two or more label objects, and described two or more label object do not have identical father node, the label object in current group label with identical father node is placed in another packet label; Current group label repeats above-mentioned steps until can not divide into groups again.
Preferably, described tab file is hypertext markup html file.
Preferably, described label object is <input> input frame;
The preset tab file of described foundation is resolved mapping table, and the attribute that mates the label object in described one or more packet label also comprises:
If only contain 1 Password Input frame in current group label, and the label object that is positioned at the described Password Input frame subordinate brotgher of node is not text input frame, and current group label is logon form.
Preferably, if current group label is logon form, and in current group label there is multistage father node in Password Input frame, and current group label can not divide again, and in the nearest father node of described Password Input frame, comprise at least one text input frame, the label object in the nearest father node of described Password Input frame is placed in new packet label.
Preferably, the preset tab file of described foundation is resolved mapping table, and the attribute that mates the label object in described one or more packet label also comprises:
If only contain 1 Password Input frame in current group label, and the label object that is positioned at the described Password Input frame subordinate brotgher of node is text input frame, and current group label is enrollment form.
Preferably, the preset tab file of described foundation is resolved mapping table, and the attribute that mates the label object in described one or more packet label also comprises:
If in current group label, comprise 2 positions continuous and the Password Input frame of the brotgher of node each other, current group label is enrollment form.
Preferably, the preset tab file of described foundation is resolved mapping table, and the attribute that mates the label object in described one or more packet label also comprises:
If comprise 3 positions in the current group label Password Input frame of the brotgher of node continuously and each other, current group label is that password is revised list.
Preferably, before described one or more packet labels being resolved by pre-defined rule, also comprise:
For addressing the above problem, disclosed herein as well is a kind of webpage fill method, comprising: tab file is resolved; Target data storage; Target data is filled;
Described tab file is resolved and is comprised:
Obtain the label object in tab file, generating labels set;
Public attribute according to label object in described tag set, divides into groups to described label object, from the result of described grouping, obtains one or more packet labels;
According to preset tab file, resolve mapping table, mate the attribute of the label object in described one or more packet label;
The packet label matching is target input item;
Described target data storage comprises:
From described target input item, obtain target data and be stored in configuration information;
Described target data is filled and is comprised:
From configuration information, obtain target data and be filled in the target input item that described tab file is corresponding.
For addressing the above problem, disclosed herein as well is a kind of tab file resolver, comprising:
Acquisition module, for obtaining the label object of tab file;
Set generation module, for the label object generating labels set that described acquisition module is obtained;
Grouping module, for dividing into groups to described label object according to the public attribute of described tag set label object;
Packet label acquisition module, the one or more packet labels that generate for obtaining described grouping module;
Parsing module, for resolving mapping table according to preset tab file, mates the attribute of the label object in one or more packet labels that described packet label acquisition module obtains; From the packet label matching, obtain tab file and resolve the data of use.
Preferably, described grouping module also comprises:
The first grouped element, is placed in same packet label for tag set being had to the label object of identical father node.
Preferably, described grouping module also comprises:
The second grouped element, if comprise two or more label objects for judgement current group label, and described two or more label object does not have identical father node, the label object in current group label with identical father node is placed in another packet label; The second grouped element repeats operation until current group label can not divide into groups again.
Preferably, described tab file is hypertext markup html file.
Preferably, the label object that described acquisition module obtains is <input> input frame.
Preferably, described parsing module also comprises:
Logon form recognition unit, if only contain 1 Password Input frame for judgement current group label, and the label object that is positioned at the described Password Input frame subordinate brotgher of node is not text input frame, current group label is logon form.
Preferably, described parsing module also comprises:
The first enrollment form recognition unit, if only contain 1 Password Input frame for judgement current group label, and the label object that is positioned at the described Password Input frame subordinate brotgher of node is text input frame, current group label is enrollment form.
Preferably, described parsing module also comprises:
The second enrollment form recognition unit, if for judge current group label comprise 2 positions continuous and the Password Input frame of the brotgher of node each other, current group label is enrollment form.
Preferably, described parsing module also comprises:
Password is revised form recognition unit, if for judging that current group label comprises 3 positions continuously and the Password Input frame of the brotgher of node each other, current group label is that password is revised list.
Compared with prior art, the application has the following advantages:
Prior art need to first find <form> label when identification html web page file, then search the <input> input frame under this <form> label, but owing to there being the html web page file that lacks in a large number <form> label, therefore use existing file identification technology cannot obtain <input> input frame because finding <form> label, so cause the parsing failure to numerous html web page files in internet, resolve to power low.For this reason, we propose the <input> label in whole html web page file to search and obtain, the public attribute simultaneously obtained <input> tag set being had by each <input> label divides into groups, allow and do not have the <input> label of obviously contact to form one or more packet labels by its total Attribute Association originally together, because the <input> in same packet label has identical public attribute, be convenient to very much be further analyzed use.In this process, need not consider position and the manifestation mode of <input> label in html web page file completely, for the web page files that there is no <form> label, still can successfully resolve, greatly improve the success ratio to html web page document analysis.According to statistics, the success ratio that web page files is resolved can be brought up to 90% left and right from 40% of prior art.
Accompanying drawing explanation
Fig. 1 is the schematic flow sheet of a kind of tab file analytic method one embodiment of the application;
Fig. 2 is the schematic flow sheet of another embodiment of a kind of tab file analytic method of the application;
Fig. 3 is a html web page file label distribution schematic diagram;
Fig. 4 is the structural representation of a kind of tab file resolver one embodiment of the application.
Embodiment
For the application's above-mentioned purpose, feature and advantage can be become apparent more, below in conjunction with the drawings and specific embodiments, the application is described in further detail.
At present, usage flag language description or storage data have become current most important Data Representation and storage mode, as HTML, HTML5, extensible HyperText Markup Language (eXtensible HyperText Markup Language, XHTML), extend markup language (Extensible Markup Language, XML) etc., topmost feature of this class markup language is exactly all to use a set of markup tags (markup tag) tissue or storage data.Tab file described in following the application just refers to the file with markup tags organising data.
With reference to Fig. 1, show the schematic flow sheet of a kind of tab file analytic method of the application, specific as follows:
Step 101, obtains the label object generating labels set in tab file.
Tab file is exactly the file with markup tags tissue or storage data to be resolved.According to the needs of data storage and performance, a certain specific label may repeatedly occur in a tab file, in when a webpage, need to occur a plurality of links, the label <a></aGreatT.Gre aT.GT for label link in the tab file that this webpage is corresponding will repeatedly occur by certain format:
<a href= http:// www.google.comthis is a link </a> for name=" google " >
<a href= http:// www.360.comthis is a link </a> for name=" 360 " >
When if desired the data of label object tissue are resolved, just from this tab file, search the set of label object generating labels.
It should be noted that, a tab file is normally organized corresponding data by various label, such as <lable>, <table>, <form>, <input> etc.So-called label object is exactly the object instance of a label in tab file, as above-mentioned label <a> has occurred 2 times in a webpage, if the label object that so this webpage is obtained to label <a> will obtain corresponding two object instances of label <a>:
<a href= http:// www.google.comthis is a link </a> for name=" google " >
<a href= http:// www.360.comthis is a link </a> for name=" 360 " >
In addition, the label object that obtains from tab file described herein also refers to the label object that obtains a certain class or a few class specific label conventionally, which class label it specifically obtains can be determined according to tab file analysis purpose by those skilled in the art when implementing the application, and the application is not limited in this respect.
Preferably, described in to obtain label object in tab file be that tab file is carried out searching the whole label objects that obtain certain class label in full, to avoid omitting, improve the accuracy of resolving.
Step 102, divides into groups to described label object according to the public attribute of label object in described tag set.
Can be according to the type of different tab files, or this document analysis object, select the different public attributes of label object, and according to this public attribute, tab file is implemented to grouping.The public attribute of so-called label object refers to the attribute information that label object is common, such as the residing position of label object, tag types, length etc.
Step 103, obtains one or more packet labels.
According to the group result in above-mentioned steps, obtain one or more packet labels.
Step 104, resolves mapping table according to preset tab file, mates the attribute of the label object in described one or more packet label, obtains the data that tab file is resolved use from the packet label matching.
The required information of matched packet label that tab file parsing mapping table is pre-stored.Those skilled in the art easily understand, concrete parsing mapping table content can be resolved the influence factors such as object, tab file type and be organized setting according to this, tab file as dissimilar in HTML, XML, or the parsing mapping table that tab file of the same type can be corresponding different when different labels are resolved.
When the label object content of certain label is resolved in need to be to a tab file in prior art, can directly find this label, then analyze one by one.The greatest problem of this analytical approach is exactly that impurity in data to be resolved is too many, has a strong impact on the success ratio of identification.Because may there is multiple describing mode in same class label in a tab file, as <input> label, it can be positioned at <form> label, also can be positioned at outside <form> label, or not use <form> label.Based on this situation, if search the <input> label in whole tab file, owing to cannot knowing the relation between a plurality of <input> labels, all <input> labels can only be analyzed as same group of label, will certainly bring so a lot of incoherent contents into causes impurity too many, and iff considering that the <input> in <form> label may cannot obtain effective <input> label because can not find <form> label, cause recognition success rate low.
For the problems referred to above, the application proposes: to the tag set obtaining from tab file, by the public attribute of label object in tag set, the label object in tag set is divided into groups to obtain packet label, again obtained one or more packet labels are resolved to the analysis result obtaining whole tab file afterwards.Because same grouping interior label object all has identical public attribute, each label object has also possessed relevance each other, is convenient to implement further analysis, and unlike each label in prior art in disordered state; Meanwhile, owing to need not considering the different expression form of label object in different tab files, thereby applicability and the success ratio of document analysis have been improved.
With reference to Fig. 2, show the schematic flow sheet of a kind of tab file analytic method of the application, specific as follows:
Step 201, obtains the whole label object generating labels set in tab file.
Step 202, divides into groups to the label object in tag set, specifically comprises: the label object with identical father node is set to same packet label.
In this example, there is identical father node and be the divide into groups common public attribute of institute's foundation of each label object in tag set.Due to this, resolving object is to wish that contraposition is set up related label and analyzes, and therefore uses identical father node as grouping foundation.Concrete, first calculate the node path of each label object, then the label object that has identical father node is arranged in same packet label.
Step 203, is used recursive operation to complete the dividing into groups again of each label object in identical father node, specifically comprises:
One or more packet labels in judgement current group result, if current group label comprises two or more label objects, and described two or more label object does not have identical father node, the label in current group label with identical father node is placed in another packet label; Current group label repeats above-mentioned steps until can not divide into groups again.
As follows by code description said process example:
Figure BDA0000409501800000091
Figure BDA0000409501800000101
Step 203, obtains one or more packet labels according to group result.
Step 204, resolves the one or more packet labels in group result.
According to preset tab file, resolve mapping table, mate the attribute of the label object in described one or more packet label; From the packet label matching, obtain tab file and resolve the data of use.
A kind of tab file analytic method above the application being proposed has been done general description, the parsing html web page file of take is below example, the <input> input frame relating to is resolved to identifying be elaborated in html web page file.The object that the present embodiment is resolved html web page is in order to identify the user's logon form in webpage, and the logon form identifying is often used to provide support for the operation of webpage automatic form filling.
Referring to Fig. 3, show the distribution schematic diagram of <input> input frame in a html web page file to be resolved, concrete resolving is described below:
In html web page file as shown in Figure 3, have 8 <input> input frames, it is respectively text0, text1, password1, text2, password2, text3, password3, text4.The text meaning is here that the type of this <input> is text input frame, and the meaning of password is that this <input> type is Password Input frame.<input> code shown in input frame below is the source code of this input frame correspondence in html file.
S1 obtains whole <input> input frames from html web page file.
The result of obtaining is to obtain a <input> set that comprises 8 <input> input frames, as:
{text0、text1、password1、text2、password2、text3、password3、text4}。
S2, to <input> input frame, set is divided into groups.
Group forming criterion: the <input> with identical father node is placed in same <input> group.
First group result:
The packet label group0(text0 that belongs to father node A, text1, password1);
The packet label group2(text2 that belongs to father node C, password2, password3, text3, text4)
First group result can be two classes: can not divide into groups again or can divide into groups again, it meets the following conditions respectively:
Can not divide into groups again: meet following a, two conditions of b:
A. in group, there is no password;
B. in group, only have 1 password(to think that N continuous password also calculates 1)
Can divide into groups again: in group, have discontinuous a plurality of password, such as (text1, password1, text2, password2)
S3, can divide into groups to continue grouping again if group result comprises.
From group result group0 and group2, by above-mentioned condition judgment, whether be to divide into groups again, if can divide into groups again, according to above-mentioned rule of classification, this packet label divided into groups again; Afterwards repeating step S3 until group result no longer can divide.Concrete determination methods is: to the packet label in group result, judgement current group label, if current group label comprises two or more label objects, and described two or more label object does not have identical father node, the label object in current group label with identical father node is placed in another packet label.
S4, deletes unnecessary <text> label in packet label.
If current group label is logon form, and can not divide again, and in packet label, the number of <text> input frame is greater than 1, the <text> label of Delete superfluous, packet label group0(text0 for example, text1, password1).Concrete grammar comprises:
First, search the nearest father node of <password> input frame in packet label group0.
According to node path, find the nearest father node B of <password> input frame, judge and in father node B, whether have <text> input frame, if exist, using the label of the <text> in father node B and <password> as a new labeled packet.
As shown in Figure 3, in packet label group0, <password1> has two father nodes, be respectively father node A, father node B, and in group0, only have a <password1> input frame to divide again.Search the nearest father node B of <password1>, in this father node B, there is <text1> input frame, therefore, using the <text> in father node B and <password> as a new packet label group1 (text1, password1).
S5, resolves group result.
According to resolving the matching relationship prestoring in mapping table, resolve, comprising:
According to the <input> number that in each packet label group, type is password, and determine form types with the associated and position relationship of text:
1) logon form: if only contain 1 Password Input frame in current group label group, and the label object that is positioned at the described Password Input frame subordinate brotgher of node is not text input frame, and current group label is logon form.That is to say, if only have 1 password in current group label group and there is no text below, current group label group logs in list.
2) enrollment form: if only contain 1 Password Input frame in current group label, and the label object that is positioned at the described Password Input frame subordinate brotgher of node is text input frame, and current group label is enrollment form.That is to say, if only have 1 password in current group label group but have text below, current group label group is enrollment form
3) enrollment form: if in current group label, comprise 2 positions continuous and the Password Input frame of the brotgher of node each other, current group label is enrollment form.That is to say, be continuous if there are 2 password(in current group certainly, discontinuously just can divide again), current group label group is enrollment form
4) Modify password list: if comprise 3 positions in the current group label Password Input frame of the brotgher of node continuously and each other, current group label is that password is revised list.That is to say, if there are 3 password in current group label group, current group label group is Modify password list.
S5, determines the user name of logon form.
To logon form group, we think that the text above password is exactly user name.So, can clear and definite logon form text and the corresponding relation of password, and for follow-up automatic form filling or other operations.
If will identify the logon form in web page files in the prior art, be could further search the <input> input frame of this <form> label inside after the <form> label that will first find in webpage, and using the <input> result finding as 1 list, then judge whether this list is logon form.But because existence is positioned at the <input> input frame of <form> outside, cause such web page files of prior art None-identified.For eliminating the problems referred to above, the mode that prior art utilization manually participates in analyzing is identified logon form.And the logon form recognition methods that the application proposes is without considering in webpage whether have <form>, without any manual intervention in the situation that, logon form recognition success rate can be brought up to 90% from original 40%.
By several embodiment, a kind of tab file analytic method of the application is described above, for the ease of those skilled in the art, better understand the application's implementation process, below in conjunction with concrete C++ code sample, a kind of tab file analytic method of the application is described in further detail, its detailed process is:
First, search <input> input frames all in target web and generate <input> tag set.Concrete, by calling the method for searching in list class CFormFinder, realize;
Then, the <input> input frame finding out is divided into groups.Concrete, calculate the path of each <input> object in <input> tag set, will have the <input> object of identical father node to divide into groups; Concrete, by the DivideIntoGroups () method defining in recursive call <input> input frame classes of packets CInputGroup, realize;
Finally, grouping <input> input frame is determined whether to logon form, enrollment form, password modification list; Concrete, by the method for calling in CInputGroup class, realize.
Provide partial reference code sample below:
The bstrType that below used is the value of current <input> object type attribute.According to HTML agreement regulation, the value of the type of <input> comprises following several: button, checkbox, file, hidden, image, password, radio, reset, submit, text.
Identification <input> object type is exactly will identify type value for the situation of text input frame text, Password Input frame password, submit button submit, and concrete recognition methods comprises:
If bstrType value is " password ", <input> is Password Input frame;
If bstrType value is " text ", <input> is text input frame; Further determine whether user name input frame: if comprise " user name ", " account ", " login name ", " login account " or attribute id value in attribute name value, comprise " userid ", think that this input frame is user name input frame;
If bstrType value is " sumbit ", <input> is submit button.
CInputItem class, represents an input object, and it provides certain methods, for judging whether input object is password box, text box, user name frame etc.
CFormFinder class, for searching list, has realized following important method in such:
Method 1 is searched list from Browser.
Method 2 is searched list from Document.
Method 3 is searched list from Collection.The method is searched list in input frame set, calls the DivideIntoGroups method in following CInputGroup class, and input frame is divided into groups and determines whether the method into list.
CInputGroup class, has represented the grouping of input frame, and minute in same group is all the input frame with identical father node.The important method realizing in CInputGroup is as follows:
Does (1) initialization function, identify location to current group label, determines this set type, as discarded group? the group that can split? logon form? enrollment form?
The pseudo-code of this initialization function is expressed as follows:
Figure BDA0000409501800000141
Figure BDA0000409501800000151
(2) whether judgement grouping is logon form
IsCanDivide (), is used for the function of grouping, adopts the mode of recursive call.
DivideIntoGroups (), divides into groups and determines whether as list input frame, and the part pseudo-code in the method is as follows:
Figure BDA0000409501800000152
Figure BDA0000409501800000161
By specific embodiment, a kind of tab file analytic method of the application is described in detail above.Below with reference to the various embodiments described above content, a kind of webpage fill method of the application is described, it specifically comprises:
Webpage automatic filling is exactly when user need to input account, password login website on webpage, system can be presented at account, the encrypted message of input before this user in the input frame that web page files is corresponding automatically automatically, to reach user-friendly object.
For achieving the above object, a kind of webpage fill method of the application need at least comprise: tab file is resolved; Target data storage; Target data is filled;
Described tab file is resolved and is comprised:
Obtain the label object in tab file, generating labels set;
According to the public attribute of label object in described tag set, described label object is divided into groups, from the result of described grouping, obtain one or more packet labels;
According to preset tab file, resolve mapping table, mate the attribute of the label object in described one or more packet label;
The packet label matching is target input item;
Described target data storage comprises:
From described target input item, obtain target data and be stored in configuration information;
Described target data is filled and is comprised:
From configuration information, obtain target data and be filled in the target input item that described tab file is corresponding.
Further, tab file is html web page file, and label object is <input>.
If the target input item of obtaining after html web page document analysis is user's logon form, this logon form comprises user name input frame, Password Input frame, fills in logon form and records and the user name in logon form, encrypted message be stored in the cookie file of client after server is submitted logging request to so user.
When html web page file is asked to be presented at browser again, from client cookie file, extract user name, the encrypted message corresponding with current html web page file logon form and be shown in position corresponding to logon form in html web page.
Be more than the simple description of a kind of webpage fill method one embodiment of the application, more related contents please refer to part corresponding to the various embodiments described above, repeat no more herein.
It should be noted that, for aforesaid each embodiment of the method, for simple description, therefore it is all expressed as to a series of combination of actions, but those skilled in the art should know, the application is not subject to the restriction of described sequence of movement, because according to the application, some step can adopt other orders or carry out simultaneously.Secondly, those skilled in the art also should know, the embodiment described in instructions all belongs to preferred embodiment, and related action might not be that the application is necessary.
Referring to Fig. 4, show the structural representation of a kind of tab file resolver one embodiment of the application, described device specifically comprises:
Acquisition module 410, for obtaining the label object of tab file;
Set generation module 420, for the label object generating labels set that described acquisition module 410 is obtained;
Grouping module 430, the tag set generating for pair set generation module 420 divides into groups by the public attribute of described tag set label object;
Packet label acquisition module 440, the one or more packet labels that generate for obtaining described grouping module 430;
Tab file is resolved mapping table 460, for the parsing map information of pre-stored tab file;
Parsing module 450, for resolving mapping table 460 according to described tab file, mates the attribute of label object in one or more packet labels that described packet label acquisition module 440 obtains; From the packet label matching, obtain tab file and resolve the data of use.
Preferably, described grouping module 430 also comprises:
The first grouped element 431, is placed in same packet label for tag set being had to the label object of identical father node.
Preferably, described grouping module 430 also comprises:
The second grouped element 432, if comprise two or more label objects for judgement current group label, and described two or more label object does not have identical father node, the label in current group label with identical father node is placed in another packet label; The second grouped element repeats operation until current group label can not divide again.
Preferably, described tab file is hypertext markup html file.
Preferably, the label object that described acquisition module 410 obtains is <input> input frame;
Preferably, described parsing module 450 also comprises:
Logon form recognition unit 451, if only contain 1 Password Input frame for judgement current group label, and the label object that is positioned at the described Password Input frame subordinate brotgher of node is not text input frame, current group label is logon form.
Preferably, described parsing module 450 also comprises:
The first enrollment form recognition unit 452, if only contain 1 Password Input frame for judgement current group label, and the label object that is positioned at the described Password Input frame subordinate brotgher of node is text input frame, current group label is enrollment form.
Preferably, described parsing module 450 also comprises:
The second enrollment form recognition unit 453, if for judge current group label comprise 2 positions continuous and the Password Input frame of the brotgher of node each other, current group label is enrollment form.
Preferably, described parsing module 450 also comprises:
Password is revised form recognition unit 454, if for judging that current group label comprises 3 positions continuously and the Password Input frame of the brotgher of node each other, current group label is that password is revised list.
Each embodiment in this instructions all adopts the mode of going forward one by one to describe, and each embodiment stresses is the difference with other embodiment, between each embodiment identical similar part mutually referring to.For system embodiment, because it is substantially similar to embodiment of the method, so description is fairly simple, relevant part is referring to the part explanation of embodiment of the method.
Finally, also it should be noted that, in this article, relational terms such as the first and second grades is only used for an entity or operation to separate with another entity or operational zone, and not necessarily requires or imply and between these entities or operation, have the relation of any this reality or sequentially.
A kind of tab file analytic method and the device that above the application are provided, and a kind of webpage fill method, be described in detail, applied specific case herein the application's principle and embodiment are set forth, the explanation of above embodiment is just for helping to understand the application's method and core concept thereof; Meanwhile, for one of ordinary skill in the art, the thought according to the application, all will change in specific embodiments and applications, and in sum, this description should not be construed as the restriction to the application.

Claims (10)

1. a tab file analytic method, is characterized in that, comprising:
Obtain the label object in tab file, generating labels set;
Public attribute according to label object in described tag set, divides into groups to described label object, from the result of described grouping, obtains one or more packet labels;
According to preset tab file, resolve mapping table, mate the attribute of the label object in described one or more packet label;
From the packet label matching, obtain the data that tab file is resolved use.
2. the method for claim 1, is characterized in that, according to the public attribute of label object in described tag set, described label object is divided into groups, and comprising:
The label object in tag set with identical father node is placed in same packet label.
3. method as claimed in claim 2, is characterized in that, the described result from described grouping also comprises that packet label divides into groups again after obtaining one or more packet labels, comprising:
One or more packet labels in judgement current group result;
If current group label comprises two or more label objects, and described two or more label object do not have identical father node, the label object in current group label with identical father node is placed in another packet label;
Current group label repeats above-mentioned steps until can not divide into groups again.
4. the method for claim 1, is characterized in that, described tab file is hypertext markup html file.
5. the method for claim 1, is characterized in that, described label object is <input> input frame;
The preset tab file of described foundation is resolved mapping table, and the attribute that mates the label object in described one or more packet label also comprises:
If only contain 1 Password Input frame in current group label, and the label object that is positioned at the described Password Input frame subordinate brotgher of node is not text input frame, and current group label is logon form.
6. method as claimed in claim 5, is characterized in that, also comprises:
If current group label is logon form, and in current group label there is multistage father node in Password Input frame, and current group label can not divide again, and in the nearest father node of described Password Input frame, comprise at least one text input frame, the label object in the nearest father node of described Password Input frame is placed in new packet label.
7. method as claimed in claim 5, is characterized in that, the preset tab file of described foundation is resolved mapping table, and the attribute that mates the label object in described one or more packet label also comprises:
If only contain 1 Password Input frame in current group label, and the label object that is positioned at the described Password Input frame subordinate brotgher of node is text input frame, and current group label is enrollment form.
8. method as claimed in claim 7, is characterized in that, the preset tab file of described foundation is resolved mapping table, and the attribute that mates the label object in described one or more packet label also comprises:
If in current group label, comprise 2 positions continuous and the Password Input frame of the brotgher of node each other, current group label is enrollment form.
9. a webpage fill method, is characterized in that, comprising: tab file is resolved; Target data storage; Target data is filled;
Described tab file is resolved and is comprised:
Obtain the label object in tab file, generating labels set;
Public attribute according to label object in described tag set, divides into groups to described label object, from the result of described grouping, obtains one or more packet labels;
According to preset tab file, resolve mapping table, mate the attribute of the label object in described one or more packet label;
The packet label matching is target input item;
Described target data storage comprises:
From described target input item, obtain target data and be stored in configuration information;
Described target data is filled and is comprised:
From configuration information, obtain target data and be filled in the target input item that described tab file is corresponding.
10. a tab file resolver, is characterized in that, comprising:
Acquisition module, for obtaining the label object of tab file;
Set generation module, for the label object generating labels set that described acquisition module is obtained;
Grouping module, for dividing into groups to described label object according to the public attribute of described tag set label object;
Packet label acquisition module, the one or more packet labels that generate for obtaining described grouping module;
Tab file is resolved mapping table, for the parsing map information of pre-stored tab file;
Parsing module, for resolving mapping table according to tab file, mates the attribute of the label object in one or more packet labels that described packet label acquisition module obtains; From the packet label matching, obtain tab file and resolve the data of use.
CN201310548150.1A 2012-03-30 2012-03-30 A kind of tab file analysis method and device Active CN103577578B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310548150.1A CN103577578B (en) 2012-03-30 2012-03-30 A kind of tab file analysis method and device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201310548150.1A CN103577578B (en) 2012-03-30 2012-03-30 A kind of tab file analysis method and device
CN2012100913114A CN102651019B (en) 2012-03-30 2012-03-30 Method and device for parsing tagged file

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN2012100913114A Division CN102651019B (en) 2012-03-30 2012-03-30 Method and device for parsing tagged file

Publications (2)

Publication Number Publication Date
CN103577578A true CN103577578A (en) 2014-02-12
CN103577578B CN103577578B (en) 2017-04-05

Family

ID=50049354

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310548150.1A Active CN103577578B (en) 2012-03-30 2012-03-30 A kind of tab file analysis method and device

Country Status (1)

Country Link
CN (1) CN103577578B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107102978A (en) * 2017-05-24 2017-08-29 北京小度信息科技有限公司 Data earth-filling method, device and mobile terminal
CN109246069A (en) * 2018-06-15 2019-01-18 华为技术有限公司 Webpage login method, device and readable storage medium storing program for executing
CN109918540A (en) * 2019-02-26 2019-06-21 深圳市元征科技股份有限公司 A kind of XML document analytic method, system and electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6571253B1 (en) * 2000-04-28 2003-05-27 International Business Machines Corporation Hierarchical view of data binding between display elements that are organized in a hierarchical structure to a data store that is also organized in a hierarchical structure
CN101071446A (en) * 2007-06-22 2007-11-14 腾讯科技(深圳)有限公司 Marked language archive analytical method, analytical module and user terminal
CN101464879A (en) * 2008-11-28 2009-06-24 中国地质大学(武汉) Method and system for implementing dynamic catalog based on regulation
CN102254014A (en) * 2011-07-21 2011-11-23 华中科技大学 Adaptive information extraction method for webpage characteristics

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6571253B1 (en) * 2000-04-28 2003-05-27 International Business Machines Corporation Hierarchical view of data binding between display elements that are organized in a hierarchical structure to a data store that is also organized in a hierarchical structure
CN101071446A (en) * 2007-06-22 2007-11-14 腾讯科技(深圳)有限公司 Marked language archive analytical method, analytical module and user terminal
CN101464879A (en) * 2008-11-28 2009-06-24 中国地质大学(武汉) Method and system for implementing dynamic catalog based on regulation
CN102254014A (en) * 2011-07-21 2011-11-23 华中科技大学 Adaptive information extraction method for webpage characteristics

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107102978A (en) * 2017-05-24 2017-08-29 北京小度信息科技有限公司 Data earth-filling method, device and mobile terminal
CN107102978B (en) * 2017-05-24 2020-11-24 北京星选科技有限公司 Data backfilling method and device and mobile terminal
CN109246069A (en) * 2018-06-15 2019-01-18 华为技术有限公司 Webpage login method, device and readable storage medium storing program for executing
CN109246069B (en) * 2018-06-15 2020-10-16 华为技术有限公司 Webpage login method and device and readable storage medium
CN109918540A (en) * 2019-02-26 2019-06-21 深圳市元征科技股份有限公司 A kind of XML document analytic method, system and electronic equipment and storage medium
CN109918540B (en) * 2019-02-26 2023-04-21 深圳市元征科技股份有限公司 XML document analysis method and system, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN103577578B (en) 2017-04-05

Similar Documents

Publication Publication Date Title
CN102651019B (en) Method and device for parsing tagged file
US10817663B2 (en) Dynamic native content insertion
US8601120B2 (en) Update notification method and system
US10853566B2 (en) Systems and methods for automatically creating tables using auto-generated templates
US8321396B2 (en) Automatically extracting by-line information
US10311120B2 (en) Method and apparatus for identifying webpage type
CN101950312B (en) Method for analyzing webpage content of internet
CN108399150B (en) Text processing method and device, computer equipment and storage medium
CN106897251B (en) Rich text display method and device
CN108021598B (en) Page extraction template matching method and device and server
CN111079043A (en) Key content positioning method
US11226991B2 (en) Interest tag determining method, computer device, and storage medium
US20220245203A1 (en) A service packaging method based on web page segmentation and search algorithm
CN104063401A (en) Webpage style address merging method and device
WO2021093673A1 (en) E-mail sending method, apparatus and device, and computer-readable storage medium
CN103544150A (en) Method and system for providing recommendation information for mobile terminal browser
CN106446123A (en) Webpage verification code element identification method
CN103577578A (en) Marker file parsing method and device
CN103714117A (en) Webpage form recognizing method
CN110955855B (en) Information interception method, device and terminal
CN111158973B (en) Web application dynamic evolution monitoring method
CN111209325B (en) Service system interface identification method, device and storage medium
Gali et al. Extracting representative image from web page
CN114900492B (en) Abnormal mail detection method, device and system and computer readable storage medium
JP6763433B2 (en) Information gathering system, information gathering method, and program

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
C41 Transfer of patent application or patent right or utility model
TA01 Transfer of patent application right

Effective date of registration: 20151225

Address after: 100088 Beijing city Xicheng District xinjiekouwai Street 28, block D room 112 (Desheng Park)

Applicant after: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Applicant after: Qizhi software (Beijing) Co.,Ltd.

Address before: 100015, No. 6, building 2, building B, No. 301-306, Jiuxianqiao Road, Chaoyang District, Beijing, room 2, room 3

Applicant before: Qizhi software (Beijing) Co.,Ltd.

Applicant before: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20220721

Address after: Room 801, 8th floor, No. 104, floors 1-19, building 2, yard 6, Jiuxianqiao Road, Chaoyang District, Beijing 100015

Patentee after: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Address before: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park)

Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Patentee before: Qizhi software (Beijing) Co.,Ltd.

TR01 Transfer of patent right