CN103605675B - XML (extensive markup language) path expression extracting method and device - Google Patents

XML (extensive markup language) path expression extracting method and device Download PDF

Info

Publication number
CN103605675B
CN103605675B CN201310524422.4A CN201310524422A CN103605675B CN 103605675 B CN103605675 B CN 103605675B CN 201310524422 A CN201310524422 A CN 201310524422A CN 103605675 B CN103605675 B CN 103605675B
Authority
CN
China
Prior art keywords
node
text
path expression
content
xml path
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310524422.4A
Other languages
Chinese (zh)
Other versions
CN103605675A (en
Inventor
刘佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201310524422.4A priority Critical patent/CN103605675B/en
Publication of CN103605675A publication Critical patent/CN103605675A/en
Application granted granted Critical
Publication of CN103605675B publication Critical patent/CN103605675B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/83Querying
    • G06F16/832Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9558Details of hyperlinks; Management of linked annotations

Abstract

The invention discloses an XML path expression extracting method and device. The method comprises, step (1), setting limit conditions with a plurality of hierarchical relations and initializing the limit conditions of the lowest layer as current limit conditions; step (2), extracting the XML path expression of an element node to be identified under the current limit conditions; step (3), performing location according to the XML path expression, and if the XML path expression locates a unique element node, ending the XML path expression extraction; if not, namely, if the XML path expression locates more than one element nodes, selecting the limit conditions one layer above the current limit conditions as current limit conditions and re-executing the step (2). By means of the XML path expression which is as strict as possible, the XML path expression extracting method and device can reduce the possibility of XML path expression failures when web pages have small changes.

Description

A kind of XML path expression extracting method and device
Technical field
The present invention relates to XML correlative technology field, particularly a kind of XML path expression extracting method and device.
Background technology
XPath is a language searching information in XML document.XPath can be used in XML document to node element Traveled through with attribute.XPath chooses node or set of node in XML document using XML path expression.These XML Path expression and the expression formula that we see in conventional computer document system are closely similar.
In webpage automatic test, the mode of conventional xpath carries out node element positioning, then again to the unit navigating to Plain node is operated.Such as certain button is navigated to by xpath, then automate to trigger again and click on;Again such as logical Cross xpath and navigate to certain text box, then automate again to text frame assignment.
The method of existing acquisition XML path expression is from the beginning of specified node element, obtains the tag name of present node Claim, and whether current element node have the brotgher of node, more successively search upwards, until run into webpage xml root node or Till comprising the node of id attribute, then it is stitched together successively.
Such as one only comprises Baidu, Jingdone district, the simple web page of three links of Taobao, and the source code of webpage is as follows:
If positioning the node element of Jingdone district link, currently general XML path expression extracting method is from capital East links this node and starts, and bookmark name is the entitled a of tag, has a brotgher of node (Baidu's link), more up finds bag P node containing id attribute, finds to comprise id attribute, lookup terminates, and the xpath expression formula of node is // * [@id=" lj "]/a [2], wherein [2] represent the 2nd child node.And for example Taobao to be positioned links this node element, and this node does not have the brotgher of node, From this node up until root node, sequentially find the entitled a of tag, have before the node of p, body, html, wherein p node One brotgher of node, does not run into the node comprising id attribute, so Taobao links the XML path expression of this node element For/html/body/p [2]/a.
However, the XML path expression that prior art extracts node element is generally not sufficiently stable, page layout occurs little During change, original XML path expression is likely to lose efficacy, and also just be can not find using the XML path expression that this lost efficacy This node element or the node element finding mistake.
Order has been exchanged in the web page interlinkage of such as above-mentioned example, by Baidu, Jingdone district, Taobao be adjusted to Taobao, Jingdone district, hundred Degree, the webpage source code after adjustment is as follows:
The XML path expression of Jingdone district link was represented before then:// * [@id=" lj "]/a [2], can mistakenly navigate to Baidu;Represent the XML path expression of Taobao's link before:/ html/body/p [2]/a, can mistakenly navigate to Jingdone district chain Connect.
Content of the invention
Based on this it is necessary to the XML path expression extraction for prior art is unstable, there are minor variations in webpage When the technical problem that can not find node element or find the node element of mistake, provide a kind of XML path expression to extract Method and device.
A kind of XML path expression extracting method, including:
Step 1, sets multiple restrictive conditions with hierarchical relationship, the restrictive condition of initialization lowest hierarchical level is current limit Condition processed, and the restrictive condition of described lowest hierarchical level treat marker element node identification means the strictest;
Step 2, extracts the XML path expression of node element to be identified under current restrictive condition;
Step 3, is positioned according to described XML path expression, and:
If described XML path expression uniquely navigates to described node element to be identified, described XML path expression Extraction terminates;
Else if described XML path expression navigates to more than a node element, then select on current restrictive condition The restrictive condition of one-level is current restrictive condition, and execution step 2.
A kind of XML path expression extraction element, including:
Restrictive condition setup module, for setting multiple restrictive conditions with hierarchical relationship, initialization lowest hierarchical level Restrictive condition be current restrictive condition, and the restrictive condition of described lowest hierarchical level treat marker element node identification means the tightest Lattice;
Path extraction module, for extracting the XML path expression of node element to be identified under current restrictive condition;
Node element locating module, for being positioned according to described XML path expression, and:
If described XML path expression uniquely navigates to described node element to be identified, described XML path expression Extraction terminates;
Else if described XML path orientation to more than a node element, then selects current restrictive condition upper level Restrictive condition is current restrictive condition, and execution route extraction module.
The present invention passes through to set multiple restrictive conditions with hierarchical relationship, adopts most stringent of XML path expression as far as possible Formula, under only current restrictive condition cannot uniquely location element node when, just using more loose upper level XML path expression Formula.Due to employing XML path expression as strict as possible, when therefore decreasing webpage generation minor variations, XML routing table Reach the possibility of formula inefficacy.
Brief description
Fig. 1 is a kind of workflow diagram of present invention XML path expression extracting method;
Fig. 2 is a kind of specific workflow figure of the step 2 of present invention XML path expression extracting method;
Fig. 3 is the workflow diagram of one example of the present invention;
Fig. 4 is a kind of construction module figure of present invention XML path expression extraction element.
Specific embodiment
The present invention will be further described in detail with specific embodiment below in conjunction with the accompanying drawings.
It is illustrated in figure 1 a kind of workflow diagram of present invention XML path expression extracting method, including:
Step 1, sets multiple restrictive conditions with hierarchical relationship, the restrictive condition of initialization lowest hierarchical level is current limit Condition processed, and the restrictive condition of described lowest hierarchical level treat marker element node identification means the strictest;
Step 2, extracts the XML path expression of node element to be identified under current restrictive condition;
Step 3, is positioned according to described XML path expression, and:
If described XML path expression uniquely navigates to described node element to be identified, described XML path expression Extraction terminates;
Else if described XML path expression navigates to more than a node element, then select on current restrictive condition The restrictive condition of one-level is current restrictive condition, and execution step 2.
Wherein, in step 1, restrictive condition refers to allow which kind of mode to treat marker element node using and be identified.Many Between individual restrictive condition, there is hierarchical relationship, this hierarchical relationship needs the restrictive condition ensureing lowest hierarchical level to treat marker element section The identification means of point are the strictest, then for each node element, preferentially tight using the identification means treating marker element node The restrictive condition of lattice extracts XML path expression, and therefore, obtained XML path expression is XML path as strict as possible Expression formula.
The identification means that the restrictive condition of lowest hierarchical level treats marker element node are the strictest, the restrictive condition of other levels Comparatively, the identification means treating marker element node are looser than the restrictive condition of lowest hierarchical level.Various level limits bar The identification means that part treats marker element node are all different, thus the XML road obtained by the restrictive condition of a level When footpath expression formula uniquely can not navigate to node element to be identified, other restrictive conditions can be taken to treat to avoid obtaining The XML path expression of marker element node.
Wherein in an embodiment, the restrictive condition of described lowest hierarchical level is:Using bookmark name and id attribute-bit Described node element to be identified, the restrictive condition of highest level is:At least using one of bookmark name, id attribute and array Or the described node element to be identified of multiple mark.
The present embodiment, the restrictive condition of lowest hierarchical level, it is the strictest that it treats the identification means of marker element node, and highest The restrictive condition of level, it is the loosest that it treats the identification means of marker element node, by the way of array, thus ensure right Any node element all can extract corresponding XML path expression, and this XML path expression is as strict as possible.With Between the hierarchical relationship of Shi Butong, can be separate can also be last layer level the restrictive condition limit that comprises next level Condition processed.
Wherein in an embodiment, the plurality of restrictive condition also includes:
Content of text restrictive condition, is at least identified using one or more of bookmark name, id attribute and content of text Described node element to be identified;
Hyperlink restrictive condition, at least described using one or more of bookmark name, id attribute and hyperlink mark Node element to be identified.
Wherein, content of text restrictive condition and hyperlink restrictive condition are in the restrictive condition of lowest hierarchical level and highest level Between restrictive condition, can be content of text restrictive condition level is lower than hyperlink restrictive condition or hyperlink limit The level of condition processed is lower than content of text restrictive condition.It can be separate, example between restrictive condition between different levels As during using content of text restrictive condition, do not allowed that described node element to be identified is identified using hyperlink, thus simplifying XML road The content of footpath expression formula, it may, however, also be the restrictive condition of last layer level comprises the restrictive condition of next level, for example, when The level of content of text restrictive condition is lower than hyperlink restrictive condition, then when using hyperlink restrictive condition, except using mark Signature claims, id attribute and hyperlink identify described node element to be identified, can also be described to be identified using content of text mark Node element.By the way of the restrictive condition of last layer level comprises the restrictive condition of next level, XML path expression is wrapped The content containing is more, then more can uniquely navigate to described node element to be identified.
Wherein in an embodiment, as shown in Fig. 2 described step 2, specifically include:
Step 21, setting present node is described node element to be identified;
Step 22, obtains the bookmark name of present node and is assigned to XML path expression;
Step 23, judges whether present node has id attribute, if there are id attribute, then adds to described XML path expression The id attribute of sovolin front nodal point;
Step 24, judges whether current restrictive condition includes identifying described node element to be identified using content of text, such as The current restrictive condition of fruit includes identifying described node element to be identified using content of text, and present node is text node, then Add the content of text of present node to described XML path expression;
Step 25, judges whether current restrictive condition includes identifying described node element to be identified using hyperlink, if Current restrictive condition includes identifying described node element to be identified using hyperlink, and the bookmark name of present node is hyperlink Label, then add the hyperlink content of present node to described XML path expression;
Step 26, judges whether current restrictive condition includes identifying described node element to be identified using array, if worked as Front restrictive condition includes identifying described node element to be identified using array, and present node includes the brother under same father node Node, then obtain node index under father node for the described present node, add described node rope to described XML path expression Draw.
Wherein, step 24-26 select execution according to different restrictive conditions respectively, exist for those of ordinary skill in the art Read after the embodiment of the present invention it will be understood that in actual motion, due to selecting different restrictive conditions, therefore step 24,25 It is fully implemented with 26 only meetings, other two steps will be jumped out after restrictive condition is judged, therefore step 24th, 25,26 execution sequence can be multiple, can be carried out step 24,25,26, or execution step 24,26,25, or hold Row step 25,24,26, or execution step 25,26,24, or execution step 26,24,25, or execution step 26,25, 24.
Wherein in an embodiment, also include:
If present node includes id attribute, or present node is root node, then execution step 3, otherwise arranges current Node is the father node of present node, execution step 22.
Wherein in an embodiment, in described step 24, specifically include:
Judge whether current restrictive condition includes identifying described node element to be identified using content of text, if current limit Condition processed includes identifying described node element to be identified using content of text, and present node is text node, then to described XML Path expression adds full text content or the part content of text of present node.
Preferably, in described step 24, to described XML path expression add present node full text content or Part content of text, specifically includes:
If the content of text of present node comprises newline, add present node to described XML path expression Pre-set text threshold value character before newline in content of text;
If the text size of the content of text of present node is less than pre-set text threshold value, and do not comprise newline, then to Described XML path expression adds the full text content of present node;
If the text size of the content of text of present node is more than pre-set text threshold value, and do not comprise newline, then to Front pre-set text threshold value character in the content of text of described XML path expression interpolation present node.
It is illustrated in figure 3 the workflow diagram of one example of the present invention, wherein:
Restrictive condition is divided into 4 levels, later layer level comprises the content of last layer level, represented with A to D:A. may only With tag title and id attribute;B. content of text can be used;C. hyperlink can be used;D. array can be used.Most stringent of from limiting A layer starts, and attempts extracting XML path expression, if extracting successfully (this XML this element of path expression unique mark), Terminate;If extracting unsuccessful, jumping to next level and relaxing the restriction condition, again extract XML path expression again, until Extraction successfully returns this expression formula when can uniquely represent the XML path expression of this element.Because level D uses array, permissible Ensure necessarily can one element of unique mark, so finally sure extract successfully.
Concrete extracting method is as follows:
Step 31, setting restrictive condition is maximum limit A
Step 32, the current xpath of setting is NUL:Xpath=";
Step 33, setting present node el is node element to be identified;
Step 34, under this restrictive condition, extracts the XML path expression of node el and is saved in xpath, extraction side Method is as follows:
1) obtain the tag title of present node, be assigned to new initializaing variable component=el.tagName;
2) judge whether present node has id attribute, if there are id attribute, then assignment component+='[@id="+ el.id+'\']';
3) if current restriction can use content of text, and present node is text node, then content of text TextValue is added in component as follows:
If 1. text size is less than 10, and do not comprise newline, then component+=" [text ()=' "+ textValue+"']";
If 2. text comprises newline, take most 10 character firstlinewords, component before newline +=" [contains (text (), ' "+firstlinewords+ " ')] ";
If 3. file does not comprise newline, foremost 10 character firstwords, component+=" [contains(text(),'"+firstwords+"')]";
4) if current restriction can use linked contents, and current tag title is hyperlink tag:A, then component +=" [@href=' "+el.href+ " '] ";
5) if present node has the brotgher of node, obtain this node and belong to which sub- node i ndex, component+ =' ['+index+'] ';
6) update xpath='/'+component+xpath;
7) if present node has id attribute, or present node is root node, it tries it is fixed to be carried out by current xpath Position:
If 1. uniquely navigating to an element, xpath extracts successfully, goes to step 35;
If 2. navigating to more than one element, it is inadequate that current xpath comprises information.Need the condition of relaxing the restriction, if Put next rank that restrictive condition is current restrictive condition, go to step 32;
8) if present node does not have id attribute, and present node is not root node, then setting present node el is current The father node el=el.parentNode of node, goes to step 34;
Step 35, returns current xpath.
It is illustrated in figure 4 a kind of construction module figure of present invention XML path expression extraction element, including:
Restrictive condition setup module 410, for setting multiple restrictive conditions with hierarchical relationship, initializes lowest hierarchical level Restrictive condition be current restrictive condition, and the restrictive condition of described lowest hierarchical level treats the identification means of marker element node Strictly;
Path extraction module 420, for extracting the XML path expression of node element to be identified under current restrictive condition Formula;
Node element locating module 430, for being positioned according to described XML path expression, and:
If described XML path expression uniquely navigates to described node element to be identified, described XML path expression Extraction terminates;
Else if described XML path orientation to more than a node element, then selects current restrictive condition upper level Restrictive condition is current restrictive condition, and execution route extraction module.
Wherein in an embodiment, the restrictive condition of described lowest hierarchical level is:Using bookmark name and id attribute-bit Described node element to be identified, the restrictive condition of highest level is:At least using one of bookmark name, id attribute and array Or the described node element to be identified of multiple mark.
Wherein in an embodiment, the plurality of restrictive condition also includes:
Content of text restrictive condition, is at least identified using one or more of bookmark name, id attribute and content of text Described node element to be identified;
Hyperlink restrictive condition, at least described using one or more of bookmark name, id attribute and hyperlink mark Node element to be identified.
Wherein in an embodiment, described path extraction module 420, specifically include:
Node arranges submodule 421, is described node element to be identified for arranging present node;
Bookmark name assignment submodule 422, for obtaining the bookmark name of present node and being assigned to XML path expression Formula;
Id attribute assignment submodule 423, for judging whether present node has id attribute, if there are id attribute, then to institute State the id attribute that XML path expression adds present node;
Content of text assignment submodule 424, for judging whether current restrictive condition includes identifying institute using content of text State node element to be identified, if current restrictive condition includes identifying described node element to be identified using content of text, and work as Front nodal point is text node, then add the content of text of present node to described XML path expression;
Hyperlink assignment submodule 425, for judging whether current restrictive condition includes treating using described in hyperlink mark Marker element node, if current restrictive condition includes identifying described node element to be identified using hyperlink, and present node Bookmark name be hyperlink label, then to described XML path expression add present node hyperlink content;
Array assignment submodule 426, described to be identified using array mark for judging whether current restrictive condition includes Node element, if current restrictive condition includes identifying described node element to be identified using array, and present node is included together The brotgher of node under one father node, then obtain node index under father node for the described present node, to described XML path expression Formula adds described node index.
Wherein in an embodiment, described path extraction module 420, also include:
Upper layer node submodule, if including id attribute for present node, or present node is root node, then execute Node element locating module, otherwise setting present node is the father node of present node, executes bookmark name assignment submodule.
Wherein in an embodiment, described content of text assignment submodule 424, specifically for:
Judge whether current restrictive condition includes identifying described node element to be identified using content of text, if current limit Condition processed includes identifying described node element to be identified using content of text, and present node is text node, then to described XML Path expression adds full text content or the part content of text of present node.
Wherein in an embodiment, in described content of text assignment submodule 424, add to described XML path expression The full text content of sovolin front nodal point or part content of text, specifically include:
If the content of text of present node comprises newline, add present node to described XML path expression Pre-set text threshold value character before newline in content of text;
If the text size of the content of text of present node is less than pre-set text threshold value, and do not comprise newline, then to Described XML path expression adds the full text content of present node;
If the text size of the content of text of present node is more than pre-set text threshold value, and do not comprise newline, then to Front pre-set text threshold value character in the content of text of described XML path expression interpolation present node.
Embodiment described above only have expressed the several embodiments of the present invention, and its description is more concrete and detailed, but simultaneously Therefore the restriction to the scope of the claims of the present invention can not be interpreted as.It should be pointed out that for those of ordinary skill in the art For, without departing from the inventive concept of the premise, some deformation can also be made and improve, these broadly fall into the guarantor of the present invention Shield scope.Therefore, the protection domain of patent of the present invention should be defined by claims.

Claims (14)

1. a kind of XML path expression extracting method is it is characterised in that include:
Step (1), sets multiple restrictive conditions with hierarchical relationship, the restrictive condition of initialization lowest hierarchical level is current restriction Condition, and the restrictive condition of described lowest hierarchical level treat marker element node identification means the strictest;
Step (2), extracts the XML path expression of node element to be identified under current restrictive condition;
Step (3), is positioned according to described XML path expression, and:
If described XML path expression uniquely navigates to described node element to be identified, described XML path expression extracts Terminate;
Else if described XML path orientation to more than a node element, then selects the restriction of current restrictive condition upper level Condition is current restrictive condition, and execution step (2).
2. XML path expression extracting method according to claim 1 is it is characterised in that the restriction of described lowest hierarchical level Condition is:Using node element to be identified described in bookmark name and id attribute-bit, the restrictive condition of highest level is:At least make Identify described node element to be identified with one or more of bookmark name, id attribute and array.
3. XML path expression extracting method according to claim 2 it is characterised in that the plurality of restrictive condition also Including:
Content of text restrictive condition, at least described using one or more of bookmark name, id attribute and content of text mark Node element to be identified;
Hyperlink restrictive condition, at least waits to mark using described in one or more of bookmark name, id attribute and hyperlink mark Know node element.
4. XML path expression extracting method according to claim 3, it is characterised in that described step (2), is specifically wrapped Include:
Step (21), setting present node is described node element to be identified;
Step (22), obtains the bookmark name of present node and is assigned to XML path expression;
Step (23), judges whether present node has id attribute, if there are id attribute, then adds to described XML path expression The id attribute of present node;
Step (24), judges whether current restrictive condition includes identifying described node element to be identified using content of text, if Current restrictive condition includes identifying described node element to be identified using content of text, and present node is text node, then to Described XML path expression adds the content of text of present node;
Step (25), judges whether current restrictive condition includes identifying described node element to be identified using hyperlink, if worked as Front restrictive condition includes identifying described node element to be identified using hyperlink, and the bookmark name of present node is hyperlink mark Sign, then add the hyperlink content of present node to described XML path expression;
Step (26), judges whether current restrictive condition includes identifying described node element to be identified using array, if currently Restrictive condition includes identifying described node element to be identified using array, and present node includes the brother's section under same father node Point, then obtain node index under father node for the described present node, add described node rope to described XML path expression Draw.
5. XML path expression extracting method according to claim 4, it is characterised in that described step (2), also includes:
If present node includes id attribute, or present node is root node, then execution step (3), and prosthomere is worked as in otherwise setting Point is the father node of present node, execution step (22).
6. XML path expression extracting method according to claim 4 is it is characterised in that in described step (24), specifically Including:
Judge whether current restrictive condition includes identifying described node element to be identified using content of text, if currently limiting bar Part includes identifying described node element to be identified using content of text, and present node is text node, then to described XML path Expression formula adds full text content or the part content of text of present node.
7. XML path expression extracting method according to claim 6 is it is characterised in that in described step (24), to institute State full text content or the part content of text that XML path expression adds present node, specifically include:
If the content of text of present node comprises newline, add the text of present node to described XML path expression Pre-set text threshold value character before newline in content;
If the text size of the content of text of present node is less than pre-set text threshold value, and do not comprise newline, then to described XML path expression adds the full text content of present node;
If the text size of the content of text of present node is more than pre-set text threshold value, and do not comprise newline, then to described Front pre-set text threshold value character in the content of text of XML path expression interpolation present node.
8. a kind of XML path expression extraction element is it is characterised in that include:
Restrictive condition setup module, for setting multiple restrictive conditions with hierarchical relationship, the restriction of initialization lowest hierarchical level Condition be current restrictive condition, and the restrictive condition of described lowest hierarchical level treat marker element node identification means the strictest;
Path extraction module, for extracting the XML path expression of node element to be identified under current restrictive condition;
Node element locating module, for being positioned according to described XML path expression, and:
If described XML path expression uniquely navigates to described node element to be identified, described XML path expression extracts Terminate;
Else if described XML path orientation to more than a node element, then selects the restriction of current restrictive condition upper level Condition is current restrictive condition, and execution route extraction module.
9. XML path expression extraction element according to claim 8 is it is characterised in that the restriction of described lowest hierarchical level Condition is:Using node element to be identified described in bookmark name and id attribute-bit, the restrictive condition of highest level is:At least make Identify described node element to be identified with one or more of bookmark name, id attribute and array.
10. XML path expression extraction element according to claim 9 it is characterised in that the plurality of restrictive condition also Including:
Content of text restrictive condition, at least described using one or more of bookmark name, id attribute and content of text mark Node element to be identified;
Hyperlink restrictive condition, at least waits to mark using described in one or more of bookmark name, id attribute and hyperlink mark Know node element.
11. XML path expression extraction elements according to claim 10 it is characterised in that described path extraction module, Specifically include:
Node arranges submodule, is described node element to be identified for arranging present node;
Bookmark name assignment submodule, for obtaining the bookmark name of present node and being assigned to XML path expression;
Id attribute assignment submodule, for judging whether present node has id attribute, if there are id attribute, then to described XML road Footpath expression formula adds the id attribute of present node;
Content of text assignment submodule, described to be identified using content of text mark for judging whether current restrictive condition includes Node element, if current restrictive condition includes identifying described node element to be identified using content of text, and present node is Text node, then add the content of text of present node to described XML path expression;
Hyperlink assignment submodule, for judging whether current restrictive condition includes identifying described element to be identified using hyperlink Node, if current restrictive condition includes identifying described node element to be identified using hyperlink, and the tag name of present node Referred to as hyperlink label, then add the hyperlink content of present node to described XML path expression;
Array assignment submodule, for judging whether current restrictive condition includes identifying described element section to be identified using array Point, if current restrictive condition includes identifying described node element to be identified using array, and present node includes same father's section The brotgher of node under point, then obtain node index under father node for the described present node, add to described XML path expression Described node index.
12. XML path expression extraction elements according to claim 11 it is characterised in that described path extraction module, Also include:
Upper layer node submodule, if including id attribute for present node, or present node is root node, then execute element Node locating module, otherwise setting present node is the father node of present node, executes bookmark name assignment submodule.
13. XML path expression extraction elements according to claim 11 are it is characterised in that described content of text assignment Submodule, specifically for:
Judge whether current restrictive condition includes identifying described node element to be identified using content of text, if currently limiting bar Part includes identifying described node element to be identified using content of text, and present node is text node, then to described XML path Expression formula adds full text content or the part content of text of present node.
14. XML path expression extraction elements according to claim 13 are it is characterised in that described content of text assignment In submodule, add full text content or the part content of text of present node, concrete bag to described XML path expression Include:
If the content of text of present node comprises newline, add the text of present node to described XML path expression Pre-set text threshold value character before newline in content;
If the text size of the content of text of present node is less than pre-set text threshold value, and do not comprise newline, then to described XML path expression adds the full text content of present node;
If the text size of the content of text of present node is more than pre-set text threshold value, and do not comprise newline, then to described Front pre-set text threshold value character in the content of text of XML path expression interpolation present node.
CN201310524422.4A 2013-10-30 2013-10-30 XML (extensive markup language) path expression extracting method and device Active CN103605675B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310524422.4A CN103605675B (en) 2013-10-30 2013-10-30 XML (extensive markup language) path expression extracting method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310524422.4A CN103605675B (en) 2013-10-30 2013-10-30 XML (extensive markup language) path expression extracting method and device

Publications (2)

Publication Number Publication Date
CN103605675A CN103605675A (en) 2014-02-26
CN103605675B true CN103605675B (en) 2017-02-15

Family

ID=50123898

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310524422.4A Active CN103605675B (en) 2013-10-30 2013-10-30 XML (extensive markup language) path expression extracting method and device

Country Status (1)

Country Link
CN (1) CN103605675B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104036026B (en) * 2014-06-27 2018-02-23 吴涛军 Storage and location structure document choose the method and system of content
CN109683999A (en) * 2017-10-19 2019-04-26 北京国双科技有限公司 A kind of cross-page surface element localization method and device
CN109214172B (en) * 2018-09-20 2021-08-31 郑州云海信息技术有限公司 Method and device for acquiring key value name of valid registry
CN109977267B (en) * 2019-02-02 2021-09-03 北京云测信息技术有限公司 Method and device for determining XPath path
CN110276039B (en) * 2019-06-27 2021-09-28 北京金山安全软件有限公司 Page element path generation method and device and electronic equipment
CN116680444B (en) * 2023-08-03 2024-01-19 腾讯科技(深圳)有限公司 Data processing method, device, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6286013B1 (en) * 1993-04-01 2001-09-04 Microsoft Corporation Method and system for providing a common name space for long and short file names in an operating system
US6405325B1 (en) * 1999-01-30 2002-06-11 Inventec Corp. Method and tool for restoring crashed operation system of computer
CN1752976A (en) * 2004-09-22 2006-03-29 精工爱普生株式会社 File management program, data structure, and file management device
CN101551818A (en) * 2009-04-14 2009-10-07 北京红旗中文贰仟软件技术有限公司 A unidirectional multi-mapping file matching method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6286013B1 (en) * 1993-04-01 2001-09-04 Microsoft Corporation Method and system for providing a common name space for long and short file names in an operating system
US6405325B1 (en) * 1999-01-30 2002-06-11 Inventec Corp. Method and tool for restoring crashed operation system of computer
CN1752976A (en) * 2004-09-22 2006-03-29 精工爱普生株式会社 File management program, data structure, and file management device
CN101551818A (en) * 2009-04-14 2009-10-07 北京红旗中文贰仟软件技术有限公司 A unidirectional multi-mapping file matching method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于XML索引技术的有效外延连接;姜学锋等;《计算机研究与发展》;20080615(第06期);1043-1055 *

Also Published As

Publication number Publication date
CN103605675A (en) 2014-02-26

Similar Documents

Publication Publication Date Title
CN103605675B (en) XML (extensive markup language) path expression extracting method and device
CN1786965B (en) Method for acquiring news web page text information
CN103399872B (en) The method and apparatus that webpage capture is optimized
CN107423391B (en) Information extraction method of webpage structured data
CN101551800B (en) Marked information generation device, inquiry unit and sharing system
CN103886023B (en) The storage of Excel tables of data, extracting method and system
CN104657377B (en) A kind of multichannel webpage control localization method and device
CN103034583A (en) Method and system for processing automatic test scrip of software
CN107577783A (en) The type of webpage automatic identifying method excavated based on Web architectural features
CN105630941A (en) Statistics and webpage structure based Wen body text content extraction method
ES2375403T3 (en) A METHOD FOR THE AUTOMATIC INDEXATION OF DOCUMENTS.
CN109086361B (en) A kind of automatic abstracting method of webpage article information and system based on mutual information between web page joint
CN103853760A (en) Method and device for extracting contents of bodies of web pages
CN107066576A (en) A kind of big data web crawlers paging system of selection and system
CN102779169A (en) Extracting method and device for webpage content based on HTML (Hypertext Markup Language) label
CN100447793C (en) Method for extracting page query interface based on character of vision
CN106919624B (en) Method and device for improving webpage loading speed
CN104331438B (en) To novel web page contents selectivity abstracting method and device
CN103246732A (en) Online Web news content extracting method and system
CN103853770B (en) The method and system of model content in a kind of extraction forum Web pages
CN107368546A (en) A kind of method and apparatus for generating outline
CN102567337A (en) Method and system for quickly recognizing webpage types through links
CN106033387B (en) The method and apparatus for testing flash intrinsic controls
CN103186690A (en) Method for identifying short-circuit path in integrated circuit layout verification process
CN107145591A (en) A kind of effective content metadata extracting method of webpage based on title

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant