CN103605675B - XML (extensive markup language) path expression extracting method and device - Google Patents
XML (extensive markup language) path expression extracting method and device Download PDFInfo
- Publication number
- CN103605675B CN103605675B CN201310524422.4A CN201310524422A CN103605675B CN 103605675 B CN103605675 B CN 103605675B CN 201310524422 A CN201310524422 A CN 201310524422A CN 103605675 B CN103605675 B CN 103605675B
- Authority
- CN
- China
- Prior art keywords
- node
- text
- path expression
- content
- xml path
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/80—Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
- G06F16/83—Querying
- G06F16/832—Query formulation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9558—Details of hyperlinks; Management of linked annotations
Abstract
The invention discloses an XML path expression extracting method and device. The method comprises, step (1), setting limit conditions with a plurality of hierarchical relations and initializing the limit conditions of the lowest layer as current limit conditions; step (2), extracting the XML path expression of an element node to be identified under the current limit conditions; step (3), performing location according to the XML path expression, and if the XML path expression locates a unique element node, ending the XML path expression extraction; if not, namely, if the XML path expression locates more than one element nodes, selecting the limit conditions one layer above the current limit conditions as current limit conditions and re-executing the step (2). By means of the XML path expression which is as strict as possible, the XML path expression extracting method and device can reduce the possibility of XML path expression failures when web pages have small changes.
Description
Technical field
The present invention relates to XML correlative technology field, particularly a kind of XML path expression extracting method and device.
Background technology
XPath is a language searching information in XML document.XPath can be used in XML document to node element
Traveled through with attribute.XPath chooses node or set of node in XML document using XML path expression.These XML
Path expression and the expression formula that we see in conventional computer document system are closely similar.
In webpage automatic test, the mode of conventional xpath carries out node element positioning, then again to the unit navigating to
Plain node is operated.Such as certain button is navigated to by xpath, then automate to trigger again and click on;Again such as logical
Cross xpath and navigate to certain text box, then automate again to text frame assignment.
The method of existing acquisition XML path expression is from the beginning of specified node element, obtains the tag name of present node
Claim, and whether current element node have the brotgher of node, more successively search upwards, until run into webpage xml root node or
Till comprising the node of id attribute, then it is stitched together successively.
Such as one only comprises Baidu, Jingdone district, the simple web page of three links of Taobao, and the source code of webpage is as follows:
If positioning the node element of Jingdone district link, currently general XML path expression extracting method is from capital
East links this node and starts, and bookmark name is the entitled a of tag, has a brotgher of node (Baidu's link), more up finds bag
P node containing id attribute, finds to comprise id attribute, lookup terminates, and the xpath expression formula of node is // * [@id=" lj "]/a
[2], wherein [2] represent the 2nd child node.And for example Taobao to be positioned links this node element, and this node does not have the brotgher of node,
From this node up until root node, sequentially find the entitled a of tag, have before the node of p, body, html, wherein p node
One brotgher of node, does not run into the node comprising id attribute, so Taobao links the XML path expression of this node element
For/html/body/p [2]/a.
However, the XML path expression that prior art extracts node element is generally not sufficiently stable, page layout occurs little
During change, original XML path expression is likely to lose efficacy, and also just be can not find using the XML path expression that this lost efficacy
This node element or the node element finding mistake.
Order has been exchanged in the web page interlinkage of such as above-mentioned example, by Baidu, Jingdone district, Taobao be adjusted to Taobao, Jingdone district, hundred
Degree, the webpage source code after adjustment is as follows:
The XML path expression of Jingdone district link was represented before then:// * [@id=" lj "]/a [2], can mistakenly navigate to
Baidu;Represent the XML path expression of Taobao's link before:/ html/body/p [2]/a, can mistakenly navigate to Jingdone district chain
Connect.
Content of the invention
Based on this it is necessary to the XML path expression extraction for prior art is unstable, there are minor variations in webpage
When the technical problem that can not find node element or find the node element of mistake, provide a kind of XML path expression to extract
Method and device.
A kind of XML path expression extracting method, including:
Step 1, sets multiple restrictive conditions with hierarchical relationship, the restrictive condition of initialization lowest hierarchical level is current limit
Condition processed, and the restrictive condition of described lowest hierarchical level treat marker element node identification means the strictest;
Step 2, extracts the XML path expression of node element to be identified under current restrictive condition;
Step 3, is positioned according to described XML path expression, and:
If described XML path expression uniquely navigates to described node element to be identified, described XML path expression
Extraction terminates;
Else if described XML path expression navigates to more than a node element, then select on current restrictive condition
The restrictive condition of one-level is current restrictive condition, and execution step 2.
A kind of XML path expression extraction element, including:
Restrictive condition setup module, for setting multiple restrictive conditions with hierarchical relationship, initialization lowest hierarchical level
Restrictive condition be current restrictive condition, and the restrictive condition of described lowest hierarchical level treat marker element node identification means the tightest
Lattice;
Path extraction module, for extracting the XML path expression of node element to be identified under current restrictive condition;
Node element locating module, for being positioned according to described XML path expression, and:
If described XML path expression uniquely navigates to described node element to be identified, described XML path expression
Extraction terminates;
Else if described XML path orientation to more than a node element, then selects current restrictive condition upper level
Restrictive condition is current restrictive condition, and execution route extraction module.
The present invention passes through to set multiple restrictive conditions with hierarchical relationship, adopts most stringent of XML path expression as far as possible
Formula, under only current restrictive condition cannot uniquely location element node when, just using more loose upper level XML path expression
Formula.Due to employing XML path expression as strict as possible, when therefore decreasing webpage generation minor variations, XML routing table
Reach the possibility of formula inefficacy.
Brief description
Fig. 1 is a kind of workflow diagram of present invention XML path expression extracting method;
Fig. 2 is a kind of specific workflow figure of the step 2 of present invention XML path expression extracting method;
Fig. 3 is the workflow diagram of one example of the present invention;
Fig. 4 is a kind of construction module figure of present invention XML path expression extraction element.
Specific embodiment
The present invention will be further described in detail with specific embodiment below in conjunction with the accompanying drawings.
It is illustrated in figure 1 a kind of workflow diagram of present invention XML path expression extracting method, including:
Step 1, sets multiple restrictive conditions with hierarchical relationship, the restrictive condition of initialization lowest hierarchical level is current limit
Condition processed, and the restrictive condition of described lowest hierarchical level treat marker element node identification means the strictest;
Step 2, extracts the XML path expression of node element to be identified under current restrictive condition;
Step 3, is positioned according to described XML path expression, and:
If described XML path expression uniquely navigates to described node element to be identified, described XML path expression
Extraction terminates;
Else if described XML path expression navigates to more than a node element, then select on current restrictive condition
The restrictive condition of one-level is current restrictive condition, and execution step 2.
Wherein, in step 1, restrictive condition refers to allow which kind of mode to treat marker element node using and be identified.Many
Between individual restrictive condition, there is hierarchical relationship, this hierarchical relationship needs the restrictive condition ensureing lowest hierarchical level to treat marker element section
The identification means of point are the strictest, then for each node element, preferentially tight using the identification means treating marker element node
The restrictive condition of lattice extracts XML path expression, and therefore, obtained XML path expression is XML path as strict as possible
Expression formula.
The identification means that the restrictive condition of lowest hierarchical level treats marker element node are the strictest, the restrictive condition of other levels
Comparatively, the identification means treating marker element node are looser than the restrictive condition of lowest hierarchical level.Various level limits bar
The identification means that part treats marker element node are all different, thus the XML road obtained by the restrictive condition of a level
When footpath expression formula uniquely can not navigate to node element to be identified, other restrictive conditions can be taken to treat to avoid obtaining
The XML path expression of marker element node.
Wherein in an embodiment, the restrictive condition of described lowest hierarchical level is:Using bookmark name and id attribute-bit
Described node element to be identified, the restrictive condition of highest level is:At least using one of bookmark name, id attribute and array
Or the described node element to be identified of multiple mark.
The present embodiment, the restrictive condition of lowest hierarchical level, it is the strictest that it treats the identification means of marker element node, and highest
The restrictive condition of level, it is the loosest that it treats the identification means of marker element node, by the way of array, thus ensure right
Any node element all can extract corresponding XML path expression, and this XML path expression is as strict as possible.With
Between the hierarchical relationship of Shi Butong, can be separate can also be last layer level the restrictive condition limit that comprises next level
Condition processed.
Wherein in an embodiment, the plurality of restrictive condition also includes:
Content of text restrictive condition, is at least identified using one or more of bookmark name, id attribute and content of text
Described node element to be identified;
Hyperlink restrictive condition, at least described using one or more of bookmark name, id attribute and hyperlink mark
Node element to be identified.
Wherein, content of text restrictive condition and hyperlink restrictive condition are in the restrictive condition of lowest hierarchical level and highest level
Between restrictive condition, can be content of text restrictive condition level is lower than hyperlink restrictive condition or hyperlink limit
The level of condition processed is lower than content of text restrictive condition.It can be separate, example between restrictive condition between different levels
As during using content of text restrictive condition, do not allowed that described node element to be identified is identified using hyperlink, thus simplifying XML road
The content of footpath expression formula, it may, however, also be the restrictive condition of last layer level comprises the restrictive condition of next level, for example, when
The level of content of text restrictive condition is lower than hyperlink restrictive condition, then when using hyperlink restrictive condition, except using mark
Signature claims, id attribute and hyperlink identify described node element to be identified, can also be described to be identified using content of text mark
Node element.By the way of the restrictive condition of last layer level comprises the restrictive condition of next level, XML path expression is wrapped
The content containing is more, then more can uniquely navigate to described node element to be identified.
Wherein in an embodiment, as shown in Fig. 2 described step 2, specifically include:
Step 21, setting present node is described node element to be identified;
Step 22, obtains the bookmark name of present node and is assigned to XML path expression;
Step 23, judges whether present node has id attribute, if there are id attribute, then adds to described XML path expression
The id attribute of sovolin front nodal point;
Step 24, judges whether current restrictive condition includes identifying described node element to be identified using content of text, such as
The current restrictive condition of fruit includes identifying described node element to be identified using content of text, and present node is text node, then
Add the content of text of present node to described XML path expression;
Step 25, judges whether current restrictive condition includes identifying described node element to be identified using hyperlink, if
Current restrictive condition includes identifying described node element to be identified using hyperlink, and the bookmark name of present node is hyperlink
Label, then add the hyperlink content of present node to described XML path expression;
Step 26, judges whether current restrictive condition includes identifying described node element to be identified using array, if worked as
Front restrictive condition includes identifying described node element to be identified using array, and present node includes the brother under same father node
Node, then obtain node index under father node for the described present node, add described node rope to described XML path expression
Draw.
Wherein, step 24-26 select execution according to different restrictive conditions respectively, exist for those of ordinary skill in the art
Read after the embodiment of the present invention it will be understood that in actual motion, due to selecting different restrictive conditions, therefore step 24,25
It is fully implemented with 26 only meetings, other two steps will be jumped out after restrictive condition is judged, therefore step
24th, 25,26 execution sequence can be multiple, can be carried out step 24,25,26, or execution step 24,26,25, or hold
Row step 25,24,26, or execution step 25,26,24, or execution step 26,24,25, or execution step 26,25,
24.
Wherein in an embodiment, also include:
If present node includes id attribute, or present node is root node, then execution step 3, otherwise arranges current
Node is the father node of present node, execution step 22.
Wherein in an embodiment, in described step 24, specifically include:
Judge whether current restrictive condition includes identifying described node element to be identified using content of text, if current limit
Condition processed includes identifying described node element to be identified using content of text, and present node is text node, then to described XML
Path expression adds full text content or the part content of text of present node.
Preferably, in described step 24, to described XML path expression add present node full text content or
Part content of text, specifically includes:
If the content of text of present node comprises newline, add present node to described XML path expression
Pre-set text threshold value character before newline in content of text;
If the text size of the content of text of present node is less than pre-set text threshold value, and do not comprise newline, then to
Described XML path expression adds the full text content of present node;
If the text size of the content of text of present node is more than pre-set text threshold value, and do not comprise newline, then to
Front pre-set text threshold value character in the content of text of described XML path expression interpolation present node.
It is illustrated in figure 3 the workflow diagram of one example of the present invention, wherein:
Restrictive condition is divided into 4 levels, later layer level comprises the content of last layer level, represented with A to D:A. may only
With tag title and id attribute;B. content of text can be used;C. hyperlink can be used;D. array can be used.Most stringent of from limiting
A layer starts, and attempts extracting XML path expression, if extracting successfully (this XML this element of path expression unique mark),
Terminate;If extracting unsuccessful, jumping to next level and relaxing the restriction condition, again extract XML path expression again, until
Extraction successfully returns this expression formula when can uniquely represent the XML path expression of this element.Because level D uses array, permissible
Ensure necessarily can one element of unique mark, so finally sure extract successfully.
Concrete extracting method is as follows:
Step 31, setting restrictive condition is maximum limit A
Step 32, the current xpath of setting is NUL:Xpath=";
Step 33, setting present node el is node element to be identified;
Step 34, under this restrictive condition, extracts the XML path expression of node el and is saved in xpath, extraction side
Method is as follows:
1) obtain the tag title of present node, be assigned to new initializaing variable component=el.tagName;
2) judge whether present node has id attribute, if there are id attribute, then assignment component+='[@id="+
el.id+'\']';
3) if current restriction can use content of text, and present node is text node, then content of text
TextValue is added in component as follows:
If 1. text size is less than 10, and do not comprise newline, then component+=" [text ()=' "+
textValue+"']";
If 2. text comprises newline, take most 10 character firstlinewords, component before newline
+=" [contains (text (), ' "+firstlinewords+ " ')] ";
If 3. file does not comprise newline, foremost 10 character firstwords, component+="
[contains(text(),'"+firstwords+"')]";
4) if current restriction can use linked contents, and current tag title is hyperlink tag:A, then component
+=" [@href=' "+el.href+ " '] ";
5) if present node has the brotgher of node, obtain this node and belong to which sub- node i ndex, component+
=' ['+index+'] ';
6) update xpath='/'+component+xpath;
7) if present node has id attribute, or present node is root node, it tries it is fixed to be carried out by current xpath
Position:
If 1. uniquely navigating to an element, xpath extracts successfully, goes to step 35;
If 2. navigating to more than one element, it is inadequate that current xpath comprises information.Need the condition of relaxing the restriction, if
Put next rank that restrictive condition is current restrictive condition, go to step 32;
8) if present node does not have id attribute, and present node is not root node, then setting present node el is current
The father node el=el.parentNode of node, goes to step 34;
Step 35, returns current xpath.
It is illustrated in figure 4 a kind of construction module figure of present invention XML path expression extraction element, including:
Restrictive condition setup module 410, for setting multiple restrictive conditions with hierarchical relationship, initializes lowest hierarchical level
Restrictive condition be current restrictive condition, and the restrictive condition of described lowest hierarchical level treats the identification means of marker element node
Strictly;
Path extraction module 420, for extracting the XML path expression of node element to be identified under current restrictive condition
Formula;
Node element locating module 430, for being positioned according to described XML path expression, and:
If described XML path expression uniquely navigates to described node element to be identified, described XML path expression
Extraction terminates;
Else if described XML path orientation to more than a node element, then selects current restrictive condition upper level
Restrictive condition is current restrictive condition, and execution route extraction module.
Wherein in an embodiment, the restrictive condition of described lowest hierarchical level is:Using bookmark name and id attribute-bit
Described node element to be identified, the restrictive condition of highest level is:At least using one of bookmark name, id attribute and array
Or the described node element to be identified of multiple mark.
Wherein in an embodiment, the plurality of restrictive condition also includes:
Content of text restrictive condition, is at least identified using one or more of bookmark name, id attribute and content of text
Described node element to be identified;
Hyperlink restrictive condition, at least described using one or more of bookmark name, id attribute and hyperlink mark
Node element to be identified.
Wherein in an embodiment, described path extraction module 420, specifically include:
Node arranges submodule 421, is described node element to be identified for arranging present node;
Bookmark name assignment submodule 422, for obtaining the bookmark name of present node and being assigned to XML path expression
Formula;
Id attribute assignment submodule 423, for judging whether present node has id attribute, if there are id attribute, then to institute
State the id attribute that XML path expression adds present node;
Content of text assignment submodule 424, for judging whether current restrictive condition includes identifying institute using content of text
State node element to be identified, if current restrictive condition includes identifying described node element to be identified using content of text, and work as
Front nodal point is text node, then add the content of text of present node to described XML path expression;
Hyperlink assignment submodule 425, for judging whether current restrictive condition includes treating using described in hyperlink mark
Marker element node, if current restrictive condition includes identifying described node element to be identified using hyperlink, and present node
Bookmark name be hyperlink label, then to described XML path expression add present node hyperlink content;
Array assignment submodule 426, described to be identified using array mark for judging whether current restrictive condition includes
Node element, if current restrictive condition includes identifying described node element to be identified using array, and present node is included together
The brotgher of node under one father node, then obtain node index under father node for the described present node, to described XML path expression
Formula adds described node index.
Wherein in an embodiment, described path extraction module 420, also include:
Upper layer node submodule, if including id attribute for present node, or present node is root node, then execute
Node element locating module, otherwise setting present node is the father node of present node, executes bookmark name assignment submodule.
Wherein in an embodiment, described content of text assignment submodule 424, specifically for:
Judge whether current restrictive condition includes identifying described node element to be identified using content of text, if current limit
Condition processed includes identifying described node element to be identified using content of text, and present node is text node, then to described XML
Path expression adds full text content or the part content of text of present node.
Wherein in an embodiment, in described content of text assignment submodule 424, add to described XML path expression
The full text content of sovolin front nodal point or part content of text, specifically include:
If the content of text of present node comprises newline, add present node to described XML path expression
Pre-set text threshold value character before newline in content of text;
If the text size of the content of text of present node is less than pre-set text threshold value, and do not comprise newline, then to
Described XML path expression adds the full text content of present node;
If the text size of the content of text of present node is more than pre-set text threshold value, and do not comprise newline, then to
Front pre-set text threshold value character in the content of text of described XML path expression interpolation present node.
Embodiment described above only have expressed the several embodiments of the present invention, and its description is more concrete and detailed, but simultaneously
Therefore the restriction to the scope of the claims of the present invention can not be interpreted as.It should be pointed out that for those of ordinary skill in the art
For, without departing from the inventive concept of the premise, some deformation can also be made and improve, these broadly fall into the guarantor of the present invention
Shield scope.Therefore, the protection domain of patent of the present invention should be defined by claims.
Claims (14)
1. a kind of XML path expression extracting method is it is characterised in that include:
Step (1), sets multiple restrictive conditions with hierarchical relationship, the restrictive condition of initialization lowest hierarchical level is current restriction
Condition, and the restrictive condition of described lowest hierarchical level treat marker element node identification means the strictest;
Step (2), extracts the XML path expression of node element to be identified under current restrictive condition;
Step (3), is positioned according to described XML path expression, and:
If described XML path expression uniquely navigates to described node element to be identified, described XML path expression extracts
Terminate;
Else if described XML path orientation to more than a node element, then selects the restriction of current restrictive condition upper level
Condition is current restrictive condition, and execution step (2).
2. XML path expression extracting method according to claim 1 is it is characterised in that the restriction of described lowest hierarchical level
Condition is:Using node element to be identified described in bookmark name and id attribute-bit, the restrictive condition of highest level is:At least make
Identify described node element to be identified with one or more of bookmark name, id attribute and array.
3. XML path expression extracting method according to claim 2 it is characterised in that the plurality of restrictive condition also
Including:
Content of text restrictive condition, at least described using one or more of bookmark name, id attribute and content of text mark
Node element to be identified;
Hyperlink restrictive condition, at least waits to mark using described in one or more of bookmark name, id attribute and hyperlink mark
Know node element.
4. XML path expression extracting method according to claim 3, it is characterised in that described step (2), is specifically wrapped
Include:
Step (21), setting present node is described node element to be identified;
Step (22), obtains the bookmark name of present node and is assigned to XML path expression;
Step (23), judges whether present node has id attribute, if there are id attribute, then adds to described XML path expression
The id attribute of present node;
Step (24), judges whether current restrictive condition includes identifying described node element to be identified using content of text, if
Current restrictive condition includes identifying described node element to be identified using content of text, and present node is text node, then to
Described XML path expression adds the content of text of present node;
Step (25), judges whether current restrictive condition includes identifying described node element to be identified using hyperlink, if worked as
Front restrictive condition includes identifying described node element to be identified using hyperlink, and the bookmark name of present node is hyperlink mark
Sign, then add the hyperlink content of present node to described XML path expression;
Step (26), judges whether current restrictive condition includes identifying described node element to be identified using array, if currently
Restrictive condition includes identifying described node element to be identified using array, and present node includes the brother's section under same father node
Point, then obtain node index under father node for the described present node, add described node rope to described XML path expression
Draw.
5. XML path expression extracting method according to claim 4, it is characterised in that described step (2), also includes:
If present node includes id attribute, or present node is root node, then execution step (3), and prosthomere is worked as in otherwise setting
Point is the father node of present node, execution step (22).
6. XML path expression extracting method according to claim 4 is it is characterised in that in described step (24), specifically
Including:
Judge whether current restrictive condition includes identifying described node element to be identified using content of text, if currently limiting bar
Part includes identifying described node element to be identified using content of text, and present node is text node, then to described XML path
Expression formula adds full text content or the part content of text of present node.
7. XML path expression extracting method according to claim 6 is it is characterised in that in described step (24), to institute
State full text content or the part content of text that XML path expression adds present node, specifically include:
If the content of text of present node comprises newline, add the text of present node to described XML path expression
Pre-set text threshold value character before newline in content;
If the text size of the content of text of present node is less than pre-set text threshold value, and do not comprise newline, then to described
XML path expression adds the full text content of present node;
If the text size of the content of text of present node is more than pre-set text threshold value, and do not comprise newline, then to described
Front pre-set text threshold value character in the content of text of XML path expression interpolation present node.
8. a kind of XML path expression extraction element is it is characterised in that include:
Restrictive condition setup module, for setting multiple restrictive conditions with hierarchical relationship, the restriction of initialization lowest hierarchical level
Condition be current restrictive condition, and the restrictive condition of described lowest hierarchical level treat marker element node identification means the strictest;
Path extraction module, for extracting the XML path expression of node element to be identified under current restrictive condition;
Node element locating module, for being positioned according to described XML path expression, and:
If described XML path expression uniquely navigates to described node element to be identified, described XML path expression extracts
Terminate;
Else if described XML path orientation to more than a node element, then selects the restriction of current restrictive condition upper level
Condition is current restrictive condition, and execution route extraction module.
9. XML path expression extraction element according to claim 8 is it is characterised in that the restriction of described lowest hierarchical level
Condition is:Using node element to be identified described in bookmark name and id attribute-bit, the restrictive condition of highest level is:At least make
Identify described node element to be identified with one or more of bookmark name, id attribute and array.
10. XML path expression extraction element according to claim 9 it is characterised in that the plurality of restrictive condition also
Including:
Content of text restrictive condition, at least described using one or more of bookmark name, id attribute and content of text mark
Node element to be identified;
Hyperlink restrictive condition, at least waits to mark using described in one or more of bookmark name, id attribute and hyperlink mark
Know node element.
11. XML path expression extraction elements according to claim 10 it is characterised in that described path extraction module,
Specifically include:
Node arranges submodule, is described node element to be identified for arranging present node;
Bookmark name assignment submodule, for obtaining the bookmark name of present node and being assigned to XML path expression;
Id attribute assignment submodule, for judging whether present node has id attribute, if there are id attribute, then to described XML road
Footpath expression formula adds the id attribute of present node;
Content of text assignment submodule, described to be identified using content of text mark for judging whether current restrictive condition includes
Node element, if current restrictive condition includes identifying described node element to be identified using content of text, and present node is
Text node, then add the content of text of present node to described XML path expression;
Hyperlink assignment submodule, for judging whether current restrictive condition includes identifying described element to be identified using hyperlink
Node, if current restrictive condition includes identifying described node element to be identified using hyperlink, and the tag name of present node
Referred to as hyperlink label, then add the hyperlink content of present node to described XML path expression;
Array assignment submodule, for judging whether current restrictive condition includes identifying described element section to be identified using array
Point, if current restrictive condition includes identifying described node element to be identified using array, and present node includes same father's section
The brotgher of node under point, then obtain node index under father node for the described present node, add to described XML path expression
Described node index.
12. XML path expression extraction elements according to claim 11 it is characterised in that described path extraction module,
Also include:
Upper layer node submodule, if including id attribute for present node, or present node is root node, then execute element
Node locating module, otherwise setting present node is the father node of present node, executes bookmark name assignment submodule.
13. XML path expression extraction elements according to claim 11 are it is characterised in that described content of text assignment
Submodule, specifically for:
Judge whether current restrictive condition includes identifying described node element to be identified using content of text, if currently limiting bar
Part includes identifying described node element to be identified using content of text, and present node is text node, then to described XML path
Expression formula adds full text content or the part content of text of present node.
14. XML path expression extraction elements according to claim 13 are it is characterised in that described content of text assignment
In submodule, add full text content or the part content of text of present node, concrete bag to described XML path expression
Include:
If the content of text of present node comprises newline, add the text of present node to described XML path expression
Pre-set text threshold value character before newline in content;
If the text size of the content of text of present node is less than pre-set text threshold value, and do not comprise newline, then to described
XML path expression adds the full text content of present node;
If the text size of the content of text of present node is more than pre-set text threshold value, and do not comprise newline, then to described
Front pre-set text threshold value character in the content of text of XML path expression interpolation present node.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310524422.4A CN103605675B (en) | 2013-10-30 | 2013-10-30 | XML (extensive markup language) path expression extracting method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310524422.4A CN103605675B (en) | 2013-10-30 | 2013-10-30 | XML (extensive markup language) path expression extracting method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103605675A CN103605675A (en) | 2014-02-26 |
CN103605675B true CN103605675B (en) | 2017-02-15 |
Family
ID=50123898
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310524422.4A Active CN103605675B (en) | 2013-10-30 | 2013-10-30 | XML (extensive markup language) path expression extracting method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103605675B (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104036026B (en) * | 2014-06-27 | 2018-02-23 | 吴涛军 | Storage and location structure document choose the method and system of content |
CN109683999A (en) * | 2017-10-19 | 2019-04-26 | 北京国双科技有限公司 | A kind of cross-page surface element localization method and device |
CN109214172B (en) * | 2018-09-20 | 2021-08-31 | 郑州云海信息技术有限公司 | Method and device for acquiring key value name of valid registry |
CN109977267B (en) * | 2019-02-02 | 2021-09-03 | 北京云测信息技术有限公司 | Method and device for determining XPath path |
CN110276039B (en) * | 2019-06-27 | 2021-09-28 | 北京金山安全软件有限公司 | Page element path generation method and device and electronic equipment |
CN116680444B (en) * | 2023-08-03 | 2024-01-19 | 腾讯科技(深圳)有限公司 | Data processing method, device, equipment and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6286013B1 (en) * | 1993-04-01 | 2001-09-04 | Microsoft Corporation | Method and system for providing a common name space for long and short file names in an operating system |
US6405325B1 (en) * | 1999-01-30 | 2002-06-11 | Inventec Corp. | Method and tool for restoring crashed operation system of computer |
CN1752976A (en) * | 2004-09-22 | 2006-03-29 | 精工爱普生株式会社 | File management program, data structure, and file management device |
CN101551818A (en) * | 2009-04-14 | 2009-10-07 | 北京红旗中文贰仟软件技术有限公司 | A unidirectional multi-mapping file matching method |
-
2013
- 2013-10-30 CN CN201310524422.4A patent/CN103605675B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6286013B1 (en) * | 1993-04-01 | 2001-09-04 | Microsoft Corporation | Method and system for providing a common name space for long and short file names in an operating system |
US6405325B1 (en) * | 1999-01-30 | 2002-06-11 | Inventec Corp. | Method and tool for restoring crashed operation system of computer |
CN1752976A (en) * | 2004-09-22 | 2006-03-29 | 精工爱普生株式会社 | File management program, data structure, and file management device |
CN101551818A (en) * | 2009-04-14 | 2009-10-07 | 北京红旗中文贰仟软件技术有限公司 | A unidirectional multi-mapping file matching method |
Non-Patent Citations (1)
Title |
---|
基于XML索引技术的有效外延连接;姜学锋等;《计算机研究与发展》;20080615(第06期);1043-1055 * |
Also Published As
Publication number | Publication date |
---|---|
CN103605675A (en) | 2014-02-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103605675B (en) | XML (extensive markup language) path expression extracting method and device | |
CN1786965B (en) | Method for acquiring news web page text information | |
CN103399872B (en) | The method and apparatus that webpage capture is optimized | |
CN107423391B (en) | Information extraction method of webpage structured data | |
CN101551800B (en) | Marked information generation device, inquiry unit and sharing system | |
CN103886023B (en) | The storage of Excel tables of data, extracting method and system | |
CN104657377B (en) | A kind of multichannel webpage control localization method and device | |
CN103034583A (en) | Method and system for processing automatic test scrip of software | |
CN107577783A (en) | The type of webpage automatic identifying method excavated based on Web architectural features | |
CN105630941A (en) | Statistics and webpage structure based Wen body text content extraction method | |
ES2375403T3 (en) | A METHOD FOR THE AUTOMATIC INDEXATION OF DOCUMENTS. | |
CN109086361B (en) | A kind of automatic abstracting method of webpage article information and system based on mutual information between web page joint | |
CN103853760A (en) | Method and device for extracting contents of bodies of web pages | |
CN107066576A (en) | A kind of big data web crawlers paging system of selection and system | |
CN102779169A (en) | Extracting method and device for webpage content based on HTML (Hypertext Markup Language) label | |
CN100447793C (en) | Method for extracting page query interface based on character of vision | |
CN106919624B (en) | Method and device for improving webpage loading speed | |
CN104331438B (en) | To novel web page contents selectivity abstracting method and device | |
CN103246732A (en) | Online Web news content extracting method and system | |
CN103853770B (en) | The method and system of model content in a kind of extraction forum Web pages | |
CN107368546A (en) | A kind of method and apparatus for generating outline | |
CN102567337A (en) | Method and system for quickly recognizing webpage types through links | |
CN106033387B (en) | The method and apparatus for testing flash intrinsic controls | |
CN103186690A (en) | Method for identifying short-circuit path in integrated circuit layout verification process | |
CN107145591A (en) | A kind of effective content metadata extracting method of webpage based on title |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant |