CN103544281A - Method, device and system for retrieving keywords - Google Patents

Method, device and system for retrieving keywords Download PDF

Info

Publication number
CN103544281A
CN103544281A CN201310503091.6A CN201310503091A CN103544281A CN 103544281 A CN103544281 A CN 103544281A CN 201310503091 A CN201310503091 A CN 201310503091A CN 103544281 A CN103544281 A CN 103544281A
Authority
CN
China
Prior art keywords
node
scks
slca
subtree
piecemeal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310503091.6A
Other languages
Chinese (zh)
Inventor
徐光剑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Security and Fire Technology Co Ltd
Original Assignee
China Security and Fire Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Security and Fire Technology Co Ltd filed Critical China Security and Fire Technology Co Ltd
Priority to CN201310503091.6A priority Critical patent/CN103544281A/en
Publication of CN103544281A publication Critical patent/CN103544281A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/83Querying
    • G06F16/835Query processing
    • G06F16/8373Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/83Querying
    • G06F16/832Query formulation

Abstract

The invention is applicable to the field of retrieval techniques for computers, and provides a method, a device and a system for retrieving keywords. The method includes receiving an inputted keyword retrieval request containing an identity (ID) of an object file and keyword retrieval statements; reading file information according to the ID of the object file; blocking the object file; running Mapper programs on various blocks, enabling the Mapper programs to query according to the keyword retrieval statements and the file information so as to obtain SCKS (structurally complete keyword sub-tree) nodes and SLCA (smallest lowest common ancestor) nodes corresponding to the various blocks, and generating sub-trees SCKS'; transmitting the SLCA nodes and the sub-trees SCKS' to Reduce equipment and enabling the Reduce equipment to perform computation according to the SLCA nodes and the sub-trees SCKS' so as to obtain retrieval results. The method, the device and the system have the advantage that distributed parallel computation can be performed on massive XML (extensive markup language) files via MapReduce, so that keyword retrieval for large quantities of data can be supported.

Description

A kind of keyword search method, Apparatus and system
Technical field
The invention belongs to computer search technical field, relate in particular to a kind of keyword search method, Apparatus and system.
Background technology
Retrieval is the basic Core Feature of computer application field.Along with the widespread use of XML technology, how in magnanimity XML data centralization, to carry out efficient retrieval and become a difficult problem.
XML path language (XML Path Language, XPath)/XML inquiry (XML Query, XQuery) is the XML query language by a kind of perfect in shape and function of World Wide Web Consortium (World Wide Web Consortium, W3C) formulation.It utilizes the hierarchical structure (Schema) of the destination document of grasping in advance, uses path expression to inquire about.Yet when being applied to magnanimity XML data set, how the hierarchical structure of the complete all documents of acquisition is challenges, this causes the inquiry of some complexity to realize.Even if grasped the hierarchical structure of all documents, but due to the hierarchical structure of document complexity, will cause that query statement is difficult for writing, search algorithm is difficult to the problems such as optimization.
XML querying method based on key word (XML Keyword Search) carries out computing by the oriented tag tree to XML document, attempts to return the key word segment tree of compacting most.It does not need user to grasp in advance the hierarchical structure of XML document, has friendly user interface.
At present, XML keyword query method is mostly based on last common ancestor (Lowest Common Ancestor, LCA) technology, and further research also has minimum last common ancestor (Smallest Lowest Common Ancestor, SLCA) etc.
If expression formula v< av' represents that node v is the ancestor node of v ', v≤ av' represents that node v is ancestor node or the node v ' of node v '.The formal definition of LCA and SLCA is as follows:
LCA: given m node v 1, v 2... v m-1, v mif node w is to each v i(1≤i≤m) all meet w≤ av i; And do not exist node u to each v i(1≤i≤m) all meet u≤ av i, and w< au, claims that node w is node v 1, v 2... v m-1, v mlCA node, be denoted as w=LCA (v 1, v 2..., v m).
SLCA: given n LCA node w 1, w 2... w n-1, w nif node s is SLCA node, there is not w i(1≤i≤j), makes w i< as, is denoted as s ∈ SLCA (w 1, w 2... w n-1, w n).
As shown in Figure 1, Fig. 1 is the test document that XML searching field is conventional, a bibliographic data base structural drawing (hereinafter to be referred as dblp).This is an XML hierarchical chart with Dewey coding.As can be seen from Figure 1, the root node of dblp be 0, dblp}, wherein 0 is Dewey coding, dblp is tag name.When inquiring about with key word " Ling " and " Wei " (can be expressed as Q=[Ling, Wei]), the LCA set of node inquiring is [{ 0.0, article}, 0.1, article}, 0, dblp}], the SLCA set of node inquiring is [{ 0.0, article}, { 0.1, article}].Can find out, the XML keyword query algorithm based on SLCA returns the most at last take the compactest XML data fragments that SLCA node is root node.Obviously, be compared to LCA, the data fragments that SLCA returns more meets user's request.
Yet there is following problem in the key search of the XML document based on SLCA (being designated hereinafter simply as SLCA retrieval):
1), during in the face of the XML data of magnanimity, SLCA retrieves directly computing.
2) data fragments that, SLCA retrieval is returned might not meet user's request.For example,, when user carries out key search Q=[Ling to the dblp of Fig. 1], when expectation obtains all articles that author is Ling,
The SLCA set of node that obtains of retrieval for [0.0.1, author}, 0.1.0, title}], and obviously this and do not meet user's request.
3), SLCA retrieval is not supported complicated with semantic retrieval.For example, retrieval Q=[Ling mentioned above, Wei], when user initiates this retrieval, expectation obtains author Ling and cooperates the article of writing with Wei.And the SLCA node that this retrieval obtains comprised node 0.1, article}, and in fact this article by author Wei complete independently.
Summary of the invention
The embodiment of the present invention provides a kind of keyword search method, Apparatus and system, is intended to solve the XML data that existing SLCA search method cannot be retrieved magnanimity, and the inaccurate problem of result for retrieval.
On the one hand, provide a kind of keyword search method, described method comprises:
Receive the key search request of input, described request comprises ID and the key search statement of file destination;
According to the ID file reading information of described file destination;
Described file destination is carried out to piecemeal;
On each piecemeal, move Mapper program, to obtain the structural integrity key word subtree SCKS that each piecemeal is corresponding, minimum last common ancestor SLCA node by described Mapper program according to described key search statement and the inquiry of described fileinfo, and from described SCKS, remove and take the data slot that described SLCA node is root node, generate subtree SCKS ';
Send described SLCA node and described subtree SCKS ' to Reduce equipment, to calculate result for retrieval by described Reduce equipment according to described SLCA node and described subtree SCKS '.
Further, described fileinfo comprises all nodal informations of described file destination, described nodal information comprises the unique identification of node, the type of the tag name of node, node, and the type of described node comprises entity node, attribute node, connected node and value node.
Further, the described Mapper program of moving on each piecemeal, according to described key search statement and described fileinfo, inquiry obtains the structural integrity key word subtree SCKS that each piecemeal is corresponding, minimum last common ancestor SLCA node comprises:
According to described nodal information, in each piecemeal, search the value node that meets described key search statement, generate value node collection;
According to described nodal information, the concentrated value node of described value node is carried out to completion operation successively, obtain the structural integrity key word subtree SCKS that each piecemeal is corresponding;
Search the public ancestors of each value node in the structural integrity key word subtree SCKS that each piecemeal is corresponding, using described public ancestors as last common ancestor LCA node;
Grandparent and grandchild's relation of each LCA node relatively, using the LCA node of seniority in the family minimum as minimum last common ancestor SLCA node.
Further, describedly according to described nodal information, the concentrated value node of described value node is carried out to completion operation successively, obtains the structural integrity key word subtree SCKS that each piecemeal is corresponding and comprise:
Step 1, be positioned to first value node that described value node is concentrated;
Step 2, read the unique identification of described value node;
Step 3, according to described unique identification, upwards completion, deposits the result after completion in corresponding structural integrity key word subtree SCKS in;
Step 4, be positioned to the next value node that described value node is concentrated, and recurrence execution step 2 to 3, until last value node completes completion operation.
Further, described Reduce equipment calculates result for retrieval according to described SLCA node and described subtree SCKS ' and comprises:
Described Reduce equipment calls Reduce program;
Described Reduce program is preserved the SLCA node receiving, and this SLCA node is designated as to P1;
Described Reduce program merges the SCKS ' that each piecemeal is corresponding, generates SCKS ";
Described Reduce program is to described SCKS " carry out SLCA computing, obtain SLCA node, and remember that this node is P2;
Described Reduce program merges described P1 and P2, obtains final result for retrieval.
Further, key search statement Q meets Q → T n| T s| QT n| QT s, wherein ,-> is " being defined as ", | be "or", T sto comprise semantic key word substatement, T nnot belong to T st, T is not containing the nonblank character string of null character (NUL), T scomprise T > T|T<T|T>T|T<=T|T >=T|T:T-T, > is " comprising ", and T:T-T represents the interval range of a setting.
On the other hand, provide a kind of key search device, described device comprises:
Retrieval request receiving element, for receiving the key search request of input, described request comprises ID and the key search statement of file destination;
Fileinfo acquiring unit, for according to the ID file reading information of described file destination;
Divide module unit, for described file destination is carried out to piecemeal;
SCKS and SLCA node generation unit, for move Mapper program on each piecemeal, to obtain the structural integrity key word subtree SCKS that each piecemeal is corresponding, minimum last common ancestor SLCA node by described Mapper program according to described key search statement and the inquiry of described fileinfo, and from described SCKS, remove and take the data slot that described SLCA node is root node, generate subtree SCKS ';
Transmitting element, for sending described SLCA node and described subtree SCKS ' to Reduce equipment, to calculate result for retrieval by described Reduce equipment according to described SLCA node and described subtree SCKS '.
Further, described fileinfo comprises all nodal informations of described file destination, described nodal information comprises the unique identification of node, the type of the tag name of node, node, and the type of described node comprises entity node, attribute node, connected node and value node.
Further, SCKS and SLCA node generation unit comprise:
Value node collection generation module, for search the value node that meets described key search statement at each piecemeal according to described nodal information, generates value node collection;
SCKS generation module, for the concentrated value node of described value node being carried out to completion operation successively according to described nodal information, obtains the structural integrity key word subtree SCKS that each piecemeal is corresponding;
LCA node acquisition module, for searching the public ancestors of each value node of the structural integrity key word subtree SCKS that each piecemeal is corresponding, using described public ancestors as last common ancestor LCA node;
SLCA node acquisition module, for grandparent and grandchild's relation of each LCA node relatively, using the LCA node of seniority in the family minimum as minimum last common ancestor SLCA node.
Further, described SCKS generation module comprises:
The first value node locator module, first value node of concentrating for being positioned to described value node;
Unique identification obtains submodule, for reading the unique identification of described value node;
Completion operation submodule, for according to described unique identification, upwards completion, deposits the result after completion in corresponding structural integrity key word subtree SCKS in;
The second value node locator module, the next value node of concentrating for being positioned to described value node, and described in recursive call, unique identification obtains submodule and described completion operation submodule, until last value node completes completion operation.
Further, key search statement Q meets Q → T n| T s| QT n| QT s, wherein ,-> is " being defined as ", | be "or", T sto comprise semantic key word substatement, T nnot belong to T st, T is not containing the nonblank character string of null character (NUL), T scomprise T > T|T<T|T>T|T<=T|T >=T|T:T-T, > is for comprising, and T:T-T represents the interval range of a setting.
Another aspect, a kind of key search system is provided, described key search system is the system in the cloud computing environment platform based on cloud computing framework MapReduce, described key search system comprises the Map equipment that stores Mapper program and the Reduce equipment that stores Reduce program, and described Map equipment comprises the key search device of stating as above.。
In the embodiment of the present invention, by MapReduce, magnanimity XML file is carried out to distributed parallel computing, thereby can support the key search of big data quantity.The method of the present embodiment has the operation efficiency of O (n), has realized efficiently, the key search on magnanimity XML data set accurately.
Accompanying drawing explanation
Fig. 1 is the schematic diagram of the test document that XML searching field is conventional;
Fig. 2 is the realization flow figure of the keyword search method that provides of the embodiment of the present invention one;
Fig. 3 is using the test document shown in Fig. 1 as file destination, this test document is carried out to the schematic diagram of each piecemeal after piecemeal;
Fig. 4 is the structural representation of take the SCKS ' ' that key search statement Q obtains when author=Ling year>=2000 inquires about that embodiment bis-provides;
Fig. 5 is the structured flowchart of the key search device that provides of the embodiment of the present invention five;
Fig. 6 is the structured flowchart of the key search system that provides of the embodiment of the present invention six.
Embodiment
In order to make object of the present invention, technical scheme and advantage clearer, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that specific embodiment described herein, only in order to explain the present invention, is not intended to limit the present invention.
In embodiments of the present invention, Map equipment receives after key search request, the ID file reading information of the file destination first comprising according to request; Again file destination is carried out to piecemeal; Then on each piecemeal, move Mapper program, the key search statement comprising according to request and the fileinfo reading inquiry obtain the structural integrity key word subtree SCKS that each piecemeal is corresponding, minimum last common ancestor SLCA node, and from SCKS, remove and take the data slot that SLCA node is root node, generate subtree SCKS '; Finally send the subtree SCKS ' of SLCA node and generation to Reduce equipment, by Reduce equipment, according to SLCA node and subtree SCKS ', calculate result for retrieval.This is a kind of distributed structure/architecture based on MapReduce, can carry out distributed parallel computing to magnanimity XML file, thereby can support the key search of big data quantity.
Below in conjunction with specific embodiment, realization of the present invention is described in detail:
Embodiment mono-
Fig. 2 shows the realization flow of the keyword search method that the embodiment of the present invention one provides, the method is the distributed computing application based on cloud computing framework MapReduce, data source (is XML data set, data source is concentrated and is comprised at least one XML file) be that distributed storage is in cloud computing environment platform, be specially and be stored in distributed file system (Distributed File System, DFS) in, this DFS can be stored in Map equipment, Map equipment and Reduce equipment are respectively Mapper program in cloud computing platform and the computing machine at Reduce program place, it is the node in cluster, wherein Map equipment is for store M apper program, Reduce equipment is used for storing Reduce program, the present embodiment be take Map equipment side and is described as example, details are as follows:
In step S201, receive the key search request of input, request comprises ID and the key search statement of file destination.
In the embodiment of the present invention, Map equipment can receive the key search request of user input, and this request comprises the ID of file destination of the XML data centralization that will inquire about and the key search statement that will use while inquiring about.
Wherein, in key search request, can comprise the ID of at least two file destinations and both keyword retrieve statement at least, to walk abreast, initiate at least two retrievals to single XML file, for convenience, in the present embodiment, so that an XML file is retrieved as to example, describe.
Wherein, in the present embodiment, in order to make result for retrieval more accurate, the grammer of key search statement is expanded.
Concrete, the grammer of traditional key search statement is as follows:
T-> does not contain the nonblank character string of null character (NUL);
Q->T|QT。
Wherein, T is a key word, and Q is key search statement, and-> is " being defined as ", | be "or".Obviously, Q can be at least two T compositions of a T or recurrence.
The grammer of the key search statement after expansion is as follows:
T-> does not contain the nonblank character string of null character (NUL);
T s→T>T|T<T|T>T|T<=T|T>=T|T:T-T;
T n-> does not belong to T st;
Q→T n|T s|QT n|QT s
Wherein, T sbe to comprise semantic key word substatement, > is " comprising ", and T:T-T represents the interval range of some settings.For example:
(1) example one: " author=Ling year>=2000 ", retrieval author is Ling, is published in 2000 and later record.
(2) example two: " title > Ling year:1990-2013 ", retrieval title comprises Ling, is published in the record of 2010 to 2013.
(3) example three: " Ling Wei ": do not comprise semantic keyword query, will not analyze semanteme.
In step S202, according to the ID file reading information of file destination.
In the embodiment of the present invention, to the XML file being stored in DFS, can, by following rule creation model, generate the fileinfo of each XML file.
The fileinfo of the XML file that DFS provides can be with 5 element group representation D={FileId, StartOffset, Length, V, v0}.Wherein:
(1), the unique ID of document D in FileId:DFS;
(2), the physics reference position of document D in StartOffset:DFS, its form is SNode_Offset.Wherein SNode is DFS memory node, and Offset is physical disk side-play amount;
(3), the length of Length:XML document;
(4), V: the set of all nodes of document D, comprise all nodal informations in document D, the structure of node v can be used 5 element group representations:
vi={DeweyCode,TagName,StartOffset,Length,Type}
Wherein, i the node that vi is document D;
DeweyCode is Dewey coding, is the unique ID of a node, specifically can be referring to Fig. 1.Its coding rule is:
(1), the DeweyCode of v0 is 0;
(2), in the process of breadth first traversal D, if node v is i the child nodes of node u, the Dewey of v is encoded to Dewey (u) .i-1, wherein Dewey (u) represents the Dewey coding of node u.
TagName is the tag name of node;
StartOffset is the start offset amount of this label;
Length is length, is the difference of end-tag and StartOffset;
Type is node type, is a kind of in entity, attribute, connection and value node.
V0: the root node of document D, v0 ∈ V.
Concrete, various node types are defined as follows:
Entity node: can repeat in father node, child node can be entity, attribute, connection and value node.
Attribute node: father node can only be entity node, can occur repeatedly, child node can only be value node.
Connected node: neither belong to entity node, also do not belong to attribute node, child node can only be entity node.
Value node: father node can only be attribute node, can only occur once, there is no child node.
In step S203, file destination is carried out to piecemeal.
In the embodiment of the present invention, file destination is carried out to piecemeal.The file destination of take in Fig. 1 is example, and as shown in Figure 3, the file destination after piecemeal comprises Block1, Block2 and tri-piecemeals of Block3 to each piecemeal after piecemeal.Specifically how file destination is carried out to piecemeal, in the present embodiment, do not limit.In addition, step S203 can carry out before step S201, also can before step S202, carry out, and was not restricted to and was placed on step S202 execution afterwards.
In step S204, on each piecemeal, move Mapper program, according to key search statement and fileinfo inquiry, obtain the structural integrity key word subtree SCKS that each piecemeal is corresponding, minimum last common ancestor SLCA node, and from SCKS, remove and take the data slot that SLCA node is root node, generate subtree SCKS '.
In the embodiment of the present invention, can obtain the structural integrity key word subtree SCKS that each piecemeal is corresponding by following steps:
Step 1, according to nodal information, in each piecemeal, search the value node that meets key search statement, generate value node collection.
Step 2, according to nodal information, the concentrated value node of value node is carried out to completion operation successively, obtain the structural integrity key word subtree SCKS that each piecemeal is corresponding.
Concrete, step 2 comprises the following steps again:
Step 2a, be positioned to first value node that value node is concentrated;
Step 2b, read the unique identification of this value node;
Step 2c, according to this unique identification, upwards completion, deposits the result after completion in corresponding structural integrity key word subtree SCKS in;
Step 2d, be positioned to the next value node that value node is concentrated, and recurrence execution step 2b to 2c, until last value node completes completion operation.
Concrete, obtain after structural integrity key word subtree SCKS that each piecemeal is corresponding, can carry out SLCA computing to structural integrity key word subtree SCKS corresponding to each piecemeal and obtain the SLCA node that each piecemeal is corresponding, the detailed step of obtaining minimum last common ancestor SLCA node comprises:
Step 11, the public ancestors that search each value node in the structural integrity key word subtree SCKS that each piecemeal is corresponding, using these public ancestors as last common ancestor LCA node;
Step 12, grandparent and grandchild's relation of each LCA node relatively, using the LCA node of seniority in the family minimum as minimum last common ancestor SLCA node.
Concrete, obtain after the structural integrity key word subtree SCKS that each piecemeal is corresponding, minimum last common ancestor SLCA node, can from SCKS, remove and take the data slot that SLCA node is root node, generate subtree SCKS '.
In step S205, send SLCA node and subtree SCKS ' to Reduce equipment, to calculate result for retrieval by stating Reduce equipment according to SLCA node and subtree SCKS '.
In embodiments of the present invention, Map equipment sends to Reduce equipment by corresponding structural integrity key word subtree SCKS and the subtree SCKS ' of each piecemeal calculating, Reduce equipment calls Reduce program, by Reduce program, first preserved the SLCA node receiving, and this SLCA node is designated as to P1, remerge the SCKS ' that each piecemeal is corresponding, generate SCKS "; again to this SCKS " carry out SLCA computing, obtain SLCA node, and remember that this node is P2, and finally merge P1 and P2, obtain final result for retrieval.
The present embodiment, carries out distributed parallel computing by MapReduce to magnanimity XML file, thereby can support the key search of big data quantity; In addition, by determining the type of each node, thereby obtain more meeting the node of user's request; Also have by expanding query condition, thereby realize the key search of supporting semantic identification, make result for retrieval more accurate.The method of the present embodiment has the operation efficiency of O (n), has realized efficiently, the key search on magnanimity XML data set accurately.
One of ordinary skill in the art will appreciate that all or part of step realizing in the various embodiments described above method is to come the hardware that instruction is relevant to complete by program, corresponding program can be stored in a computer read/write memory medium, described storage medium, as ROM/RAM, disk or CD etc.
Embodiment bis-
The embodiment of the present invention two has provided a description take the process of key search statement Q when author=Ling year>=2000 inquires about, and details are as follows:
The first step: Mapper is from Block1, and Block2 and Block3 inquire the value node that meets search key retrieve statement.
Map equipment is from each block data shown in the physical deflection traversing graph 3 of minute BOB(beginning of block), until read a minute EOB symbol.When reading label author, if this node is attribute node, attempts reading the value node under it, and judge whether the value of this value node is Ling; When reading label year, if this node is attribute node, attempt reading the value node under it, and the value that judges this value node >=2000 whether.The final value node inquiring from each piecemeal Block is as follows:
The value node collection of Block1 is [{ 0.0.1.0, Ling}];
The value node collection of Block2 be [0.0.3.0,2011}, 0.1.2.0,2000}];
The value node collection of Block3 is [].
Second step, Map equipment carry out completion operation to the concentrated value node of each value node, obtain the structural integrity key word subtree that Block1, Block2 and Block3 are corresponding and are respectively:
SCKS1=[{0,dblp},{0.0,article},{0.0.1,author},{0.0.1.0,Ling}];
SCKS2=[{0,dblp},{0.0,article},{0.1,article},{0.0.3,year},{0.1.2,year},{0.0.3.0,2011},{0.1.2.0,2000}];
SCKS3=[]。
The 3rd step, Map equipment carry out SLCA computing to SCKS1, SCKS2 and SCKS3, calculate the SLCA node in SCKS1 and SCKS2, and send SCKS ' and SLCA node to Reduce equipment.
In the present embodiment, the SLCA node of SCKS1, SCKS2 does not all exist, so SCKS '=SCKS.
The 4th step, Reduce equipment merge the SCKS ' that each piecemeal is corresponding, form SCKS ' ', and SCKS ' ' is carried out to SLCA computing, obtain SLCA node and are designated as P2.
Because SLCA node does not exist, P1=[].
In the present embodiment, the SCKS ' ' of merging as shown in Figure 4.SLCA computing that SCKS ' ' is carried out will be respectively from [0.0.1.0, Ling}, 0.0.3.0,2011}] and [0.0.1.0, Ling}, 0.1.2.0,2000}] beginning up inquires about last common ancestor.Yi Zhi, last common ancestor LCA node is { 0.0, article} and { 0, dblp}.So SLCA node is that { 0.0, article} is designated as P2.
The 5th step, merging P1 and P2, obtain net result, i.e. { 0.0, article }.
The present embodiment, the key search statement Q of take inquires about as author=Ling year>=2000, and comprising semantic retrieval, the result of inquiry is just in time the needed result of user, and result for retrieval accuracy rate is high.
Embodiment tri-
The embodiment of the present invention three has provided a description take the process of key search statement Q when title > Ling year:1990-2013 inquires about, and details are as follows:
The first step, Map equipment are from Block1, and Block2 and Block3 inquire the value node that meets search key retrieve statement.
The value node collection of Block1 is [];
The value node collection of Block2 be [0.0.3.0,2011}, 0.1.0.0, Ling ' s war}, 0.1.2.0,2000}];
The value node collection of Block3 is [].
Second step, obtain Block1, the structural integrity key word subtree that Block2 and Block3 are corresponding.
SCKS1=[]。
SCKS2=[{0,dblp},{0.0,article},{0.1,article},{0.0.3,year},{0.1.0,title},{0.1.2,year},{0.0.3.0,2011},{0.1.0.0,Ling’s?war},{0.1.2.0,2000}]。
SCKS3=[]。
The 3rd step, calculate SLCA node { 0.1, the article} in SCKS2.
The 4th step, from SCKS2, remove with SLCA node the data slot of 0.1, article}, and obtain SCKS2 '=[0, dblp}, 0.0, article}, 0.0.3, year}, 0.0.3.0,2011}], and send SCKS2 ' and SLCA node to Reduce equipment.
The 5th step, Reduce equipment are designated as P1 by SLCA node, P1=[{0.1 in the present embodiment, article}].
The 6th step, merge the SCKS ' of each piecemeal, form SCKS ' '.And SCKS ' ' is carried out to SLCA computing, obtain SLCA node, and this node is designated as to P2.
In the present embodiment, P2=[]
The 7th step, merging P1 and P2, obtain net result, i.e. { 0.1, article }.
The present embodiment, the key search statement Q of take inquires about as title > Ling year:1990-2013, and comprising semantic retrieval, comparing embodiment bis-is to have comprised the key search of supporting scope.
Embodiment tetra-
The embodiment of the present invention four has provided a description take the process of key search statement Q when Ling Wei inquires about, and details are as follows:
The first step, Map equipment are from Block1, and Block2 and Block3 inquire the value node that meets search key retrieve statement.
The value node collection of Block1 be [0.0.1.0, Ling}, 0.0.2.0, Wei}];
The value node collection of Block2 be [0.1.0.0, Ling ' s war}, 0.1.1.0, Wei}];
The value node collection of Block3 is [{ 0.2.1.0, Wei}].
Second step, obtain Block1, the structural integrity key word subtree that Block2 and Block3 are corresponding.SCKS1=[{0,dblp},{0.0,article},{0.0.1,author},{0.0.2,author},{0.0.1.0,Ling},{0.0.2.0,Wei}]。
SCKS2=[{0,dblp},{0.1,article},{0.1.0,title},{0.1.1,author},{0.1.0.0,Ling’s?war},{0.1.1.0,Wei}]。
SCKS3=[{0,dblp},{0.2,proceedings},{0.2.1,editor}{0.2.1.0,Wei}]。
The 3rd step, { 0.0, article}, { 0.1, article}, the SLCA node of SCKS3 does not exist the SLCA node in SCKS2 to calculate the SLCA node of SCKS1.
The 4th step, from SCKS1, remove with SLCA node 0.0, the data slot of article}, obtains SCKS1 '=[{ 0, dblp}, { 0.0.1, author}, { 0.0.2, author}, { 0.0.1.0, Ling}, 0.0.2.0, Wei}], from SCKS2, remove that { data slot that 0.1, article} is root node, obtains SCKS2 '=[{ 0 with SLCA node, dblp}, { 0.1.0, title}, { 0.1.1, author}, { 0.1.0.0, Ling ' s war}, { 0.1.1.0, Wei}], SCKS3 '=SCKS3, and send SCKS ' that each piecemeal is corresponding and SLCA node to Reduce equipment.
The 5th step, Reduce equipment are designated as P1 by the SLCA node of structural integrity key word subtree corresponding to each piecemeal, P1=[{0.0 in the present embodiment, and article}, 0.1, article}].
The 6th step, merge the SCKS ' of each piecemeal, form SCKS ' '.And SCKS ' ' is carried out to SLCA computing, obtain SLCA node, and this node is designated as to P2.
In the present embodiment, P2=[].
The 7th step, merging P1 and P2, obtain net result, and [0.0, article}, 0.1, article}].
The present embodiment, the key search statement Q of take inquires about as Ling Wei, and this is a kind of semantic retrieval that do not comprise, and when inquiry, will not analyze semanteme.
Embodiment five
Fig. 5 shows the concrete structure block diagram of the key search device that the embodiment of the present invention five provides, and for convenience of explanation, only shows the part relevant to the embodiment of the present invention.This key search device 5 is the unit that are built in software unit, hardware cell or software and hardware combining in Map equipment, and this key search device 5 comprises: retrieval request receiving element 51, fileinfo acquiring unit 52, minute module unit 53, SCKS and SLCA node generation unit 54 and transmitting element 55.
Wherein, retrieval request receiving element 51, for receiving the key search request of input, this request comprises ID and the key search statement of file destination;
Fileinfo acquiring unit 52, for according to the ID file reading information of file destination;
Divide module unit 53, for file destination is carried out to piecemeal;
SCKS and SLCA node generation unit 54, for move Mapper program on each piecemeal, to obtain the structural integrity key word subtree SCKS that each piecemeal is corresponding, minimum last common ancestor SLCA node by Mapper program according to key search statement and fileinfo inquiry, and from SCKS, remove and take the data slot that SLCA node is root node, generate subtree SCKS ';
Transmitting element 55, for sending SLCA node and subtree SCKS ' to Reduce equipment, to calculate result for retrieval by Reduce equipment according to SLCA node and subtree SCKS '.
Concrete, fileinfo comprises all nodal informations of file destination, and nodal information comprises the unique identification of node, the type of the tag name of node, node, and the type of node comprises entity node, attribute node, connected node and value node.
Concrete, SCKS and SLCA node generation unit 54 comprise:
Value node collection generation module, for search the value node that meets key search statement at each piecemeal according to nodal information, generates value node collection;
SCKS generation module, for the concentrated value node of value node being carried out to completion operation successively according to nodal information, obtains the structural integrity key word subtree SCKS that each piecemeal is corresponding;
LCA node acquisition module, for searching the public ancestors of each value node of the structural integrity key word subtree SCKS that each piecemeal is corresponding, using these public ancestors as last common ancestor LCA node;
SLCA node acquisition module, for grandparent and grandchild's relation of each LCA node relatively, using the LCA node of seniority in the family minimum as minimum last common ancestor SLCA node.
Concrete, described SCKS generation module comprises:
The first value node locator module, first value node of concentrating for being positioned to value node;
Unique identification obtains submodule, for the unique identification of read value node;
Completion operation submodule, the unique identification reading for basis, upwards completion, deposits the result after completion in corresponding structural integrity key word subtree SCKS in;
The second value node locator module, the next value node of concentrating for being positioned to value node, and recursive call unique identification obtains submodule and completion operates submodule, until last value node completes completion operation.
Concrete, key search statement Q meets Q → T n| T s| QT n| QT s, wherein ,-> is " being defined as ", | be "or", T sto comprise semantic key word substatement, T nnot belong to T st, T is not containing the nonblank character string of null character (NUL), T scomprise T > T|T<T|T>T|T<=T|T >=T|T:T-T, > is for comprising, and T:T-T represents the interval range of a setting.
The present embodiment, carries out distributed parallel computing by MapReduce to magnanimity XML file, thereby can support the key search of big data quantity; In addition, by determining the type of each node, thereby obtain more meeting the node of user's request; Also have by expanding query condition, thereby realize the key search of supporting semantic identification, make result for retrieval more accurate.The method of the present embodiment has the operation efficiency of O (n), has realized efficiently, the key search on magnanimity XML data set accurately.
The key search device that the embodiment of the present invention provides can be applied in the embodiment of the method one of aforementioned correspondence, and details, referring to the description of above-described embodiment one, do not repeat them here.
Embodiment six
Fig. 6 shows the concrete structure block diagram of the key search system that the embodiment of the present invention six provides, and for convenience of explanation, only shows the part relevant to the embodiment of the present invention.This key search system is the system in the cloud computing environment platform based on cloud computing framework MapReduce, this key search system 6 comprises the Map equipment 61 that stores Mapper program and the Reduce equipment 62 that stores Reduce program, wherein the number of Map equipment and Reduce equipment is at least one, Map equipment and Reduce equipment are respectively Mapper program and the computing machine at Reduce program place, the i.e. nodes in cluster in cloud computing platform.
The present embodiment, carries out distributed parallel computing by MapReduce to magnanimity XML file, thereby can support the key search of big data quantity; In addition, by determining the type of each node, thereby obtain more meeting the node of user's request; Also have by expanding query condition, thereby realize the key search of supporting semantic identification, make result for retrieval more accurate.The method of the present embodiment has the operation efficiency of O (n), has realized efficiently, the key search on magnanimity XML data set accurately.
It should be noted that in said system embodiment, included unit is just divided according to function logic, but is not limited to above-mentioned division, as long as can realize corresponding function; In addition, the concrete title of each functional unit also, just for the ease of mutual differentiation, is not limited to protection scope of the present invention.
The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, all any modifications of doing within the spirit and principles in the present invention, be equal to and replace and improvement etc., within all should being included in protection scope of the present invention.

Claims (12)

1. a keyword search method, is characterized in that, described method comprises:
Receive the key search request of input, described request comprises ID and the key search statement of file destination;
According to the ID file reading information of described file destination;
Described file destination is carried out to piecemeal;
On each piecemeal, move Mapper program, to obtain the structural integrity key word subtree SCKS that each piecemeal is corresponding, minimum last common ancestor SLCA node by described Mapper program according to described key search statement and the inquiry of described fileinfo, and from described SCKS, remove and take the data slot that described SLCA node is root node, generate subtree SCKS ';
Send described SLCA node and described subtree SCKS ' to Reduce equipment, to calculate result for retrieval by described Reduce equipment according to described SLCA node and described subtree SCKS '.
2. the method for claim 1, it is characterized in that, described fileinfo comprises all nodal informations of described file destination, described nodal information comprises the unique identification of node, the type of the tag name of node, node, and the type of described node comprises entity node, attribute node, connected node and value node.
3. method as claimed in claim 2, it is characterized in that, the described Mapper program of moving on each piecemeal, according to described key search statement and described fileinfo, inquiry obtains the structural integrity key word subtree SCKS that each piecemeal is corresponding, minimum last common ancestor SLCA node comprises:
According to described nodal information, in each piecemeal, search the value node that meets described key search statement, generate value node collection;
According to described nodal information, the concentrated value node of described value node is carried out to completion operation successively, obtain the structural integrity key word subtree SCKS that each piecemeal is corresponding;
Search the public ancestors of each value node in the structural integrity key word subtree SCKS that each piecemeal is corresponding, using described public ancestors as last common ancestor LCA node;
Grandparent and grandchild's relation of each LCA node relatively, using the LCA node of seniority in the family minimum as minimum last common ancestor SLCA node.
4. method as claimed in claim 3, is characterized in that, describedly according to described nodal information, the concentrated value node of described value node is carried out to completion operation successively, obtains the structural integrity key word subtree SCKS that each piecemeal is corresponding and comprises:
Step 1, be positioned to first value node that described value node is concentrated;
Step 2, read the unique identification of described value node;
Step 3, according to described unique identification, upwards completion, deposits the result after completion in corresponding structural integrity key word subtree SCKS in;
Step 4, be positioned to the next value node that described value node is concentrated, and recurrence execution step 2 to 3, until last value node completes completion operation.
5. the method for claim 1, is characterized in that, described Reduce equipment calculates result for retrieval according to described SLCA node and described subtree SCKS ' and comprises:
Described Reduce equipment calls Reduce program;
Described Reduce program is preserved the SLCA node receiving, and this SLCA node is designated as to P1;
Described Reduce program merges the SCKS ' that each piecemeal is corresponding, generates SCKS ";
Described Reduce program is to described SCKS " carry out SLCA computing, obtain SLCA node, and remember that this node is P2;
Described Reduce program merges described P1 and P2, obtains final result for retrieval.
6. the method as described in claim 1 to 5 any one, is characterized in that, key search statement Q meets Q → T n| T s| QT n| QT s, wherein ,-> is " being defined as ", | be "or", T sto comprise semantic key word substatement, T nnot belong to T st, T is not containing the nonblank character string of null character (NUL), T scomprise T > T|T<T|T>T|T<=T|T >=T|T:T-T, > is " comprising ", and T:T-T represents the interval range of a setting.
7. a key search device, is characterized in that, described device comprises:
Retrieval request receiving element, for receiving the key search request of input, described request comprises ID and the key search statement of file destination;
Fileinfo acquiring unit, for according to the ID file reading information of described file destination;
Divide module unit, for described file destination is carried out to piecemeal;
SCKS and SLCA node generation unit, for move Mapper program on each piecemeal, to obtain the structural integrity key word subtree SCKS that each piecemeal is corresponding, minimum last common ancestor SLCA node by described Mapper program according to described key search statement and the inquiry of described fileinfo, and from described SCKS, remove and take the data slot that described SLCA node is root node, generate subtree SCKS ';
Transmitting element, for sending described SLCA node and described subtree SCKS ' to Reduce equipment, to calculate result for retrieval by described Reduce equipment according to described SLCA node and described subtree SCKS '.
8. device as claimed in claim 7, it is characterized in that, described fileinfo comprises all nodal informations of described file destination, described nodal information comprises the unique identification of node, the type of the tag name of node, node, and the type of described node comprises entity node, attribute node, connected node and value node.
9. device as claimed in claim 8, is characterized in that, SCKS and SLCA node generation unit comprise:
Value node collection generation module, for search the value node that meets described key search statement at each piecemeal according to described nodal information, generates value node collection;
SCKS generation module, for the concentrated value node of described value node being carried out to completion operation successively according to described nodal information, obtains the structural integrity key word subtree SCKS that each piecemeal is corresponding;
LCA node acquisition module, for searching the public ancestors of each value node of the structural integrity key word subtree SCKS that each piecemeal is corresponding, using described public ancestors as last common ancestor LCA node;
SLCA node acquisition module, for grandparent and grandchild's relation of each LCA node relatively, using the LCA node of seniority in the family minimum as minimum last common ancestor SLCA node.
10. device as claimed in claim 9, is characterized in that, described SCKS generation module comprises:
The first value node locator module, first value node of concentrating for being positioned to described value node;
Unique identification obtains submodule, for reading the unique identification of described value node;
Completion operation submodule, for according to described unique identification, upwards completion, deposits the result after completion in corresponding structural integrity key word subtree SCKS in;
The second value node locator module, the next value node of concentrating for being positioned to described value node, and described in recursive call, unique identification obtains submodule and described completion operation submodule, until last value node completes completion operation.
11. devices as described in claim 7 to 10 any one, is characterized in that, key search statement Q meets Q → T n| T s| QT n| QT s, wherein ,-> is " being defined as ", | be "or", T sto comprise semantic key word substatement, T nnot belong to T st, T is not containing the nonblank character string of null character (NUL), T scomprise T > T|T<T|T>T|T<=T|T >=T|T:T-T, > is for comprising, and T:T-T represents the interval range of a setting.
12. 1 kinds of key search systems, it is characterized in that, described key search system is the system in the cloud computing environment platform based on cloud computing framework MapReduce, described key search system comprises that the Map equipment that stores Mapper program and the Reduce equipment that stores Reduce program, described Map equipment comprise and state key search device as described in claim 7 to 11 any one.
CN201310503091.6A 2013-10-23 2013-10-23 Method, device and system for retrieving keywords Pending CN103544281A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310503091.6A CN103544281A (en) 2013-10-23 2013-10-23 Method, device and system for retrieving keywords

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310503091.6A CN103544281A (en) 2013-10-23 2013-10-23 Method, device and system for retrieving keywords

Publications (1)

Publication Number Publication Date
CN103544281A true CN103544281A (en) 2014-01-29

Family

ID=49967733

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310503091.6A Pending CN103544281A (en) 2013-10-23 2013-10-23 Method, device and system for retrieving keywords

Country Status (1)

Country Link
CN (1) CN103544281A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107615271A (en) * 2015-12-30 2018-01-19 深圳配天智能技术研究院有限公司 Character string retrieving method and device
CN108509658A (en) * 2018-04-28 2018-09-07 中国联合网络通信集团有限公司 A kind of analysis method and device of XML file
CN115878321A (en) * 2022-12-14 2023-03-31 成都信息工程大学 File searching method based on GPU acceleration

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101364234A (en) * 2008-09-27 2009-02-11 复旦大学 Last common ancestor rapid search method of XML keyword search
CN101615190A (en) * 2009-07-31 2009-12-30 复旦大学 The XML keyword search method of safety
US8086606B1 (en) * 2008-07-15 2011-12-27 Teradata Us, Inc. Performing a keyword search based on identifying exclusive lowest common ancestor (ELCA) nodes
CN103150404A (en) * 2013-03-28 2013-06-12 北京大学 Hybrid relational-extensible markup language (XML) data keyword searching method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8086606B1 (en) * 2008-07-15 2011-12-27 Teradata Us, Inc. Performing a keyword search based on identifying exclusive lowest common ancestor (ELCA) nodes
CN101364234A (en) * 2008-09-27 2009-02-11 复旦大学 Last common ancestor rapid search method of XML keyword search
CN101615190A (en) * 2009-07-31 2009-12-30 复旦大学 The XML keyword search method of safety
CN103150404A (en) * 2013-03-28 2013-06-12 北京大学 Hybrid relational-extensible markup language (XML) data keyword searching method

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
CHENJING ZHANG 等: "Distributed SLCA-Based XML Keyword Search by Map-Reduce", 《DATABASE SYSTEMS FOR ADVANCED APPLICATIONS 2010》 *
ZHEN-FANG LI;SHI-QUN TAO: "A XML Keyword Search Algorithm Based on MapReduce", 《INTERNATIONAL JOURNAL OF DIGITAL CONTENT TECHNOLOGY & ITS APPLICATIONS》 *
周梦婕: "大规模集群下XML关键字检索算法设计与实现", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
孔令波 等: "XML数据的查询技术", 《软件学报》 *
陈次白 等: "《信息存储与检索技术》", 30 September 2006 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107615271A (en) * 2015-12-30 2018-01-19 深圳配天智能技术研究院有限公司 Character string retrieving method and device
CN108509658A (en) * 2018-04-28 2018-09-07 中国联合网络通信集团有限公司 A kind of analysis method and device of XML file
CN115878321A (en) * 2022-12-14 2023-03-31 成都信息工程大学 File searching method based on GPU acceleration
CN115878321B (en) * 2022-12-14 2023-11-10 成都信息工程大学 File searching method based on GPU acceleration

Similar Documents

Publication Publication Date Title
Flesca et al. Fast detection of XML structural similarity
Angles A comparison of current graph database models
CN102650992B (en) Method and device for generating binary XML (extensible markup language) data and locating nodes of the binary XML data
Liu et al. A geohash-based index for spatial data management in distributed memory
CN102270232B (en) Semantic data query system with optimized storage
JP5152877B2 (en) Document data storage method and apparatus in document base system
CN102999637B (en) According to the method and system that file eigenvalue is file automatic powder adding add file label
Wang et al. Approximate graph schema extraction for semi-structured data
CN103226608B (en) A kind of parallel file searching method based on directory level telescopic Bloom Filter bitmap table
Zhu et al. Mini-XML: An efficient mapping approach between XML and relational database
CN103544281A (en) Method, device and system for retrieving keywords
Liu et al. Dynamically querying possibilistic XML data
KR101226162B1 (en) Method and apparatus for converting ontology date to graph data
Wu et al. Storage and retrieval of massive heterogeneous IoT data based on hybrid storage
Sen et al. Dynamic discovery of query path on the lattice of cuboids using hierarchical data granularity and storage hierarchy
Scriney et al. Efficient cube construction for smart city data
Hsu et al. UCIS-X: an updatable compact indexing scheme for efficient extensible markup language document updating and query evaluation
Zhang et al. Building XML data warehouse based on frequent patterns in user queries
Tang et al. Searching the Internet of Things using coding enabled index technology
Dadheech et al. An optimal framework for spatial query optimization using hadoop in big data analytics
Alghamdi et al. Object-based methodology for XML data partitioning (OXDP)
Song et al. Handling XML to relational database transformation using model-based mapping approaches
Song et al. Design of Index Schema based on Bit-Streams for XML Documents
Tung et al. An improved indexing method for Xpath queries
Hsu et al. Accelerating Topic Exploration of Multi-Dimensional Documents

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20140129

RJ01 Rejection of invention patent application after publication