Summary of the invention
Technical problem: the present invention provides a kind of user friendly, and the natural language inputted by user just can be automatic
Generate SQL query statement, and Query Result is returned to user based on body and the data base of natural language processing
Querying method.
Technical scheme: the present invention based on body and the data base query method of natural language processing, first from data base
Relation schema in extract body, a concrete class will be abstracted into by relation table, have between class and class succession,
The relations such as equivalence;The data type attribute that every string in relation table is conceptualized as in body;Secondly, coefficient is closed
The object properties being also translated in body according to the external key in storehouse.Then, ontology translation is become graph data structure, then ties
The limited natural language that user is inputted by conjunction natural language processing technique carries out participle, builds keyword index, and search is even
Connecing the connected graph of multiple key word, SQL conversion etc. realizes natural language to SQL (a kind of data base querying and program
Design language) conversion of language.
The present invention based on body and the data base query method of natural language processing, comprise the steps:
1) ontology translation gone out according to database relation mode construction is become graph data structure: the class in body is changed
For class node, data type attribute is converted into attribute node, and attribute node has one to be connected to the class that it is specified respectively
The limit of node, object properties are converted into the limit connecting two classes;
2) build participle special dictionary and keyword index: each the record being successively read in data base, will read
To record value add to dictionary is inquired about as user time participle special dictionary, when reading out each record simultaneously
This record value as value, is formed key assignments as key, the relation table name corresponding in data base using this record value and row name
Right, it is deposited in non-relational database, as keyword index, for the key word that quickly location is given, improves
Search efficiency;
3), after system receives user's Limited Natural Language Query, utilize step 2) in the special dictionary that constructs will be subject to
Limit natural language is decomposed into multiple significant key word;
4) using step 3) in the key word that decomposites one by one as key, keyword index is searched corresponding value,
I.e. find out the relation table name corresponding to this key word and row name, then in step 1) in generate graph data structure in look for
To the node that all relation table names are corresponding with row name, finally by connected component corresponding for all nodes from graph data structure
Extract, as search space;
5) traversal step 4) in connected component in the search space that constructs, finding can be by relevant for institute in search space
All connected subgraphs that keyword couples together, if any one can not be found to meet the connected subgraph of condition, then find out bag
Containing the connected subgraph of key word as much as possible, then by the connected subgraph found out according to its key word number comprised from greatly
It is ranked up to little, the connected subgraph identical for comprising key word number, then further according to the limit number comprised from small to large
Being ranked up, k the connected subgraph that last selected and sorted is the most forward, the value of k is according to the size of data base and search
The all connected subgraph numbers obtained determine;
6) by step 5) in select k connected graph according to sequence, be converted into SQL language according to following rule successively
Sentence: the Select clause in SQL statement is filled with *, in order to represent, all of row is all returned, by connected graph
In class node be written in the From clause in SQL statement, the limit connecting two class nodes is converted to external key close
In the Where clause that system is written in SQL statement, keyword root user inputted is according to the relation table name of its correspondence
In the Where words and expressions being written in SQL statement with row name;
After SQL statement generates, data base is inquired about, then Query Result is returned to user.
In one preferred version of the present invention, step 2) in non-relational database use MongoDB data base.
Beneficial effect: the present invention compared with prior art, has the advantage that
The inventive method passes through the semantic information in body storage data base between relation table, quick by keyword index
Construct search space, find in search volume one or more comprise all user's searching keywords minimum even
Logical figure, and according to certain rule, minimum connected graph is converted to SQL query statement, effective real on this basis
Existing semantic query.
In this method, the Limited Natural Language Query of user is resolved into multiple significant key word, between key word
Potential association be the graph data structure corresponding by search and excavate, be therefore not related to part-of-speech tagging and grammer
Analyze, be also not limited to sentence pattern.Such as inquiry " student number of Zhang San " and " Zhang San of student number ", they are all decomposed by system
For key word " Zhang San ", " student number ", the result that therefore two kinds of inquiries obtain is duplicate.
It addition, this method is different from tradition method based on data base's E-R model, a kind of method proposing novelty, will
The body extracted from database relation pattern, is converted into graph data structure, by traveling through this graph data structure, looks for
The connected subgraph that in inquiring about to user, the key word that comprises is corresponding, it is achieved the natural language inquiry to data base.For one
Individual huge data base, it can comprise ten hundreds of records, but its relation schema is relatively simple.And
And for data, relation schema is the least, and each record in data base must have the most right
The relation schema answered, it is possible to relation schema is regarded as abstract unified standard out, data base from data
In all of record all meet this standard.This method extracts body from the relation schema of data base, is turned by this body
Change graph data structure into, by keyword index, from graph data structure, extract all connections corresponding with key word divide
Amount, as search space, reduces the scope of search greatly, thus is greatly improved search efficiency.
The present invention uses non-relational database to store keyword index, flexible structure when non-relational database stores,
Not by relevant database ACID affairs (atomicity Atomicity, concordance Consistency, isolation Isolation,
Persistency Durability) constraint, during storage keyword index, comparing traditional Relational DataBase has the strongest excellent
Gesture.
Proving through instance analysis, utilizing data base's natural language querying method that the present invention proposes, user need not ten
Divide and understand data base, it is not required that SQL query language is had any basis, it is only necessary to input-bound natural language
Can be achieved with the inquiry to data base.Automatic transformation process from natural language to SQL is fully transparent to user.
The present invention is a kind of method of user friendly, and by checking, the method is the most feasible.
Detailed description of the invention
Below in conjunction with embodiment and Figure of description, describe the implementation process of the present invention in detail.
The relational database query method based on body and limited natural language processing of the present invention, including following 6 steps
Rapid:
1) body turns PCDO graph data structure: body (Ontology) is to be proposed by World Wide Web Consortium (W3C)
For describing a kind of specification of all kinds of resource informations on WWW, the body in the present invention is according to relational database
Pattern information, builds according to certain rule, all kinds of resource informations in descriptive data base, ontological construction
Rule is as follows:
A () body builds class (class): for all relation tables in relational database, structure the most respectively
Build out a corresponding class, class and relation table one_to_one corresponding;
B () body builds data type attribute (dataTypeProperty): every in each relation table t
String c, constructs a data type attribute corresponding for c, this class belonging to data type attribute the most respectively
It is the class corresponding to relation table t;
C () body builds object properties (objectProperty): for each association two in relational database
Open the external key f of relation table t1, t2, between two classes that relation table t1, t2 are corresponding, construct a f corresponding
Object properties, two classes that these object properties connect are class corresponding to relation table t1, t2.
Body only comprises class, object properties and data type attribute.In order to make full use of the information in body, we
Propose PCDO graph data structure (calling PCDO figure in the following text).PCDO figure mainly comprises two seed data structures and i.e. ties
Point (Node) and limit (Edge).
The data structure of node is as shown in table 1:
The data structure of table 1 node
The data structure of node comprises Type, Name, Edges, Keyword, Value, KeywordType six
Individual attribute.Type attribute is for identifying the type of node, and node types includes class node (C_Node) and attribute knot
Point (P_Node);Name attribute is for identifying the title of node;Edges attribute is adjacent with current node for record
All of limit;Keyword attribute, Value attribute, KeywordType attribute are in step 1) transformation process in
All it is set to sky, step 4) the middle initialization procedure describing these three attribute in detail.
The data structure on limit is as shown in table 2:
The data structure on table 2 limit
Limit data structure comprises tetra-attributes of Type, Name, Node1, Node2.Type attribute is used for identifying
The type on limit, the type on limit includes data type attribute limit (D_Edge) and object properties limit (O_Edge);Name
Attribute is for identifying the title on limit;Two nodes that two attributes of Node1, Node2 connect when front for record.
Because PCDO figure is non-directed graph, so the property value of two attributes of Node1 and Node2 is tradable.
Body is to the switch process of PCDO figure:
(1) conversion of class node (C_Node): in body, all of class is respectively converted into a node, the Type of node
Attribute is set to " C_Node ", and the Name attribute of node is set to the title of corresponding class, the Edges attribute of node
It is set to sky;
(2) conversion of attribute node (P_Node): in body, all of data type attribute is converted into a knot respectively
Point, the Type attribute of node is set to " P_Node ", and the Name attribute of node is set to the data type of correspondence and belongs to
The title of property, the Edges attribute of node is set to sky;
(3) conversion of data type attribute limit (D_Edge): tie in the class belonging to attribute node P and this attribute node
Adding a limit between some C, the Type attribute on limit is set to " D_Edge ", Node1, Node2 attribute on limit
Being respectively set to P, C, the Name attribute on data type attribute limit is set to " hasProperty ", respectively at node
The Edges attribute of P and node C adds the data type attribute limit being currently converted to;
(4) conversion of object properties limit (O_Edge): all object properties in body are converted into two that it connects respectively
An individual limit between class node C1, C2, the Type attribute on limit is set to " O_Edge ", Node1, the Node2 on limit
Attribute is respectively set to C1, C2, and the Name attribute on limit is set to the title of the object properties of correspondence, respectively at knot
The Edges attribute of some C1, C2 adds the object properties limit being currently converted to.
2) participle special dictionary and keyword index are built:
All relation tables in ergodic data storehouse, each the record being successively read in relation table, record value is written to
In dictionary, simultaneously using this record value as key, the relation table name corresponding in data base using this record value and row name as
Value, forms key-value pair, as keyword index.In a preferred embodiment of the present invention, non-relational database is adopted
Using MongoDB data base, key-value pair is deposited in MongoDB data base, as keyword index.Certainly,
The inventive method is not limited to use MongoDB data base, and all non-relationals (NoSQL) data base all can be
This uses.
Structure and the example of key-value pair are as shown in table 3 below." student " table such as there is a record under " name " this string
For " Zhang San ", when reading " Zhang San " this record, " Zhang San " is set as key (key), TableName attribute
Being set to the relation table name " student " that " Zhang San " is corresponding, ColumnName attribute is set to the row name of row corresponding to " Zhang San "
" name ", forms key-value pair, is stored in keyword index.Dictionary creation is saved in magnetic after completing in the form of a file
In dish, when using this dictionary, then go to relevant position in disk to read every time.Build participle special dictionary be in order to from
The natural language querying of family input decomposites key word.
Table 3 key-value pair structure and example
During traveling through each relation table, simultaneously using the table name of relation table and all row names as key, structure
Go out the key-value pair shown in table 3, be deposited in keyword index: for the table name of relation table, by TableName attribute
It is set to " table ", ColumnName attribute is set to the table name of mapping table;For the row name of relation table,
TableName attribute is set to " column ", ColumnName attribute is set to the row name of respective column.
It addition, key word may multiple elements in correspondence database, the most above-mentioned " Zhang San " be likely to corresponding another
Open another row of table, it is also desirable to be deposited in keyword index by this key-value pair.As user's inquiry " Zhang San ", logical
Cross searching keyword index, all table names corresponding to " Zhang San " this word and row name can be rapidly be.
3) utilize special dictionary to carry out participle: after system receives user's natural language querying, to utilize step 2) in structure
Natural language is decomposed into multiple significant key word by the special dictionary built out;
4) combine Fig. 2 and the structure of search space be described: to step 3) in each key word of decompositing, by inquiry
Keyword index, can obtain this word all of relation table name corresponding in data base and row name.According to step 1),
A class in one relation table name correspondence body, a class node in such corresponding PCDO figure;In relation table
A data type attribute in one row name correspondence body, in this data type attribute also corresponding PCDO figure
Attribute node.When the corresponding multiple relation table names of key word and row name, the most just correspond in PCDO figure is multiple
Node.Below the node in key word correspondence PCDO figure is referred to as the mapping node of this key word.
The Keyword attribute mapping node is set to corresponding key word.If having multiple mapping to tie for a key word
Point, makes a distinction with Value attribute, and Value takes different numberings.KeywordType attribute include " table ",
" column ", " value " three value, determines according to keyword index.If the TableName in keyword index
Value is " table ", then KeywordType attribute is set to " table ";If the TableName value in keyword index is
" column ", then KeywordType attribute is set to " column ";In the case of other, KeywordType attribute sets
It is set to " value ".
By the connected component belonging to all mapping nodes corresponding for all key words from step 1) build PCDO figure
In extract, as search space.Because search space one is set to a subset of PCDO figure, and by key
After glossarial index finds the connected component belonging to all mapping nodes, can effectively reduce hunting zone.
5) combine Fig. 3 and connected subgraph searching method be described: according to step 4) in the search space that constructs, search bag
Containing the connected subgraph of all key words, key step is as follows:
A) randomly choose in search space connected component not processed, find the institute in this connected component
There is mapping node, put in set X.
If b) there is n (n >=2) individual mapping node Node in set X1、Node2。。。NodenKeyword belong to
Property identical, then according to the difference of Value attribute, set X is extended to n set X1、X2…Xn, then by X
In except Node1、Node2。。。NodenOutside other all mapping nodes add X to1、X2…XnIn, and delete
Set X.
C) to X1、X2…XnRepeat step b), until each during each is gathered maps node
Keyword attribute is the most different, finally gives m the set that can not extend again;
D) arbitrarily select a set in m set to be designated as W, arbitrarily select one to map node as initial knot
Point does BFS (BFS);
E) during BFS traversal, if running into new mapping node, then by the path of this node to initial node
Record in set Set, and the mapping node newly run into is deleted from W;
F) repeat e), until all mapping nodes have all traveled through in set W, now have recorded in set Set
By mapping Node connectedness all in W to node together and limit, i.e. search all mapping in an association W and tied
The path of point, and this path is step 1) in the subset of PCDO figure that constructs, below by such connection
Path is referred to as PCDO subgraph;
G) repeat d) until all of process of aggregation is complete;
H) repeat a), until all of subquery spatial manipulation is complete;
I) all PCDO subgraphs obtained are ranked up, arrange from big to small by the number of its key word comprised
Sequence, the PCDO subgraph identical for comprising key word number, then it is ranked up from small to large further according to the limit number comprised,
K the PCDO subgraph that last selected and sorted is the most forward, k needs to determine one suitably according to the size of concrete database
Value, or specified by user, the most only represent a suitable number;
6) generate SQL statement: according to step 5) in obtained k PCDO subgraph, respectively by PCDO subgraph
Being converted into SQL statement, SQL statement form is as follows:
Select<inquiry content>
From<tables of data>
Where<querying condition>
PCDO subgraph is to the transformational rule of SQL:
A) select clause inserts " * ", represent and all row meeting querying condition in data base are all returned to
User;
B) from clause is according to all class nodes in PCDO subgraph, inserts the relation table of correspondence;
C) where clause (does not has object according to the object properties limit of PCDO subgraph, the foreign key relationship inserting correspondence
Attribute limit is not filled out);
D) where clause is according to the attribute node in PCDO subgraph, inserts the value that attribute node is corresponding.
After above-mentioned SQL statement generates, being inquired about data base by database query interface, result is returned the most at last
Back to user.
Below in conjunction with one simplify application example, describe in detail the present invention implementation process:
1) body turns PCDO figure: the relational database involved by this example is a record students' needs relevant information
Data base, data comprise following information:
Relation table: student, curricula-variable, course, institute
Row name: student _ name, student _ student number, curricula-variable _ course code name, curricula-variable _ student number, course _ department's code name,
Course _ course code name, institute _ department's code name, institute _ department's remarks
External key: curricula-variable _ student number=student _ student number, curricula-variable _ course code name=course _ course code name,
Course _ department's code name=institute _ department's code name
Therefore the body extracted from the pattern information of this relational database is one and describes students' needs relevant information
Body.The details comprised in body are as follows:
Class: student, curricula-variable, course, institute;
Data type attribute: student _ name, student _ student number, curricula-variable _ course code name, curricula-variable _ student number, course _ institute
It is code name, course _ course code name, institute _ department's code name, institute _ department's remarks;
Object properties: student _ student number _ curricula-variable _ student number, curricula-variable _ course code name _ course _ course code name, course _ department
Code name _ institute _ department's code name.
The body being converted to from relation schema is as follows:
<?Xml version=" 1.0 "?>
<rdf:RDF
Xmlns:rdf=" http://www.w3.org/1999/02/22-rdf-syntax-ns# "
Xmlns:xsd=" http://www.w3.org/2001/XMLSchema# "
Xmlns:rdfs=" http://www.w3.org/2000/01/rdf-schema# "
Xmlns:owl=" http://www.w3.org/2002/07/owl# "
Xmlns=" http://www.project.com/d2o_owl# "
Xml:base=" http://www.project.com/d2o_owl " >
<owl:Ontology rdf:about=" "/>
<owl:ObjectProperty rdf:ID=" curricula-variable _ student number _ student _ student number ">
<rdfs:range rdf:resource=" # student "/>
<rdfs:domain rdf:resource=" # curricula-variable "/>
</owl:ObjectProperty>
<owl:DatatypeProperty rdf:ID=" curricula-variable _ student number ">
<rdfs:range rdf:resource=" http://www.project.com/d2o_owl# character string type "/>
<rdfs:domain rdf:resource=" # curricula-variable "/>
</owl:DatatypeProperty>
<owl:DatatypeProperty rdf:ID=" course _ course code name ">
<rdfs:range rdf:resource=" http://www.project.com/d2o_owl# character string type "/>
<rdfs:domain rdf:resource=" # course "/>
</owl:DatatypeProperty>
<owl:DatatypeProperty rdf:ID=" curricula-variable _ course code name ">
<rdfs:range rdf:resource=" http://www.project.com/d2o_owl# character string type "/>
<rdfs:domain rdf:resource=" # curricula-variable "/>
</owl:DatatypeProperty>
<owl:Class rdf:ID=" curricula-variable "/>
<owl:DatatypeProperty rdf:ID=" student _ student number ">
<rdfs:range rdf:resource=" http://www.project.com/d2o_owl# character string type "/>
<rdfs:domain rdf:resource=" # student "/>
</owl:DatatypeProperty>
<owl:Class rdf:ID=" course "/>
<owl:DatatypeProperty rdf:ID=" course _ department's code name ">
<rdfs:range rdf:resource=" http://www.project.com/d2o_owl# character string type "/>
<rdfs:domain rdf:resource=" # course "/>
</owl:DatatypeProperty>
<owl:DatatypeProperty rdf:ID=" student _ name ">
<rdfs:range rdf:resource=" http://www.project.com/d2o_owl# character string type "/>
<rdfs:domain rdf:resource=" # student "/>
</owl:DatatypeProperty>
<owl:Class rdf:ID=" student ">
</owl:Class>
<owl:ObjectProperty rdf:ID=" curricula-variable _ course code name _ course _ course code name ">
<rdfs:range rdf:resource=" # course "/>
<rdfs:domain rdf:resource=" # curricula-variable "/>
</owl:ObjectProperty>
<owl:ObjectProperty rdf:ID=" course _ department's code name _ institute _ department's code name ">
<rdfs:range rdf:resource=" # institute "/>
<rdfs:domain rdf:resource=" # course "/>
</owl:ObjectProperty>
<owl:DatatypeProperty rdf:ID=" institute _ department's code name ">
<rdfs:range rdf:resource=" http://www.project.com/d2o_owl# character string type "/>
<rdfs:domain rdf:resource=" # institute "/>
</owl:DatatypeProperty>
<owl:DatatypeProperty rdf:ID=" institute _ department's remarks ">
<rdfs:range rdf:resource=" http://www.project.com/d2o_owl# character string type "/>
<rdfs:domain rdf:resource=" # institute "/>
</owl:DatatypeProperty>
<owl:Class rdf:ID=" institute "/>
<rdf:RDF>
By step 1) in transformational rule body is changed into PCDO graph data structure, as shown in Figure 4.
Class in body is converted to the class node in PCDO figure, i.e. oval node in figure;Data type in body belongs to
Property is converted to the attribute node in PCDO figure, i.e. rectangle node in figure;The class node that attribute node is corresponding with it is used
Article one, hasProperty limit is connected;Object properties in body are converted to connect the limit of two class nodes.
2) building participle special dictionary and keyword index, keyword index partial content is as follows:
Wherein, in the key-value pair that key " student " is corresponding, TableName attribute is " table ", ColumnName attribute
For " student ", what expression " student " was corresponding herein is a relation table, and relation table entitled " student ";
In the key-value pair that key " student number " is corresponding, TableName field is " column ", and ColumnName field is " to learn
Raw ", what i.e. expression " student number " was corresponding herein is a row name, and arranges entitled " student number ";
The ColumnName field that key " 09131011 " is corresponding is " student number ", and TableName field is " student ", i.e.
Represent the key " 09131011 " occurrence in corresponding " student " table herein under " student number " row.
3) participle: such as, user input query statement " searches the course selected by student that student number is 09131011
Belonging to institute ", after participle, obtain five significant key words: student number, 09131011, student, course,
Institute.
4) build search space: utilize 2) in the keyword index that obtains, key word can be obtained and scheme with PCDO
The mapping relations of middle node are as follows:
Table 4 key word maps PCDO figure node
According to key word and the mapping relations of PCDO figure node, construct search space (wherein overstriking as shown in Figure 5
Node be the node that key word is mapped to).
5) connected subgraph search: according to searching algorithm, find in PCDO figure and all key words can be connected together
One or more connected subgraphs, the result that obtains of search is as shown in Figure 6.
6) according to the transformational rule of PCDO subgraph to SQL, generation SQL statement:
Select clause fills out " * ", i.e. obtains select clause: select*;
From clause inserts the relation table name that class node is corresponding, i.e. obtains from clause:
From student, curricula-variable, course, institute;
Where clause is converted to the external key of correspondence according to object properties limit, three object properties limits in Fig. 6: student _
Student number _ curricula-variable _ student number, curricula-variable _ course code name _ course _ course code name, course _ department's code name _ institute _ department's code name,
Change respectively, i.e. obtain where clause: where student _ student number=curricula-variable _ student number and curricula-variable _ course code name=class
Journey _ course code name and course _ department's code name=institute _ department's code name
Last processing attribute node: the mapping node of " 09131011 " is " student _ student number ", obtains after conversion
Student _ student number=" 09131011 ", adds in where clause.
The SQL statement ultimately generated is:
select*
From student, curricula-variable, course, institute
Where student _ student number=curricula-variable _ student number and curricula-variable _ course code name=course _ course code name and course _ department's generation
Number=institute _ department code name and student _ student number=" 09131011 "
After SQL statement generates, by database query interface, data base is inquired about, finally return result to use
Family.