CN109033314A - The Query method in real time and system of extensive knowledge mapping in the case of memory-limited - Google Patents
The Query method in real time and system of extensive knowledge mapping in the case of memory-limited Download PDFInfo
- Publication number
- CN109033314A CN109033314A CN201810787762.9A CN201810787762A CN109033314A CN 109033314 A CN109033314 A CN 109033314A CN 201810787762 A CN201810787762 A CN 201810787762A CN 109033314 A CN109033314 A CN 109033314A
- Authority
- CN
- China
- Prior art keywords
- index
- vocabulary
- intersection
- disk
- result
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to technical field of data processing, the Query method in real time and system of the extensive knowledge mapping in the case of a kind of memory-limited are provided, this method comprises: carrying out processing analysis to original knowledge map obtains inverted file Hash list;It is indexed based on original knowledge map construction multilevel structure;Query statement is parsed to obtain target vocabulary, and the corresponding triple of the target vocabulary is searched according to the inverted file Hash list and multilevel structure index and generates result subgraph.The present invention greatly improves single machine knowledge mapping query capability, can provide the result set for not only meeting user time demand but also meeting user's accuracy requirement in the case where memory is extremely limited.
Description
Technical field
The present invention relates to the extensive knowledge mappings in the case of technical field of data processing more particularly to a kind of memory-limited
Query method in real time and system.
Background technique
WWW has formd a huge network from being born till now, and constitute its node is net one by one
Page, it is interrelated by hyperlink between webpage.Based on this simply open technology in WWW, modern search engines technology can
To search the related web page of problem in huge cyberspace.But due to the development of mobile Internet, mobile device screen
Space limitation, user it is expected that search engine is available accurately as a result, rather than finding one by one in search result.Due to
This accuracy requirement at family is only that the storage of webpage cannot meet.
In order to solve this demand XML (extensible markup language), RDF (resource description framework) and OWL (Network ontology
Language) etc. be proposed for description network in information.XML is by adding label for document and data content, in order to data
Exchange;RDF describes the semantic relation of resources in network by the form of (subject, predicate, object) triple;OWL, which allows, describes this
The conception of species is possibly realized, and has extremely strong ability to express and interpretability.Pass through three of the above internet information describing mode
The concept of knowledge mapping is suggested in recent years.Entity and entity attribute in webpage are put into knowledge mapping after being identified and deposit
Storage can prepare to understand that user is intended to according to node known in knowledge mapping, provide and accurately return when user initiates to search for
It answers.
Have at present in the main storage querying method of the knowledge mapping based on RDF triple form: huge based on one
Triple table divides table by vertical classification by hierarchical cluster attribute table and based on multiple based on multiple.Based on a huge triple
The form of table is by all triple stores in a huge three lists lattice, and Major Systems in this way have:
RDF-3x and Hexastore;There are two types of major type of tables for form based on multiple tables by hierarchical cluster attribute: tuple attributes are poly-
The table of class table and the object with like attribute;Based on forms such as multiple tables divided by vertical classification to each attribute
Construct an individual 2 list lattice.For storing subject and object.RDF storage system based on above-mentioned three kinds of forms have Jena,
Yars2, Sesame 2.0, SW-store, EDF-3x, x-RDF-3x, Hexastore, gStore etc..
Existing RDF storage inquiry system such as Jena, Yars2 and Sesame 2.0 is imitated on biggish RDF data collection
Fruit is poor.And SW-store, EDF-3x, x-RDF-3x and Hexastore by using mapping dictionary mode solve compared with
The problem of big RDF data collection, it can only but support fixed SparQL language.And most of current method cannot be quick
Solve the problems, such as RDF data online updating.Such as the system Jena based on multiple forms by hierarchical cluster attribute table, if will be at it
The attribute information of more new data then needs to cluster and rebuild again attribute list on data set.In SW-store system due to
Update needs to rewrite many column, and it is also fairly expensive for updating cost.Although having used the mode of " overflow table+write in batches "
Also it is difficult to be required the high application use of real-time.And much RDF datas are intended to non-critical structural, such as same
It is not attribute all having the same in the data of type.It is this non-critical structural, be conducive to the integrated of data but for
Many classics accelerate aggregation of data query processing with relationship type method.Although gStore is solved using the method for T-index
Part above problem, but single machine supports data set limited size in T-index structure, and 1,000,000,000 triples can only be supported to advise
The data administration tasks of the RDF knowledge mapping of mould.
However as human knowledge update become larger, knowledge mapping scale is also accordingly increasing, size far more than
1000000000 tuples.The common computing capability for calculating equipment does not catch up with knowledge mapping rate of rise far but, and ordinary user looks on it
It is more and more difficult to ask processing.Such as freebase about 380G, there are 8G or so in ordinary user at present, and average PC user is on it
Directly a large amount of I/O operation will be generated by doing inquiry, greatly waste user time.However most of ordinary users do not need ten
Divide accurate result, it is only necessary to which polling routine provides approximate solution.It is more and more with the rise of Approximate query processing technology
Result of study show: in most cases approximation can meet user demand, and can largely save user calculate when
Between, reduce the requirement to equipment is calculated.
Summary of the invention
The technical problem to be solved in the present invention is that being provided for above one or more defects in the prior art
The Query method in real time and system of a kind of extensive knowledge mapping in the case of memory-limited.
In order to solve the above-mentioned technical problems, the present invention provides the real-time of the extensive knowledge mapping in the case of memory-limited
Querying method, comprising:
Processing analysis is carried out to original knowledge map and obtains inverted file Hash list;
It is indexed based on original knowledge map construction multilevel structure;
Query statement is parsed to obtain target vocabulary, and according to the inverted file Hash list and multilevel structure rope
Draw and searches the corresponding triple generation result subgraph of the target vocabulary.Optionally, described that processing point is carried out to original knowledge map
Analysis obtains inverted file Hash list, comprising:
Extract the tuple information of the offset form again of first vocabulary in original knowledge map;
The tuple information of extraction is converted into first vocabulary offset form again;
The tuple information of first vocabulary offset form again is ranked up according to vocabulary, obtains inverted file;
Hash processing is carried out to obtained inverted file, obtains inverted file Hash list.
It is optionally, described to be indexed based on original knowledge map construction multilevel structure, comprising:
The preliminary structure discovery result of original knowledge map is carried out data classification, cleaning and simplifies data to indicate to obtain
Knowledge mapping data classification simplifies result;
Knowledge based spectrum data classification eases result extracts fabric node;
Simplify result to the knowledge mapping data classification further to extract, realizes higher level's configuration index.
Optionally, described that query statement is parsed to obtain target vocabulary, and according to the inverted file Hash list
It is indexed with multilevel structure and searches the step of corresponding triple of the target vocabulary generates result subgraph and include:
The query statement Q of user's input is received, tuple number lower limit min is returned, returns to tuple number upper limit max and pumping
Sample ratio δ;
Query statement Q is parsed, the word finder for needing to inquire is obtained;
To each vocabulary in word finder, corresponding disk indexed set { S is found parallel in inverted file Hash list1,
S2... ..., Sn, and disk index intersection S is obtained after seeking intersection;
Judge whether the length of disk index intersection S is less than and return to tuple number lower limit min:
It is that then any index position in disk index intersection S is saved using the index and its location information as one
Point is added in result subgraph;
It otherwise, is to enable sampling when judging whether the length of disk index intersection S is greater than return tuple number upper limit max
Quantity is max, and otherwise enabling sample size is that disk indexes the length of intersection S and the product of sampling ratio δ, and if the sampling
When quantity is less than return tuple number lower limit min, enabling sample size is tuple number lower limit min;It is right after determining sample size
Disk indexes intersection S and carries out semi-random sampling, wherein the auxiliary sampling node superNode for needing to obtain using step S102
Information.Each index that sampling obtains is added in structure subgraph in multilevel structure index and its location information.
The present invention also provides a kind of real time inquiry systems of the extensive knowledge mapping in the case of memory-limited, comprising:
Unit, multiple index construction unit and search unit are established in Hash list;
Unit is established in the Hash list, obtains inverted file Hash column for carrying out processing analysis to original knowledge map
Table;
The multiple index construction unit, for being indexed based on original knowledge map construction multilevel structure;;
The query unit obtains target vocabulary for being parsed to query statement, and is breathed out according to the inverted file
Uncommon list and multilevel structure index search the corresponding triple of the target vocabulary and generate result subgraph.
Optionally, the Hash list establishes unit for executing following steps:
Extract the tuple information of the offset form again of first vocabulary in original knowledge map;
The tuple information of extraction is converted into first vocabulary offset form again;
The tuple information of first vocabulary offset form again is ranked up according to vocabulary, obtains inverted file;
Hash processing is carried out to obtained inverted file, obtains inverted file Hash list.
Optionally, the multiple index construction unit is for executing following steps:
The preliminary structure discovery result of original knowledge map is carried out data classification, cleaning and simplifies data to indicate to obtain
Knowledge mapping data classification simplifies result;
Knowledge based spectrum data classification eases result extracts fabric node;
Simplify result to the knowledge mapping data classification further to extract, realizes higher level's configuration index.Optionally, described
Query unit is for executing following steps:
The query statement Q of user's input is received, tuple number lower limit min is returned, returns to tuple number upper limit max and pumping
Sample ratio δ;
Query statement Q is parsed, the word finder for needing to inquire is obtained;
To each vocabulary in word finder, corresponding disk indexed set { S is found parallel in inverted file Hash list1,
S2... ..., Sn, and disk index intersection S is obtained after seeking intersection;
Judge whether the length of disk index intersection S is less than and return to tuple number lower limit min:
It is that then any index position in disk index intersection S is saved using the index and its location information as one
Point is added in result subgraph;
It otherwise, is to enable sampling when judging whether the length of disk index intersection S is greater than return tuple number upper limit max
Quantity is max, and otherwise enabling sample size is that disk indexes the length of intersection S and the product of sampling ratio δ, and if the sampling
When quantity is less than return tuple number lower limit min, enabling sample size is tuple number lower limit min;It is right after determining sample size
Disk indexes intersection S and carries out semi-random sampling, wherein the auxiliary sampling node superNode for needing to obtain using step S102
Information.Each index that sampling obtains is added in structure subgraph in multilevel structure index and its location information.
Implement the extensive knowledge mapping in the case of memory-limited provided in an embodiment of the present invention Query method in real time and
System at least has the following beneficial effects:
1, the present invention can take into account relationship between the demand and UE capability of user, by inverted index and
Configuration index improves user's single machine data-handling capacity, and the result set of user can be found within the very fast time.
2, the present invention is further by fusion Approximate query processing technology, using the thought in Approximate query processing field,
Subgraph structure is extracted after obtaining the extensive result set that user specifies.Both the query time for having saved user reduces memory sky
Between restriction for query engine, and can return to a user according to user intention can be with the result of fast understanding.
Detailed description of the invention
Fig. 1 is the Query method in real time of the extensive knowledge mapping in the case of the provided memory-limited of the embodiment of the present invention one
Flow chart;
Fig. 2 is according to the principle of the present invention schematic diagram;
Fig. 3 a, 3b and 3c be respectively fabric schematic diagram, bottom layer node and the relation schematic diagram extracted of the present invention and on
Node layer and relation schematic diagram;
Fig. 4 is the real time inquiry system of the extensive knowledge mapping in the case of the provided memory-limited of the embodiment of the present invention five
Schematic diagram;
In figure: 401: unit is established in Hash list;402: multiple index construction unit;403: searching unit.
Specific embodiment
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention
In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is
A part of the embodiments of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people
Member's every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.
Embodiment one
Referring to Fig. 1, for according to the extensive knowledge mapping in the case of the provided memory-limited of the embodiment of the present invention one
The flow chart of Query method in real time;Fig. 2 is according to the principle of the present invention schematic diagram.As shown, provided in an embodiment of the present invention
The Query method in real time of extensive knowledge mapping in the case of memory-limited, may comprise steps of:
Step S101: executing inverted file Hash list establishment step, i.e., carries out processing to original knowledge map and analyze
To inverted file Hash list.Since lexical repetition rate is higher in the ultra-large knowledge mapping of nonumeric type, the row's of falling text is used
Part may be implemented to position triple rapidly according to vocabulary, and it is fast in order to accelerate vocabulary to search that inverted file, which is carried out Hash processing,
Degree reduces file I/O operation.
Step S102: executing multilevel structure index construct step, is based on original knowledge map construction multilevel structure rope.
Step S103: in inquiry, according to the inverted file Hash list, multilevel structure index and original knowledge map
Sentence is inquired.
Key point of the invention is to realize that the query time of average PC user and precision need using the memory headroom of very little
It asks, it may be assumed that the knowledge mapping real-time query in memory limited space avoids generating when user memory space is smaller a large amount of
I/O operation, cause cpu busy percentage not high, it is time-consuming excessive to read file, the extremely long situation of period of reservation of number.
The present invention is for the current knowledge mapping based on RDF structure, the method for taking structure extraction, by data vertex
Layered shaping is carried out, thus the vertex structure being simplified.It joined hash data structure in inverted file design, may be implemented to look into
Tuple is looked for carry out within O (1) time.Two kinds of structures are combined, the result set of user can be found in O (1) time.It is close by merging
It is extracted after obtaining the extensive result set that user specifies like Query Processing Technique using the thought in Approximate query processing field
Subgraph structure.Both the query time for having saved user reduces restriction of the memory headroom for query engine, and can be according to user
Wish returns to a user can be with the result of fast understanding.
The present invention overcomes the difficult points that knowledge mapping structure is extracted on non-critical structural knowledge map, and are counting greatly
According to the time complexity that collection is operated, it can guarantee shorter off-line data processing time and on-line search time.
Embodiment two
On the basis of the Query method in real time of extensive knowledge mapping in one provided memory-limited of embodiment,
Processing analysis is carried out to original knowledge map in step S101 and obtains the process of inverted file Hash list, it specifically can be by such as
Under type is realized:
Step 1: extracting the tuple information of the offset form again of first vocabulary in original knowledge map.Elder generation's vocabulary offset form again
Refer to the form of (offset, vocabulary ... ..., vocabulary), i.e., extracts (offset, word from original knowledge map in the step 1
Converge ... ..., vocabulary) form tuple information.
Step 2: the tuple information of extraction is converted into first vocabulary offset form again.Elder generation's vocabulary offset shape again
Formula refers to the form of (vocabulary, offset ... ..., offset), i.e., by (offset, vocabulary ... ..., vocabulary) shape in the step 2
The tuple information of formula switchs to the form of (vocabulary, offset ... ..., offset).
Step 3: to first vocabulary, offset form is the tuple information of (vocabulary, offset ... ..., offset) according to word again
Remittance is ranked up, and obtains inverted file;
The step 3 includes:
Step 3.1: merging the offset information of repeated vocabulary between adjacent 100,000 tuple;
Step 3.2: memory order is carried out as unit of 100,000;
Step 3.3: sorting to file merger obtained above;
Step 3.4: (vocabulary, offset ... ..., offset) tuple after being sorted.
Step 4: Hash processing being carried out to obtained inverted file, inverted file Hash list is obtained, to improve subsequent look into
Look for efficiency.
Shown in the following algorithm 1 of algorithm for constructing inverted file Hash list section, 1-11 row corresponds to abovementioned steps 1 to step
Rapid 3.Wherein, 1-7 row is the process that (v, p ..., p) tuple is extracted from the i.e. extensive knowledge mapping G of original knowledge map, often
The quantity for the tuple extracted in a inverted file is no more than preset quantity Max, and executes " list.addAndSort
(extract (triple)) " when the tuple of extraction being added to inventory list, needs to turn (v, p ..., p) tuple form
Be changed to (p, v, ..., v) form, and be ranked up according to vocabulary therein.It can obtain one in Max range intervals
A result set to have sorted exports in file up to inverted file.8-11 row, our obtained rows of available previous step
The number of the good inverted file of sequence.12-18 row can be fallen by selection hash function and all inverted files of merging
Arrange file Hash list fileList.
Embodiment three
On the basis of the Query method in real time of extensive knowledge mapping in two provided memory-limited of embodiment,
Process based on original knowledge map construction multilevel structure index in step S102, can specifically be accomplished in that
The present invention carries out the isolated preliminary structure of body layer to original knowledge map and finds, then carries out multilevel index structure
Building, comprising: knowledge mapping constructional depth analysis, knowledge mapping memory node index establish and overall structure index establish
Three parts.
(1) knowledge mapping constructional depth is analyzed: carrying out data classification, cleaning to the preliminary structure discovery result of knowledge mapping
And simplified data indicate to obtain the simplified result of knowledge mapping data classification;Wherein data reduction indicates, is for original RDF
Knowledge mapping is converted.It is here to leave out original knowledge map that original knowledge mapping, which has many redundancies,
Redundancy.
(2) knowledge mapping memory node index is established: extracting original knowledge map, (RDF triple is in principle according to subject
Same position is adjacent to be stored in disk) in the Disk Locality that first appears of vertex, using quicksort method by the disk
The tuple of position is ranked up to obtain knowledge mapping memory node index according to the size relation between node and node;The node
That is Disk Locality.
(3) overall structure index is established: it is further to simplify result progress Disk Locality to the knowledge mapping data classification
It extracts, realizes higher level's configuration index, then organically combine to obtain by knowledge mapping memory node index and higher level's configuration index more
Level structure index.
Basic Ontological concept is possessed by the knowledge mapping that Ontology Language development comes, the collection including real world objects
The set of relationship between conjunction and real world objects.This knowledge mapping can easily be divided into ontology (concept) layer
And true (object) layer.Obviously, body layer possesses many examples in true layer in extensive knowledge mapping.Utilize this
The one characteristic present invention can easily extract the body layer of knowledge mapping using data mining technology, and then separate its body layer
With true layer, the building of multilayered structure index of the invention is completed.The present invention can be used bottom-up method and realize knowledge
The AUTOMATIC ZONING of map.Certainly critical step: knowledge mapping cleaning operation is done before layering, using certain
Coding rule reduces the redundancy in knowledge mapping, and at the same time, the present invention extracts the leaf section in knowledge mapping simultaneously
Point and their Disk Locality information, the fabric as multiple index.Then it goes to extract bottom using these bottom layer nodes
Relation information and upper layer node information between node layer.Further separation knowledge mapping.For example, what the present invention obtained
Fabric is as shown in Figure 3a, next layer circulation in by obtain this level node relationships information (as shown in Figure 3b) and on
One layer of nodal information and upper and lower level node relationships information (as shown in Figure 3c).
In one embodiment of the invention, the building process of above-mentioned multilevel structure index can specifically include following step
It is rapid:
Step 1: extracting the fabric node of extensive knowledge mapping G, specifically include: for extensive knowledge mapping G
In each triple traversed, judge whether the object of the triple is leaf node, is the subject then by the triple
And location information is added to set N0In, and multilayer knot is added using the subject of the triple and location information as a node
In structure index;Otherwise set N is added in the subject of the triple and location information1In.
Step 2: constructing the incidence relation information of the upper layer node index and current Hierarchy nodes of current Hierarchy nodes, specifically
Are as follows:
Detect set N1When not being empty set, set S is enabled0=N0, S1=N1, by set N0With set N1It is set to empty set;For
Set S1Each of (triple, position) traversed, for current (triple, position):
If the object of the triple is in set S0In and subject not in set S0In, then extract the following letter of the triple
It ceases (triple subject, position) and set N is added0In, and extract following information (triple subject, position, the collection of the triple
Close S0In the triple object position) be added multilevel structure index in;
If the object of the triple is in set S0In and subject in set S0In, then extract the following information of the triple
(set S0In the triple subject position, set S0In the triple object position) be added multilevel structure index in;
Otherwise, set N is added in (triple, the position) of the triple1In;
Step 3: extracting the higher-level node (high-level nodal information) in multilevel structure index.
Following algorithm 2 is detailed to illustrate the knowledge mapping level method for digging extraction multilevel structure for how passing through automation
Index.The 1-7 row of algorithm is extracted the fabric node of extensive knowledge mapping G.Algorithm is gradual in following circulation
Construct configuration index.The upper layer node index of present node level is constructed by 11-13 row and two-layer node index closes
System.The incidence relation information of current Hierarchy nodes is constructed by the 14th, 15 rows.Note that in order to establish level index and the row's of falling text
Part breathes out the incidence relation between series of tables, and the two is all the form memory node using key-value pair, and " key " is each node
Position in disk, " value " are the information needed in our various algorithms.Above-mentioned process is loop structure, N0Represent extraction
Lower level node out, N1Indicate the upper layer node extracted.And S is assigned in second of circulation0S1.Finally,
18 rows, it would be desirable to higher-level node (high-level nodal information) superNode be extracted according to obtained configuration index, for me
Subsequent searching algorithm service.
Example IV
On the basis of the Query method in real time of extensive knowledge mapping in three provided memory-limited of embodiment,
Query statement is parsed in step S103 to obtain target vocabulary, and according to the inverted file Hash list and multilevel structure
Index searches the process that the corresponding triple of the target vocabulary generates result subgraph, can specifically be accomplished in that
Step 1: receiving the query statement Q of user's input, return to tuple number lower limit min, return to tuple number upper limit max
And sampling ratio δ;
Step 2: parsing query statement Q obtains the word finder for needing to inquire;
Step 3: to each vocabulary in word finder, finding corresponding magnetic parallel in inverted file Hash list fileList
Disk indexed set { S1, S2... ..., Sn, and disk index intersection S is obtained after seeking intersection;Wherein n is the number of vocabulary in word finder
Amount.
Step 4: judge whether the length of disk index intersection S is less than and returns to tuple number lower limit min:
It is that then any index position in disk index intersection S is saved using the index and its location information as one
Point is added in result subgraph;
It otherwise, is to enable sampling when judging whether the length of disk index intersection S is greater than return tuple number upper limit max
Quantity is max, and otherwise enabling sample size is that disk indexes the length of intersection S and the product of sampling ratio δ, and if the sampling
When quantity is less than return tuple number lower limit min, enabling sample size is tuple number lower limit min;It is right after determining sample size
Disk indexes intersection S and carries out semi-random sampling, wherein the auxiliary sampling node superNode for needing to obtain using step S102
Information.Each index that sampling obtains is added in structure subgraph in multilevel structure index and its location information.
The present invention uses the operation result of step S101 and step S102, is search service.It is arranged using inverted file Hash
Table finds the tuple position that user query wish to find, and finds its adjacent vertex structure using multilayer index, realizes in memory
Rapid structural inquiry under limited situation.But it is only also far from enough using only obtained result, in search process still
How there are many problems solves what inverted file Hash list obtained for example, whether the inquiry of user's input is accurately to inquire
The huge situation of result set.There is no the inquiries to user's input to limit by the present invention, exactly this unrestricted inquiry
Resulting in inquiry, there may be inaccurate situations.Without accurately inquiring widely distributed, this result set that will lead to result set
Widely distributed situation can be used for the list of inverted file Hash and multiple index, user memory even if the present invention
A possibility that in the presence of query result can not be handled.For extensive knowledge mapping, it is assumed that the inquiry of user's input is high precision
, an example in upper layer node vocabulary ' Award (prize-winning) ' such as " Award winner (award-winner) " is inquired, that
The present invention is bound to provide an accurate perfect result in the efficient time.But if the user desired that check vocabulary
The case where ' Award '? even if the result (user memory result to be treated) that we return in this case is not related to
When neighbor information content, size still times over even it is several decuple user be provided to searching algorithm memory it is big
It is small, for the describe in SPARQL sentence just less with mentioning.Moreover, such case appears in user query sentence
Frequency be again it is especially high, in the case where user has little understanding to inquiry content, usable means are exactly from macroscopic view to micro-
That sees inquires knowledge mapping, is exactly that user in most cases cannot provide one and accurately look into briefly
Ask sentence.So how to solve the problems, such as that this non-precision query statement causes to realize efficient inquiry in this case
Become the querying method of the present invention main problem to be overcome.
In order to solve the problems, such as those discussed above, the accuracy and query time demand of balancing user inquiry, this hair
The bright thought for combining some Approximate query processings is in searching method, it may be assumed that can be to one and half accurate results to use when search
Family.From the point of view of a certain angle, the online query in searching algorithm and Approximate query processing of the invention is very close, still,
In Approximate query processing system, since the inquiry of user is towards entire data set, user needs nomination sample ratio.Every time
When inquiry, the methods of sampling is pushed away down in any case, correct query statement, in fact have in Approximate query processing system
It is operated using sampler.Wherein precision guarantee shows the difficult shape that becomes increasingly complex with gradually pushing away down for sampler
Condition.
Since there are inverted file Hash list structure, not all inquiry requires subsampling operation in the present invention
, this undoubtedly ensure that the absolute accuracy of a part inquiry.And when user carries out fuzzy query, we provide a knots
The big minizone of fruit map space ([Max, Min]) and desired sampling ratio (E δ) variable transfer to user specified, Yi Jiyi
A semi-random sampler provides precision guarantee.Obviously, when us, the result set obtained in the inverted file Hash list is slight greatly
We do not need to be sampled processing to obtained result when being equal to Min, the ternary that we will directly by inquiring
Progress synthon graph structure in group position is given user and is checked.And when obtained size is more than Max, we will pass through sampling
Ratio is that the result set that the sampling rate of Max ÷ length (results) size is given carries out semi-randomization sample process.Work as result set
When between the section that user specifies, we can carry out half to result set using the desired sampling ratio E δ of user first
Random sampling, it is Min ÷ length that we, which will do it practical sampling ratio, when sampling results size is less than Min
(results) it is sampled, if result set, in interval range, practical sampling ratio A δ is equal to the expected sampling ratio E of user
δ.It can be seen that the result set magnitude range [Min, Max] that user specifies be it is absolute, algorithm can strictly defer to user and specify
Interval range works.But the expectation sampling ratio E δ that user specifies is to change according to the actual situation, last algorithm
Practical sampling ratio A δ can be returned.In addition, one is worth the thing of explanation to be that precision guarantee is very in Approximate query processing
An important measurement dimension.The present invention guarantees our result precision using semi-random sampling function.It is so-called semi-random, just
It is that aforementioned obtained superNode is utilized to retain upper layer node in sampling process.
The pseudocode of the specific implementation of step S103 is as shown in following algorithm 3.In the 1st row, the inquiry that user is inputted
Sentence is parsed, in order to find inquiry target vocabulary.Then, it from 2-6 row, is breathed out using the inverted file that algorithm 2 obtains
Uncommon list and target vocabulary obtained in the previous step position the triple that all user query are related to, and find distribution of results situation.
Since user is not aware that whether the query statement that he specifies is accurate, the big of Accurate Prediction query result of also having no idea
It is small, in order to guarantee result set be sized for user memory operation and guarantee implement search efficiency, each time inquire before
The present invention claims users to give result set magnitude range [Min, Max] and the desired sampling ratio E δ of user.Therefore, the 7th
Row, it would be desirable to which it is [Min, Max] that result subset magnitude range, which is arranged,.In addition, the result obtained according to us by inverted index
Distributing position and result set size decide whether sampling and sample mode in the 9th row.Followed by row 10-11 and 20-
21, construct subgraph structure.That be worth explaining is G*A kind of structure of adjust automatically subgraph structure, in one new section of addition every time
While point enters, G*Adjust automatically result set being indexed according to level, furthermore multiple index is deposited according to key value structure
Storage, it means that the time complexity for extracting multilevel hierarchy index is (1) O, so constructing subgraph knot within O (1) time
Structure G*It is obviously feasible.
Embodiment five
As shown in figure 4, the real-time of extensive knowledge mapping in the case of the memory-limited that the embodiment of the present invention five provides is looked into
Inquiry system may include: that unit 401, grade index construct unit 402 and query unit 403 are established in Hash list;
Unit 401 is established in Hash list, obtains inverted file Hash column for carrying out processing analysis to original knowledge map
Table.The operation that the execution of unit 401 is established in the Hash list is identical as step S101 in preceding method.
Multiple index construction unit 402, for being indexed based on original knowledge map construction multilevel structure.The multiple index structure
The operation for building the execution of unit 402 is identical as step S102 in preceding method.
Query unit 403 obtains target vocabulary for being parsed to query statement, and according to the inverted file Hash
List and multilevel structure index search the corresponding triple of the target vocabulary and generate result subgraph.What the query unit 403 executed
It operates identical as step S103 in preceding method.
Preferably, Hash list establishes unit 401 for executing following steps:
Extract the tuple information of the offset form again of first vocabulary in original knowledge map;
The tuple information of extraction is converted into first vocabulary offset form again;
The tuple information of first vocabulary offset form again is ranked up according to vocabulary, obtains inverted file;
Hash processing is carried out to obtained inverted file, obtains inverted file Hash list.
Preferably, multiple index construction unit 402 is for executing following steps:
The preliminary structure discovery result of original knowledge map is carried out data classification, cleaning and simplifies data to indicate to obtain
Knowledge mapping data classification simplifies result;
Knowledge based spectrum data classification eases result extracts fabric node;
Simplify result to the knowledge mapping data classification further to extract, realizes higher level's configuration index.
Preferably, query unit 403 is for executing following steps:
The query statement Q of user's input is received, tuple number lower limit min is returned, returns to tuple number upper limit max and pumping
Sample ratio δ;
Query statement Q is parsed, the word finder for needing to inquire is obtained;
To each vocabulary in word finder, corresponding disk indexed set { S is found parallel in inverted file Hash list D1,
S2... ..., Sn, and disk index intersection S is obtained after seeking intersection;
Judge whether the length of disk index intersection S is less than and return to tuple number lower limit min:
It is that then any index position in disk index intersection S is saved using the index and its location information as one
Point is added in result subgraph;
It otherwise, is to enable sampling when judging whether the length of disk index intersection S is greater than return tuple number upper limit max
Quantity is max, and otherwise enabling sample size is that disk indexes the length of intersection S and the product of sampling ratio δ, and if the sampling
When quantity is less than return tuple number lower limit min, enabling sample size is tuple number lower limit min;It is right after determining sample size
Disk indexes intersection S and carries out semi-random sampling, wherein the auxiliary sampling node superNode for needing to obtain using step S102
Information.Each index that sampling obtains is added in structure subgraph in multilevel structure index and its location information.
It is further to note that the reality of the extensive knowledge mapping in the case of memory-limited provided in an embodiment of the present invention
When inquiry system, can also be realized by way of hardware or software and hardware combining by software realization.It is implemented in software
For, it is by the CPU of equipment where it by nonvolatile memory as shown in figure 4, as the system on a logical meaning
In corresponding computer program instructions be read into memory operation formed.
In conclusion compared with prior art, the present invention greatly improves single machine knowledge mapping query capability, it can
The result set for not only meeting user time demand but also meeting user's accuracy requirement is provided in the case where memory is extremely limited.It is existing
Knowledge mapping inquiry system is to provide based on complete query processing ability, in the case of having ignored current this knowledge huge explosion
The demand that personal user inquires knowledge mapping consumes the result that a large amount of memory headroom is found and has also exceeded ordinary user's
Data understandability.
The present invention can take into account the relationship between the demand and UE capability of user, pass through inverted index and knot
Structure index improves user's single machine data-handling capacity, by Approximate query processing technology and automation for knowing on a large scale
The Structure Understanding for knowing map, provides the user with a suitable result set.
Finally, it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations;Although
Present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: it still may be used
To modify the technical solutions described in the foregoing embodiments or equivalent replacement;And these are modified or replaceed, not
Depart from the spirit and scope of the technical scheme of various embodiments of the present invention the essence of corresponding technical solution.
Claims (8)
1. a kind of Query method in real time of the extensive knowledge mapping in the case of memory-limited characterized by comprising
Processing analysis is carried out to original knowledge map and obtains inverted file Hash list;
It is indexed based on original knowledge map construction multilevel structure;
Query statement is parsed to obtain target vocabulary, and is looked into according to the inverted file Hash list and multilevel structure index
The corresponding triple of the target vocabulary is looked for generate result subgraph.
2. the method according to claim 1, wherein it is described to original knowledge map carry out processing analysis fallen
Arrange the list of file Hash, comprising:
Extract the tuple information of the offset form again of first vocabulary in original knowledge map;
The tuple information of extraction is converted into first vocabulary offset form again;
The tuple information of first vocabulary offset form again is ranked up according to vocabulary, obtains inverted file;
Hash processing is carried out to obtained inverted file, obtains inverted file Hash list.
3. the method according to claim 1, wherein described be based on original knowledge map construction multilevel structure rope
Draw, comprising:
The preliminary structure discovery result of original knowledge map is carried out data classification, cleaning and simplifies data to indicate to obtain knowledge
Spectrum data classification eases result;
Knowledge based spectrum data classification eases result extracts fabric node;
Simplify result to the knowledge mapping data classification further to extract, realizes higher level's configuration index.
4. method described in any one of claim 1 to 3, which is characterized in that described parse to query statement
The corresponding triple of the target vocabulary is searched to target vocabulary, and according to the inverted file Hash list and multilevel structure index
The step of generating result subgraph, comprising:
The query statement Q of user's input is received, tuple number lower limit min is returned, returns to tuple number upper limit max and sampling fraction
Rate δ;
Query statement Q is parsed, the word finder for needing to inquire is obtained;
To each vocabulary in word finder, corresponding disk indexed set { S is found parallel in inverted file Hash list1,
S2... ..., Sn, and disk index intersection S is obtained after seeking intersection;
Judge whether the length of disk index intersection S is less than and return to tuple number lower limit min:
It is that then any index position in disk index intersection S is added using the index and its location information as a node
Enter in result subgraph;
It otherwise, is to enable sample size when judging whether the length of disk index intersection S is greater than return tuple number upper limit max
For max, otherwise enabling sample size is that disk indexes the length of intersection S and the product of sampling ratio δ, and if the sample size
When less than returning to tuple number lower limit min, enabling sample size is tuple number lower limit min;To disk after determining sample size
It indexes intersection S and carries out semi-random sampling, each index that sampling obtains is added in multilevel structure index and its location information
In structure subgraph.
5. a kind of real time inquiry system of the extensive knowledge mapping in the case of memory-limited characterized by comprising Hash column
Table establishes unit, multiple index construction unit and search unit;
Unit is established in the Hash list, obtains inverted file Hash list for carrying out processing analysis to original knowledge map;
The multiple index construction unit, for being indexed based on original knowledge map construction multilevel structure;
The query unit obtains target vocabulary for being parsed to query statement, and is arranged according to the inverted file Hash
Table and multilevel structure index search the corresponding triple of the target vocabulary and generate result subgraph.
6. system according to claim 5, which is characterized in that the Hash list establishes unit for executing following step
It is rapid:
Extract the tuple information of the offset form again of first vocabulary in original knowledge map;
The tuple information of extraction is converted into first vocabulary offset form again;
The tuple information of first vocabulary offset form again is ranked up according to vocabulary, obtains inverted file;
Hash processing is carried out to obtained inverted file, obtains inverted file Hash list.
7. system according to claim 5, which is characterized in that the multiple index construction unit is for executing following step
It is rapid:
The preliminary structure discovery result of original knowledge map is carried out data classification, cleaning and simplifies data to indicate to obtain knowledge
Spectrum data classification eases result;
Knowledge based spectrum data classification eases result extracts fabric node;
Simplify result to the knowledge mapping data classification further to extract, realizes higher level's configuration index.
8. the system according to any one of claim 5~7, which is characterized in that the query unit is following for executing
Step:
The query statement Q of user's input is received, tuple number lower limit min is returned, returns to tuple number upper limit max and sampling fraction
Rate δ;
Query statement Q is parsed, the word finder for needing to inquire is obtained;
To each vocabulary in word finder, corresponding disk indexed set { S is found parallel in inverted file Hash list D1,
S2... ..., Sn, and disk index intersection S is obtained after seeking intersection;
Judge whether the length of disk index intersection S is less than and return to tuple number lower limit min:
It is that then any index position in disk index intersection S is added using the index and its location information as a node
Enter in result subgraph;
It otherwise, is to enable sample size when judging whether the length of disk index intersection S is greater than return tuple number upper limit max
For max, otherwise enabling sample size is that disk indexes the length of intersection S and the product of sampling ratio δ, and if the sample size
When less than returning to tuple number lower limit min, enabling sample size is tuple number lower limit min;To disk after determining sample size
It indexes intersection S and carries out semi-random sampling, each index that sampling obtains is added in multilevel structure index and its location information
In structure subgraph.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810787762.9A CN109033314B (en) | 2018-07-18 | 2018-07-18 | Real-time query method and system for large-scale knowledge graph under condition of limited memory |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810787762.9A CN109033314B (en) | 2018-07-18 | 2018-07-18 | Real-time query method and system for large-scale knowledge graph under condition of limited memory |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109033314A true CN109033314A (en) | 2018-12-18 |
CN109033314B CN109033314B (en) | 2020-10-23 |
Family
ID=64643743
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810787762.9A Active CN109033314B (en) | 2018-07-18 | 2018-07-18 | Real-time query method and system for large-scale knowledge graph under condition of limited memory |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109033314B (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110275894A (en) * | 2019-06-24 | 2019-09-24 | 恒生电子股份有限公司 | A kind of update method of knowledge mapping, device, electronic equipment and storage medium |
CN112445890A (en) * | 2019-08-27 | 2021-03-05 | 北京国双科技有限公司 | Data processing method based on contract knowledge graph and related device |
CN112905806A (en) * | 2021-03-25 | 2021-06-04 | 哈尔滨工业大学 | Knowledge graph materialized view generator and generation method based on reinforcement learning |
CN113010746A (en) * | 2021-03-19 | 2021-06-22 | 厦门大学 | Medical record sequence retrieval method and system based on subtree inverted index |
CN113094449A (en) * | 2021-04-09 | 2021-07-09 | 天津大学 | Large-scale knowledge map storage scheme based on distributed key value library |
CN113254720A (en) * | 2021-05-06 | 2021-08-13 | 天津大学深圳研究院 | Hash sorting construction method in storage based on novel memory |
CN113486092A (en) * | 2021-07-30 | 2021-10-08 | 苏州工业职业技术学院 | Time graph approximate query method and device based on time constraint |
CN114911844A (en) * | 2022-05-11 | 2022-08-16 | 复旦大学 | Approximate query optimization system based on machine learning |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105653706A (en) * | 2015-12-31 | 2016-06-08 | 北京理工大学 | Multilayer quotation recommendation method based on literature content mapping knowledge domain |
CN105760468A (en) * | 2016-02-05 | 2016-07-13 | 大连大学 | Large-scale image querying system based on inverted position-sensitive Hash indexing in mobile environment |
US20160224637A1 (en) * | 2013-11-25 | 2016-08-04 | Ut Battelle, Llc | Processing associations in knowledge graphs |
CN105868313A (en) * | 2016-03-25 | 2016-08-17 | 浙江大学 | Mapping knowledge domain questioning and answering system and method based on template matching technique |
CN108256065A (en) * | 2018-01-16 | 2018-07-06 | 智言科技(深圳)有限公司 | Knowledge mapping inference method based on relationship detection and intensified learning |
-
2018
- 2018-07-18 CN CN201810787762.9A patent/CN109033314B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160224637A1 (en) * | 2013-11-25 | 2016-08-04 | Ut Battelle, Llc | Processing associations in knowledge graphs |
CN105653706A (en) * | 2015-12-31 | 2016-06-08 | 北京理工大学 | Multilayer quotation recommendation method based on literature content mapping knowledge domain |
CN105760468A (en) * | 2016-02-05 | 2016-07-13 | 大连大学 | Large-scale image querying system based on inverted position-sensitive Hash indexing in mobile environment |
CN105868313A (en) * | 2016-03-25 | 2016-08-17 | 浙江大学 | Mapping knowledge domain questioning and answering system and method based on template matching technique |
CN108256065A (en) * | 2018-01-16 | 2018-07-06 | 智言科技(深圳)有限公司 | Knowledge mapping inference method based on relationship detection and intensified learning |
Non-Patent Citations (4)
Title |
---|
GRAINGER T等: "The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain", 《 2016 IEEE 3RD INTERNATIONAL CONFERENCE ON DATA SCIENCE AND ADVANCED ANALYTICS》 * |
JAYARAM N等: "Querying Knowledge Graphs by Example Entity Tuples", 《IEEE TRANSACTIONS ON KNOWLEDGE & DATA ENGINEERING》 * |
XIAOLONG WAN等: "LKAQ: Large-scale knowledge graph approximate query algorithm", 《INFORMATION SCIENCES》 * |
YANG SHENGQI: "Querying Large-scale Knowledge Graphs", 《DISSERTATIONS & THESES GRADWORKS》 * |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110275894A (en) * | 2019-06-24 | 2019-09-24 | 恒生电子股份有限公司 | A kind of update method of knowledge mapping, device, electronic equipment and storage medium |
CN112445890A (en) * | 2019-08-27 | 2021-03-05 | 北京国双科技有限公司 | Data processing method based on contract knowledge graph and related device |
CN113010746A (en) * | 2021-03-19 | 2021-06-22 | 厦门大学 | Medical record sequence retrieval method and system based on subtree inverted index |
CN113010746B (en) * | 2021-03-19 | 2023-08-29 | 厦门大学 | Medical record graph sequence retrieval method and system based on sub-tree inverted index |
CN112905806A (en) * | 2021-03-25 | 2021-06-04 | 哈尔滨工业大学 | Knowledge graph materialized view generator and generation method based on reinforcement learning |
CN113094449A (en) * | 2021-04-09 | 2021-07-09 | 天津大学 | Large-scale knowledge map storage scheme based on distributed key value library |
CN113094449B (en) * | 2021-04-09 | 2023-04-18 | 天津大学 | Large-scale knowledge map storage method based on distributed key value library |
CN113254720A (en) * | 2021-05-06 | 2021-08-13 | 天津大学深圳研究院 | Hash sorting construction method in storage based on novel memory |
CN113486092A (en) * | 2021-07-30 | 2021-10-08 | 苏州工业职业技术学院 | Time graph approximate query method and device based on time constraint |
CN113486092B (en) * | 2021-07-30 | 2023-07-21 | 苏州工业职业技术学院 | Time constraint-based time chart approximate query method and device |
CN114911844A (en) * | 2022-05-11 | 2022-08-16 | 复旦大学 | Approximate query optimization system based on machine learning |
CN114911844B (en) * | 2022-05-11 | 2024-04-05 | 复旦大学 | Approximate query optimization system based on machine learning |
Also Published As
Publication number | Publication date |
---|---|
CN109033314B (en) | 2020-10-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109033314A (en) | The Query method in real time and system of extensive knowledge mapping in the case of memory-limited | |
CN109739849B (en) | Data-driven network sensitive information mining and early warning platform | |
Popescul et al. | Statistical relational learning for link prediction | |
Nabli et al. | Efficient cloud service discovery approach based on LDA topic modeling | |
CN111581949B (en) | Method and device for disambiguating name of learner, storage medium and terminal | |
CN109359115B (en) | Distributed storage method, device and system based on graph database | |
CN104092744B (en) | Web service discovery method based on memorization service cluster mapping catalogue | |
CN104699786A (en) | Semantic intelligent search communication network complaint system | |
Sekhar et al. | Optimized focused web crawler with natural language processing based relevance measure in bioinformatics web sources | |
Kaur et al. | SIMHAR-smart distributed web crawler for the hidden web using SIM+ hash and redis server | |
CN114996549A (en) | Intelligent tracking method and system based on active object information mining | |
Consoli et al. | A quartet method based on variable neighborhood search for biomedical literature extraction and clustering | |
Moutafis et al. | Algorithms for processing the group K nearest-neighbor query on distributed frameworks | |
CN117056465A (en) | Vector searching method, system, electronic device and storage medium | |
Abdallah et al. | Towards a GML-Enabled Knowledge Graph Platform | |
Chen | English translation template retrieval based on semantic distance ontology knowledge recognition algorithm | |
Zhang et al. | A new online field feature selection algorithm based on streaming data | |
Khurana et al. | Survey of techniques for deep web source selection and surfacing the hidden web content | |
Li et al. | An entity linking model based on candidate features | |
Wang et al. | A hunger-based scheduling strategy for distributed crawler | |
Li et al. | A compressed graph representation for services composition | |
Zhou et al. | BDMCA: a big data management system for Chinese auditing | |
Wang et al. | RDF Multi-query optimization algorithm based on triple pattern reordering | |
Lu et al. | SSPR: A Skyline-Based Semantic Place Retrieval Method | |
Li et al. | Suffix tree based incremental web services clustering method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |