CN109033314A - The Query method in real time and system of extensive knowledge mapping in the case of memory-limited - Google Patents

The Query method in real time and system of extensive knowledge mapping in the case of memory-limited Download PDF

Info

Publication number
CN109033314A
CN109033314A CN201810787762.9A CN201810787762A CN109033314A CN 109033314 A CN109033314 A CN 109033314A CN 201810787762 A CN201810787762 A CN 201810787762A CN 109033314 A CN109033314 A CN 109033314A
Authority
CN
China
Prior art keywords
index
vocabulary
intersection
disk
result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810787762.9A
Other languages
Chinese (zh)
Other versions
CN109033314B (en
Inventor
王宏志
万晓珑
高宏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN201810787762.9A priority Critical patent/CN109033314B/en
Publication of CN109033314A publication Critical patent/CN109033314A/en
Application granted granted Critical
Publication of CN109033314B publication Critical patent/CN109033314B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to technical field of data processing, the Query method in real time and system of the extensive knowledge mapping in the case of a kind of memory-limited are provided, this method comprises: carrying out processing analysis to original knowledge map obtains inverted file Hash list;It is indexed based on original knowledge map construction multilevel structure;Query statement is parsed to obtain target vocabulary, and the corresponding triple of the target vocabulary is searched according to the inverted file Hash list and multilevel structure index and generates result subgraph.The present invention greatly improves single machine knowledge mapping query capability, can provide the result set for not only meeting user time demand but also meeting user's accuracy requirement in the case where memory is extremely limited.

Description

The Query method in real time and system of extensive knowledge mapping in the case of memory-limited
Technical field
The present invention relates to the extensive knowledge mappings in the case of technical field of data processing more particularly to a kind of memory-limited Query method in real time and system.
Background technique
WWW has formd a huge network from being born till now, and constitute its node is net one by one Page, it is interrelated by hyperlink between webpage.Based on this simply open technology in WWW, modern search engines technology can To search the related web page of problem in huge cyberspace.But due to the development of mobile Internet, mobile device screen Space limitation, user it is expected that search engine is available accurately as a result, rather than finding one by one in search result.Due to This accuracy requirement at family is only that the storage of webpage cannot meet.
In order to solve this demand XML (extensible markup language), RDF (resource description framework) and OWL (Network ontology Language) etc. be proposed for description network in information.XML is by adding label for document and data content, in order to data Exchange;RDF describes the semantic relation of resources in network by the form of (subject, predicate, object) triple;OWL, which allows, describes this The conception of species is possibly realized, and has extremely strong ability to express and interpretability.Pass through three of the above internet information describing mode The concept of knowledge mapping is suggested in recent years.Entity and entity attribute in webpage are put into knowledge mapping after being identified and deposit Storage can prepare to understand that user is intended to according to node known in knowledge mapping, provide and accurately return when user initiates to search for It answers.
Have at present in the main storage querying method of the knowledge mapping based on RDF triple form: huge based on one Triple table divides table by vertical classification by hierarchical cluster attribute table and based on multiple based on multiple.Based on a huge triple The form of table is by all triple stores in a huge three lists lattice, and Major Systems in this way have: RDF-3x and Hexastore;There are two types of major type of tables for form based on multiple tables by hierarchical cluster attribute: tuple attributes are poly- The table of class table and the object with like attribute;Based on forms such as multiple tables divided by vertical classification to each attribute Construct an individual 2 list lattice.For storing subject and object.RDF storage system based on above-mentioned three kinds of forms have Jena, Yars2, Sesame 2.0, SW-store, EDF-3x, x-RDF-3x, Hexastore, gStore etc..
Existing RDF storage inquiry system such as Jena, Yars2 and Sesame 2.0 is imitated on biggish RDF data collection Fruit is poor.And SW-store, EDF-3x, x-RDF-3x and Hexastore by using mapping dictionary mode solve compared with The problem of big RDF data collection, it can only but support fixed SparQL language.And most of current method cannot be quick Solve the problems, such as RDF data online updating.Such as the system Jena based on multiple forms by hierarchical cluster attribute table, if will be at it The attribute information of more new data then needs to cluster and rebuild again attribute list on data set.In SW-store system due to Update needs to rewrite many column, and it is also fairly expensive for updating cost.Although having used the mode of " overflow table+write in batches " Also it is difficult to be required the high application use of real-time.And much RDF datas are intended to non-critical structural, such as same It is not attribute all having the same in the data of type.It is this non-critical structural, be conducive to the integrated of data but for Many classics accelerate aggregation of data query processing with relationship type method.Although gStore is solved using the method for T-index Part above problem, but single machine supports data set limited size in T-index structure, and 1,000,000,000 triples can only be supported to advise The data administration tasks of the RDF knowledge mapping of mould.
However as human knowledge update become larger, knowledge mapping scale is also accordingly increasing, size far more than 1000000000 tuples.The common computing capability for calculating equipment does not catch up with knowledge mapping rate of rise far but, and ordinary user looks on it It is more and more difficult to ask processing.Such as freebase about 380G, there are 8G or so in ordinary user at present, and average PC user is on it Directly a large amount of I/O operation will be generated by doing inquiry, greatly waste user time.However most of ordinary users do not need ten Divide accurate result, it is only necessary to which polling routine provides approximate solution.It is more and more with the rise of Approximate query processing technology Result of study show: in most cases approximation can meet user demand, and can largely save user calculate when Between, reduce the requirement to equipment is calculated.
Summary of the invention
The technical problem to be solved in the present invention is that being provided for above one or more defects in the prior art The Query method in real time and system of a kind of extensive knowledge mapping in the case of memory-limited.
In order to solve the above-mentioned technical problems, the present invention provides the real-time of the extensive knowledge mapping in the case of memory-limited Querying method, comprising:
Processing analysis is carried out to original knowledge map and obtains inverted file Hash list;
It is indexed based on original knowledge map construction multilevel structure;
Query statement is parsed to obtain target vocabulary, and according to the inverted file Hash list and multilevel structure rope Draw and searches the corresponding triple generation result subgraph of the target vocabulary.Optionally, described that processing point is carried out to original knowledge map Analysis obtains inverted file Hash list, comprising:
Extract the tuple information of the offset form again of first vocabulary in original knowledge map;
The tuple information of extraction is converted into first vocabulary offset form again;
The tuple information of first vocabulary offset form again is ranked up according to vocabulary, obtains inverted file;
Hash processing is carried out to obtained inverted file, obtains inverted file Hash list.
It is optionally, described to be indexed based on original knowledge map construction multilevel structure, comprising:
The preliminary structure discovery result of original knowledge map is carried out data classification, cleaning and simplifies data to indicate to obtain Knowledge mapping data classification simplifies result;
Knowledge based spectrum data classification eases result extracts fabric node;
Simplify result to the knowledge mapping data classification further to extract, realizes higher level's configuration index.
Optionally, described that query statement is parsed to obtain target vocabulary, and according to the inverted file Hash list It is indexed with multilevel structure and searches the step of corresponding triple of the target vocabulary generates result subgraph and include:
The query statement Q of user's input is received, tuple number lower limit min is returned, returns to tuple number upper limit max and pumping Sample ratio δ;
Query statement Q is parsed, the word finder for needing to inquire is obtained;
To each vocabulary in word finder, corresponding disk indexed set { S is found parallel in inverted file Hash list1, S2... ..., Sn, and disk index intersection S is obtained after seeking intersection;
Judge whether the length of disk index intersection S is less than and return to tuple number lower limit min:
It is that then any index position in disk index intersection S is saved using the index and its location information as one Point is added in result subgraph;
It otherwise, is to enable sampling when judging whether the length of disk index intersection S is greater than return tuple number upper limit max Quantity is max, and otherwise enabling sample size is that disk indexes the length of intersection S and the product of sampling ratio δ, and if the sampling When quantity is less than return tuple number lower limit min, enabling sample size is tuple number lower limit min;It is right after determining sample size Disk indexes intersection S and carries out semi-random sampling, wherein the auxiliary sampling node superNode for needing to obtain using step S102 Information.Each index that sampling obtains is added in structure subgraph in multilevel structure index and its location information.
The present invention also provides a kind of real time inquiry systems of the extensive knowledge mapping in the case of memory-limited, comprising: Unit, multiple index construction unit and search unit are established in Hash list;
Unit is established in the Hash list, obtains inverted file Hash column for carrying out processing analysis to original knowledge map Table;
The multiple index construction unit, for being indexed based on original knowledge map construction multilevel structure;;
The query unit obtains target vocabulary for being parsed to query statement, and is breathed out according to the inverted file Uncommon list and multilevel structure index search the corresponding triple of the target vocabulary and generate result subgraph.
Optionally, the Hash list establishes unit for executing following steps:
Extract the tuple information of the offset form again of first vocabulary in original knowledge map;
The tuple information of extraction is converted into first vocabulary offset form again;
The tuple information of first vocabulary offset form again is ranked up according to vocabulary, obtains inverted file;
Hash processing is carried out to obtained inverted file, obtains inverted file Hash list.
Optionally, the multiple index construction unit is for executing following steps:
The preliminary structure discovery result of original knowledge map is carried out data classification, cleaning and simplifies data to indicate to obtain Knowledge mapping data classification simplifies result;
Knowledge based spectrum data classification eases result extracts fabric node;
Simplify result to the knowledge mapping data classification further to extract, realizes higher level's configuration index.Optionally, described Query unit is for executing following steps:
The query statement Q of user's input is received, tuple number lower limit min is returned, returns to tuple number upper limit max and pumping Sample ratio δ;
Query statement Q is parsed, the word finder for needing to inquire is obtained;
To each vocabulary in word finder, corresponding disk indexed set { S is found parallel in inverted file Hash list1, S2... ..., Sn, and disk index intersection S is obtained after seeking intersection;
Judge whether the length of disk index intersection S is less than and return to tuple number lower limit min:
It is that then any index position in disk index intersection S is saved using the index and its location information as one Point is added in result subgraph;
It otherwise, is to enable sampling when judging whether the length of disk index intersection S is greater than return tuple number upper limit max Quantity is max, and otherwise enabling sample size is that disk indexes the length of intersection S and the product of sampling ratio δ, and if the sampling When quantity is less than return tuple number lower limit min, enabling sample size is tuple number lower limit min;It is right after determining sample size Disk indexes intersection S and carries out semi-random sampling, wherein the auxiliary sampling node superNode for needing to obtain using step S102 Information.Each index that sampling obtains is added in structure subgraph in multilevel structure index and its location information.
Implement the extensive knowledge mapping in the case of memory-limited provided in an embodiment of the present invention Query method in real time and System at least has the following beneficial effects:
1, the present invention can take into account relationship between the demand and UE capability of user, by inverted index and Configuration index improves user's single machine data-handling capacity, and the result set of user can be found within the very fast time.
2, the present invention is further by fusion Approximate query processing technology, using the thought in Approximate query processing field, Subgraph structure is extracted after obtaining the extensive result set that user specifies.Both the query time for having saved user reduces memory sky Between restriction for query engine, and can return to a user according to user intention can be with the result of fast understanding.
Detailed description of the invention
Fig. 1 is the Query method in real time of the extensive knowledge mapping in the case of the provided memory-limited of the embodiment of the present invention one Flow chart;
Fig. 2 is according to the principle of the present invention schematic diagram;
Fig. 3 a, 3b and 3c be respectively fabric schematic diagram, bottom layer node and the relation schematic diagram extracted of the present invention and on Node layer and relation schematic diagram;
Fig. 4 is the real time inquiry system of the extensive knowledge mapping in the case of the provided memory-limited of the embodiment of the present invention five Schematic diagram;
In figure: 401: unit is established in Hash list;402: multiple index construction unit;403: searching unit.
Specific embodiment
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiments of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people Member's every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.
Embodiment one
Referring to Fig. 1, for according to the extensive knowledge mapping in the case of the provided memory-limited of the embodiment of the present invention one The flow chart of Query method in real time;Fig. 2 is according to the principle of the present invention schematic diagram.As shown, provided in an embodiment of the present invention The Query method in real time of extensive knowledge mapping in the case of memory-limited, may comprise steps of:
Step S101: executing inverted file Hash list establishment step, i.e., carries out processing to original knowledge map and analyze To inverted file Hash list.Since lexical repetition rate is higher in the ultra-large knowledge mapping of nonumeric type, the row's of falling text is used Part may be implemented to position triple rapidly according to vocabulary, and it is fast in order to accelerate vocabulary to search that inverted file, which is carried out Hash processing, Degree reduces file I/O operation.
Step S102: executing multilevel structure index construct step, is based on original knowledge map construction multilevel structure rope.
Step S103: in inquiry, according to the inverted file Hash list, multilevel structure index and original knowledge map Sentence is inquired.
Key point of the invention is to realize that the query time of average PC user and precision need using the memory headroom of very little It asks, it may be assumed that the knowledge mapping real-time query in memory limited space avoids generating when user memory space is smaller a large amount of I/O operation, cause cpu busy percentage not high, it is time-consuming excessive to read file, the extremely long situation of period of reservation of number.
The present invention is for the current knowledge mapping based on RDF structure, the method for taking structure extraction, by data vertex Layered shaping is carried out, thus the vertex structure being simplified.It joined hash data structure in inverted file design, may be implemented to look into Tuple is looked for carry out within O (1) time.Two kinds of structures are combined, the result set of user can be found in O (1) time.It is close by merging It is extracted after obtaining the extensive result set that user specifies like Query Processing Technique using the thought in Approximate query processing field Subgraph structure.Both the query time for having saved user reduces restriction of the memory headroom for query engine, and can be according to user Wish returns to a user can be with the result of fast understanding.
The present invention overcomes the difficult points that knowledge mapping structure is extracted on non-critical structural knowledge map, and are counting greatly According to the time complexity that collection is operated, it can guarantee shorter off-line data processing time and on-line search time.
Embodiment two
On the basis of the Query method in real time of extensive knowledge mapping in one provided memory-limited of embodiment, Processing analysis is carried out to original knowledge map in step S101 and obtains the process of inverted file Hash list, it specifically can be by such as Under type is realized:
Step 1: extracting the tuple information of the offset form again of first vocabulary in original knowledge map.Elder generation's vocabulary offset form again Refer to the form of (offset, vocabulary ... ..., vocabulary), i.e., extracts (offset, word from original knowledge map in the step 1 Converge ... ..., vocabulary) form tuple information.
Step 2: the tuple information of extraction is converted into first vocabulary offset form again.Elder generation's vocabulary offset shape again Formula refers to the form of (vocabulary, offset ... ..., offset), i.e., by (offset, vocabulary ... ..., vocabulary) shape in the step 2 The tuple information of formula switchs to the form of (vocabulary, offset ... ..., offset).
Step 3: to first vocabulary, offset form is the tuple information of (vocabulary, offset ... ..., offset) according to word again Remittance is ranked up, and obtains inverted file;
The step 3 includes:
Step 3.1: merging the offset information of repeated vocabulary between adjacent 100,000 tuple;
Step 3.2: memory order is carried out as unit of 100,000;
Step 3.3: sorting to file merger obtained above;
Step 3.4: (vocabulary, offset ... ..., offset) tuple after being sorted.
Step 4: Hash processing being carried out to obtained inverted file, inverted file Hash list is obtained, to improve subsequent look into Look for efficiency.
Shown in the following algorithm 1 of algorithm for constructing inverted file Hash list section, 1-11 row corresponds to abovementioned steps 1 to step Rapid 3.Wherein, 1-7 row is the process that (v, p ..., p) tuple is extracted from the i.e. extensive knowledge mapping G of original knowledge map, often The quantity for the tuple extracted in a inverted file is no more than preset quantity Max, and executes " list.addAndSort (extract (triple)) " when the tuple of extraction being added to inventory list, needs to turn (v, p ..., p) tuple form Be changed to (p, v, ..., v) form, and be ranked up according to vocabulary therein.It can obtain one in Max range intervals A result set to have sorted exports in file up to inverted file.8-11 row, our obtained rows of available previous step The number of the good inverted file of sequence.12-18 row can be fallen by selection hash function and all inverted files of merging Arrange file Hash list fileList.
Embodiment three
On the basis of the Query method in real time of extensive knowledge mapping in two provided memory-limited of embodiment, Process based on original knowledge map construction multilevel structure index in step S102, can specifically be accomplished in that
The present invention carries out the isolated preliminary structure of body layer to original knowledge map and finds, then carries out multilevel index structure Building, comprising: knowledge mapping constructional depth analysis, knowledge mapping memory node index establish and overall structure index establish Three parts.
(1) knowledge mapping constructional depth is analyzed: carrying out data classification, cleaning to the preliminary structure discovery result of knowledge mapping And simplified data indicate to obtain the simplified result of knowledge mapping data classification;Wherein data reduction indicates, is for original RDF Knowledge mapping is converted.It is here to leave out original knowledge map that original knowledge mapping, which has many redundancies, Redundancy.
(2) knowledge mapping memory node index is established: extracting original knowledge map, (RDF triple is in principle according to subject Same position is adjacent to be stored in disk) in the Disk Locality that first appears of vertex, using quicksort method by the disk The tuple of position is ranked up to obtain knowledge mapping memory node index according to the size relation between node and node;The node That is Disk Locality.
(3) overall structure index is established: it is further to simplify result progress Disk Locality to the knowledge mapping data classification It extracts, realizes higher level's configuration index, then organically combine to obtain by knowledge mapping memory node index and higher level's configuration index more Level structure index.
Basic Ontological concept is possessed by the knowledge mapping that Ontology Language development comes, the collection including real world objects The set of relationship between conjunction and real world objects.This knowledge mapping can easily be divided into ontology (concept) layer And true (object) layer.Obviously, body layer possesses many examples in true layer in extensive knowledge mapping.Utilize this The one characteristic present invention can easily extract the body layer of knowledge mapping using data mining technology, and then separate its body layer With true layer, the building of multilayered structure index of the invention is completed.The present invention can be used bottom-up method and realize knowledge The AUTOMATIC ZONING of map.Certainly critical step: knowledge mapping cleaning operation is done before layering, using certain Coding rule reduces the redundancy in knowledge mapping, and at the same time, the present invention extracts the leaf section in knowledge mapping simultaneously Point and their Disk Locality information, the fabric as multiple index.Then it goes to extract bottom using these bottom layer nodes Relation information and upper layer node information between node layer.Further separation knowledge mapping.For example, what the present invention obtained Fabric is as shown in Figure 3a, next layer circulation in by obtain this level node relationships information (as shown in Figure 3b) and on One layer of nodal information and upper and lower level node relationships information (as shown in Figure 3c).
In one embodiment of the invention, the building process of above-mentioned multilevel structure index can specifically include following step It is rapid:
Step 1: extracting the fabric node of extensive knowledge mapping G, specifically include: for extensive knowledge mapping G In each triple traversed, judge whether the object of the triple is leaf node, is the subject then by the triple And location information is added to set N0In, and multilayer knot is added using the subject of the triple and location information as a node In structure index;Otherwise set N is added in the subject of the triple and location information1In.
Step 2: constructing the incidence relation information of the upper layer node index and current Hierarchy nodes of current Hierarchy nodes, specifically Are as follows:
Detect set N1When not being empty set, set S is enabled0=N0, S1=N1, by set N0With set N1It is set to empty set;For Set S1Each of (triple, position) traversed, for current (triple, position):
If the object of the triple is in set S0In and subject not in set S0In, then extract the following letter of the triple It ceases (triple subject, position) and set N is added0In, and extract following information (triple subject, position, the collection of the triple Close S0In the triple object position) be added multilevel structure index in;
If the object of the triple is in set S0In and subject in set S0In, then extract the following information of the triple (set S0In the triple subject position, set S0In the triple object position) be added multilevel structure index in;
Otherwise, set N is added in (triple, the position) of the triple1In;
Step 3: extracting the higher-level node (high-level nodal information) in multilevel structure index.
Following algorithm 2 is detailed to illustrate the knowledge mapping level method for digging extraction multilevel structure for how passing through automation Index.The 1-7 row of algorithm is extracted the fabric node of extensive knowledge mapping G.Algorithm is gradual in following circulation Construct configuration index.The upper layer node index of present node level is constructed by 11-13 row and two-layer node index closes System.The incidence relation information of current Hierarchy nodes is constructed by the 14th, 15 rows.Note that in order to establish level index and the row's of falling text Part breathes out the incidence relation between series of tables, and the two is all the form memory node using key-value pair, and " key " is each node Position in disk, " value " are the information needed in our various algorithms.Above-mentioned process is loop structure, N0Represent extraction Lower level node out, N1Indicate the upper layer node extracted.And S is assigned in second of circulation0S1.Finally, 18 rows, it would be desirable to higher-level node (high-level nodal information) superNode be extracted according to obtained configuration index, for me Subsequent searching algorithm service.
Example IV
On the basis of the Query method in real time of extensive knowledge mapping in three provided memory-limited of embodiment, Query statement is parsed in step S103 to obtain target vocabulary, and according to the inverted file Hash list and multilevel structure Index searches the process that the corresponding triple of the target vocabulary generates result subgraph, can specifically be accomplished in that
Step 1: receiving the query statement Q of user's input, return to tuple number lower limit min, return to tuple number upper limit max And sampling ratio δ;
Step 2: parsing query statement Q obtains the word finder for needing to inquire;
Step 3: to each vocabulary in word finder, finding corresponding magnetic parallel in inverted file Hash list fileList Disk indexed set { S1, S2... ..., Sn, and disk index intersection S is obtained after seeking intersection;Wherein n is the number of vocabulary in word finder Amount.
Step 4: judge whether the length of disk index intersection S is less than and returns to tuple number lower limit min:
It is that then any index position in disk index intersection S is saved using the index and its location information as one Point is added in result subgraph;
It otherwise, is to enable sampling when judging whether the length of disk index intersection S is greater than return tuple number upper limit max Quantity is max, and otherwise enabling sample size is that disk indexes the length of intersection S and the product of sampling ratio δ, and if the sampling When quantity is less than return tuple number lower limit min, enabling sample size is tuple number lower limit min;It is right after determining sample size Disk indexes intersection S and carries out semi-random sampling, wherein the auxiliary sampling node superNode for needing to obtain using step S102 Information.Each index that sampling obtains is added in structure subgraph in multilevel structure index and its location information.
The present invention uses the operation result of step S101 and step S102, is search service.It is arranged using inverted file Hash Table finds the tuple position that user query wish to find, and finds its adjacent vertex structure using multilayer index, realizes in memory Rapid structural inquiry under limited situation.But it is only also far from enough using only obtained result, in search process still How there are many problems solves what inverted file Hash list obtained for example, whether the inquiry of user's input is accurately to inquire The huge situation of result set.There is no the inquiries to user's input to limit by the present invention, exactly this unrestricted inquiry Resulting in inquiry, there may be inaccurate situations.Without accurately inquiring widely distributed, this result set that will lead to result set Widely distributed situation can be used for the list of inverted file Hash and multiple index, user memory even if the present invention A possibility that in the presence of query result can not be handled.For extensive knowledge mapping, it is assumed that the inquiry of user's input is high precision , an example in upper layer node vocabulary ' Award (prize-winning) ' such as " Award winner (award-winner) " is inquired, that The present invention is bound to provide an accurate perfect result in the efficient time.But if the user desired that check vocabulary The case where ' Award '? even if the result (user memory result to be treated) that we return in this case is not related to When neighbor information content, size still times over even it is several decuple user be provided to searching algorithm memory it is big It is small, for the describe in SPARQL sentence just less with mentioning.Moreover, such case appears in user query sentence Frequency be again it is especially high, in the case where user has little understanding to inquiry content, usable means are exactly from macroscopic view to micro- That sees inquires knowledge mapping, is exactly that user in most cases cannot provide one and accurately look into briefly Ask sentence.So how to solve the problems, such as that this non-precision query statement causes to realize efficient inquiry in this case Become the querying method of the present invention main problem to be overcome.
In order to solve the problems, such as those discussed above, the accuracy and query time demand of balancing user inquiry, this hair The bright thought for combining some Approximate query processings is in searching method, it may be assumed that can be to one and half accurate results to use when search Family.From the point of view of a certain angle, the online query in searching algorithm and Approximate query processing of the invention is very close, still, In Approximate query processing system, since the inquiry of user is towards entire data set, user needs nomination sample ratio.Every time When inquiry, the methods of sampling is pushed away down in any case, correct query statement, in fact have in Approximate query processing system It is operated using sampler.Wherein precision guarantee shows the difficult shape that becomes increasingly complex with gradually pushing away down for sampler Condition.
Since there are inverted file Hash list structure, not all inquiry requires subsampling operation in the present invention , this undoubtedly ensure that the absolute accuracy of a part inquiry.And when user carries out fuzzy query, we provide a knots The big minizone of fruit map space ([Max, Min]) and desired sampling ratio (E δ) variable transfer to user specified, Yi Jiyi A semi-random sampler provides precision guarantee.Obviously, when us, the result set obtained in the inverted file Hash list is slight greatly We do not need to be sampled processing to obtained result when being equal to Min, the ternary that we will directly by inquiring Progress synthon graph structure in group position is given user and is checked.And when obtained size is more than Max, we will pass through sampling Ratio is that the result set that the sampling rate of Max ÷ length (results) size is given carries out semi-randomization sample process.Work as result set When between the section that user specifies, we can carry out half to result set using the desired sampling ratio E δ of user first Random sampling, it is Min ÷ length that we, which will do it practical sampling ratio, when sampling results size is less than Min (results) it is sampled, if result set, in interval range, practical sampling ratio A δ is equal to the expected sampling ratio E of user δ.It can be seen that the result set magnitude range [Min, Max] that user specifies be it is absolute, algorithm can strictly defer to user and specify Interval range works.But the expectation sampling ratio E δ that user specifies is to change according to the actual situation, last algorithm Practical sampling ratio A δ can be returned.In addition, one is worth the thing of explanation to be that precision guarantee is very in Approximate query processing An important measurement dimension.The present invention guarantees our result precision using semi-random sampling function.It is so-called semi-random, just It is that aforementioned obtained superNode is utilized to retain upper layer node in sampling process.
The pseudocode of the specific implementation of step S103 is as shown in following algorithm 3.In the 1st row, the inquiry that user is inputted Sentence is parsed, in order to find inquiry target vocabulary.Then, it from 2-6 row, is breathed out using the inverted file that algorithm 2 obtains Uncommon list and target vocabulary obtained in the previous step position the triple that all user query are related to, and find distribution of results situation. Since user is not aware that whether the query statement that he specifies is accurate, the big of Accurate Prediction query result of also having no idea It is small, in order to guarantee result set be sized for user memory operation and guarantee implement search efficiency, each time inquire before The present invention claims users to give result set magnitude range [Min, Max] and the desired sampling ratio E δ of user.Therefore, the 7th Row, it would be desirable to which it is [Min, Max] that result subset magnitude range, which is arranged,.In addition, the result obtained according to us by inverted index Distributing position and result set size decide whether sampling and sample mode in the 9th row.Followed by row 10-11 and 20- 21, construct subgraph structure.That be worth explaining is G*A kind of structure of adjust automatically subgraph structure, in one new section of addition every time While point enters, G*Adjust automatically result set being indexed according to level, furthermore multiple index is deposited according to key value structure Storage, it means that the time complexity for extracting multilevel hierarchy index is (1) O, so constructing subgraph knot within O (1) time Structure G*It is obviously feasible.
Embodiment five
As shown in figure 4, the real-time of extensive knowledge mapping in the case of the memory-limited that the embodiment of the present invention five provides is looked into Inquiry system may include: that unit 401, grade index construct unit 402 and query unit 403 are established in Hash list;
Unit 401 is established in Hash list, obtains inverted file Hash column for carrying out processing analysis to original knowledge map Table.The operation that the execution of unit 401 is established in the Hash list is identical as step S101 in preceding method.
Multiple index construction unit 402, for being indexed based on original knowledge map construction multilevel structure.The multiple index structure The operation for building the execution of unit 402 is identical as step S102 in preceding method.
Query unit 403 obtains target vocabulary for being parsed to query statement, and according to the inverted file Hash List and multilevel structure index search the corresponding triple of the target vocabulary and generate result subgraph.What the query unit 403 executed It operates identical as step S103 in preceding method.
Preferably, Hash list establishes unit 401 for executing following steps:
Extract the tuple information of the offset form again of first vocabulary in original knowledge map;
The tuple information of extraction is converted into first vocabulary offset form again;
The tuple information of first vocabulary offset form again is ranked up according to vocabulary, obtains inverted file;
Hash processing is carried out to obtained inverted file, obtains inverted file Hash list.
Preferably, multiple index construction unit 402 is for executing following steps:
The preliminary structure discovery result of original knowledge map is carried out data classification, cleaning and simplifies data to indicate to obtain Knowledge mapping data classification simplifies result;
Knowledge based spectrum data classification eases result extracts fabric node;
Simplify result to the knowledge mapping data classification further to extract, realizes higher level's configuration index.
Preferably, query unit 403 is for executing following steps:
The query statement Q of user's input is received, tuple number lower limit min is returned, returns to tuple number upper limit max and pumping Sample ratio δ;
Query statement Q is parsed, the word finder for needing to inquire is obtained;
To each vocabulary in word finder, corresponding disk indexed set { S is found parallel in inverted file Hash list D1, S2... ..., Sn, and disk index intersection S is obtained after seeking intersection;
Judge whether the length of disk index intersection S is less than and return to tuple number lower limit min:
It is that then any index position in disk index intersection S is saved using the index and its location information as one Point is added in result subgraph;
It otherwise, is to enable sampling when judging whether the length of disk index intersection S is greater than return tuple number upper limit max Quantity is max, and otherwise enabling sample size is that disk indexes the length of intersection S and the product of sampling ratio δ, and if the sampling When quantity is less than return tuple number lower limit min, enabling sample size is tuple number lower limit min;It is right after determining sample size Disk indexes intersection S and carries out semi-random sampling, wherein the auxiliary sampling node superNode for needing to obtain using step S102 Information.Each index that sampling obtains is added in structure subgraph in multilevel structure index and its location information.
It is further to note that the reality of the extensive knowledge mapping in the case of memory-limited provided in an embodiment of the present invention When inquiry system, can also be realized by way of hardware or software and hardware combining by software realization.It is implemented in software For, it is by the CPU of equipment where it by nonvolatile memory as shown in figure 4, as the system on a logical meaning In corresponding computer program instructions be read into memory operation formed.
In conclusion compared with prior art, the present invention greatly improves single machine knowledge mapping query capability, it can The result set for not only meeting user time demand but also meeting user's accuracy requirement is provided in the case where memory is extremely limited.It is existing Knowledge mapping inquiry system is to provide based on complete query processing ability, in the case of having ignored current this knowledge huge explosion The demand that personal user inquires knowledge mapping consumes the result that a large amount of memory headroom is found and has also exceeded ordinary user's Data understandability.
The present invention can take into account the relationship between the demand and UE capability of user, pass through inverted index and knot Structure index improves user's single machine data-handling capacity, by Approximate query processing technology and automation for knowing on a large scale The Structure Understanding for knowing map, provides the user with a suitable result set.
Finally, it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations;Although Present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: it still may be used To modify the technical solutions described in the foregoing embodiments or equivalent replacement;And these are modified or replaceed, not Depart from the spirit and scope of the technical scheme of various embodiments of the present invention the essence of corresponding technical solution.

Claims (8)

1. a kind of Query method in real time of the extensive knowledge mapping in the case of memory-limited characterized by comprising
Processing analysis is carried out to original knowledge map and obtains inverted file Hash list;
It is indexed based on original knowledge map construction multilevel structure;
Query statement is parsed to obtain target vocabulary, and is looked into according to the inverted file Hash list and multilevel structure index The corresponding triple of the target vocabulary is looked for generate result subgraph.
2. the method according to claim 1, wherein it is described to original knowledge map carry out processing analysis fallen Arrange the list of file Hash, comprising:
Extract the tuple information of the offset form again of first vocabulary in original knowledge map;
The tuple information of extraction is converted into first vocabulary offset form again;
The tuple information of first vocabulary offset form again is ranked up according to vocabulary, obtains inverted file;
Hash processing is carried out to obtained inverted file, obtains inverted file Hash list.
3. the method according to claim 1, wherein described be based on original knowledge map construction multilevel structure rope Draw, comprising:
The preliminary structure discovery result of original knowledge map is carried out data classification, cleaning and simplifies data to indicate to obtain knowledge Spectrum data classification eases result;
Knowledge based spectrum data classification eases result extracts fabric node;
Simplify result to the knowledge mapping data classification further to extract, realizes higher level's configuration index.
4. method described in any one of claim 1 to 3, which is characterized in that described parse to query statement The corresponding triple of the target vocabulary is searched to target vocabulary, and according to the inverted file Hash list and multilevel structure index The step of generating result subgraph, comprising:
The query statement Q of user's input is received, tuple number lower limit min is returned, returns to tuple number upper limit max and sampling fraction Rate δ;
Query statement Q is parsed, the word finder for needing to inquire is obtained;
To each vocabulary in word finder, corresponding disk indexed set { S is found parallel in inverted file Hash list1, S2... ..., Sn, and disk index intersection S is obtained after seeking intersection;
Judge whether the length of disk index intersection S is less than and return to tuple number lower limit min:
It is that then any index position in disk index intersection S is added using the index and its location information as a node Enter in result subgraph;
It otherwise, is to enable sample size when judging whether the length of disk index intersection S is greater than return tuple number upper limit max For max, otherwise enabling sample size is that disk indexes the length of intersection S and the product of sampling ratio δ, and if the sample size When less than returning to tuple number lower limit min, enabling sample size is tuple number lower limit min;To disk after determining sample size It indexes intersection S and carries out semi-random sampling, each index that sampling obtains is added in multilevel structure index and its location information In structure subgraph.
5. a kind of real time inquiry system of the extensive knowledge mapping in the case of memory-limited characterized by comprising Hash column Table establishes unit, multiple index construction unit and search unit;
Unit is established in the Hash list, obtains inverted file Hash list for carrying out processing analysis to original knowledge map;
The multiple index construction unit, for being indexed based on original knowledge map construction multilevel structure;
The query unit obtains target vocabulary for being parsed to query statement, and is arranged according to the inverted file Hash Table and multilevel structure index search the corresponding triple of the target vocabulary and generate result subgraph.
6. system according to claim 5, which is characterized in that the Hash list establishes unit for executing following step It is rapid:
Extract the tuple information of the offset form again of first vocabulary in original knowledge map;
The tuple information of extraction is converted into first vocabulary offset form again;
The tuple information of first vocabulary offset form again is ranked up according to vocabulary, obtains inverted file;
Hash processing is carried out to obtained inverted file, obtains inverted file Hash list.
7. system according to claim 5, which is characterized in that the multiple index construction unit is for executing following step It is rapid:
The preliminary structure discovery result of original knowledge map is carried out data classification, cleaning and simplifies data to indicate to obtain knowledge Spectrum data classification eases result;
Knowledge based spectrum data classification eases result extracts fabric node;
Simplify result to the knowledge mapping data classification further to extract, realizes higher level's configuration index.
8. the system according to any one of claim 5~7, which is characterized in that the query unit is following for executing Step:
The query statement Q of user's input is received, tuple number lower limit min is returned, returns to tuple number upper limit max and sampling fraction Rate δ;
Query statement Q is parsed, the word finder for needing to inquire is obtained;
To each vocabulary in word finder, corresponding disk indexed set { S is found parallel in inverted file Hash list D1, S2... ..., Sn, and disk index intersection S is obtained after seeking intersection;
Judge whether the length of disk index intersection S is less than and return to tuple number lower limit min:
It is that then any index position in disk index intersection S is added using the index and its location information as a node Enter in result subgraph;
It otherwise, is to enable sample size when judging whether the length of disk index intersection S is greater than return tuple number upper limit max For max, otherwise enabling sample size is that disk indexes the length of intersection S and the product of sampling ratio δ, and if the sample size When less than returning to tuple number lower limit min, enabling sample size is tuple number lower limit min;To disk after determining sample size It indexes intersection S and carries out semi-random sampling, each index that sampling obtains is added in multilevel structure index and its location information In structure subgraph.
CN201810787762.9A 2018-07-18 2018-07-18 Real-time query method and system for large-scale knowledge graph under condition of limited memory Active CN109033314B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810787762.9A CN109033314B (en) 2018-07-18 2018-07-18 Real-time query method and system for large-scale knowledge graph under condition of limited memory

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810787762.9A CN109033314B (en) 2018-07-18 2018-07-18 Real-time query method and system for large-scale knowledge graph under condition of limited memory

Publications (2)

Publication Number Publication Date
CN109033314A true CN109033314A (en) 2018-12-18
CN109033314B CN109033314B (en) 2020-10-23

Family

ID=64643743

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810787762.9A Active CN109033314B (en) 2018-07-18 2018-07-18 Real-time query method and system for large-scale knowledge graph under condition of limited memory

Country Status (1)

Country Link
CN (1) CN109033314B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110275894A (en) * 2019-06-24 2019-09-24 恒生电子股份有限公司 A kind of update method of knowledge mapping, device, electronic equipment and storage medium
CN112445890A (en) * 2019-08-27 2021-03-05 北京国双科技有限公司 Data processing method based on contract knowledge graph and related device
CN112905806A (en) * 2021-03-25 2021-06-04 哈尔滨工业大学 Knowledge graph materialized view generator and generation method based on reinforcement learning
CN113010746A (en) * 2021-03-19 2021-06-22 厦门大学 Medical record sequence retrieval method and system based on subtree inverted index
CN113094449A (en) * 2021-04-09 2021-07-09 天津大学 Large-scale knowledge map storage scheme based on distributed key value library
CN113254720A (en) * 2021-05-06 2021-08-13 天津大学深圳研究院 Hash sorting construction method in storage based on novel memory
CN113486092A (en) * 2021-07-30 2021-10-08 苏州工业职业技术学院 Time graph approximate query method and device based on time constraint
CN114911844A (en) * 2022-05-11 2022-08-16 复旦大学 Approximate query optimization system based on machine learning

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105653706A (en) * 2015-12-31 2016-06-08 北京理工大学 Multilayer quotation recommendation method based on literature content mapping knowledge domain
CN105760468A (en) * 2016-02-05 2016-07-13 大连大学 Large-scale image querying system based on inverted position-sensitive Hash indexing in mobile environment
US20160224637A1 (en) * 2013-11-25 2016-08-04 Ut Battelle, Llc Processing associations in knowledge graphs
CN105868313A (en) * 2016-03-25 2016-08-17 浙江大学 Mapping knowledge domain questioning and answering system and method based on template matching technique
CN108256065A (en) * 2018-01-16 2018-07-06 智言科技(深圳)有限公司 Knowledge mapping inference method based on relationship detection and intensified learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160224637A1 (en) * 2013-11-25 2016-08-04 Ut Battelle, Llc Processing associations in knowledge graphs
CN105653706A (en) * 2015-12-31 2016-06-08 北京理工大学 Multilayer quotation recommendation method based on literature content mapping knowledge domain
CN105760468A (en) * 2016-02-05 2016-07-13 大连大学 Large-scale image querying system based on inverted position-sensitive Hash indexing in mobile environment
CN105868313A (en) * 2016-03-25 2016-08-17 浙江大学 Mapping knowledge domain questioning and answering system and method based on template matching technique
CN108256065A (en) * 2018-01-16 2018-07-06 智言科技(深圳)有限公司 Knowledge mapping inference method based on relationship detection and intensified learning

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
GRAINGER T等: "The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain", 《 2016 IEEE 3RD INTERNATIONAL CONFERENCE ON DATA SCIENCE AND ADVANCED ANALYTICS》 *
JAYARAM N等: "Querying Knowledge Graphs by Example Entity Tuples", 《IEEE TRANSACTIONS ON KNOWLEDGE & DATA ENGINEERING》 *
XIAOLONG WAN等: "LKAQ: Large-scale knowledge graph approximate query algorithm", 《INFORMATION SCIENCES》 *
YANG SHENGQI: "Querying Large-scale Knowledge Graphs", 《DISSERTATIONS & THESES GRADWORKS》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110275894A (en) * 2019-06-24 2019-09-24 恒生电子股份有限公司 A kind of update method of knowledge mapping, device, electronic equipment and storage medium
CN112445890A (en) * 2019-08-27 2021-03-05 北京国双科技有限公司 Data processing method based on contract knowledge graph and related device
CN113010746A (en) * 2021-03-19 2021-06-22 厦门大学 Medical record sequence retrieval method and system based on subtree inverted index
CN113010746B (en) * 2021-03-19 2023-08-29 厦门大学 Medical record graph sequence retrieval method and system based on sub-tree inverted index
CN112905806A (en) * 2021-03-25 2021-06-04 哈尔滨工业大学 Knowledge graph materialized view generator and generation method based on reinforcement learning
CN113094449A (en) * 2021-04-09 2021-07-09 天津大学 Large-scale knowledge map storage scheme based on distributed key value library
CN113094449B (en) * 2021-04-09 2023-04-18 天津大学 Large-scale knowledge map storage method based on distributed key value library
CN113254720A (en) * 2021-05-06 2021-08-13 天津大学深圳研究院 Hash sorting construction method in storage based on novel memory
CN113486092A (en) * 2021-07-30 2021-10-08 苏州工业职业技术学院 Time graph approximate query method and device based on time constraint
CN113486092B (en) * 2021-07-30 2023-07-21 苏州工业职业技术学院 Time constraint-based time chart approximate query method and device
CN114911844A (en) * 2022-05-11 2022-08-16 复旦大学 Approximate query optimization system based on machine learning
CN114911844B (en) * 2022-05-11 2024-04-05 复旦大学 Approximate query optimization system based on machine learning

Also Published As

Publication number Publication date
CN109033314B (en) 2020-10-23

Similar Documents

Publication Publication Date Title
CN109033314A (en) The Query method in real time and system of extensive knowledge mapping in the case of memory-limited
CN109739849B (en) Data-driven network sensitive information mining and early warning platform
Popescul et al. Statistical relational learning for link prediction
Nabli et al. Efficient cloud service discovery approach based on LDA topic modeling
CN111581949B (en) Method and device for disambiguating name of learner, storage medium and terminal
CN109359115B (en) Distributed storage method, device and system based on graph database
CN104092744B (en) Web service discovery method based on memorization service cluster mapping catalogue
CN104699786A (en) Semantic intelligent search communication network complaint system
Sekhar et al. Optimized focused web crawler with natural language processing based relevance measure in bioinformatics web sources
Kaur et al. SIMHAR-smart distributed web crawler for the hidden web using SIM+ hash and redis server
CN114996549A (en) Intelligent tracking method and system based on active object information mining
Consoli et al. A quartet method based on variable neighborhood search for biomedical literature extraction and clustering
Moutafis et al. Algorithms for processing the group K nearest-neighbor query on distributed frameworks
CN117056465A (en) Vector searching method, system, electronic device and storage medium
Abdallah et al. Towards a GML-Enabled Knowledge Graph Platform
Chen English translation template retrieval based on semantic distance ontology knowledge recognition algorithm
Zhang et al. A new online field feature selection algorithm based on streaming data
Khurana et al. Survey of techniques for deep web source selection and surfacing the hidden web content
Li et al. An entity linking model based on candidate features
Wang et al. A hunger-based scheduling strategy for distributed crawler
Li et al. A compressed graph representation for services composition
Zhou et al. BDMCA: a big data management system for Chinese auditing
Wang et al. RDF Multi-query optimization algorithm based on triple pattern reordering
Lu et al. SSPR: A Skyline-Based Semantic Place Retrieval Method
Li et al. Suffix tree based incremental web services clustering method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant