CN107122421A - Information retrieval method and device - Google Patents

Information retrieval method and device Download PDF

Info

Publication number
CN107122421A
CN107122421A CN201710217499.5A CN201710217499A CN107122421A CN 107122421 A CN107122421 A CN 107122421A CN 201710217499 A CN201710217499 A CN 201710217499A CN 107122421 A CN107122421 A CN 107122421A
Authority
CN
China
Prior art keywords
mrow
document
similarity
knowledge
destination document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710217499.5A
Other languages
Chinese (zh)
Inventor
杨硕
邹磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University filed Critical Peking University
Priority to CN201710217499.5A priority Critical patent/CN107122421A/en
Publication of CN107122421A publication Critical patent/CN107122421A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

This application discloses a kind of information retrieval method and device, belong to internet arena, with the accuracy for the result for improving the user's problem to be solved retrieved.Methods described includes:Receive the problem to be solved of input;Determine the technical field belonging to the problem to be solved;According to the knowledge base in the technical field pre-established, determine the destination document matched in the technical field with the problem to be solved, wherein, the knowledge base includes the corresponding relation between corresponding relation and the Object of Knowledge and the document object between problem objects, Object of Knowledge, document object, described problem object and the Object of Knowledge, and the Object of Knowledge is selected from a part for described problem object;Return to the destination document.The application is used to answer problem to be solved.

Description

Information retrieval method and device
Technical field
The application is related to internet arena, more particularly to a kind of information retrieval method and device.
Background technology
With the high speed development of internet, user currently more and more tends to obtain by puing question on the internet The answer of problem.Search engine can be carried out after the enquirement of user is got based on one or more keywords occurred in enquirement Retrieval, and return to the result with one or more Keywords matchings.
However, for machine, a problem for understanding the mankind is a highly difficult thing, passes through above-mentioned this side The result that formula is got is likely to not be that user puts question to the result for wanting to obtain, so as to cause retrieval rate relatively low.
The content of the invention
The embodiment of the present application provides a kind of information retrieval method and device, to improve the user's problem to be solved retrieved Result accuracy.The technical scheme is as follows:
On the one hand there is provided a kind of information retrieval method, methods described includes:
Receive the problem to be solved of input;
Determine the technical field belonging to the problem to be solved;
According to the knowledge base in the technical field pre-established, determine in the technical field to be solved to ask with described The destination document matched is inscribed, wherein, the knowledge base includes problem objects, Object of Knowledge, document object, described problem pair As the corresponding relation between the corresponding relation between the Object of Knowledge and the Object of Knowledge and the document object, institute State the part that Object of Knowledge is selected from described problem object;
Return to the destination document.
On the other hand there is provided a kind of information indexing device, described information retrieval device includes:
Interface module, the problem to be solved for receiving input;
Processing module, for determining the technical field belonging to the problem to be solved;
The processing module, is additionally operable to, according to the knowledge base in the technical field pre-established, determine the technology The destination document matched in field with the problem to be solved, wherein, the knowledge base includes problem objects, knowledge pair As, corresponding relation and the Object of Knowledge and the text between document object, described problem object and the Object of Knowledge Corresponding relation between shelves object, the Object of Knowledge is selected from a part for described problem object;
The interface module, is additionally operable to return to the destination document.
The beneficial effect that the technical scheme that the embodiment of the present application is provided is brought includes:
When the problem to be solved (i.e. user puts question to) based on user is retrieved, consider not only one or more in problem Individual keyword, while in view of the technical field of problem, by considering the technical field of problem to be solved and utilizing advance structure The specific knowledge storehouse built, can greatly improve the accuracy of the result of the user retrieved problem to be solved.
Brief description of the drawings
Fig. 1 be the embodiment of the present application provide particular technology area in four layers of knowledge graph schematic diagram;
Fig. 2 be the embodiment of the present application provide it is a kind of exemplary the problem of node, knowledge node and file node relation Figure;
Fig. 3 is the flow chart for the Exemplary Information-Retrieval method that the embodiment of the present application is provided;
Fig. 4 is a kind of schematic diagram for Exemplary Information-Retrieval method that the embodiment of the present application is provided;
Fig. 5 is the graph of a relation shown between node between the node of random walk probability that the embodiment of the present application is provided;
Fig. 6 is a kind of structured flowchart for Exemplary Information-Retrieval device that the embodiment of the present application is provided.
Embodiment
To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with accompanying drawing to embodiment party of the present invention Formula is described in further detail." electronic equipment " said in text can include smart mobile phone, tablet personal computer, intelligent television, E-book reader, MP3 player (Moving Picture Experts Group Audio Layer III, dynamic image Expert's compression standard audio aspect 3), MP4 (Moving Picture Experts Group Audio Layer IV, dynamic shadow As expert's compression standard audio aspect 4) player, pocket computer on knee and desktop computer etc.." the letter said in text Breath retrieval device " can be one or more servers etc..
Related information retrieval method considers only the keyword occurred among a problem, is often difficult to understand for using The intention at family.In order to understand a problem, the mankind usually using they technical field ABC.Such as problem " when user attempts to send some special forms in outbox, program is just stuck in wait state ".First, we can pay close attention to To " special form " and " outbox ", these are all product outlook some parts, and we are with regard to that can be inferred to this Some problems that outlook is produced.
Being analyzed more than to be drawn, the background knowledge of technical field is played an important role among problem understanding.This Machine is facilitated to understand customer problem by building the knowledge base of particular technology area in application.
Information retrieval method in the application is based on the knowledge base built in advance.The knowledge base includes problem pair As, the corresponding relation between Object of Knowledge, document object, described problem object and the Object of Knowledge and the Object of Knowledge Corresponding relation between the document object.Wherein, problem objects can be the problem to be solved one by one of user's input, know Know the part that object may be selected from the problem to be solved, document object can be the document for solving problem to be solved one by one.
It is to show knowledge base in the form of knowledge graph in description below for ease of understanding the knowledge base mentioned in the application In various pieces and its relation.
One technical problem is generally made up of three parts:Product, component and event word.As a rule, knowing in the application Four parts can be included by knowing figure:Conceptual level, gas producing formation, component layer and event layers.Wherein:
Conceptual level:In conceptual level, a node represents a concept, and one group of a representation of concept has identity function Product a, concept is generally also the sub- concept of another concept.
Gas producing formation:The attribute of all product and product is contained in gas producing formation.Gas producing formation is the core of whole knowledge graph The heart, the node of gas producing formation illustrates the attribute of specific a product or product.The several properties of product, example can be pre-defined Such as version, language and running environment.
Component layer:Generally, a technical problem is all some component on product, and component layer contains all productions The component of product.
Event layers:After product or defined component, it is to be understood that the specific phenomenon of a problem, component layer bag Some nouns of the problematic phenomenon containing description, verb, adjective etc..
The example of one knowledge graph can be partitioned into four layers by order from top to bottom in Fig. 1 with dotted line as indicated with 1:Concept Layer, gas producing formation, component layer and event layers.
Knowledge graph is built using technology language material herein, specific construction method is described below.
Conceptual level and gas producing formation
Concept and product are extracted from product information herein.Such as 6052 products have always been obtained, have belonged to altogether for example 214 different classifications.Extract the attribute of product, such as " Office Pro using pre-defined rule herein simultaneously Win32IT " represents the entitled Office of product, and version is Pro, and language is Italian (Italian), and is mounted in 32 Windows operating systems on.
Component layer
Herein using daily record the problem of technology language material and user come extraction assembly.First, the side of some sequence labellings is utilized Method identifies the component mentioned among language material.These phrases extracted are represented as the node of component layer, herein using product Weighed with the PMI value of component.PMI is a kind of common method for being used for weighing similarity between two phrases, if one Individual component c and a product p more than one threshold value of PMI value, then it is considered that c is a p component.PMI definition is such as Under:
Wherein
# (c) represents c occurrence number, and # (p) represents p occurrence number, and # (p, c) represents p and c co-occurrence number of times.
Event layers
Event layers have two kinds of different sides, are " event word (EventWordOf) " respectively and " are related to (RelatedTo) ", we discuss both sides respectively.First, event word (EventWordOf) connects a product and one Action word, we extract such relation using PMI using the method for similar assembly layer.As a rule, user is using dynamic Word, adjective, adverbial word, noun etc. describe the phenomenon of a problem.Give large-scale technology language material, first with some into The method of ripe location tags (POS-TAG), marks out the part of speech of technology language material.Simultaneously, it is assumed that if two technical problem energy By same Resolving probiems, then they should be semantically closely similar, such as, document d can solve 3 technologies and ask Topic is as follows respectively:
q2:Outlook 2007 motionless (Outlook 2007gets frozen).
q9:Outlook sends state and has been kept for a few hours (Outlook sending status remains for hours)。
q15:Email is stuck in outbox (Emails get stuck in outbox).
So, we can draw motionless (frozen), keep (remain) and block (stuck) three words and compare semantically It is more similar, so the relation of " being related to (RelatedTo) " between the corresponding timing node of these three words, can be connected.
In order to return Object of Knowledge and document object in the destination document being associated with user's problem to be solved, the application It is associated.Wherein, document object can be obtained according to the technical problem daily record collected on network.
A kind of example annexation of problem objects, Object of Knowledge and document object can be as shown in Figure 2.Each saved in Fig. 2 Point can represent an object, such as one problem objects, an Object of Knowledge or a document object.To be solved in Fig. 2 is asked Inscribing q is:Some specific self-defined forms are stuck in outbox (some specific Custom forms when user sends get stuck in Outbox when users send it).Document d1 is the explanation that SP2 is set with to Microsoft Office (Description of 2007Microsoft Office Suite SP2)。
In fig. 2, there is the connection side of three types:Trouble node is connected to the side of knowledge node, two knowledge of connection Node while and knowledge node be connected to document node while.Wherein, for same trouble node, the trouble node connects The side of each knowledge node is connected to, with identical weight.Conditional probability table can be used by connecting the weight on the side of two knowledge nodes Show, that is to say, that the weight from node x to node y is expressed as the probability that y occurs in the case that x occurs, and is expressed as below:
Wherein, # (x, y) represents x and y co-occurrence number of times.
The weight on the side of document node is connected to for knowledge node to be represented with equation below:
Wherein, molecule, which represents all, to be solved by d and comprising number the problem of belonging to x, and denominator is all to be solved by d The problem of quantity, QL (d) represents all the problem of can be solved by d.
After knowledge base is built in advance, you can the problem of being inputted according to user carries out information retrieval.
Reference picture 3, the embodiment of the present invention provides a kind of information retrieval method, and methods described includes:
Step 31, the problem to be solved of input is received.
Wherein, the problem to be solved of input can be the problem to be solved that user is inputted by electronic equipment.
Step 32, the technical field belonging to the problem to be solved is determined.
In the embodiment of the present application, problem institute to be solved can be determined by one or more keywords in problem to be solved The technical field of category.
Step 33, according to the knowledge base in the technical field pre-established, determine in the technical field with it is described The destination document that problem to be solved matches, wherein, the knowledge base includes problem objects, Object of Knowledge, document object, institute State pair between the corresponding relation and the Object of Knowledge and the document object between problem objects and the Object of Knowledge It should be related to, the Object of Knowledge is selected from a part for described problem object;
Step 34, the destination document is returned.
In this application, the destination document matched with the problem to be solved can be, for example, and wait to solve described in solution Certainly the destination document of problem, the destination document comprising problem to be solved, include one or more keywords in problem to be solved Destination document.
In this application, the destination document is returned to described in step 34 may include:Return to the title of the destination document And/or the content in the return destination document.
The embodiment of the present application is considered not only and asked when the problem to be solved (i.e. user puts question to) based on user is retrieved One or more keywords in topic, while in view of the technical field of problem, by consider the technical field of problem to be solved with And using the specific knowledge storehouse built in advance, the accuracy of the result of the user retrieved problem to be solved can be greatly improved.
In the embodiment of the present application, determine with the problem to be solved to match in the technical field described in step 33 Destination document may include:
The problem objects according to the knowledge base, the Object of Knowledge and described problem object and the knowledge pair Corresponding relation as between, the problem of determining similar with the problem to be solved in the technical field;
It is determined that it is each described similar the problem of and the problem to be solved between similarity score;
Based on the similarity score, and it is each described similar the problem of corresponding destination document, it is determined that being treated with described Solve the problems, such as the destination document matched.
Here it is to be understood that the embodiment of the present application is corresponding the problem of being based on similarity score and be each described similar Destination document, can directly select the corresponding destination document of similarity score highest Similar Problems as the problem phase to be solved The destination document of matching.So, can be with most fast speed to user's returning result.This mode goes for user to speed Degree requires high scene.
Certainly, in this application, can using it is each described similar the problem of corresponding destination document be used as candidate documents, institute State based on the similarity score and it is each described similar the problem of corresponding destination document, it is determined that with the problem to be solved The destination document matched may include:
Based on the similarity score, determine that the problem to be solved is similar to each in the candidate documents Degree;
According to the sequential selection of similarity from high to low between the problem to be solved and the candidate documents one or more Candidate documents are used as the destination document matched with the problem to be solved;
Wherein, the problem to be solved and the similarity of each in the candidate documents are determined as follows:
Q represents problem to be solved, and d represents a candidate documents, and score (q, d) represents problem q to be solved and candidate documents Similarity between d, # (d, C) represents the total degree that d occurs in C, # (d, C0) represent d in C0The number of times of middle appearance, (q 'i, d)∈C0Represent that d can be solved in C0Middle the problem of q 'i, score (q 'i, q) represent q 'iWith q similarity score;And C0Expression is asked Topic daily record C subset, the problem of q ' represents similar with problem q to be solved, and
C0={ (q '0, d '0), { (q '1, d '1) ..., { q 'm, d 'm), q 'iRepresent i-th it is similar with q the problem of, m is represented With sums of q the problem of similar, d ' represents destination document corresponding with q '.
The Similar Problems for being shown in Fig. 4, Fig. 4 and belonging to same technical field with the problem to be solved are can refer to, and The corresponding destination document of each Similar Problems.The application with the problem to be solved it is determined that match in the technical field During destination document, reference picture 4, if problem q000 is similar for the similarity score highest between problem to be solved Problem, then can using the corresponding document d1 of problem q000 as problem to be solved destination document.It is of course also possible to by d5 and d1 (merely illustrative) as the destination document of problem to be solved, while before d1 is come into d5 in returning result.
Alternatively, it may be based on the problem to be solved similar to each in the candidate documents Spend the ordering of the result to determine return.Correspondingly, after step 33 determines destination document, the embodiment of the present application is carried The information retrieval method of confession may also include:Based on random walk (random walk) algorithm, the problem to be solved and institute are calculated State the similarity of each document object in knowledge base;Based on each text in the problem to be solved and the knowledge base The similarity of shelves object, reorders to the multiple destination document.
After being reordered to multiple destination documents, you can return to destination document according to the result after reordering.
Wherein, Random Walk Algorithm is based on described in the embodiment of the present application, the problem to be solved and the knowledge is calculated The similarity of each document object in storehouse includes:Select one or more between the problem to be solved and the document object Individual node sets index, wherein, the index of the node represents the node to the phase of each document object in the knowledge base Like degree;It is based upon the index that one or more described nodes are set, calculates in the problem to be solved and the knowledge base The similarity of each document object.
It is a kind of to select to set the mode of node of index be:Select the frequent node on path that index is set, wherein, frequently Node is node of the product more than threshold value of in-degree and out-degree.
Random walk (random walk) algorithm is the method for weighing node similarity, generally, if saved from one Point, according to the probability of each edge, is gone on another node at random as starting, and the probability for reaching another node is exactly initial section The similarity of point and another node.The similarity that Random Walk Algorithm is calculated can be calculated by following manner:
Wherein, s (x, y) is the similarity between the node x based on random walk and node y, and N (x) represents all and x phases The node of connection, and T (x, x ') represent to go to node x ' probability from node x.
In the application, the probability of transfer is used as using normalized weight.Similarity is being calculated based on Random Walk Algorithm When, only retain and the side of document node is connected to from knowledge space node, the side for being connected to knowledge node from trouble node, Make in a like fashion.
The similarity of customer problem node q and document node d based on random walk can be calculated by different modes. A kind of mode is the method based on sampling.We are from customer problem node q, using side right weight as transition probability, random movement To an adjacent node.Assuming that sampling number is N, rested on wherein having r times on document node d, q and d similarity are just It is r/N, experiment shows, probably needs 4,000,000 samplings, node similarity can just tends to convergence, and this shows in online query Taken very much using the method based on sampling, because in inquiry phase, it is necessary to system real-time response.Another way be based on The definition of machine migration similarity, creates system of linear equations, and solve system of linear equations and obtain answer.Each two in reference picture 5, Fig. 5 Line between node represents the probability from a node migration to another adjacent node.It can be listed based on numerical value shown in Fig. 5 System of linear equations it is as follows:
However, the complexity for solving a system of linear equations is very high, the complexity of system of linear equations is solved by Gaussian elimination Spend for O (n3), wherein n is the number of unknown number in equation group.In the knowledge graph built herein, number of nodes is very huge, The complexity for solving a system of linear equations is very high, in order to improve calculating speed, can build index on some nodes in advance.It is right In a node being indexed, the form of index is exactly a series of floating number, represents present node to the similar of all documents Degree, such as be indexed, x index form is to node x:
Idex (x)={ s (x, d0), s (x, d1) ..., s (x, dm)}
Wherein, m is the number of document, it is assumed that index is built on node, the similarity with each document can be obtained.
For example, if in node v5、v8、v10On pre-establish index, s (v can be directly obtained5,d1)=0.701, s(v8,d1)=0.668, s (v10,d1)=0.642, then system of linear equations above can be as follows by simplified result:
If from the example above as can be seen that building index, the quantity of the unknown number of equation on some nodes in advance It will greatly reduce and (be reduced to 3 from 11).
In the embodiment of the present application, a greedy algorithm is proposed to select the node of materialization (being indexed).This greed Algorithm selects some and frequently occurs on node on many paths every time because frequently node be easier to cover it is more Path, is used as the measurement index of frequent node, that is to say, that in-degree × out-degree is bigger, and frequency is got over by the use of in-degree × out-degree herein It is high.Frequency highest node is picked out every time in greedy algorithm, this node is added into index node, then recalculated other The frequency of node, finally obtains all materialization nodes.
The mode for the calculating similarity based on index that the embodiment of the present application is provided, can greatly reduce amount of calculation, carry Computationally efficient.Meanwhile, select to set the node indexed based on frequency, can be set and indexed with selected section node, without Indexed with being set to all nodes, reduce further amount of calculation.
Fig. 6 is a kind of structured flowchart for information indexing device that the embodiment of the present application is provided, and reference picture 6, the application is implemented The information indexing device 600 that example is provided includes:Interface module 601 and processing module 602.Wherein:
Interface module 601, the problem to be solved for receiving input;
Processing module 602, for determining the technical field belonging to the problem to be solved;
The processing module 602, is additionally operable to, according to the knowledge base in the technical field pre-established, determine the skill The destination document matched in art field with the problem to be solved, wherein, the knowledge base includes problem objects, knowledge pair As, corresponding relation and the Object of Knowledge and the text between document object, described problem object and the Object of Knowledge Corresponding relation between shelves object, the Object of Knowledge is selected from a part for described problem object;
The interface module 601, is additionally operable to return to the destination document.
The information indexing device that the embodiment of the present application is provided, is carried out in the problem to be solved (i.e. user puts question to) based on user During retrieval, one or more keywords in problem are considered not only, while in view of the technical field of problem, by considering to wait to solve The certainly technical field of problem and using the specific knowledge storehouse that builds in advance, can greatly improve that the user retrieved is to be solved to be asked The accuracy of the result of topic.
Alternatively, the destination document that the described and problem to be solved matches is the target for solving the problem to be solved Document.
The interface module specifically for:Return to the title of the destination document and/or return in the destination document Content.
Alternatively, the processing module 602 specifically for:
The problem objects according to the knowledge base, the Object of Knowledge and described problem object and the knowledge pair Corresponding relation as between, the problem of determining similar with the problem to be solved in the technical field;
It is determined that it is each described similar the problem of and the problem to be solved between similarity score;
Based on the similarity score, and it is each described similar the problem of corresponding destination document, it is determined that being treated with described Solve the problems, such as the destination document matched.
Alternatively, each described similar the problem of, corresponding destination document was as candidate documents, and the processing module 602 has Body is used for:
Based on the similarity score, determine that the problem to be solved is similar to each in the candidate documents Degree;
According to the sequential selection of similarity from high to low between the problem to be solved and the candidate documents one or more Candidate documents are used as the destination document matched with the problem to be solved;
Wherein, the problem to be solved and the similarity of each in the candidate documents are determined as follows:
Q represents problem to be solved, and d represents a candidate documents, and score (q, d) represents problem q to be solved and candidate documents Similarity between d, # (d, C) represents the total degree that d occurs in C, # (d, C0) represent d in C0The number of times of middle appearance, (q 'i, d)∈C0Represent that d can be solved in C0Middle the problem of q 'i, score (q 'i, q) represent q 'iWith q similarity score;And C0Expression is asked Topic daily record C subset, the problem of q ' represents similar with problem q to be solved, and
C0={ (q '0, d '0), { (q '1, d '1) ..., { q 'm, d 'm), q 'iRepresent i-th it is similar with q the problem of, m is represented With sums of q the problem of similar, d ' represents destination document corresponding with q '.
Alternatively, it is determined that after destination document, the processing module 602 is additionally operable to:
Based on Random Walk Algorithm, the problem to be solved and the phase of each document object in the knowledge base are calculated Like degree;
Based on the similarity of each document object in the problem to be solved and the knowledge base, to the multiple mesh Mark document is reordered.
Alternatively, based on Random Walk Algorithm, the problem to be solved and each text in the knowledge base are calculated Shelves object similarity when, the processing module 602 specifically for:
Select one or more nodes between the problem to be solved and the document object that index is set, wherein, it is described The index of node represents the node to the similarity of each document object in the knowledge base;
It is based upon the index that one or more described nodes are set, calculates in the problem to be solved and the knowledge base Each document object similarity.
Alternatively, selection set index node when, the processing module 602 specifically for:
Select the frequent node on path that index is set, wherein, frequent node is more than threshold value for the product of in-degree and out-degree Node.
It should be noted that:The information indexing device that above-described embodiment is provided, only being partitioned into above-mentioned each functional module Row is for example, in practical application, as needed can distribute above-mentioned functions by different functional module completions, Ji Jiangxin The internal structure of breath retrieval device is divided into different functional modules, to complete all or part of function described above.Separately Outside, the information indexing device that above-described embodiment is provided belongs to same design with information retrieval method embodiment, and it was implemented Journey refers to embodiment of the method, repeats no more here.
Here also it is to be understood that interface module 601 and processing module 602 can be the different moulds in same physical equipment Block, can be with depending on the application, and interface module 601 can be to be distributed in one or more physical equipments at diverse location, processing Module 602 can also be to be distributed in one or more physical equipments at diverse location.
The embodiment of the present invention additionally provides a kind of computer-readable recording medium, and the computer-readable recording medium can be The computer-readable recording medium included in memory in above-described embodiment;Can also be individualism, without supplying eventually Computer-readable recording medium in end.The computer-readable recording medium storage has one or more than one program, and this one Individual or more than one program is used for performing above- mentioned information search method by one or more than one processor.
Unless otherwise defined, technical term or scientific terminology used herein should be in the application art and had The ordinary meaning that the personage of general technical ability is understood.Used in the application patent application specification and claims " the One ", " second " and similar word are not offered as any order, quantity or importance, and are used only to distinguish different Part.Equally, the similar word such as " one " or " one " does not indicate that quantity is limited yet, but represents there is at least one. The similar word such as " connection " or " connected " is not limited to physics or machinery connection, but can include electrically Connection, it is either directly or indirect.
One of ordinary skill in the art will appreciate that realizing that all or part of step of above-described embodiment can be by hardware To complete, the hardware of correlation can also be instructed to complete by program, described program can be stored in a kind of computer-readable In storage medium, storage medium mentioned above can be read-only storage, disk or CD etc..
The foregoing is only the example embodiment of the application, not to limit the application, it is all in spirit herein and Within principle, any modification, equivalent substitution and improvements made etc. should be included within the protection domain of the application.

Claims (14)

1. a kind of information retrieval method, it is characterised in that methods described includes:
Receive the problem to be solved of input;
Determine the technical field belonging to the problem to be solved;
According to the knowledge base in the technical field pre-established, determine in the technical field with the problem phase to be solved The destination document of matching, wherein, the knowledge base include problem objects, Object of Knowledge, document object, described problem object and The corresponding relation between corresponding relation and the Object of Knowledge and the document object between the Object of Knowledge, it is described to know Know the part that object is selected from described problem object;
Return to the destination document.
2. according to the method described in claim 1, it is characterised in that the destination document matched with the problem to be solved To solve the destination document of the problem to be solved;
The return destination document includes:Return to the title of the destination document and/or return in the destination document Content.
3. according to the method described in claim 1, it is characterised in that described to determine in the technical field to be solved to ask with described Inscribing the technical documentation matched includes:
The problem objects according to the knowledge base, the Object of Knowledge and described problem object and the Object of Knowledge it Between corresponding relation, the problem of determining similar with the problem to be solved in the technical field;
It is determined that it is each described similar the problem of and the problem to be solved between similarity score;
Based on the similarity score, and it is each described similar the problem of corresponding destination document, it is determined that with it is described to be solved The destination document that problem matches.
4. method according to claim 3, it is characterised in that corresponding destination document conduct the problem of each described similar Candidate documents, it is described based on the similarity score and it is each described similar the problem of corresponding destination document, it is determined that and institute Stating the destination document that problem to be solved matches includes:
Based on the similarity score, the problem to be solved and the similarity of each in the candidate documents are determined;
According to one or more candidates of the sequential selection of similarity from high to low between the problem to be solved and the candidate documents Document is used as the destination document matched with the problem to be solved;
Wherein, the problem to be solved and the similarity of each in the candidate documents are determined as follows:
<mrow> <mi>s</mi> <mi>c</mi> <mi>o</mi> <mi>r</mi> <mi>e</mi> <mrow> <mo>(</mo> <mrow> <mi>q</mi> <mo>,</mo> <mi>d</mi> </mrow> <mo>)</mo> </mrow> <mo>=</mo> <mi>log</mi> <mo>#</mo> <mrow> <mo>(</mo> <mrow> <mi>d</mi> <mo>,</mo> <mi>C</mi> </mrow> <mo>)</mo> </mrow> <mo>&amp;times;</mo> <munder> <mi>&amp;Sigma;</mi> <mrow> <mrow> <mo>(</mo> <mrow> <msub> <msup> <mi>q</mi> <mo>&amp;prime;</mo> </msup> <mi>i</mi> </msub> <mo>,</mo> <mi>d</mi> </mrow> <mo>)</mo> </mrow> <mo>&amp;Element;</mo> <msub> <mi>C</mi> <mn>0</mn> </msub> </mrow> </munder> <mfrac> <mrow> <mo>#</mo> <mrow> <mo>(</mo> <mrow> <mi>d</mi> <mo>,</mo> <msub> <mi>C</mi> <mn>0</mn> </msub> </mrow> <mo>)</mo> </mrow> </mrow> <mrow> <mo>#</mo> <mrow> <mo>(</mo> <mrow> <mi>d</mi> <mo>,</mo> <mi>C</mi> </mrow> <mo>)</mo> </mrow> <mo>&amp;times;</mo> <mi>i</mi> </mrow> </mfrac> <mo>&amp;times;</mo> <mi>s</mi> <mi>c</mi> <mi>o</mi> <mi>r</mi> <mi>e</mi> <mrow> <mo>(</mo> <mrow> <msub> <msup> <mi>q</mi> <mo>&amp;prime;</mo> </msup> <mi>i</mi> </msub> <mo>,</mo> <mi>q</mi> </mrow> <mo>)</mo> </mrow> <mo>;</mo> </mrow>
Q represents problem to be solved, and d represents a candidate documents, score (q, d) represent problem q and candidate documents d to be solved it Between similarity, # (d, C) represents the total degrees that occur in C of d, # (d, C0) represent d in C0The number of times of middle appearance, (q 'i,d)∈ C0Represent that d can be solved in C0Middle the problem of q 'i, score (q 'i, q) represent q 'iWith q similarity score;And C0Problem of representation day Will C subset, the problem of q ' represents similar with problem q to be solved, and
C0={ (q '0, d '0), { (q '1, d '1) ..., { q 'm, d 'm), q 'iRepresent i-th it is similar with q the problem of, m is represented and q The sum of similar the problem of, d ' represents destination document corresponding with q '.
5. according to any described methods of claim 1-4, it is characterised in that it is determined that after destination document, methods described is also Including:
Based on Random Walk Algorithm, the problem to be solved is calculated similar to each document object in the knowledge base Degree;
Based on the similarity of each document object in the problem to be solved and the knowledge base, to the multiple target text Shelves are reordered.
6. method according to claim 5, it is characterised in that described to be based on Random Walk Algorithm, is calculated described to be solved Problem and the similarity of each document object in the knowledge base include:
Select one or more nodes between the problem to be solved and the document object that index is set, wherein, the node Index represent the node to each document object in the knowledge base similarity;
Be based upon the index that one or more described nodes are set, calculate the problem to be solved with it is every in the knowledge base The similarity of one document object.
7. method according to claim 6, it is characterised in that selection sets the node of index to include:
Select the frequent node on path that index is set, wherein, frequent node is section of the product more than threshold value of in-degree and out-degree Point.
8. a kind of information indexing device, it is characterised in that described information retrieval device includes:
Interface module, the problem to be solved for receiving input;
Processing module, for determining the technical field belonging to the problem to be solved;
The processing module, is additionally operable to, according to the knowledge base in the technical field pre-established, determine the technical field In the destination document that matches with the problem to be solved, wherein, the knowledge base includes problem objects, Object of Knowledge, text Corresponding relation and the Object of Knowledge and the document object between shelves object, described problem object and the Object of Knowledge Between corresponding relation, the Object of Knowledge be selected from described problem object a part;
The interface module, is additionally operable to return to the destination document.
9. information indexing device according to claim 8, it is characterised in that what the described and problem to be solved matched Destination document is the destination document for solving the problem to be solved;
The interface module specifically for:Return to the content in the title and/or the return destination document of the destination document.
10. information indexing device according to claim 8, it is characterised in that the processing module specifically for:
The problem objects according to the knowledge base, the Object of Knowledge and described problem object and the Object of Knowledge it Between corresponding relation, the problem of determining similar with the problem to be solved in the technical field;
It is determined that it is each described similar the problem of and the problem to be solved between similarity score;
Based on the similarity score, and it is each described similar the problem of corresponding destination document, it is determined that with it is described to be solved The destination document that problem matches.
11. information indexing device according to claim 10, it is characterised in that corresponding mesh the problem of each described similar Mark document as candidate documents, the processing module specifically for:
Based on the similarity score, the problem to be solved and the similarity of each in the candidate documents are determined;
According to one or more candidates of the sequential selection of similarity from high to low between the problem to be solved and the candidate documents Document is used as the destination document matched with the problem to be solved;
Wherein, the problem to be solved and the similarity of each in the candidate documents are determined as follows:
<mrow> <mi>s</mi> <mi>c</mi> <mi>o</mi> <mi>r</mi> <mi>e</mi> <mrow> <mo>(</mo> <mrow> <mi>q</mi> <mo>,</mo> <mi>d</mi> </mrow> <mo>)</mo> </mrow> <mo>=</mo> <mi>log</mi> <mo>#</mo> <mrow> <mo>(</mo> <mrow> <mi>d</mi> <mo>,</mo> <mi>C</mi> </mrow> <mo>)</mo> </mrow> <mo>&amp;times;</mo> <munder> <mi>&amp;Sigma;</mi> <mrow> <mrow> <mo>(</mo> <mrow> <msub> <msup> <mi>q</mi> <mo>&amp;prime;</mo> </msup> <mi>i</mi> </msub> <mo>,</mo> <mi>d</mi> </mrow> <mo>)</mo> </mrow> <mo>&amp;Element;</mo> <msub> <mi>C</mi> <mn>0</mn> </msub> </mrow> </munder> <mfrac> <mrow> <mo>#</mo> <mrow> <mo>(</mo> <mrow> <mi>d</mi> <mo>,</mo> <msub> <mi>C</mi> <mn>0</mn> </msub> </mrow> <mo>)</mo> </mrow> </mrow> <mrow> <mo>#</mo> <mrow> <mo>(</mo> <mrow> <mi>d</mi> <mo>,</mo> <mi>C</mi> </mrow> <mo>)</mo> </mrow> <mo>&amp;times;</mo> <mi>i</mi> </mrow> </mfrac> <mo>&amp;times;</mo> <mi>s</mi> <mi>c</mi> <mi>o</mi> <mi>r</mi> <mi>e</mi> <mrow> <mo>(</mo> <mrow> <msub> <msup> <mi>q</mi> <mo>&amp;prime;</mo> </msup> <mi>i</mi> </msub> <mo>,</mo> <mi>q</mi> </mrow> <mo>)</mo> </mrow> <mo>;</mo> </mrow>
Q represents problem to be solved, and d represents a candidate documents, score (q, d) represent problem q and candidate documents d to be solved it Between similarity, # (d, C) represents the total degrees that occur in C of d, # (d, C0) represent d in C0The number of times of middle appearance, (q 'i,d)∈ C0Represent that d can be solved in C0Middle the problem of q 'i, score (q 'i, q) represent q 'iWith q similarity score;And C0Problem of representation day Will C subset, the problem of q ' represents similar with problem q to be solved, and
C0={ (q '0, d '0), { (q '1, d '1) ..., { q 'm, d 'm), q 'iRepresent i-th it is similar with q the problem of, m is represented and q The sum of similar the problem of, d ' represents destination document corresponding with q '.
12. according to any described information indexing devices of claim 8-11, it is characterised in that it is determined that after destination document, The processing module is additionally operable to:
Based on Random Walk Algorithm, the problem to be solved is calculated similar to each document object in the knowledge base Degree;
Based on the similarity of each document object in the problem to be solved and the knowledge base, to the multiple target text Shelves are reordered.
13. information indexing device according to claim 12, it is characterised in that based on Random Walk Algorithm, calculate institute When stating the similarity of each document object in problem to be solved and the knowledge base, the processing module specifically for:
Select one or more nodes between the problem to be solved and the document object that index is set, wherein, the node Index represent the node to each document object in the knowledge base similarity;
Be based upon the index that one or more described nodes are set, calculate the problem to be solved with it is every in the knowledge base The similarity of one document object.
14. information indexing device according to claim 13, it is characterised in that when selection sets the node of index, institute State processing module specifically for:
Select the frequent node on path that index is set, wherein, frequent node is section of the product more than threshold value of in-degree and out-degree Point.
CN201710217499.5A 2017-04-05 2017-04-05 Information retrieval method and device Pending CN107122421A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710217499.5A CN107122421A (en) 2017-04-05 2017-04-05 Information retrieval method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710217499.5A CN107122421A (en) 2017-04-05 2017-04-05 Information retrieval method and device

Publications (1)

Publication Number Publication Date
CN107122421A true CN107122421A (en) 2017-09-01

Family

ID=59726211

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710217499.5A Pending CN107122421A (en) 2017-04-05 2017-04-05 Information retrieval method and device

Country Status (1)

Country Link
CN (1) CN107122421A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107885842A (en) * 2017-11-10 2018-04-06 上海智臻智能网络科技股份有限公司 Method, apparatus, server and the storage medium of intelligent answer

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1794240A (en) * 2006-01-09 2006-06-28 北京大学深圳研究生院 Computer information retrieval system based on natural speech understanding and its searching method
CN101373532A (en) * 2008-07-10 2009-02-25 昆明理工大学 FAQ Chinese request-answering system implementing method in tourism field
CN102129477A (en) * 2011-04-23 2011-07-20 山东大学 Multimode-combined image reordering method
CN102779182A (en) * 2012-07-02 2012-11-14 吉林大学 Collaborative filtering recommendation method for integrating preference relationship and trust relationship
JP5697202B2 (en) * 2011-03-08 2015-04-08 インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation Method, program and system for finding correspondence of terms
CN106294505A (en) * 2015-06-10 2017-01-04 华中师范大学 A kind of method and apparatus feeding back answer
CN106372087A (en) * 2015-07-23 2017-02-01 北京大学 Information retrieval-oriented information map generation method and dynamic updating method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1794240A (en) * 2006-01-09 2006-06-28 北京大学深圳研究生院 Computer information retrieval system based on natural speech understanding and its searching method
CN101373532A (en) * 2008-07-10 2009-02-25 昆明理工大学 FAQ Chinese request-answering system implementing method in tourism field
JP5697202B2 (en) * 2011-03-08 2015-04-08 インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation Method, program and system for finding correspondence of terms
CN102129477A (en) * 2011-04-23 2011-07-20 山东大学 Multimode-combined image reordering method
CN102779182A (en) * 2012-07-02 2012-11-14 吉林大学 Collaborative filtering recommendation method for integrating preference relationship and trust relationship
CN106294505A (en) * 2015-06-10 2017-01-04 华中师范大学 A kind of method and apparatus feeding back answer
CN106372087A (en) * 2015-07-23 2017-02-01 北京大学 Information retrieval-oriented information map generation method and dynamic updating method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SHUO YANG等: "Efficiently Answering Technical Questions - A Knowledge Graph Approach", 《PROCEEDINGS OF THE THIRTY-FIRST AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE》 *
宋琛等: "基于随机游走相似度矩阵的改进标签传播算法", 《计算机应用与软件》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107885842A (en) * 2017-11-10 2018-04-06 上海智臻智能网络科技股份有限公司 Method, apparatus, server and the storage medium of intelligent answer
CN107885842B (en) * 2017-11-10 2021-01-08 上海智臻智能网络科技股份有限公司 Intelligent question and answer method, device, server and storage medium

Similar Documents

Publication Publication Date Title
CN110046236B (en) Unstructured data retrieval method and device
WO2018049960A1 (en) Method and apparatus for matching resource for text information
US10997370B2 (en) Hybrid classifier for assigning natural language processing (NLP) inputs to domains in real-time
Bar-Yossef et al. Context-sensitive query auto-completion
CN110569496B (en) Entity linking method, device and storage medium
US9342590B2 (en) Keywords extraction and enrichment via categorization systems
CN109062994A (en) Recommended method, device, computer equipment and storage medium
US20160012122A1 (en) Automatically linking text to concepts in a knowledge base
US20120029908A1 (en) Information processing device, related sentence providing method, and program
JP6124917B2 (en) Method and apparatus for information retrieval
CN109948121A (en) Article similarity method for digging, system, equipment and storage medium
CN103455487B (en) The extracting method and device of a kind of search term
US10152478B2 (en) Apparatus, system and method for string disambiguation and entity ranking
US10635733B2 (en) Personalized user-categorized recommendations
JP2009093650A (en) Selection of tag for document by paragraph analysis of document
US20180032608A1 (en) Flexible summarization of textual content
CN112732870B (en) Word vector based search method, device, equipment and storage medium
WO2020125015A1 (en) Contextualized merchant recall
He et al. Twitter summarization with social-temporal context
CN112487161A (en) Enterprise demand oriented expert recommendation method, device, medium and equipment
CN107765883A (en) The sort method and sequencing equipment of candidate&#39;s word of input method
US20220164546A1 (en) Machine Learning Systems and Methods for Many-Hop Fact Extraction and Claim Verification
CN112988971A (en) Word vector-based search method, terminal, server and storage medium
CN107122421A (en) Information retrieval method and device
Gupta et al. Songs recommendation using context-based semantic similarity between lyrics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20170901