CN112732926A - Text retrieval method and device, computer equipment and storage medium - Google Patents

Text retrieval method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN112732926A
CN112732926A CN202011637716.4A CN202011637716A CN112732926A CN 112732926 A CN112732926 A CN 112732926A CN 202011637716 A CN202011637716 A CN 202011637716A CN 112732926 A CN112732926 A CN 112732926A
Authority
CN
China
Prior art keywords
document
node
neighbor
direct
nodes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011637716.4A
Other languages
Chinese (zh)
Inventor
王昊
张乐情
罗水权
刘剑
李燕婷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Asset Management Co Ltd
Original Assignee
Ping An Asset Management Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Asset Management Co Ltd filed Critical Ping An Asset Management Co Ltd
Priority to CN202011637716.4A priority Critical patent/CN112732926A/en
Publication of CN112732926A publication Critical patent/CN112732926A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/338Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Abstract

The application relates to artificial intelligence and provides a text retrieval method, a text retrieval device, computer equipment and a storage medium. The method comprises the steps of extracting entities aiming at retrieval sentences, inquiring a main node corresponding to a pre-constructed knowledge graph to obtain an inquiry node, obtaining a document neighbor node of the inquiry node to obtain a direct document, and obtaining a retrieval result according to the occurrence frequency of the direct document in the document neighbor node of a 1-N-order neighbor main node of the inquiry node. Because the domain attributes are constructed as the main nodes in the knowledge graph, and the attribute values of the domain attributes are entities, the associated entities are connected, after the document neighbor nodes of the entities are retrieved, the multi-order neighbor query is carried out by traversing the document neighbor nodes of the 1 to N-order neighbor main nodes of the query nodes, so that the range of the multi-order query is enlarged, the retrieval result can be associated with other domain attributes, the relevance among the domain attributes is embodied, and the accuracy of the retrieval result can be improved.

Description

Text retrieval method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of knowledge graph technology, and in particular, to a text retrieval method, apparatus, computer device, and storage medium.
Background
Text retrieval refers to retrieving a text set according to text contents, such as keywords, semantemes and the like, and returning target contents. The retrieval precision is embodied in the matching degree of the retrieval words and the target content. With the rapid development of internet big data, the requirement of people on the retrieval precision is increasingly improved.
The conventional retrieval method is usually implemented as follows: 1. preparing a text database, wherein a plurality of texts \ text segments \ sentences are generally available; 2. after the text is subjected to word segmentation/word segmentation, reverse indexes are made through words and words; 3. and (4) dividing the text database into buckets or performing hash according to the inverted index. In this way, the returned results are relatively stiff based on word frequency or word frequency, and are not in line with human intuition. If a company legal person is used as a key word to query, the text of the company legal person can be queried but the text related to the business product of the company main operation cannot be returned. However, in human impression, since the justice and the company hosting business product are related, the two are related from the viewpoint of human intuition, and therefore, the data of the two are related from the viewpoint of human intuition. If only the documents relevant to the keywords are returned, the retrieval result does not meet the human retrieval requirement.
Namely, the traditional searching method has the technical problem of low searching precision.
Disclosure of Invention
In view of the above, it is necessary to provide a text retrieval method, apparatus, computer device and storage medium capable of improving retrieval accuracy.
A method of text retrieval, the method comprising:
acquiring a retrieval sentence input by a user;
carrying out entity identification on the retrieval sentence, and extracting an entity in the retrieval sentence;
inquiring a main node corresponding to a pre-constructed knowledge graph according to the entity to obtain an inquiry node; the knowledge graph is constructed by taking a domain attribute as a main node, the attribute value of the domain attribute is an entity, and the neighbor nodes of the main node comprise document neighbor nodes related to the content of the main node;
acquiring a document neighbor node of the query node in the knowledge graph to obtain a direct document;
and determining the text retrieval result of the retrieval sentence from the direct document according to the occurrence frequency of the direct document in the document neighbor nodes of the 1-N-order neighbor host nodes of the query node.
In one embodiment, determining the text retrieval result of the retrieval sentence from the direct document according to the occurrence number of the direct document in the document neighbor nodes of the 1 to N-order neighbor host nodes of the query node includes:
acquiring the occurrence frequency of the direct document in a document neighbor node of a 1-N order neighbor main node of the query node;
and sequencing the direct documents according to the occurrence times from high to low to obtain the text retrieval result of the retrieval sentence.
In one embodiment, determining the text retrieval result of the retrieval sentence from the direct document according to the occurrence number of the direct document in the document neighbor nodes of the 1 to N-order neighbor host nodes of the query node includes:
traversing the document neighbor nodes of the 1-N order neighbor main nodes of the query node, and if the document neighbor nodes are the direct documents, adding a preset recommendation score to the direct documents;
and sequencing the direct documents from high to low according to the recommendation score, and taking the first K direct documents as text retrieval results of the retrieval sentences.
In one embodiment, if the document neighbor node is the direct document, adding a preset recommendation score to the direct document according to the order weight; wherein the order weight is inversely related to the order.
In one embodiment, the manner of constructing the knowledge-graph includes:
according to at least two associated domain attributes, taking an attribute value of the domain attribute as a main node, wherein the domain attribute has associated main node connection, and the attribute value of the domain attribute is an entity;
obtaining a document related to an entity from a text database, and associating the document to a main node to be used as a document neighbor node of the main node;
and obtaining the knowledge graph according to the main node and the document neighbor node of the main node.
In one embodiment, obtaining a document associated with an entity from a text database, and associating the document with a host node as a document neighbor node of the host node includes:
and analyzing each text in the database by using a word frequency analysis method, obtaining the correlation degree of the text and the entity according to the word frequency containing relation of the text and the entity, and taking the text with the correlation degree larger than a threshold value as a document neighbor node of the corresponding main node.
In one embodiment, the method further comprises:
and returning the text retrieval result and the knowledge graph information to the user terminal, wherein the knowledge graph information comprises the query node and the 1 to N-order neighbor main nodes of the query node.
A text retrieval device, the device comprising:
the retrieval sentence acquisition module is used for acquiring a retrieval sentence input by a user;
the entity extraction module is used for carrying out entity identification on the retrieval sentence and extracting an entity in the retrieval sentence;
the query node acquisition module is used for querying a main node corresponding to a pre-constructed knowledge graph according to the entity to obtain a query node; the knowledge graph is constructed by taking a domain attribute as a main node, the attribute value of the domain attribute is an entity, and the neighbor nodes of the main node comprise document neighbor nodes related to the content of the main node;
the document acquisition module is used for acquiring document neighbor nodes of the query nodes in the knowledge graph to obtain direct documents;
and the retrieval module is used for determining the text retrieval result of the retrieval sentence from the direct document according to the occurrence frequency of the direct document in the document neighbor nodes of the 1-N-order neighbor host nodes of the query node.
A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:
acquiring a retrieval sentence input by a user;
carrying out entity identification on the retrieval sentence, and extracting an entity in the retrieval sentence;
inquiring a main node corresponding to a pre-constructed knowledge graph according to the entity to obtain an inquiry node; the knowledge graph is constructed by taking a domain attribute as a main node, the attribute value of the domain attribute is an entity, and the neighbor nodes of the main node comprise document neighbor nodes related to the content of the main node;
acquiring a document neighbor node of the query node in the knowledge graph to obtain a direct document;
and determining the text retrieval result of the retrieval sentence from the direct document according to the occurrence frequency of the direct document in the document neighbor nodes of the 1-N-order neighbor host nodes of the query node.
A computer storage medium having a computer program stored thereon, the computer program when executed by a processor implementing the steps of:
acquiring a retrieval sentence input by a user;
carrying out entity identification on the retrieval sentence, and extracting an entity in the retrieval sentence;
inquiring a main node corresponding to a pre-constructed knowledge graph according to the entity to obtain an inquiry node; the knowledge graph is constructed by taking a domain attribute as a main node, the attribute value of the domain attribute is an entity, and the neighbor nodes of the main node comprise document neighbor nodes related to the content of the main node;
acquiring a document neighbor node of the query node in the knowledge graph to obtain a direct document;
and determining the text retrieval result of the retrieval sentence from the direct document according to the occurrence frequency of the direct document in the document neighbor nodes of the 1-N-order neighbor host nodes of the query node.
According to the text retrieval method, the text retrieval device, the computer equipment and the storage medium, aiming at the retrieval sentence, the entity is extracted, the main node corresponding to the pre-constructed knowledge graph is inquired to obtain the inquiry node, the document neighbor node of the inquiry node is obtained to obtain the direct document, and the retrieval result is obtained according to the occurrence frequency of the direct document in the document neighbor node of the 1-N-order neighbor main node of the inquiry node. Because the domain attributes are constructed as the main nodes in the knowledge graph, and the attribute values of the domain attributes are entities, the associated entities are connected, after the document neighbor nodes of the entities are retrieved, the multi-order neighbor query is carried out by traversing the document neighbor nodes of the 1 to N-order neighbor main nodes of the query nodes, so that the range of the multi-order query is enlarged, the retrieval result can be associated with other domain attributes, the relevance among the domain attributes is embodied, and the accuracy of the retrieval result can be improved.
Drawings
FIG. 1 is a diagram of an application scenario of a text retrieval method in one embodiment;
FIG. 2 is a flow diagram that illustrates a method for text retrieval in one embodiment;
FIG. 3 is a diagram illustrating the structure of a knowledge-graph in one embodiment;
FIG. 4 is a schematic flow chart diagram illustrating the steps of constructing a knowledge-graph in another embodiment;
FIG. 5 is a block diagram showing the structure of a text retrieval device according to an embodiment;
FIG. 6 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The text retrieval method provided by the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. A user inputs a retrieval sentence through the terminal 102, and the server acquires the retrieval sentence input by the user; carrying out entity identification on the retrieval sentence, and extracting an entity in the retrieval sentence; inquiring a main node corresponding to a pre-constructed knowledge graph according to the entity to obtain an inquiry node; the knowledge graph is constructed by taking a domain attribute as a main node, the attribute value of the domain attribute is an entity, and the neighbor nodes of the main node comprise document neighbor nodes related to the content of the main node; acquiring a document neighbor node of the query node in the knowledge graph to obtain a direct document; and determining the text retrieval result of the retrieval sentence from the direct document according to the occurrence frequency of the direct document in the document neighbor nodes of the 1-N-order neighbor host nodes of the query node.
The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers.
In one embodiment, as shown in fig. 2, a text retrieval method is provided, which is described by taking the method as an example applied to the server in fig. 1, and includes the following steps:
step 202, a retrieval sentence input by a user is obtained.
The search sentence indicates a search target of the user. Search sentences usually include search keywords. Such as "i want to listen to a certain song of the week" in a search sentence, wherein "certain week" is a keyword in the search sentence.
And 204, performing entity identification on the search sentence, and extracting an entity in the search sentence.
Specifically, a named entity identification method is adopted to identify the entities of the search sentence and extract the entities in the search sentence. Named Entity Recognition (NER), also called "proper name Recognition", refers to recognizing entities with specific meaning in text, mainly including names of people, places, organizations, proper nouns, etc. It generally comprises two parts: (1) identifying entity boundaries; (2) entity categories (person name, place name, organization name, or others) are determined.
Specifically, a user inputs a retrieval sentence, and after the retrieval sentence is received by a background, the retrieval sentence input by the user is identified by a named entity, and the identification category is a domain attribute category in a domain database, such as a company name, a product name and a person name. If the user inputs the search sentence as 'Chinese safety brief introduction', the entity identified after the named entity is identified as 'Chinese safety'.
Step 206, inquiring a main node corresponding to a pre-constructed knowledge graph according to the entity to obtain an inquiry node; the knowledge graph is constructed by taking a domain attribute as a main node, the attribute value of the domain attribute is an entity, and the neighbor nodes of the main node comprise document neighbor nodes related to the content of the main node.
In particular, a domain attribute refers to an abstract abstraction that can be used to describe a domain, a feature that can characterize an object from one dimension. Taking company information as an example, both company name and legal are features that can describe a company from both the name and legal dimensions. The attribute value of a domain attribute is an entity of a specific domain content. For example, a company name of a company is "a technology company", which corresponds to the field attribute of the company name, and an entity of "a technology company" corresponds to the attribute value of the field attribute of the company name.
The neighbor nodes of the main node comprise other main nodes related to the domain attribute and document neighbor nodes related to the main node content. As shown in fig. 3, taking a "certain technology company" master node as an example, the domain attribute belonging to the master node is a company name, and the domain attributes associated with the domain attribute include a main product name, a legal person, and a subsidiary company name; wherein the domain attribute associated with the subsidiary name comprises a subsidiary corporate.
Taking a main node of a certain science and technology company as an example, the neighbor nodes of the main node comprise the name of a legal person of the main node company, "zhang san", the main business content of the main node company "certain insurance", the associated sub-company "certain insurance company", and the neighbor main nodes of the certain insurance company "comprise the legal person of the sub-company" lie si ". Each master node has a document neighbor node.
And inquiring the main node corresponding to the pre-constructed knowledge graph according to the entity, specifically, matching the entity with the attribute values of the main nodes in the knowledge graph, and taking the matched main node as an inquiry node. And if the extracted entity is 'a certain technology company', taking the corresponding main node as a query node.
And 208, acquiring a document neighbor node of the query node in the knowledge graph to obtain a direct document.
Specifically, the document neighbor nodes of the query node are used as direct documents. The neighbor nodes of the master node include documents related to the content of the master node, that is, the document nodes of the neighbor nodes of the master node are the documents related to the content of the master node. Taking the main node "a technology company" in fig. 3 as an example, the neighboring nodes of the document are document 1, document 2 and document 7, where document 1, document 2 and document 7 are the document contents related to the technology company. Document 1, document 2, and document 7 are taken as direct documents.
Step 210, determining the text retrieval result of the retrieval sentence from the direct document according to the occurrence frequency of the direct document in the document neighbor nodes of the 1 to N-order neighbor host nodes of the query node.
The neighbor master node refers to a neighbor node of which the node attribute is the master node. The knowledge graph is constructed by taking the domain attributes as the main nodes, and the neighbor nodes of the main nodes comprise the documents related to the main node content, namely the document nodes, and also comprise the main nodes related to other domain attributes of the main nodes. As shown in fig. 3, with "a technology company" as the query node, the documents 1, 2 and 7 are their document neighbor nodes, and "zhang san", "some insurance" and "some insurance company" are their neighbor host nodes.
The order-1 neighbor node is a neighbor node of a certain node, the order-N neighbor node is a neighbor node of the order-N-1 neighbor node, and for example, the order-two neighbor node is a neighbor node of the order-1 neighbor node. The document neighbor nodes of the neighbor master nodes of 1 to N orders of the query node refer to nodes with the attribute of a document in the neighbor nodes of the neighbor master nodes of 1 to N orders of the query node. In fig. 3, "a certain technology company" is taken as a query node, "zhang san," "certain insurance," and "certain insurance company" are neighboring master nodes of 1 st order, and in the case of the neighboring master node "certain insurance company," document 1 and document 5 are their document neighboring nodes. "Liquan" is a neighbor master node of "some insurance company", and "some insurance company" is a neighbor master node of 1 st order of the query node "some technology company", and "Liquan" is a neighbor master node of 1 st order of the query node "some technology company", and is a neighbor master node of 2 nd order of the query node "some technology company". If the document neighbor nodes of "Liquan" are the document 1 and the document 6, the document neighbor nodes of the 2 nd order neighbor host node of the query node "a certain technology company" are the document 1 and the document 6.
Specifically, if the direct document appears in the document neighbor node of the neighboring master node of the 1 to N th orders of the query node, it indicates that the direct document is also a document related to the content of the neighboring master node of the 1 to N orders. That is, the document is a document related to the master node corresponding to the entity, and is also a document related to another master node related to the master node domain attribute. Taking fig. 3 as an example, document 1, document 2, and document 7 are direct documents of the query node "a technology company", and document 1, document 2, document 3, document 4, and document 5 are document neighbor nodes of a first-order neighbor master node of the query node "a technology company", where document 1 is a document neighbor node of 3 first-order neighbor master nodes, and document 2, document 3, document 4, and document 5 are document neighbor nodes of one first-order neighbor master node, respectively. Documents 1 and 6 are document neighbor nodes of a second-order neighbor host node of "technology company". That is, document 1 appears 3 times in the document neighbor nodes of the first order neighbor master node, and document 2 appears 1 time in the document neighbor nodes of the first order neighbor master node. Document 1 appears 1 time in the document neighbor nodes of the second-order neighbor master node. Document 7 appears 0 times in the document neighbor nodes of the 1 to N-th order neighbor master nodes.
In specific application, direct documents can be sorted according to the occurrence times to obtain a text retrieval result.
According to the text retrieval method, entities are extracted aiming at retrieval sentences, a main node corresponding to a pre-constructed knowledge graph is inquired to obtain an inquiry node, a document neighbor node of the inquiry node is obtained to obtain a direct document, and a retrieval result is obtained according to the occurrence frequency of the direct document in the document neighbor nodes of the 1-N-order neighbor main nodes of the inquiry node. Because the domain attributes are constructed as the main nodes in the knowledge graph, and the attribute values of the domain attributes are entities, the associated entities are connected, after the document neighbor nodes of the entities are retrieved, the multi-order neighbor query is carried out by traversing the document neighbor nodes of the 1 to N-order neighbor main nodes of the query nodes, so that the range of the multi-order query is enlarged, the retrieval result can be associated with other domain attributes, the relevance among the domain attributes is embodied, and the accuracy of the retrieval result can be improved.
In one embodiment, determining the text retrieval result of the retrieval sentence from the direct document according to the occurrence number of the direct document in the document neighbor nodes of the 1 to N-th order neighbor host nodes of the query node includes: acquiring the occurrence frequency of the direct document in a document neighbor node of a 1-N order neighbor main node of the query node; and sequencing the direct documents according to the occurrence times from high to low to obtain the text retrieval result of the retrieval sentence.
Specifically, according to the occurrence frequency of the direct document in the document neighbor nodes of the 1 to N-order neighbor master nodes of the query node, the direct document may be ranked according to the occurrence frequency, and the text retrieval result of the retrieval sentence is determined from the direct document according to the ranking from top to bottom. Taking fig. 3 as an example, the direct document "document 1" appears 4 times at the document neighbor node of the neighboring master node of the 1 to N orders, the direct document "document 2" appears 1 time at the document neighbor node of the neighboring master node of the 1 to N orders, and the direct document "document 7" appears 0 times at the document neighbor node of the neighboring master node of the 1 to N orders, and the three are sorted, the order is document 1, document 2, and document 7, and the three are sorted and returned as the text retrieval result of the retrieval sentence.
Specifically, the more times the direct document appears in the document neighbor nodes of the 1 to N-order neighbor master nodes, the more the direct document is related to the multiple domain attributes, the more the document relates to the content of the entities of the multiple domain attributes, the relevance is provided, and the human retrieval requirements can be better met.
In another embodiment, determining the text retrieval result of the retrieval sentence from the direct document according to the occurrence number of the direct document in the document neighbor nodes of the 1 to N-order neighbor master nodes of the query node includes: traversing the document neighbor nodes of the 1-N order neighbor main nodes of the query node, and if the document neighbor nodes are the direct documents, adding a preset recommendation score to the direct documents; and sequencing the direct documents from high to low according to the recommendation score, and taking the first K direct documents as text retrieval results of the retrieval sentences.
Specifically, according to the query node, searching a 1-N order neighbor main node of the query node, returning a document neighbor node of the 1-N order neighbor main node, and if the document neighbor node is a direct document, adding a preset recommendation score to the direct document. If the document 1, the document 2 and the document 7 are direct documents of the query node "a certain technology company", the direct document "document 1" appears 4 times at the document neighbor node of the adjacent master node of the 1 to N orders, the direct document "document 2" appears 1 time at the document neighbor node of the adjacent master node of the 1 to N orders, and the direct document "document 7" appears 0 times at the document neighbor node of the adjacent master node of the 1 to N orders. Each time a direct document is traversed, a preset recommendation score is added to the direct document.
In one embodiment, the preset recommendation score is fixed. And if the traversal reaches each time, the increment is one minute.
In one implementation, if the document neighbor node is the direct document, adding a preset recommendation score to the direct document according to the order weight; wherein the order weight is inversely related to the order. That is, the larger the order, the smaller the order weight, and the smaller the order, the larger the order weight. Specifically, the order refers to the order of a neighboring master node of 1 to N orders. Since knowledge graph has associated main nodes connected, the smaller the order, the closer the relation to the query node is generally represented. Thus, in general, documents for neighboring primary nodes of orders 1 to N that are more closely related to the primary node are more relevant to the primary node. Therefore, the weight of the document neighbor node with a small order is increased in the embodiment. For example, if the direct document "document 1" appears 3 times in the document neighbor node of the neighboring master node of order 1, the weight of order 1 is set to a, and appears 1 time in the document neighbor node of the neighboring master node of order 2, and the weight of order 2 is set to b, the recommendation score of document 1 is 3a + b. And sequencing the direct documents from high to low according to the recommendation score, and taking the first K direct documents as text retrieval results of the retrieval sentences.
In a specific application, in order to consider the search experiment and the effect, N may be set to 2, that is, only the document neighbor node with the second-order neighbor master node is searched.
In another embodiment, as shown in FIG. 4, the manner of constructing the knowledge-graph includes:
s402, according to at least two associated domain attributes, taking the attribute value of the domain attribute as a main node, wherein the domain attribute has associated main node connection, and the attribute value of the domain attribute is an entity.
The domain attribute refers to an abstract summary which can be used for describing a domain, and can represent the characteristics of an object from one dimension. Taking company information as an example, both company name and legal are features that can describe a company from both the name and legal dimensions. The attribute value of a domain attribute is an entity of a specific domain content. For example, a company name of a company is "a technology company", which corresponds to the attribute of the field of the company name, and the entity of "a technology company" is the attribute value of the field of the company name. The related domain attributes can provide a basis for expanding the query range.
And acquiring a domain data set from a domain related channel, and constructing a domain knowledge graph. Each node of the knowledge graph is a domain entity, and the domain entities with the relationship are connected through edges. Taking company information as an example, the entities of the knowledge graph include names, products hosted by the company, corporate legal persons, and the like. Wherein, the company main product and the justice are connected with the company name.
S404, obtaining the document related to the entity from the text database, and associating the document to the main node to be used as the document neighbor node of the main node.
Specifically, the document neighbor node only stores the document ID and does not store the content, so that the size of the knowledge graph spectrum can be reduced, and the retrieval efficiency is improved.
Specifically, documents related to the entity are obtained from the text database, for example, for a node of "a technology company", documents related to the entity can be obtained from the text database as its document neighbor nodes.
Specifically, each text in the database is analyzed by using a word frequency analysis method, the relevancy between the text and the entity is obtained according to the word frequency containing relationship between the text and the entity, and the text with the relevancy larger than a threshold value is used as a document neighbor node corresponding to the main node.
The term frequency analysis method can adopt TF-IDF (term frequency-inverse document frequency), wherein TF-IDF is a common weighting technology for information retrieval and data mining. TF is Term Frequency (Term Frequency) and IDF is Inverse text Frequency index (Inverse Document Frequency). And obtaining the relevancy between the text and the entity according to the word frequency containing relationship between the entity and the text. Generally, the higher the frequency with which an entity appears in text, the greater the relevance of the text to the entity. And taking the text with the relevance larger than the threshold as the document neighbor node of the entity corresponding to the main node.
S406, obtaining the knowledge graph according to the main node and the document neighbor node of the main node.
Specifically, a knowledge graph is obtained according to the connection relationship between the main node and the connection relationship between the main node and the document neighbor node.
In the embodiment, the relation between the text and the entity and the relation between the entity are mined in advance, and the knowledge graph is constructed to provide a basis for text retrieval.
In another embodiment, after the step of obtaining the text retrieval result, the method further includes: and returning the text retrieval result and the knowledge graph information to the user terminal, wherein the knowledge graph information comprises the query node and the 1 to N-order neighbor main nodes of the query node.
Specifically, in this embodiment, besides returning the text of the query result to the user, the knowledge graph information is also returned, specifically, the query node and the corresponding neighbor node, and their relationship. For example, when a document related to a certain technology company is returned and displayed, the document return display device simultaneously comprises a company legal person and a main business product, and associates a plurality of characters of a subsidiary company, so that the interpretability of the document return is increased.
It should be understood that although the steps in the flowcharts of fig. 2 and 4 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2 and 4 may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least some of the sub-steps or stages of other steps.
In one embodiment, as shown in fig. 5, there is provided a text retrieval apparatus including:
a retrieval sentence obtaining module 502, configured to obtain a retrieval sentence input by a user.
And an entity extraction module 504, configured to perform entity identification on the search sentence, and extract an entity in the search sentence.
A query node obtaining module 506, configured to query, according to the entity, a master node corresponding to a pre-constructed knowledge graph to obtain a query node; the knowledge graph is constructed by taking a domain attribute as a main node, the attribute value of the domain attribute is an entity, and the neighbor nodes of the main node comprise document neighbor nodes related to the content of the main node.
A document obtaining module 508, configured to obtain a document neighbor node of the query node in the knowledge graph, to obtain a direct document.
A retrieving module 510, configured to determine a text retrieval result of the retrieval sentence from the direct document according to the occurrence frequency of the direct document in a document neighbor node of a 1-to-N-order neighbor master node of the query node.
The text retrieval device extracts entities aiming at the retrieval sentences, queries the main nodes corresponding to the pre-constructed knowledge graph to obtain query nodes, obtains document neighbor nodes of the query nodes to obtain direct documents, and obtains retrieval results according to the occurrence frequency of the direct documents in the document neighbor nodes of the 1-N-order neighbor main nodes of the query nodes. Because the domain attributes are constructed as the main nodes in the knowledge graph, and the attribute values of the domain attributes are entities, the associated entities are connected, after the document neighbor nodes of the entities are retrieved, the multi-order neighbor query is carried out by traversing the document neighbor nodes of the 1 to N-order neighbor main nodes of the query nodes, so that the range of the multi-order query is enlarged, the retrieval result can be associated with other domain attributes, the relevance among the domain attributes is embodied, and the accuracy of the retrieval result can be improved.
In another embodiment, the retrieval module is configured to obtain the occurrence number of the direct document in a document neighbor node of a 1 to N-th order neighbor master node of the query node; and sequencing the direct documents according to the occurrence times from high to low to obtain the text retrieval result of the retrieval sentence.
In another embodiment, the search module is configured to traverse document neighbor nodes of 1 to N-order neighbor master nodes of the query node, and if the document neighbor nodes are the direct documents, add a preset recommendation score to the direct documents; and sequencing the direct documents from high to low according to the recommendation score, and taking the first K direct documents as text retrieval results of the retrieval sentences.
If the document neighbor node is the direct document, adding a preset recommendation score to the direct document according to the order weight; wherein the order weight is inversely related to the order.
In another embodiment, the text retrieval apparatus further includes:
the system comprises a main node module, a domain attribute module and a domain attribute module, wherein the main node module is used for taking an attribute value of at least two associated domain attributes as a main node, the domain attributes have associated main node connection, and the attribute value of the domain attributes is an entity;
the document neighbor node module is used for acquiring documents related to the entity from the text database, and associating the documents to the main node to be used as the document neighbor nodes of the main node;
and the construction module is used for obtaining the knowledge graph according to the main node and the document neighbor node of the main node.
In another embodiment, the document neighbor node module is configured to analyze each text in the database by using a word frequency analysis method, obtain a correlation degree between the text and the entity according to a word frequency containing relationship between the text and the entity, and use the text with the correlation degree greater than a threshold value as the document neighbor node corresponding to the master node.
In another embodiment, the text retrieval apparatus further includes:
and the information processing module is used for returning the text retrieval result and the knowledge graph information to the user terminal, wherein the knowledge graph information comprises the query node and the 1-to-N-order neighbor main nodes of the query node.
For the specific limitations of the text retrieval device, reference may be made to the above limitations of the text retrieval method, which will not be described herein again. The modules in the text retrieval device can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 6. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used to store text data and a knowledge-graph. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a text retrieval method.
Those skilled in the art will appreciate that the architecture shown in fig. 6 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, there is provided a computer device comprising a memory storing a computer program and a processor implementing the following steps when the processor executes the computer program:
acquiring a retrieval sentence input by a user;
carrying out entity identification on the retrieval sentence, and extracting an entity in the retrieval sentence;
inquiring a main node corresponding to a pre-constructed knowledge graph according to the entity to obtain an inquiry node; the knowledge graph is constructed by taking a domain attribute as a main node, the attribute value of the domain attribute is an entity, and the neighbor nodes of the main node comprise document neighbor nodes related to the content of the main node;
acquiring a document neighbor node of the query node in the knowledge graph to obtain a direct document;
and determining the text retrieval result of the retrieval sentence from the direct document according to the occurrence frequency of the direct document in the document neighbor nodes of the 1-N-order neighbor host nodes of the query node.
In one embodiment, determining the text retrieval result of the retrieval sentence from the direct document according to the occurrence number of the direct document in the document neighbor nodes of the 1 to N-order neighbor host nodes of the query node includes:
acquiring the occurrence frequency of the direct document in a document neighbor node of a 1-N order neighbor main node of the query node;
and sequencing the direct documents according to the occurrence times from high to low to obtain the text retrieval result of the retrieval sentence.
In one embodiment, determining the text retrieval result of the retrieval sentence from the direct document according to the occurrence number of the direct document in the document neighbor nodes of the 1 to N-order neighbor host nodes of the query node includes:
traversing the document neighbor nodes of the 1-N order neighbor main nodes of the query node, and if the document neighbor nodes are the direct documents, adding a preset recommendation score to the direct documents;
and sequencing the direct documents from high to low according to the recommendation score, and taking the first K direct documents as text retrieval results of the retrieval sentences.
In one embodiment, if the document neighbor node is the direct document, adding a preset recommendation score to the direct document according to the order weight; wherein the order weight is inversely related to the order.
In one embodiment, the manner of constructing the knowledge-graph includes:
according to at least two associated domain attributes, taking an attribute value of the domain attribute as a main node, wherein the domain attribute has associated main node connection, and the attribute value of the domain attribute is an entity;
obtaining a document related to an entity from a text database, and associating the document to a main node to be used as a document neighbor node of the main node;
and obtaining the knowledge graph according to the main node and the document neighbor node of the main node.
In one embodiment, obtaining a document associated with an entity from a text database, and associating the document with a host node as a document neighbor node of the host node includes:
and analyzing each text in the database by using a word frequency analysis method, obtaining the correlation degree of the text and the entity according to the word frequency containing relation of the text and the entity, and taking the text with the correlation degree larger than a threshold value as a document neighbor node of the corresponding main node.
In one embodiment, the processor, when executing the computer program, further performs the steps of:
and returning the text retrieval result and the knowledge graph information to the user terminal, wherein the knowledge graph information comprises the query node and the 1 to N-order neighbor main nodes of the query node.
In one embodiment, a computer storage medium is provided, having a computer program stored thereon, the computer program, when executed by a processor, implementing the steps of:
acquiring a retrieval sentence input by a user;
carrying out entity identification on the retrieval sentence, and extracting an entity in the retrieval sentence;
inquiring a main node corresponding to a pre-constructed knowledge graph according to the entity to obtain an inquiry node; the knowledge graph is constructed by taking a domain attribute as a main node, the attribute value of the domain attribute is an entity, and the neighbor nodes of the main node comprise document neighbor nodes related to the content of the main node;
acquiring a document neighbor node of the query node in the knowledge graph to obtain a direct document;
and determining the text retrieval result of the retrieval sentence from the direct document according to the occurrence frequency of the direct document in the document neighbor nodes of the 1-N-order neighbor host nodes of the query node.
In one embodiment, determining the text retrieval result of the retrieval sentence from the direct document according to the occurrence number of the direct document in the document neighbor nodes of the 1 to N-order neighbor host nodes of the query node includes:
acquiring the occurrence frequency of the direct document in a document neighbor node of a 1-N order neighbor main node of the query node;
and sequencing the direct documents according to the occurrence times from high to low to obtain the text retrieval result of the retrieval sentence.
In one embodiment, determining the text retrieval result of the retrieval sentence from the direct document according to the occurrence number of the direct document in the document neighbor nodes of the 1 to N-order neighbor host nodes of the query node includes:
traversing the document neighbor nodes of the 1-N order neighbor main nodes of the query node, and if the document neighbor nodes are the direct documents, adding a preset recommendation score to the direct documents;
and sequencing the direct documents from high to low according to the recommendation score, and taking the first K direct documents as text retrieval results of the retrieval sentences.
In one embodiment, if the document neighbor node is the direct document, adding a preset recommendation score to the direct document according to the order weight; wherein the order weight is inversely related to the order.
In one embodiment, the manner of constructing the knowledge-graph includes:
according to at least two associated domain attributes, taking an attribute value of the domain attribute as a main node, wherein the domain attribute has associated main node connection, and the attribute value of the domain attribute is an entity;
obtaining a document related to an entity from a text database, and associating the document to a main node to be used as a document neighbor node of the main node;
and obtaining the knowledge graph according to the main node and the document neighbor node of the main node.
In one embodiment, obtaining a document associated with an entity from a text database, and associating the document with a host node as a document neighbor node of the host node includes:
and analyzing each text in the database by using a word frequency analysis method, obtaining the correlation degree of the text and the entity according to the word frequency containing relation of the text and the entity, and taking the text with the correlation degree larger than a threshold value as a document neighbor node of the corresponding main node.
In one embodiment, the processor, when executing the computer program, further performs the steps of:
and returning the text retrieval result and the knowledge graph information to the user terminal, wherein the knowledge graph information comprises the query node and the 1 to N-order neighbor main nodes of the query node.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A method of text retrieval, the method comprising:
acquiring a retrieval sentence input by a user;
carrying out entity identification on the retrieval sentence, and extracting an entity in the retrieval sentence;
inquiring a main node corresponding to a pre-constructed knowledge graph according to the entity to obtain an inquiry node; the knowledge graph is constructed by taking a domain attribute as a main node, the attribute value of the domain attribute is an entity, and the neighbor nodes of the main node comprise document neighbor nodes related to the content of the main node;
acquiring a document neighbor node of the query node in the knowledge graph to obtain a direct document;
and determining the text retrieval result of the retrieval sentence from the direct document according to the occurrence frequency of the direct document in the document neighbor nodes of the 1-N-order neighbor host nodes of the query node.
2. The method of claim 1, wherein determining the text retrieval result of the retrieval sentence from the direct document according to the number of occurrences of the direct document in document neighbor nodes of a 1 to N-th order neighbor host node of the query node comprises:
acquiring the occurrence frequency of the direct document in a document neighbor node of a 1-N order neighbor main node of the query node;
and sequencing the direct documents according to the occurrence times from high to low to obtain the text retrieval result of the retrieval sentence.
3. The method of claim 1, wherein determining the text retrieval result of the retrieval sentence from the direct document according to the number of occurrences of the direct document in document neighbor nodes of a 1 to N-th order neighbor host node of the query node comprises:
traversing the document neighbor nodes of the 1-N order neighbor main nodes of the query node, and if the document neighbor nodes are the direct documents, adding a preset recommendation score to the direct documents;
and sequencing the direct documents from high to low according to the recommendation score, and taking the first K direct documents as text retrieval results of the retrieval sentences.
4. The method according to claim 3, wherein if the document neighbor node is the direct document, adding a preset recommendation score to the direct document according to an order weight; wherein the order weight is inversely related to the order.
5. The method of claim 1, wherein the manner in which the knowledge-graph is constructed comprises:
according to at least two associated domain attributes, taking an attribute value of the domain attribute as a main node, wherein the domain attribute has associated main node connection, and the attribute value of the domain attribute is an entity;
obtaining a document related to an entity from a text database, and associating the document to a main node to be used as a document neighbor node of the main node;
and obtaining the knowledge graph according to the main node and the document neighbor node of the main node.
6. The method of claim 5, wherein obtaining documents related to an entity from a text database, and associating the documents to a primary node as document neighbor nodes of the primary node, comprises:
and analyzing each text in the database by using a word frequency analysis method, obtaining the correlation degree of the text and the entity according to the word frequency containing relation of the text and the entity, and taking the text with the correlation degree larger than a threshold value as a document neighbor node of the corresponding main node.
7. The method of claim 1, further comprising:
and returning the text retrieval result and the knowledge graph information to the user terminal, wherein the knowledge graph information comprises the query node and the 1 to N-order neighbor main nodes of the query node.
8. A text retrieval apparatus, the apparatus comprising:
the retrieval sentence acquisition module is used for acquiring a retrieval sentence input by a user;
the entity extraction module is used for carrying out entity identification on the retrieval sentence and extracting an entity in the retrieval sentence;
the query node acquisition module is used for querying a main node corresponding to a pre-constructed knowledge graph according to the entity to obtain a query node; the knowledge graph is constructed by taking a domain attribute as a main node, the attribute value of the domain attribute is an entity, and the neighbor nodes of the main node comprise document neighbor nodes related to the content of the main node;
the document acquisition module is used for acquiring document neighbor nodes of the query nodes in the knowledge graph to obtain direct documents;
and the retrieval module is used for determining the text retrieval result of the retrieval sentence from the direct document according to the occurrence frequency of the direct document in the document neighbor nodes of the 1-N-order neighbor host nodes of the query node.
9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
CN202011637716.4A 2020-12-31 2020-12-31 Text retrieval method and device, computer equipment and storage medium Pending CN112732926A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011637716.4A CN112732926A (en) 2020-12-31 2020-12-31 Text retrieval method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011637716.4A CN112732926A (en) 2020-12-31 2020-12-31 Text retrieval method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112732926A true CN112732926A (en) 2021-04-30

Family

ID=75608881

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011637716.4A Pending CN112732926A (en) 2020-12-31 2020-12-31 Text retrieval method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112732926A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140046934A1 (en) * 2012-08-08 2014-02-13 Chen Zhou Search Result Ranking and Presentation
CN105095433A (en) * 2015-07-22 2015-11-25 百度在线网络技术(北京)有限公司 Recommendation method and device for entities
CN110321408A (en) * 2019-05-30 2019-10-11 重庆金融资产交易所有限责任公司 Searching method, device, computer equipment and the storage medium of knowledge based map
CN111177532A (en) * 2019-12-02 2020-05-19 平安资产管理有限责任公司 Vertical search method, device, computer system and readable storage medium
CN111984694A (en) * 2020-07-17 2020-11-24 北京欧应信息技术有限公司 Orthopedics search engine system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140046934A1 (en) * 2012-08-08 2014-02-13 Chen Zhou Search Result Ranking and Presentation
CN105095433A (en) * 2015-07-22 2015-11-25 百度在线网络技术(北京)有限公司 Recommendation method and device for entities
CN110321408A (en) * 2019-05-30 2019-10-11 重庆金融资产交易所有限责任公司 Searching method, device, computer equipment and the storage medium of knowledge based map
CN111177532A (en) * 2019-12-02 2020-05-19 平安资产管理有限责任公司 Vertical search method, device, computer system and readable storage medium
CN111984694A (en) * 2020-07-17 2020-11-24 北京欧应信息技术有限公司 Orthopedics search engine system

Similar Documents

Publication Publication Date Title
CN110765275B (en) Search method, search device, computer equipment and storage medium
WO2020057022A1 (en) Associative recommendation method and apparatus, computer device, and storage medium
US10146862B2 (en) Context-based metadata generation and automatic annotation of electronic media in a computer network
US9104979B2 (en) Entity recognition using probabilities for out-of-collection data
WO2021120627A1 (en) Data search matching method and apparatus, computer device, and storage medium
US8131684B2 (en) Adaptive archive data management
CN110377558B (en) Document query method, device, computer equipment and storage medium
WO2020019562A1 (en) Search sorting method and device, electronic device, and storage medium
CN112732883A (en) Fuzzy matching method and device based on knowledge graph and computer equipment
CN107844493B (en) File association method and system
CN111460090A (en) Vector-based document retrieval method and device, computer equipment and storage medium
WO2021115474A1 (en) Data search method, apparatus, computer device, and storage medium
CN113407785B (en) Data processing method and system based on distributed storage system
CN112560444A (en) Text processing method and device, computer equipment and storage medium
CN112651236B (en) Method and device for extracting text information, computer equipment and storage medium
CN112685475A (en) Report query method and device, computer equipment and storage medium
CN111191105A (en) Method, device, system, equipment and storage medium for searching government affair information
CN110555165B (en) Information identification method and device, computer equipment and storage medium
CN109656947B (en) Data query method and device, computer equipment and storage medium
CN113761161A (en) Text keyword extraction method and device, computer equipment and storage medium
CN112597274A (en) Document determination method, device, equipment and storage medium based on BM25 algorithm
CN110688516A (en) Image retrieval method, image retrieval device, computer equipment and storage medium
CN115374849A (en) Enterprise related patent retrieval method, device, equipment and medium
CN112732926A (en) Text retrieval method and device, computer equipment and storage medium
CN115269765A (en) Account identification method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination