CN114218472A

CN114218472A - Intelligent search system based on knowledge graph

Info

Publication number: CN114218472A
Application number: CN202111540151.2A
Authority: CN
Inventors: 陈杨; 肖创柏
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2021-12-15
Filing date: 2021-12-15
Publication date: 2022-03-22

Abstract

The invention discloses an intelligent search system based on a knowledge graph, which comprises a data management module, a data processing module, a natural language processing service module, a knowledge graph construction module and an information retrieval module; the data management module, the data processing module, the natural language processing service module, the knowledge graph construction module and the information retrieval module are connected in parallel. The following three NLP services including triple extraction, named entity recognition and semantic matching are realized based on a BERT pre-training model. Through a python-based flash framework. Developing a knowledge graph constructed based on data in a specific field, and providing a method for secondarily training a triple extraction model, so that the workload of manually marking training data is reduced, the purpose of manually marking an original specific data set as little as possible is realized, and the triple extraction model for the data set is trained; to some extent, the search engine is enabled to understand the intent of the user, making the enterprise-level search engine more intelligent.

Description

Intelligent search system based on knowledge graph

Technical Field

The invention relates to an intelligent search system realized based on a knowledge graph for enterprise data, and belongs to the technical field of computers.

Background

After years of development, the technology of general search engines is continuously developed, and for example, the famous search engines such as Google, hundredth, Bing and the like have remarkable development on the search technology based on the knowledge level. The knowledge graph is a concept proposed by Google corporation in 2012, and aims to improve search results by describing various entities and concepts in the real world and the relationships between the entities and concepts. On the basis that a semantic network is formed by constructing the knowledge graph, the search engine can understand the intention of the user to a certain extent according to the entity and the relation of the knowledge graph and search the information really needed by the user. Knowledge graph-based search supports more accurate and concise result return, knowledge search usually maps a search statement of a user to a structured query statement, and the final positioning target is an entity in a knowledge base, and the entity contains abundant relevant information, so that an accurate and concise search result can be conveniently returned to the user as long as the entity in the knowledge base is accurately positioned.

The application of search engine technology is not only a general search scenario covering the whole internet data, such as hundredths, Google, etc., but also a personalized search engine for enterprise-level data. At present, most enterprise-level search engines are still in a relatively primary stage, the search engines only process data, construct an index database and store the data for enterprise data, and users search through keywords and return document data containing input keywords, so that the semantics of user search cannot be understood, the interactivity of the users between the search engines is reduced, the search accuracy is also reduced, and the increasing search requirements of the users cannot be met. With the continuous development of knowledge graph technology and natural language processing technology in recent years, a new development idea is provided for a search engine. The knowledge graph can describe and construct knowledge by using a graph model, a group of triple networks are formed by extracting and storing entities and relations, a knowledge base in a certain specific field can be constructed to a certain extent, and more complex semantic relations among data are provided by the knowledge base in the specific field, so that the knowledge graph technology is very necessary to be applied to an enterprise-level search engine, natural language understanding of a certain degree is realized by searching based on the knowledge base in the specific field, the accuracy of a search result is enhanced by constructing and applying the knowledge graph, the relevance between the result and user input is enhanced, and better search service is provided for the data in the specific field.

Disclosure of Invention

The invention aims to provide an intelligent search system based on a knowledge graph, provides a solution for natural language search, and designs and develops the intelligent search system for realizing natural language search based on the knowledge graph aiming at data in enterprise-level specific fields.

The invention develops an intelligent search system by applying natural language processing technologies such as triple extraction, named entity recognition, semantic matching and the like, combining a search engine solr and a graph database nebula-graph adopted by the invention and using Java, python, springboot frames and the like.

In order to achieve the purpose, the technical scheme adopted by the invention is an intelligent search system based on a knowledge graph, and is divided into the following five sub-modules from the functional perspective: the system comprises a data management module, a data processing module, a natural language processing service module, a knowledge graph construction module and an information retrieval module; the data management module, the data processing module, the natural language processing service module, the knowledge graph construction module and the information retrieval module are connected in parallel, as shown in fig. 1. In a natural language processing module, the invention realizes the following three NLP services based on a BERT pre-training model, including triple extraction (used for constructing a knowledge graph), named entity identification and semantic matching. The system is provided with services in a web service interface mode through a python-based flash framework, returned results are packaged, and the system calls corresponding interfaces at a part needing natural language processing to analyze and process the results. In the knowledge graph construction part, a method for constructing a knowledge graph based on data in a specific field and providing a secondary training triple extraction model is developed, so that the workload of manually marking training data is reduced, the purpose of manually marking the original specific data set as little as possible is realized, and the triple extraction model for the data set is trained. In the natural language search part, semantic search based on named entity recognition and template matching is realized, improvement is made on the basis, and a semantic matching mode between sentences and relational words is provided, so that an enterprise-level search engine can understand the natural language search request of a user.

Fig. 2 shows the architecture of the whole system, and five important modules designed in the whole system are marked. The invention relates to a method for searching a database, which is characterized in that a selector is a search engine and can be regarded as a database in a broad sense, in the invention, a Java client provided by the selector is used for creating a collection, each collection stores enterprise data in a specific field, the collection stores the enterprise data called the collection to be inquired, the data stored in the selector is displayed in a json format on an admin interface of the selector, a plurality of documents (doc) are stored in one collection, each document is one piece of data, each piece of data is a document and has a plurality of fields (fields), and an id field is used as a unique identifier of the data in the collection.

The method is characterized in that a nebula-graph is a graph database product adopted by the subject, one nebula-graph instance is composed of one or more graph spaces (spaces), each graph space is physically isolated, a user can use different graph spaces to store different data sets in the same instance, a spacenname uniquely identifies one data set, each space corresponds to a collection, and stores entity-relation data of one type, namely triple information extracted from the data corresponding to the collection, namely knowledge information, so as to form a knowledge base with semantics corresponding to the collection. For each space, a schema configuration needs to be defined for the space, and the schema of the nebula-graph is shown in table 1.

TABLE 1 graph space configuration of nebula-graph

A data management module: in the invention, the creation and deletion of the collection of the solr, the addition, deletion and modification of the fields of the collection of the solr, the creation and deletion of the space of the nebula-graph and the like are realized by a data management module. The data management module is used for managing data and basic configuration in the whole system and mainly realizes the following four functions: the management of solr data, the management of nebula-graph data, the management of triple ontology schema configuration, and the management of natural language question template configuration are shown in fig. 3.

The solr data management module is responsible for creating, configuring and deleting the collection of the data sets, and configuring, adding and deleting the fields of the collection. The creating and deleting of the collection are to correspond to the creating and deleting of a space of the nebula-graph, namely, the creating and deleting methods of the space are called in the creating and deleting methods of the collection, the name of the space is the same as that of the collection, and the space and the collection jointly form a data set for a user to search, wherein the collection is an original data set to be inquired, and the data stored in the space is an extracted knowledge base corresponding to the collection data.

The data management of the nebula-graph is responsible for realizing the creation and deletion of space of the graph database nebula-graph, and is responsible for managing schema configuration information of the space of the nebula-graph, namely creating, deleting a label or a point type (tag), creating, deleting an edge type (edge type), creating, deleting a tag index and creating, deleting an edge index.

The triplet schema configuration management module: the triple is a subject, a predicate, an object, i.e. a head entity, a relation, and a tail entity, the triple schema is a type of the subject, and the type of the predicate and the object. This module is used to construct the knowledge graph. The invention needs to perform triple extraction on the data of the collection to construct the corresponding knowledge graph, and the triple extraction is realized by calling a web interface provided by a natural language processing service module. The method is based on a bert pre-training language model, parameters of the model are correspondingly modified through the downstream task, then the triple extraction character is trained, schema configuration is needed for training, and labeled training data are configured according to the schema. The schema is configured and stored in a data set with a suffix name of 'schema' corresponding to a data set to be searched and is used for managing the schema, the module is used for adding, modifying and deleting the schema, auditing and re-labeling the data of the collection to be searched according to the schema, and writing the configuration into training data to train a triple extraction model. The triplet ontology configuration management is shown in figure 3.

And (3) managing a natural language question template: for the schemas with the configured collection to be queried, a relationship, namely, predicate, can be obtained from each schema, and a template can be matched according to all the relationships (predicates) in the knowledge graph, wherein the template management part is responsible for adding, deleting and modifying the matched template, and the question templates are stored in the corresponding collection with the suffix of "_ template".

A data processing module: in the subject, there are a total of two parts of data storage, namely the search engine solr and the graph database nebula-graph.

The module is responsible for storing the two parts of data, adding, deleting and modifying the data in the collection of the solr, and inserting, deleting and updating the triple entities and the relations in the nebula-graph.

The original data need to be processed to a certain extent before being stored in solrcollection, the module realizes three processing modules of short text filtering, text replacement and segmentation and sentence segmentation of the data, and finally indexes the processed data into the collection of solr. And the part of data processing has expandable property, and can realize corresponding requirements by adding a processing module, as shown in fig. 4. Triple data corresponding to the collection to be queried is stored in the collection with the suffix name of 'extract', triple data, namely, a relation and an entity in the collection are stored in a nebula-graph, and a relation is established between nodes through the relation to form a knowledge graph.

A natural language processing service module: the module is written by python, realizes four functions of triple extraction, named entity identification, semantic matching between sentences and relational words and semantic matching between two sentences, respectively encapsulates the four functions into interfaces, and provides a web service form through a flash framework for the invocation of the springboot project. As shown in fig. 5.

And (3) extracting triples: the module realizes the triple extraction function, firstly trains a triple extraction model, the trained model is stored on a server, codes are written, web service is provided for the outside through a flash frame, a short text set List < Stringtext > is input, the output is the triple corresponding to each input short text, including the extracted text, and the triple information corresponding to the text, namely, subject, reject type, object type and predicate, and the returned result is encapsulated into a json format.

Named entity recognition: the module realizes the named entity recognition function, firstly trains a model for named entity recognition, stores the trained model on a server, writes codes, provides a web service interface through a flash frame for a springboot project to call, inputs the web service interface into a short text set, returns a result as a named entity, and packages the result into a json format.

Semantic matching: the part is divided into two parts of semantic matching between the sentences and the relation words and semantic matching between the two sentences, models are trained respectively, the trained models are stored on a server, codes are written, web services are provided through a flash framework, and calling can be carried out in the natural language searching process. Semantic matching between the sentences and the relation words, aiming at obtaining an entity which is most similar to a certain entity relationship from a graph database; semantic matching between two sentences aims to find a question template most relevant to the input sentence of the user for subsequent searching, and the detailed introduction is carried out in a natural language search section. Inputting: list < Stringtext > set, each text is two parts, which are divided by "#", the output parameter "prob" represents the score, which reflects the matching degree of the two parts, a threshold value is set in the program, and the two parts are considered to be matched if the score exceeds the threshold value.

A knowledge graph construction module:

the purpose of the module is to perform triple extraction on data of solrcollection and store the extracted entity and relationship data into a nebula-graph as a knowledge graph to support natural language search.

The knowledge graph construction process comprises the following steps:

step 1: and carrying out triple data annotation on the data in the collection to be queried of the solr.

Step 2: and training the triple extraction model.

And step 3: and calling a triple extraction interface of the natural language processing service module for extraction, and storing the extracted result into the collection corresponding to the solr.

And 4, step 4: and auditing the extracted triples, namely the triples of data stored in the selector corresponding to the collection.

And 5: and storing the checked data into the space corresponding to the graph database to be used as a knowledge base corresponding to the collection data.

An information retrieval module: the module function is to search the data in the database (solr, nebula-graph), and the information search module is divided into a common search module and a natural language search module for searching.

The common retrieval module is realized based on a keyword matching and solr packaged query resolver.

The natural language search module is a scene of natural language input by a user, and needs to convert an unstructured natural language query statement input by the user into a structured query statement (for a query statement of a graph database, in the present invention, a nebula-graph is used, and the unstructured query statement needs to be converted into a structured NGQL), and queries in a corresponding knowledge base (space), where the returned query result is entity information, the entity information is a part of a result of the query by the user, and the entity information is searched in a corresponding collection to return a final search result, which is another part of the query result by the user, and the two search results jointly form a query result of the user.

Drawings

FIG. 1 is a diagram of an intelligent search system and its sub-modules.

Fig. 2 is a system architecture diagram.

Fig. 3 is a management diagram of triple ontology configuration.

Fig. 4 is a block diagram of data processing.

FIG. 5 is a block diagram of a natural language processing service.

FIG. 6 is a diagram of a data management module and its sub-modules.

FIG. 7 is a knowledge graph construction flow diagram.

Fig. 8 is a flow chart of natural language search.

Detailed Description

In order to further clarify the objects, technical solutions and advantages of the embodiments of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The intelligent search system is composed of a plurality of modules: the system comprises a data management module, a data processing module, a natural language processing service module, a knowledge graph construction module and an information retrieval module. The overall architecture of the system is shown in fig. 2.

In the present invention, the entire system architecture is described by the following steps according to fig. 2.

Step 1: data management

The data management module and its sub-module diagram are shown in fig. 6.

Firstly, creating the collection of the solr through a solr data management module, wherein each collection represents enterprise data in a specific field;

after the collection is created, configuring fields of the collection;

creating a graph database space corresponding to the collection, and configuring a point type (tag) and an edge type (edgetype) corresponding to the space;

configuring a plurality of attributes of tag and edge type;

establishing a single index and a joint index of the attributes of tag and edge type;

configuring a triplet ontology, namely schema (type and relation of subject and type of object), used by a knowledge graph to be constructed by the data of the collection to be queried, and storing the schema after configuration in a data set with a suffix name of "_ schema" corresponding to the data set to be queried;

step 2: data processing

Before the data are indexed into the collection of the solr, according to actual requirements, some processing needs to be performed on the data to be indexed, and then the data are indexed into the collection of the solr. The module realizes three processing modules of short text filtering, text replacement and segmentation and sentence segmentation of data, and the processing of the part of data can realize corresponding requirements by adding the processing modules aiming at different types of data.

And step 3: knowledge graph construction

The system knowledge graph construction process is shown in fig. 7. The method comprises the following steps:

1. configuring a triad schema for the collection to be extracted;

2. training a triple extraction model through an open source training set;

3. carrying out clauses on the data to be extracted according to the collection;

4. extracting by calling triple extraction of a natural language processing service module, and storing a result in a collection with a suffix name of 'extraction';

5. extracting, cleaning and re-labeling the auditing triples, and writing the training data re-labeled for the data of the collection into a training file (the purpose of the auditing of the step is to re-label the triples of the training data for the collection to be queried);

6. training the triple extraction model again aiming at the collection to be queried, and storing the triple extraction model in the corresponding collection;

7. audit data (the purpose of this audit is to store in a database space);

8. the database nebula-graph is stored in the database.

Because the invention constructs the knowledge graph aiming at the data of a certain collection to be inquired, the data of the collection is usually from enterprise clients or the data of a certain specific field, the model trained by the open-source triple extraction training data set does not necessarily have good extraction effect aiming at the data of the specific field, the common method is that special personnel are needed to carry out manual marking aiming at the specific data set according to the configured triple schema, the system writes the marked training data into the training set file, calls the extraction service of the natural language processing service module to carry out triple extraction, and then checks and stores the triple. Aiming at the situation, the invention adopts a small improvement scheme, namely, two times of training are carried out, the operation of manually marking data is reduced, and the efficiency of the whole process is improved.

At this time, data of the data set to be queried can be extracted again through the model, and re-audit is performed, wherein the purpose of the audit is to store the data passing the audit into the space of the nebula-graph for the purpose of storing the final ternary group data.

And 4, step 4: natural language search

A natural language search flow diagram is shown in fig. 8.

By the above introduction, a knowledge graph for the to-be-queried collection has been constructed, and natural language search needs to be performed based on the knowledge graph. The query sentence input by the user is unstructured natural language, the knowledge graph is realized through a graph database nebula-graph, the query sentence needs to be structured, and the result is returned. In the system, the essence of realizing natural language search is to convert the natural language query sentence of an unstructured user into a structured query sentence based on a knowledge graph for query. The conversion of unstructured sentences into structured query sentences is performed by the following steps.

1. Acquiring a subject of the text through named entity recognition;

named entity recognition is the first step in converting an unstructured query statement into a structured query statement, and requires obtaining key entity information, namely, a subject, of a user input statement.

2. Obtaining a relation predicate through semantic matching;

firstly, constructing a query statement NGQL (the NGQL does not specify an edge type, namely does not specify a relation) aiming at the object obtained in the step 1, so as to obtain all relation words (prefixes) corresponding to the object in a graph database;

combining each relation word with the input sentence of the user to obtain a List < Stringtext > set;

the set is used as a parameter to call a relation semantic matching structure of a natural language processing service module, each group of relations and query sentences return a matching score, when the matching score reaches a certain threshold value, the relation is considered as a relation word capable of reflecting the query sentences input by a user, and the relation word with the highest score is taken as a prefix;

and constructing a structured query statement NGQL aiming at the nebula-graph by using the obtained object and the predicate, and querying a tail entity, namely the object, in the space of the nebula-graph, wherein the entity is a part of a returned result.

3. And taking the obtained tail entity as a query keyword to be queried in the collection of the solr for query, and taking the tail entity as the other part of the returned result.

In conclusion, through the research and the application of the bert model, the invention realizes NLP downstream tasks such as triple extraction, named entity identification and the like, and through the application of NLP technology and a knowledge graph constructed by enterprise data, namely data in a specific field, the search engine can understand the intention of a user to a certain extent, the natural language search of the user is realized, and the enterprise-level search engine is more intelligent.

Claims

1. The intelligent search system based on the knowledge graph is characterized in that: the system comprises a data management module, a data processing module, a natural language processing service module, a knowledge graph construction module and an information retrieval module; the data management module, the data processing module, the natural language processing service module, the knowledge map construction module and the information retrieval module are connected in parallel; in a natural language processing module, realizing the following three NLP services based on a BERT pre-training model, wherein the three NLP services comprise triple extraction, named entity identification and semantic matching; providing services for the system in a web service interface form through a python-based flash framework, packaging a returned result, calling a corresponding interface at a part needing natural language processing, and analyzing and processing the result; on the basis, improvement is carried out, and a semantic matching mode between the sentences and the relational words is provided, so that the enterprise-level search engine can understand the natural language search request of the user.

2. The intellectual search system based on knowledge-graph of claim 1 wherein: the solr is a search engine, and creates the collection through a Java client provided by the solr, each collection stores enterprise data in a specific field, the collection for storing the enterprise data is called the collection to be inquired, the data stored in the solr is displayed in a json format on an admin interface of the solr, a plurality of documents are stored in one collection, each document is a piece of data, each piece of data is a document and has a plurality of fields, and the id field is used as the unique identifier of the data in the collection; the method includes the steps that a nebula-graph is used as a graph database product, one nebula-graph instance is composed of one or more graph spaces, each graph space is physically isolated, users use different graph spaces to store different data sets in the same instance, a spacenname uniquely identifies one data set, each space corresponds to one collection, one type of entity-relation data is stored, and namely triple information, namely knowledge information, extracted from the data corresponding to the collection is stored.

3. The intellectual search system based on knowledge-graph according to claim 2 wherein: a data management module: the method comprises the following steps that (1) a solr and a nebula-graph both have some data and basic configuration to be managed, the collection and the deletion of the solr, the addition, the deletion and the modification of the fields of the collection of the solr and the creation and the deletion of the space of the nebula-graph are realized by a data management module; the data management module manages data and basic configuration in the whole system and realizes the following four functions: the method comprises the following steps of solr data management, nebula-graph data management, triple body schema configuration management and natural language question template configuration management.

4. The intellectual search system based on knowledge-graph of claim 1 wherein: the solr data management module is responsible for creating, configuring and deleting a data set collection, and configuring, adding and deleting a collection field; the creating and deleting of the collection are to correspond to the creating and deleting of a space of the nebula-graph, namely, the creating and deleting methods of the space are called in the creating and deleting methods of the collection, the name of the space is the same as that of the collection, and the space and the collection jointly form a data set for a user to search, wherein the collection is an original data set to be inquired, and the data stored in the space is an extracted knowledge base corresponding to the collection data.

5. The intellectual search system based on knowledge-graph according to claim 4 wherein: the data management of the nebula-graph is responsible for realizing the creation and deletion of space of the database nebula-graph, and is responsible for managing schema configuration information of the space of the nebula-graph, namely creating, deleting labels or point types, creating, deleting edge types, creating, deleting tag indexes and creating, deleting edge indexes;

the triplet schema configuration management module: the triple is a subject, a predicate and an object, namely a head entity, a relation and a tail entity, the triple schema is the type of the subject, the predicate and the type of the object; the module is used for constructing a knowledge graph; based on a bert pre-training language model, correspondingly modifying parameters of the model through a downstream task, then training a triple extraction character, wherein the training needs schema configuration and training data labeled according to the schema configuration; the schema is configured and stored in a data set with a suffix name of 'schema' corresponding to a data set to be searched and is used for managing the schema, the module is used for adding, modifying and deleting the schema, auditing and re-labeling the data of the collection to be searched according to the schema, and writing the configuration into training data to train a triple extraction model.

6. The intellectual search system based on knowledge-graph according to claim 3 wherein: and (3) managing a natural language question template: aiming at the schemas with the configured collection to be inquired, a relation, namely a prefix, is obtained from each schema, a question matching template is obtained according to all the relations in the knowledge graph, the template management part is responsible for adding, deleting and modifying the matching template, and the question template is stored in the corresponding collection with the prefix name of 'template'.

7. The intellectual search system based on knowledge-graph of claim 1 wherein: a data processing module: a total of two parts of data storage, namely a search engine solr and a graph database nebula-graph; the data storage module is responsible for storing the two parts of data, adding, deleting and modifying the data in the collection of the solr, and inserting, deleting and updating the triple entities and the relations in the nebula-graph;

the original data need to be processed to a certain extent before being stored in the solr collection, the module realizes three processing modules of short text filtering, text replacement and segmentation and sentence segmentation of the data, and finally indexes the processed data into the collection of the solr; the part of data processing has expandable property, and realizes corresponding requirements by adding a processing module; triple data corresponding to the collection to be queried is stored in the collection with the suffix name of 'extract', triple data, namely, a relation and an entity in the collection are stored in a nebula-graph, and a relation is established between nodes through the relation to form a knowledge graph.

8. The intellectual search system based on knowledge-graph of claim 1 wherein: a natural language processing service module: the method is written by python, realizes four functions of triple extraction, named entity identification, semantic matching between sentences and relational words and semantic matching between two sentences, respectively encapsulates the four functions into interfaces, and provides a web service form through a flash frame for the calling of a springboot project;

and (3) extracting triples: the module realizes the triple extraction function, firstly training a triple extraction model, storing the trained model on a server, compiling codes, providing web services to the outside through a flash frame, inputting a short text set List < String text >, outputting triples corresponding to each input short text, including extracted text, and triple information corresponding to the text, namely, subject, reject type, object type and predicaton, and returning a result and encapsulating the result into a jformat;

named entity recognition: the module realizes the named entity recognition function, firstly, a model for named entity recognition is trained, the trained model is stored on a server, codes are compiled, a web service interface is provided through a flash frame for being called by a springboot project, the interface is input into a short text set, a returned result is a named entity, and the result is packaged into a json format;

semantic matching: the part is divided into two parts of semantic matching between the sentences and the relation words and semantic matching between the two sentences, models are trained respectively, the trained models are stored on a server, codes are written, web services are provided through a flash framework, and calling can be carried out in the natural language searching process.

9. The intellectual search system based on knowledge-graph of claim 1 wherein: the knowledge graph building module comprises the following building processes:

step 1: carrying out triple data annotation on data in the collection to be queried of the solr;

step 2: training a triple extraction model;

and step 3: calling a triple extraction interface of the natural language processing service module for extraction, and storing the extracted result into the collection corresponding to the solr;

and 4, step 4: the extracted triples, namely the triples stored in the selector corresponding to the collection, are examined;

10. The intellectual search system based on knowledge-graph of claim 1 wherein: and the information retrieval module is used for retrieving data in the database and is divided into a common retrieval module and a natural language search module for retrieval.