CN116523041A - Knowledge graph construction method, retrieval method and system for equipment field and electronic equipment - Google Patents

Knowledge graph construction method, retrieval method and system for equipment field and electronic equipment Download PDF

Info

Publication number
CN116523041A
CN116523041A CN202310497796.5A CN202310497796A CN116523041A CN 116523041 A CN116523041 A CN 116523041A CN 202310497796 A CN202310497796 A CN 202310497796A CN 116523041 A CN116523041 A CN 116523041A
Authority
CN
China
Prior art keywords
equipment
equipment field
knowledge graph
knowledge
constructing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310497796.5A
Other languages
Chinese (zh)
Inventor
程渤
郭霄
李翔
顾文渊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN202310497796.5A priority Critical patent/CN116523041A/en
Publication of CN116523041A publication Critical patent/CN116523041A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a knowledge graph construction method, a retrieval method, a system and electronic equipment in the equipment field, and relates to the technical field of knowledge graphs, wherein the knowledge graph construction method in the equipment field comprises the following steps: constructing an equipment field body based on the equipment field original data set; based on the original data set of the equipment field and the body of the equipment field, extracting and combining the entity and the relation into a whole information extraction task for processing, and outputting the triplet data in an end-to-end mode; and constructing a knowledge graph of the equipment field according to the triplet data. The invention can build a corresponding equipment knowledge system, and provide more intelligent and personalized services by combining knowledge extraction, information retrieval and other technologies so as to promote the digital and visual development of the equipment field.

Description

Knowledge graph construction method, retrieval method and system for equipment field and electronic equipment
Technical Field
The invention relates to the technical field of knowledge graphs, in particular to a knowledge graph construction method, a knowledge graph retrieval system and electronic equipment in the equipment field.
Background
Knowledge graph is a technique that organizes entities, concepts and relationships into a structured knowledge network. The multi-source heterogeneous knowledge is integrated and represented, so that a machine can understand and infer human knowledge, and further application in the artificial intelligence fields such as semantic understanding, natural language processing, question answering and the like is realized. The knowledge graph construction needs to solve the problems of multi-source knowledge extraction, knowledge storage and the like, and meanwhile needs to consider the problems of knowledge consistency, completeness, reliability and the like.
The domain knowledge graph is the application of the knowledge graph in a specific domain, and aims to construct a graph with rich knowledge and semantic relations in the specific domain, and is used for supporting various intelligent applications in the domain. The construction of the domain knowledge graph relates to knowledge extraction, knowledge storage and other technologies. Knowledge extraction refers to extracting entities and relations related to the field from unstructured data, and knowledge storage refers to persistence of knowledge extracted from different data sources in a unified knowledge graph form. The domain knowledge graph retrieval needs to combine the technologies of semantic matching, path searching and the like to quickly and accurately retrieve related entities and knowledge from the graph, thereby providing accurate answers and services for users. In the domain knowledge graph, because of various complex relationships among entities, the traditional retrieval mode based on keyword matching cannot meet the requirements. Therefore, research and application of domain knowledge graph retrieval have important significance for promoting intelligent development of various domains.
A large amount of unstructured data is accumulated in the field of equipment today, hiding much of the information available. However, the existing equipment data is organized in disorder and stored in various databases and websites, and when facing to the large-scale equipment information, the related field personnel often need to spend a great deal of time and effort on data research and reading, so that it is difficult to acquire key information in real time.
Disclosure of Invention
The invention aims to provide a knowledge graph construction method, a retrieval method, a system and electronic equipment in the equipment field, which are used for constructing a corresponding equipment knowledge system and providing more intelligent and personalized services by combining knowledge extraction, information retrieval and other technologies so as to promote the digital and visual development of the equipment field.
In order to achieve the above object, the present invention provides the following solutions:
in a first aspect, the present invention provides a method for constructing a knowledge graph in an equipment field, including:
constructing an equipment field original data set;
constructing an equipment field body based on the equipment field original data set;
based on the original data set of the equipment field and the body of the equipment field, extracting and combining the entity and the relation into a whole information extraction task for processing, and outputting the triplet data in an end-to-end mode;
and constructing a knowledge graph of the equipment field according to the triplet data.
In a second aspect, the invention provides a retrieval method based on knowledge graph in equipment field, comprising the following steps:
according to the search keywords, information retrieval is carried out on the equipment field knowledge graph determined in the first aspect based on the information retrieval strategy of node matching and query expansion; the node matching refers to judging whether the search keywords can be mapped to the entities in the knowledge graph of the equipment field; the query expansion refers to the organic expansion of entities in the knowledge graph of the equipment field, and the entities similar to the search keywords are found out.
In a third aspect, the present invention provides a knowledge graph construction system in an equipment field, including:
the data set construction module is used for constructing an original data set in the equipment field;
the equipment field body construction module is used for constructing an equipment field body based on the equipment field original data set;
the triple data extraction module is used for processing the entity and relation extraction and merging into a whole information extraction task based on the original data set of the equipment field and the body of the equipment field, and outputting triple data in an end-to-end mode;
and the equipment field knowledge graph construction module is used for constructing an equipment field knowledge graph according to the triplet data.
In a fourth aspect, the present invention provides a retrieval system based on knowledge graph in equipment domain, including:
the information retrieval module is used for retrieving information on the equipment field knowledge graph determined in the first aspect based on the information retrieval strategy of node matching and query expansion according to the search keywords; the node matching refers to judging whether the search keywords can be mapped to the entities in the knowledge graph of the equipment field; the query expansion refers to the organic expansion of entities in the knowledge graph of the equipment field, and the entities similar to the search keywords are found out.
In a fifth aspect, the present invention provides an electronic device, including a memory and a processor, where the memory is configured to store a computer program, and the processor is configured to execute the computer program to cause the electronic device to perform an equipment domain knowledge graph construction method according to the first aspect.
In a sixth aspect, the present invention provides an electronic device, including a memory and a processor, where the memory is configured to store a computer program, and the processor is configured to execute the computer program to cause the electronic device to perform a retrieval method based on a knowledge graph in an equipment domain according to the second aspect.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
the invention realizes a set of complete domain knowledge graph application flow from file analysis, ontology modeling, knowledge extraction, graph construction and information retrieval, has excellent data processing and semantic association capability, and can provide excellent data visualization function.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic flow chart of a knowledge graph construction method in the equipment field according to an embodiment of the present invention;
fig. 2 is a flow chart of a retrieval method based on knowledge graph in equipment field according to an embodiment of the present invention;
fig. 3 is a schematic flow chart of an automatic construction and retrieval method for knowledge maps in the equipment field according to an embodiment of the present invention;
FIG. 4 is a flowchart of equipment domain ontology construction provided by an embodiment of the present invention;
FIG. 5 is a block diagram of a knowledge joint extraction algorithm in the equipment domain provided by the embodiment of the invention;
FIG. 6 is a flow chart for constructing a knowledge graph in the equipment field according to the embodiment of the invention;
FIG. 7 is a diagram of an information retrieval strategy structure based on node matching and query expansion provided by an embodiment of the present invention;
FIG. 8 is a knowledge retrieval flow chart based on a domain knowledge graph provided by an embodiment of the invention;
fig. 9 is a diagram of an overall architecture of a domain knowledge graph construction and retrieval system according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The invention provides a knowledge graph construction method, a retrieval method, a system and electronic equipment in the equipment field, which mainly comprise the following steps: visual management and export of the domain map mode layer ontology; performing format verification, cleaning, format conversion and storage on the uploaded text information; extracting triples from the text information subjected to data processing based on an end-to-end knowledge joint extraction framework; combining the extracted triples into a knowledge graph, and storing by using a persistence means; and searching the knowledge graph nodes and path information stored in the system based on an information searching strategy of node matching and query expansion. The invention uses the knowledge graph as an information management tool, and the association among different equipment entities is displayed in an auxiliary way through a visual means, so that a more efficient data management scheme is provided for intelligent construction in the equipment field.
In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.
Example 1
As shown in fig. 1, the method for constructing a knowledge graph in an equipment field provided in this embodiment includes:
Step 101: and constructing an original data set in the equipment field, and providing data input for a knowledge extraction algorithm.
In this embodiment, the step 101 specifically includes: and capturing webpage data of the professional weapon equipment by adopting a crawler technology, and storing the captured data as a JSON format file to establish an original data set in the field of construction equipment.
The detailed process of the steps is as follows:
(1) The crawler technology is adopted to grab webpage data of professional weaponry, text data of the professional weaponry are collected, and the data acquisition flow comprises: and acquiring equipment level information, determining an initial link queue, traversing the link queue, analyzing a page structure, analyzing a brief introduction description of equipment by adopting a regular expression, and storing a candidate equipment data set.
(2) Washing unstructured text data, the data washing comprising: repeating data filtering, unifying data formats and describing information, and eliminating meaningless noise characters.
Step 102: based on the equipment domain original data set, an equipment domain body is constructed, namely, the equipment domain body is designed for the constructed equipment domain original data set, and mode layer information is provided for entity relation extraction.
In this embodiment, the step 102 specifically includes: and constructing the equipment field ontology by adopting a top-down ontology modeling scheme and a bottom-up ontology modeling scheme.
Top-down ontology modeling scheme: summarizing the equipment domain body of the designed mode layer by means of the prior knowledge of the equipment domain industry, extracting the data in the original data set of the equipment domain from top to bottom according to the equipment domain body of the designed mode layer to construct the equipment domain body, namely summarizing the body of the designed mode layer by means of the prior knowledge of the equipment domain industry, cleaning and filling the acquired entity relation data into the data layer according to the defined mode, and finally forming a high-quality knowledge graph.
Bottom-up ontology modeling scheme: the method comprises the steps of arranging and cleaning data in original data set of the equipment field and extracting equipment field body from bottom to top to realize modeling of the equipment field body, namely arranging, cleaning collected texts, extracting entity relations and the like, obtaining triple data, summarizing and summarizing the data, and gradually abstracting concepts and designing a knowledge base body model from a bottom layer.
Step 103: based on the original data set of the equipment field and the body of the equipment field, entity and relation extraction are combined into a whole information extraction task to be processed, and triple data are output in an end-to-end mode.
In an embodiment, the step 103 specifically includes: based on the original data set of the equipment field and the body of the equipment field, adopting a knowledge joint extraction algorithm based on a Seq-to-Seq frame and a RoBERTa model to extract and combine the entity and the relation into a whole information extraction task for processing, and outputting the triplet data in an end-to-end mode.
The detailed process of the steps is as follows:
(1) The original sentence is output as a sequence of encoded vectors of equal length via an improved bi-directional encoder representation model, the sequence of vectors being formed by superposition of word vectors of words, text vectors of sentences in which the words are located, and position vectors of the words in the sentences.
(2) And identifying and marking the main entity by the output code vector sequence through a standardization layer, wherein the standardization layer is composed of two classifiers and is realized by adopting a half pointer-half marking type activation function, so that the code vector with the semantic space characteristics of the main entity is obtained.
(3) And splicing the coding vector with the semantic space characteristics of the main entity with the original coding vector, taking the coding vector of the main entity as a condition, inputting the coding vector of the main entity into a condition standardization layer, and predicting the head pointer and the tail pointer vector representation of the corresponding guest entity by the same half pointer-half label structure aiming at the relation and attribute information in the established equipment field body.
Step 104: and constructing a knowledge graph of the equipment field according to the triplet data.
In this embodiment, the step 104 specifically includes: and mapping the entities and the relations in the triple data into nodes and edges in the atlas by using a Cypher grammar, and importing the constructed equipment field knowledge atlas into a Neo4j graph database for storage.
Example two
As shown in fig. 2, the present embodiment provides an information retrieval method based on a domain knowledge graph, including:
step 201: according to the search keywords, information retrieval is carried out on the equipment field knowledge graph determined in the first embodiment based on the information retrieval strategy of node matching and query expansion; the node matching refers to judging whether the search keywords can be mapped to the entities in the knowledge graph of the equipment field, so that the problem of incomplete user expression is solved from the semantic level, and the retrieval accuracy is improved; the query expansion refers to the organic expansion of the entities in the knowledge graph of the equipment field, the entities similar to the search keywords are found out, the links between the knowledge in the equipment graph are fully utilized, the search range is expanded to return more triples, and the search recall rate is improved.
The node matching method comprises the following steps:
The embodiment provides a scheme for constructing a synonymous strong connected graph in the equipment field, and a host-guest entity containing the relations of codes, aliases, translations and the like is added into the connected graph. Based on the above, the relation between words with low word hierarchy similarity is identified, and dynamically updated according to the input of the user.
According to the embodiment, the inverted index based on the entity name attribute is constructed, the entity name containing the search term can be rapidly positioned when the nodes are matched, pruning can be performed by using the index, and the efficiency of node and path query is optimized.
After the matching nodes are determined, the candidate entity set expanded by the subsequent query is reduced by the ontology type information of the corresponding entity, so that the workload of semantic comparison is reduced.
The query expansion method comprises the following steps: two methods of map structure-based and semantic matching-based.
The query expansion based on the map structure refers to obtaining neighbor nodes, which are separated by more than one, of matching nodes in the knowledge map of the equipment field, and the retrieval is more complete and full by utilizing the attribute and the relation information. The query expansion based on semantic matching refers to a process of expanding and searching keywords by means of semantic differences between two words, and semantic matching score calculation is carried out on the keywords by means of algorithms such as Jaccard coefficients, attribute similarity, editing distance and the like.
Example III
The embodiment provides an automatic construction and retrieval method for a knowledge graph in an equipment field, and fig. 3 specifically shows an overall flow of the automatic construction and retrieval method for the knowledge graph in the equipment field, which comprises the following steps:
s1, constructing a data set, wherein knowledge in the equipment field is taken as an example to introduce how to construct the data set; the method specifically comprises the following steps:
because of the sensitivity and confidentiality of the equipment field, a complete equipment field data set is not disclosed at present, and related information is less. According to the embodiment, through investigation, weapon equipment websites with high structuring degree, reliability and information comprehensiveness are selected, such as professional websites of weapon encyclopedia, world military nets and the like.
According to the embodiment, the crawler technology is adopted to capture the webpage data of the professional weapon equipment, and the webpage data is used as an important data source for constructing the knowledge graph in the equipment field. In the data crawling aspect, the URL of the first page of the website is firstly analyzed to obtain weapon classification information, and an initial URL queue is determined from the weapon classification information. The queue is then traversed to obtain links for all equipped nodes. And finally, acquiring data from the website corresponding to each data link, extracting a brief introduction description of the weapon equipment through analysis of a page structure, the use of regular expression and other tools, and storing the brief introduction description as a file in a JSON format to establish an original data set in the equipment field.
S2, ontology modeling is carried out, wherein the ontology in the equipment field is designed, and a specific flow is shown in FIG. 4 and can be divided into seven steps; the method specifically comprises the following steps:
first, the research area and content are determined. In this embodiment, taking the equipment field as an example, a research task is explicitly refined to provide a user with a construction scheme of a knowledge graph of the equipment field, so as to construct a more appropriate ontology capable of covering application requirements.
Second, multiplexing domain ontology. According to the embodiment, the stored priori knowledge system and related equipment field information are widely studied, the weapon encyclopedia classification system is used for reference, the hierarchical division of the equipment body is completed, the time cost is reduced, and the quality of the finally constructed body is guaranteed.
Third, domain core concepts and elements are listed. This step requires listing which core concepts and elements are in the field. In this embodiment, the core concept is defined as the body of an aircraft, a missile weapon, etc., the elements are defined as the range, the maximum speed, etc., the technical terms in the knowledge graph of the equipment field are listed, and the specific classification system and the attribute relationship are refined in the following steps.
Fourth, a classification system is established. The prior knowledge of the equipment field is integrated, and a hierarchical system construction is carried out on entity types, so that the correct upper-lower relationship is required to be ensured. The related entities in the equipment field build an ontology hierarchy and a type structure from top to bottom according to the hierarchical relationship contained between the entities.
Fifthly, defining the attribute and the relation of the ontology. The method comprises the steps of defining the attribute of the ontology in each category and constructing association relations with other ontologies. Taking a pistol as an example, its attributes include magazine capacity, firing performance, etc., which represent characteristics of the equipment field concept itself.
Sixth, define constraints of attributes and relationships. Through limiting the attribute and the relation of the ontology, the robustness and the standardability of the information can be ensured, and abnormal values are avoided. For example, the relations of the country of production, research and development units and the like of the equipment are only used for representing related concepts in the field of the equipment and cannot be used in an irrelevant field knowledge graph, and specifications are formulated for the length, width, height and other metering attributes of the equipment so as to keep the data consistent.
Seventh, ontology setup and revision. In this embodiment, first, the existing equipment industry information standard, encyclopedia level information, and the like are summarized from top to bottom to form a domain bottom concept. And then according to the construction flow requirement of the knowledge graph, the ontology modeling of the knowledge graph in the equipment field is realized. Finally, because the manually constructed ontology mode is incomplete, the ontology needs to be perfected and corrected from bottom to top according to the words in the high-frequency field in the data set.
S3, knowledge extraction, wherein the embodiment provides a knowledge joint extraction algorithm based on a Seq-to-Seq frame and a RoBERTa model, and the model structure is shown in FIG. 5; the method specifically comprises the following steps:
in this embodiment, the modeling thought of the decoder in the Seq-to-Seq model is utilized to combine the two subtasks of entity identification and relationship classification into an end-to-end problem for processing, and a probability modeling formula for extracting triples is created in the entity relationship extraction model, as shown in the following formula:
P(s,p,o)=p(s)P(o|S)P(p|s,o))。
firstly, inputting a sentence into a Roberta pre-training language model, and superposing a word vector of a word, a text vector of a sentence where the word is located and a position vector of the word in the sentence to strengthen the contextual characteristics and text ambiguity of the input sentence. And outputting the original sentence sequence into a coded vector sequence with equal length, wherein each corresponding output vector is used as sparse representation of the corresponding output vector in a semantic space, and comprises the characteristics of the current word and the whole input sentence.
The coded vector sequence is then input into the LayerNormalization layer, which is the labeling layer of the master entity, consisting of two bi-classifiers, each using a half-pointer-half-labeling structure, implemented via a sigmoid activation function. Both classifiers are used to detect if the current word is the start and end position of the master entity, if so, then the position is marked as 1, otherwise it is marked as 0. Through this step, the semantic space can be mapped into the partition space of the required main entity, so as to obtain the head pointer vector representation and the tail pointer vector representation of the main entity in the input text.
Taking an entity 'fighter-16 fighter' as an example, the head pointer and the tail pointer of the main entity obtained in the previous step are used to obtain the coded vector with the semantic space characteristics of the main entity 'fighter-16 fighter' from the coded vector sequence output by the Roberta pre-training model, then the coded vector is spliced with the multiplexed Roberta coded vector sequence, the coded vector of the main entity is taken as a condition to be input to a conditional layernormalization layer together, and the next prediction process is similar to the process of extracting the main entity. Aiming at each relation information in the constructed relation and attribute list of the equipment domain ontology, a half pointer-half labeling structure is adopted to predict the head pointer and tail pointer vector representation of the corresponding guest entity, and the relation is labeled while the guest entity indication is completed. Finally, the end-to-end joint extraction model combines the extracted guest entity with the host entity and the relationship as a group of triples, and outputs the triples in the form of SPO, and the finally outputted triples list is [ Jian-16 fighter, research and development unit, shenyang plane company ], [ Jian-16 fighter, produced country, china ].
The embodiment adopts a knowledge extraction method of RoBERTa based on a Seq-to-Seq framework, and the method converts extraction tasks aiming at entities and relations into a combination of a plurality of two-classification tasks based on sequence labeling through a semi-pointer-semi-labeled sigmoid activation function. From the perspective of model conversion, the loss can be calculated during knowledge extraction model training by means of a bi-classification cross entropy loss function as shown in the following formula.
loss=-ylogp-(1-y)log(1-p)。
Where y is the classification label used in entity extraction, and p is used to represent the prediction probability of the label output by the model as y. Meanwhile, when a sigmoid activation function is adopted to execute entity extraction tasks, the problem of unbalanced categories of entities to be extracted far less than non-target entities occurs. The embodiment provides a loss function optimization method of probability value power shown in the following formula, so that the probability value is closer to 0, and the initial state of the probability is closer to an ideal state, thereby accelerating model convergence.
loss=-ylogp n -(1-y)log(1-p n )。
As shown in fig. 6, the knowledge extraction described in this embodiment may be divided into two parts, an offline part and an online part:
the offline portion includes the data set acquisition, ontology modeling, knowledge extraction model training, model packaging, and knowledge extractor setup described above. The online portion includes the file upload, data processing, entity relationship extraction services and triplet creation described above.
It should be noted that in this embodiment, the offline portion of the system uses a hot-start scheme, and the server loads the triplet extraction model at start-up, so that the result can be quickly processed and returned when the knowledge extraction request is received.
S4, knowledge storage, wherein the knowledge storage is used for facilitating subsequent data management and retrieval, and the acquired entity relationship data is required to be persisted to a database for storage, and specifically comprises the following steps:
The data to be saved in this embodiment mainly includes: and (3) storing the processing result and the structured path information in a server after the data cleaning execution is finished. In the latter data, the embodiment selects a Neo4j graph database with high query efficiency and perfect development ecology as a knowledge graph persistence tool, maps the entity and the relation into nodes and edges in the graph by using a cytoer grammar after acquiring the triplet data, and imports the built equipment field knowledge graph into Neo4j for storage.
S5, information retrieval, wherein FIG. 7 is an information retrieval strategy structure based on node matching and query expansion, and specifically comprises the following steps:
node matching refers to a determination of whether a search keyword can be mapped onto an entity in a graph. The present embodiment considers that if the entity represented by the search keyword already exists in the existing equipment domain knowledge graph, it can be node-matched with the entity in the graph. Conversely, if the search term cannot form a matching relationship with the entity name in the index file of the map, information retrieval is performed based on a query expansion strategy set forth later.
The query expansion refers to the organic expansion of the entity, and finds out the similar entity, so as to avoid the problem of search failure caused by improper input of a user. The key of realizing query expansion in the embodiment is to combine a query expansion method based on a graph structure and a query expansion method based on semantic matching. By carrying out structure expansion and semantic matching score calculation on the search term and the map entity, the performance and the search effect of the knowledge map search system can be effectively improved.
The semantic matching calculation method between the search term and the map entity specifically comprises the following steps:
in this embodiment, a semantic matching score calculation method between a search term and a map entity is provided by combining a Jaccard coefficient with an editing distance, as shown in the following formula.
Sim(w,e)=αJ(w,e)+(1-α)Sim d (w,e)。
In the formula, alpha is a weight coefficient, w is a search term, e is a map entity name, sim (w, e) is a semantic matching score between the search term and the map entity, J (w, e) is a Jaccard coefficient between the search term and the map entity, and Simd (w, e) is a similarity on a text calculated according to an editing distance between the search term and the map entity.
The Jaccard coefficient is a measurement way for measuring similarity and difference between different limited sample sets, and to solve the Jaccard coefficient between the search term w and the map entity e, the search term w and the map entity e can be regarded as a character sequence set, firstly, the number of identical characters (w, e) between the search term w and the map entity e is calculated, and then the Jaccard coefficient is calculated through a formula, wherein the Size function represents the number of different characters in a character string.
Simd (w, e) is considered as the text match score between the query vocabulary and the graph nodes, and is calculated as shown in the following equation.
In the formula, maxlen (w, e) refers to word lengths with more characters in the search word w and the entity name e, and is used for normalizing the lengths of two character strings. Dis (w, e) is used for reflecting the difference degree of the two, and refers to the Levenshtein editing distance between the search term w and the entity name e, i.e. the minimum operation times for converting the character string w into e, and can be calculated by the following formula, i.e. through a dynamic programming algorithm.
The method for calculating the matching score between the map entities specifically comprises the following steps:
in this embodiment, the number of entity attributes is used as an index for determining the semantic matching score between the entities, specifically, the association degree between two entities is measured by adopting the same attribute set size in the two entities, so that the semantic relationship between the entities is better explored, and the accuracy of information search is improved. The more common attributes the higher the similarity between entities, the more common attributes can be used with Sim (e 1 ,e 2 ) To express different entities e 1 And e 2 The semantic matching score between the two is calculated by the following formula.
Wherein A (e) represents an attribute set of the entity e; sim (e) 1 ,e 2 ) Representing entity e 1 And e 2 The semantic matching scores among the entities can be calculated and obtained according to the proportion of the common parts of all the attributes of the entities, for example, if the two equipment entities in the weapon domain knowledge graph have a plurality of identical parameters in the attribute sets, such as caliber, speed of fire, range and the like, the semantic similarity among the two equipment entities is higher.
The overall map detection strategy is shown in fig. 8 by combining the two algorithms, and specifically includes:
first, node matching is performed on query keywords. In this case, two situations of successful node matching and failure may occur, and in order to optimize the searching effect, the embodiment designs a corresponding information searching strategy for the two situations that whether the matching entity can be successfully obtained from the knowledge graph in the equipment field.
The scene of node matching failure specifically comprises:
the embodiment provides a method for carrying out semantic matching score calculation by combining a Jaccard coefficient and an editing distance, and the method comprises the steps of sequentially carrying out semantic fuzzy comparison on a search keyword and all nodes in a map based on the proposed calculation method, and obtaining nodes with first scores after sequencing. If the matching score exceeds the set threshold, the task of inquiring expansion in the scene is completed, the entity is used as the node linked to the inquiring key word, and the node is returned to the user, and all the triples data formed by the direct association set with the node are directly related.
The scene of successful node matching specifically comprises the following steps:
in the scene, a keyword w input by a user can be mapped into a knowledge graph in the equipment field, and is addressed to a graph node E E corresponding to the keyword w based on an index, and after an entity node E is determined, the keyword w can be used as a starting point to expand query and expand, so that the retrieval of related entities and relations thereof is realized. Firstly, the entity node e can be searched in the synonym table, if the corresponding synonym entity can be found, the corresponding inquiry expansion triplet information is returned to the user, otherwise, the next-layer calculation is continued. And secondly, according to the matched entity nodes, acquiring the type information of the entity nodes in the knowledge graph, and then reducing the size of the entity set to acquire all the entity nodes of the type. The search keywords can be expanded by searching for entities with similar semantic relation with the entity e, and if the matching score exceeds a preset threshold, the entities are added into the synonym table, so that the relevance of the search keywords is further expanded. Then, a set of directly associated nodes of the entity e and similar entities can be obtained from the knowledge graph. Finally, in the case of successful node matching, the knowledge-graph-based information retrieval scheme of the embodiment returns the entity e corresponding to the retrieval key word, its synonymous or similar node e', and all the triplet information formed by the entity set and their directly associated node sets.
In combination with the above search content, in the knowledge search process, first layer query expansion is performed, semantic matching scores between the search word and the map entity are calculated through editing distance and Jaccard coefficients, all entity lists after degradation matching are obtained, node matching is performed after the user selects the corresponding search entity, address information of the entity is searched by means of name inverted index established in the knowledge extraction module, and the entity actually existing in the map is located. And then reducing the size of the entity set to be searched according to the type attribute of the entity, performing second-layer query expansion, calculating semantic matching scores among map entities through attribute similarity, judging that the expanded entity is acquired at the moment by the system if the attribute similarity exceeds a preset threshold, adding the expanded entity into a synonym table, and dynamically updating the synonym table.
S6, map visualization, wherein the function is mainly used for facilitating interaction between a user and the platform in the embodiment, and specifically comprises the following steps:
in the embodiment, vue. Js is selected as a basic development frame of the whole front end, and a D3.js excellent graphic visualization frame is adopted to realize a dynamic and three-dimensional knowledge graph visualization effect of the field of weaponry, so that visual display of the knowledge graph of the field of weaponry is realized. Js has high flexibility and strong expansibility, and supports the display of various data formats and graphic types. In the aspect of knowledge graph visualization in the field of weapon equipment, D3.js can provide a relationship between entities displayed by the force-directed graph module, and random movement of the simulated particles enables the graph to be more dynamic and interactive. In addition, the system also provides a parameter adjusting mechanism based on centripetal force, positioning force and collision mechanism, so as to better adapt to the visual display of the map data in various scenes. In order to meet the interaction requirements of personnel in the relevant field in the field of knowledge graph visualization of weapon equipment, the embodiment realizes a basic interaction function through canvas operation, element operation and data operation. Specifically, canvas operations support functions such as movement, scaling, etc.; elements such as nodes and edges support functions such as style configuration, focusing and highlighting, mouse dragging, type changing and the like; adding, deleting, modifying and checking the data operation support nodes and edges; in addition, interaction of the search scene can be achieved through operations such as path locking and focusing display.
First, knowledge-graph visualization requires a background canvas, typically using < svg > tags to generate the graph container. Then, the embodiment uses the select function provided in d3.Js to perform the selection operation on the generated map container, and the data format is set to be force-oriented map data through the forensic function, so that the entity can be better distributed in the map container by adjusting the central position and the mutual exclusion size of the entity in the map container layout. In order to allow the user to resize the graph by the mouse wheel, the zoom attribute of the graph may be changed using the zoom function provided in d3. Js.
Next, in this embodiment, the data function provided in d3.Js is used to set the triplet information in the map container, and bind the entity and the relationship data respectively, so as to realize map visualization. The entity data may generate a < circle > tag and a < text > tag, and the relationship data may generate a < line > tag and a < text > tag. Finally, through these labels, nodes, edges and their name attributes can be combined to form a weapon equipment knowledge graph visualization. When the mouse is placed on a certain node, the map visualization interface also provides the function of a knowledge card and displays more map attribute information. The advantage of js is that it is able to generate element labels on the visual interface through the data and dynamically update the interface appearance as the data changes.
In addition, the design and writing of the whole information interaction page are realized by using the Vue. Js framework, and a plurality of reusable sub-interfaces including pages such as page management and function selection menu bars are defined under the common catalog, so that other pages can be focused on the development of function display. The three sub-pages of gBuild, gModel and gSearch are respectively arranged under the pages catalog and correspond to three use cases in system demand analysis and three controllers at the back-end Controller level respectively, and the design of front-end and back-end logic separation can greatly reduce the coupling degree of the system. In order to realize the standard transmission of front and back end data, json format is designed and used for transmitting data through GET and POST requests, in the aspect of result display, a framework for page display is designed at first, and the data is displayed during rendering display.
Example IV
In order to execute the method corresponding to the above embodiment to achieve the corresponding functions and technical effects, a knowledge graph construction system in the equipment field is provided below.
The embodiment provides a knowledge graph construction system in equipment field, which comprises:
and the data set construction module is used for constructing an original data set in the equipment field.
The equipment field ontology construction module is used for constructing an equipment field ontology based on the equipment field original data set.
And the triplet data extraction module is used for processing the entity and relation extraction and merging the entity and relation extraction into an integral information extraction task based on the equipment field original data set and the equipment field body, and outputting triplet data in an end-to-end mode.
And the equipment field knowledge graph construction module is used for constructing an equipment field knowledge graph according to the triplet data.
Example five
In order to execute the method corresponding to the above embodiment to achieve the corresponding functions and technical effects, a retrieval system based on the knowledge graph of the equipment field is provided below.
The retrieval system based on the knowledge graph in the equipment field provided by the embodiment comprises:
the information retrieval module is used for retrieving information on the equipment field knowledge graph determined in the first embodiment based on the information retrieval strategy of node matching and query expansion according to the search keywords; the node matching refers to judging whether the search keywords can be mapped to the entities in the knowledge graph of the equipment field; the query expansion refers to the organic expansion of entities in the knowledge graph of the equipment field, and the entities similar to the search keywords are found out.
Example six
The embodiment provides an equipment field knowledge graph automatic construction and retrieval system, which comprises:
and the data processing module is used for receiving the file in the format of the body txt, word, pdf uploaded by the user, and cleaning and preprocessing the text data content.
The ontology construction module is used for providing a visual tool for designing a knowledge graph mode layer after analyzing and arranging equipment field data by a user and supporting ontology management and ontology export functions.
The knowledge extraction module is used for receiving the text output by the data processing module, packaging the knowledge extraction model, realizing the joint extraction of the entity relationship by means of interface call, and transmitting the triplet data to the knowledge storage module.
And the knowledge storage module is used for responding to a map construction request of a user, integrating the extracted triplet data into a knowledge map, mapping the entity and the relation into nodes and edges in the map, and storing by using a persistence means.
And the knowledge retrieval module is used for responding to the node query and path query requests of the users, calling the corresponding retrieval service according to the retrieval type and the keyword parameters in the requests, generating query sentences through node matching and query expansion capability, and realizing the corresponding query by depending on the graph database.
The knowledge visualization module is an interactive interface and a main communication channel between a user and a system, bears the responsibility of transmitting a user request and responding to a rear-end result, and presents the triplet data as a graphic structure of nodes and edges in an interactive and visual mode. Wherein fig. 9 shows the overall architecture of such a system.
In order to facilitate development and subsequent maintenance of programs, improve stability and efficiency of a system, reduce coupling degree between modules, the embodiment divides the system into a data layer, a model layer, a service layer and a presentation layer, realizes core functions of the system by service logic and model codes at the rear end, and provides a visual operation interface and an access to be unified.
The data layer mainly comprises data crawled from a network information source in the selected equipment field and military document data in a format of txt, doc, pdf and the like uploaded by a user, and the original unstructured text needs to be converted into structured triplet data to be stored in a graph database so as to provide data support for an upper layer.
The model layer mainly comprises a knowledge extraction model based on RoBERTa and Seq-to-Seq and an information retrieval model based on a domain knowledge graph, trains and constructs a corresponding model by accessing triplet information in the data layer, and provides an interactive interface for the service layer in an API (application program interface) mode.
The business layer integrates four functional modules of data processing, ontology modeling, knowledge extraction and knowledge retrieval. Wherein the data processing comprises cleaning and preprocessing of original files in a data layer, and the ontology modeling needs to provide the capability of designing a pattern layer for a user, and knowledge extraction and retrieval involve Jython calling and packaging of the model layer. The business layer mainly bears the information processing of the layer, the calling of the model layer and the feedback task of the representation layer.
As an interactive interface between a user and a system, the main responsibility of the presentation layer is to provide a concise and clear operation interface so that the user can conveniently construct and retrieve the knowledge graph. The layer covers a plurality of functional modules such as ontology modeling, knowledge extraction and the like, and provides comprehensive knowledge map management service for users. The presentation layer encapsulates the request into a JSON format, submits the JSON format to the service layer, returns result information in real time, and can visually display information in the map.
Example seven
The present embodiment provides an electronic device including a memory and a processor, where the memory is configured to store a computer program, and the processor is configured to execute the computer program to cause the electronic device to execute the method for constructing a knowledge graph in an equipment domain according to the first embodiment.
Alternatively, the electronic device may be a server.
In addition, the present embodiment also provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the apparatus domain knowledge graph construction method of the first embodiment.
Example eight
The present embodiment provides an electronic device including a memory and a processor, where the memory is configured to store a computer program, and the processor is configured to execute the computer program to cause the electronic device to execute the method for searching a knowledge graph based on the equipment domain according to the second embodiment.
Alternatively, the electronic device may be a server.
In addition, the present embodiment also provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the method for searching for a knowledge graph based on the equipment domain according to the second embodiment.
Aiming at the situation that main bodies in the original data set of the established equipment field are overlapped in a crossing way, entity identification and relationship classification are combined into a main entity representation emphasis mode, and a half pointer-half labeling sequence extraction task is used; the invention also discloses a knowledge graph retrieval strategy based on node matching and query expansion, and a semantic matching score calculation scheme under different scenes is completed by combining the synonymous connected graph and attribute similarity, so that a more diversified and accurate matching mode is provided; the invention realizes a set of complete domain knowledge graph application flow from file analysis, ontology modeling, knowledge extraction, graph construction and information retrieval, has excellent data processing and semantic association capability, and can provide excellent data visualization function.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the system disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to assist in understanding the methods of the present invention and the core ideas thereof; also, it is within the scope of the present invention to be modified by those of ordinary skill in the art in light of the present teachings. In view of the foregoing, this description should not be construed as limiting the invention.

Claims (10)

1. The knowledge graph construction method in the equipment field is characterized by comprising the following steps of:
constructing an equipment field original data set;
constructing an equipment field body based on the equipment field original data set;
based on the original data set of the equipment field and the body of the equipment field, extracting and combining the entity and the relation into a whole information extraction task for processing, and outputting the triplet data in an end-to-end mode;
And constructing a knowledge graph of the equipment field according to the triplet data.
2. The method for constructing an equipment domain knowledge graph according to claim 1, wherein the constructing an equipment domain raw data set specifically comprises:
and capturing webpage data of the professional weapon equipment by adopting a crawler technology, and storing the captured data as a JSON format file to establish an original data set in the field of construction equipment.
3. The method for constructing an equipment domain knowledge graph according to claim 1, wherein the constructing an equipment domain ontology based on the equipment domain raw data set specifically comprises:
adopting a top-down ontology modeling scheme and a bottom-up ontology modeling scheme to construct an equipment field ontology;
the top-down ontology modeling scheme: summarizing the equipment domain body of the designed mode layer by means of prior knowledge of the equipment domain industry, and carrying out top-down extraction operation on data in the original data set of the equipment domain according to the equipment domain body of the designed mode layer so as to realize construction of the equipment domain body;
the bottom-up ontology modeling scheme: and (3) arranging, cleaning and extracting the data in the original data set of the equipment field from bottom to top to realize modeling of the equipment field body.
4. The method for constructing a knowledge graph in an equipment domain according to claim 1, wherein the method for constructing a knowledge graph in an equipment domain based on an equipment domain original data set and an equipment domain ontology, integrating entity and relationship extraction into a whole information extraction task is used for processing, and outputting triplet data in an end-to-end manner, specifically comprises:
based on the original data set of the equipment field and the body of the equipment field, adopting a knowledge joint extraction algorithm based on a Seq-to-Seq frame and a RoBERTa model to extract and combine the entity and the relation into a whole information extraction task for processing, and outputting the triplet data in an end-to-end mode.
5. The method for constructing an equipment domain knowledge graph according to claim 1, wherein the constructing the equipment domain knowledge graph according to the triplet data specifically comprises:
and mapping the entities and the relations in the triple data into nodes and edges in the atlas by using a Cypher grammar, and importing the constructed equipment field knowledge atlas into a Neo4j graph database for storage.
6. The retrieval method based on the knowledge graph in the equipment field is characterized by comprising the following steps of:
according to the search keywords, based on the information retrieval strategy of node matching and query expansion, carrying out information retrieval on the equipment domain knowledge graph determined by any one of claims 1-5; the node matching refers to judging whether the search keywords can be mapped to the entities in the knowledge graph of the equipment field; the query expansion refers to the organic expansion of entities in the knowledge graph of the equipment field, and the entities similar to the search keywords are found out.
7. The knowledge graph construction system in the equipment field is characterized by comprising:
the data set construction module is used for constructing an original data set in the equipment field;
the equipment field body construction module is used for constructing an equipment field body based on the equipment field original data set;
the triple data extraction module is used for processing the entity and relation extraction and merging into a whole information extraction task based on the original data set of the equipment field and the body of the equipment field, and outputting triple data in an end-to-end mode;
and the equipment field knowledge graph construction module is used for constructing an equipment field knowledge graph according to the triplet data.
8. The retrieval system based on the knowledge graph in the equipment field is characterized by comprising:
the information retrieval module is used for retrieving information from the equipment domain knowledge graph determined by any one of claims 1-5 based on the information retrieval strategy of node matching and query expansion according to the search keywords; the node matching refers to judging whether the search keywords can be mapped to the entities in the knowledge graph of the equipment field; the query expansion refers to the organic expansion of entities in the knowledge graph of the equipment field, and the entities similar to the search keywords are found out.
9. An electronic device comprising a memory and a processor, the memory being configured to store a computer program, the processor being configured to cause the electronic device to perform an equipment domain knowledge graph construction method according to any one of claims 1 to 5.
10. An electronic device comprising a memory for storing a computer program and a processor that runs the computer program to cause the electronic device to perform a method of retrieving an equipment domain knowledge-graph based on claim 6.
CN202310497796.5A 2023-05-06 2023-05-06 Knowledge graph construction method, retrieval method and system for equipment field and electronic equipment Pending CN116523041A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310497796.5A CN116523041A (en) 2023-05-06 2023-05-06 Knowledge graph construction method, retrieval method and system for equipment field and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310497796.5A CN116523041A (en) 2023-05-06 2023-05-06 Knowledge graph construction method, retrieval method and system for equipment field and electronic equipment

Publications (1)

Publication Number Publication Date
CN116523041A true CN116523041A (en) 2023-08-01

Family

ID=87391780

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310497796.5A Pending CN116523041A (en) 2023-05-06 2023-05-06 Knowledge graph construction method, retrieval method and system for equipment field and electronic equipment

Country Status (1)

Country Link
CN (1) CN116523041A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117093219A (en) * 2023-10-20 2023-11-21 成都华栖云科技有限公司 Visualization method based on data source, electronic equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117093219A (en) * 2023-10-20 2023-11-21 成都华栖云科技有限公司 Visualization method based on data source, electronic equipment and storage medium
CN117093219B (en) * 2023-10-20 2023-12-26 成都华栖云科技有限公司 Visualization method based on data source, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110489395B (en) Method for automatically acquiring knowledge of multi-source heterogeneous data
CN110399457B (en) Intelligent question answering method and system
CN111753060B (en) Information retrieval method, apparatus, device and computer readable storage medium
US20210097089A1 (en) Knowledge graph building method, electronic apparatus and non-transitory computer readable storage medium
CN110990590A (en) Dynamic financial knowledge map construction method based on reinforcement learning and transfer learning
EP1672537A2 (en) Data semanticizer
CN110765277B (en) Knowledge-graph-based mobile terminal online equipment fault diagnosis method
CN117235281B (en) Multi-element data management method and system based on knowledge graph technology
CN111651447B (en) Intelligent construction life-span data processing, analyzing and controlling system
CN114218472A (en) Intelligent search system based on knowledge graph
CN111026941A (en) Intelligent query method for demonstration and evaluation of equipment system
CN113449066B (en) Method, processor and storage medium for storing cultural relic data by using knowledge graph
CN116127090B (en) Aviation system knowledge graph construction method based on fusion and semi-supervision information extraction
CN113190687B (en) Knowledge graph determining method and device, computer equipment and storage medium
CN115759037A (en) Intelligent auditing frame and auditing method for building construction scheme
CN117709866A (en) Method and system for generating bidding document and computer readable storage medium
CN114117000A (en) Response method, device, equipment and storage medium
CN113704420A (en) Method and device for identifying role in text, electronic equipment and storage medium
CN115438195A (en) Construction method and device of knowledge graph in financial standardization field
CN111831624A (en) Data table creating method and device, computer equipment and storage medium
CN116523041A (en) Knowledge graph construction method, retrieval method and system for equipment field and electronic equipment
CN114911893A (en) Method and system for automatically constructing knowledge base based on knowledge graph
CN114117242A (en) Data query method and device, computer equipment and storage medium
CN113377739A (en) Knowledge graph application method, knowledge graph application platform, electronic equipment and storage medium
CN115270777A (en) Contract document information extraction method, device and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination