CN114417012A

CN114417012A - Method for generating knowledge graph and electronic equipment

Info

Publication number: CN114417012A
Application number: CN202210067605.7A
Authority: CN
Inventors: 王伟印; 张晓程
Original assignee: Shanghai Hongji Information Technology Co Ltd
Current assignee: Shanghai Hongji Information Technology Co Ltd
Priority date: 2022-01-20
Filing date: 2022-01-20
Publication date: 2022-04-29

Abstract

The invention discloses a method for generating a knowledge graph, which comprises the following steps: defining a knowledge structure; converting the existing document and/or the incremental document into one or more of structured data, semi-structured data and unstructured data according to the knowledge structure; and integrating one or more entities, relations and attributes in the structured data, the semi-structured data and the unstructured data to obtain the knowledge graph. According to the technical scheme, the required knowledge structure is defined, the existing document and the incremental document are converted to obtain one or more of structured data, semi-structured data and unstructured data, the knowledge graph is obtained by using one or more of the structured data, the semi-structured data and the unstructured data, and the generation efficiency and accuracy of the knowledge graph can be effectively improved. The invention also includes an electronic device comprising a processor for executing one or more computer program instructions to implement the method described above.

Description

Method for generating knowledge graph and electronic equipment

Technical Field

The invention belongs to the technical field of computers, and particularly relates to a method for generating a knowledge graph and electronic equipment.

Background

The knowledge map describes knowledge resources and carriers thereof by using a visualization technology, and mines, analyzes, constructs, draws and displays knowledge and mutual relations among the knowledge resources and the carriers. With the development of informatization technology, different knowledge resources are provided for users in a knowledge graph mode to become a new knowledge providing scheme. The scheme is realized by firstly acquiring data information from a data source and generating a knowledge graph.

When a company or a department accumulates technical documents for many years, including ppt, word, excel, picture, pdf, etc., the documents are usually required to be converted into a knowledge graph for structured deposition and accumulation of knowledge and experience of the company. If the publication date is 2020, 2, 14, and the publication number is CN110795923A, a chinese patent with a patent name of an automatic generation system and generation method of technical documents based on natural language processing proposes a technical scheme, which includes a BOE subsystem, a SOW subsystem and an authority subsystem in a flat relationship: the BOE subsystem comprises an NLP platform and a data management module, the SOW subsystem comprises a SOW template generation module and a template management module, and the authority management subsystem is provided with three levels of authority management; the method combines services and artificial intelligence to meet the requirements of digital management based on knowledge experience, establishes an equipment technical specification knowledge base by using a natural language processing technology and a knowledge graph technology, and realizes automatic generation of the document by using an intelligent text processing technology.

The application publication date is 2021, 3 and 12, the application publication number is CN12486919A, and another technical scheme is proposed in chinese patent with patent name of document management method, system and storage medium, which includes: collecting and cleaning document data; performing semantic analysis on the cleaned document data to generate a document, an abstract, a keyword and an association relation; and inputting the document, the abstract, the key words and the incidence relation into the constraint model to generate a knowledge graph. The document, the abstract, the keywords and the association relation can be obtained by performing semantic analysis on the cleaned document data, and then the obtained document, the abstract, the keywords and the association relation are input into the established constraint model, so that a knowledge graph can be generated, and the requirement of a user on document management is met.

However, the technical scheme has the following problems: in the whole process, deep learning and machine learning models are needed, accuracy cannot be guaranteed, some models are large due to a lot of parameters of the deep learning and the machine learning, if a knowledge graph needs to be formed by a new document, a complete generation process of the knowledge graph still needs to be executed, and efficiency is low.

Disclosure of Invention

1. Problems to be solved

Aiming at the problems of low accuracy, low efficiency and the like in the prior art, the invention provides a method for generating a knowledge graph and electronic equipment.

2. Technical scheme

In order to solve the problems, the technical scheme adopted by the invention is as follows: a method of generating a knowledge-graph, comprising the steps of:

defining a knowledge structure;

converting the existing document and/or the incremental document into one or more of structured data, semi-structured data and unstructured data according to the knowledge structure;

and integrating the entities, the relations and the attributes of one or more of the structured data, the semi-structured data and the unstructured data to obtain the knowledge graph. According to the technical scheme, the required knowledge structure is defined, the existing document and the incremental document are converted to obtain one or more of structured data, semi-structured data and unstructured data, the knowledge graph is obtained by using one or more of the structured data, the semi-structured data and the unstructured data, and the generation efficiency and accuracy of the knowledge graph can be effectively improved.

Further, the defining a knowledge structure comprises:

acquiring the existing document;

defining the knowledge structure according to the existing document and preset requirements;

the preset requirements comprise entities, relations and attributes required for generating the knowledge graph; the knowledge structure includes entities, relationships, and attributes of the knowledge structure. According to the technical scheme, the knowledge structure is defined by presetting the entities, the relations and the attributes required by the knowledge map, so that the knowledge structure can better meet the requirements of users, and structured data, semi-structured data and unstructured data generated according to the knowledge structure are more suitable for the knowledge map.

Further, before converting the existing document and the incremental document into one or more of structured data, semi-structured data and unstructured data according to the knowledge structure, the method further comprises: generating an incremental document according to the knowledge structure, specifically:

constructing a relational database according to the knowledge structure; the main key of the relational database is the attribute of a knowledge structure, and the foreign key is the relation and the attribute of the knowledge structure;

receiving data input by a user;

and generating the incremental document according to the data input by the user and the relational database.

Further, before integrating the entities, relationships and attributes of one or more of the structured data, semi-structured data and unstructured data, the method further comprises:

according to the entity, the relation and the attribute of the knowledge structure, one or more of the structured data, the semi-structured data and the unstructured data are processed by utilizing a natural language processing technology, and the method specifically comprises the following steps: and carrying out entity identification on the entities, carrying out relationship extraction on the relationship, and carrying out attribute extraction on the attributes.

according to the entity, the relation and the attribute of the knowledge structure, one or more of the structured data, the semi-structured data and the unstructured data converted from the existing document are processed by using a natural language processing technology, and the method specifically comprises the following steps: carrying out entity identification on entities in the data, carrying out relationship extraction on the relationship in the data, and carrying out attribute extraction on attributes in the data;

acquiring first knowledge content according to the entity, the relation and the attribute of the knowledge structure; wherein the first knowledge content is entities, relationships and attributes in the structured data of the incremental document transformation. The technical scheme is used for processing the structured data, the semi-structured data and the unstructured data to obtain entities, relations and attributes meeting the requirements of the knowledge graph.

Further, the integrating the entities, relationships, and attributes of one or more of the structured data, semi-structured data, unstructured data comprises:

and carrying out entity unification, entity disambiguation and reference resolution on the entities, the relations and the attributes obtained after the processing by using the natural language processing technology to obtain second knowledge content. The technical scheme utilizes the natural language recognition technology to carry out entity unification, entity disambiguation and reference resolution, so that the obtained entities, relationships and attributes are more accurate.

Further, the integrating the entities, relationships, and attributes of one or more of the structured data, the semi-structured data, and the unstructured data to obtain the knowledge graph includes:

and storing one or two of the first knowledge content and the second knowledge content in a graph database to complete the construction of the knowledge graph.

Further, said storing one or both of said first knowledge content, second knowledge content into a graph database comprises:

and when one or two of the first knowledge content and the second knowledge content lack the entity, the relation or the attribute, acquiring the missing entity, the relation or the attribute through external data.

Further, the method further comprises:

and taking the first knowledge content as the marking data of the natural language processing technology, and training a model in the natural language processing technology. One or more of the generated structured data, semi-structured data and unstructured data are used as marking data, so that the time for manual marking can be saved, the efficiency is improved, and the accuracy can be effectively improved.

The present invention also includes an electronic device comprising: a memory and a processor;

the memory is to store one or more computer program instructions;

the processor is to execute the one or more computer program instructions to: the steps in the above method are performed.

3. Advantageous effects

Compared with the prior art, the invention has the beneficial effects that:

(1) the knowledge graph is connected by using the document generation system, so that the knowledge and the knowledge structure can be completely corresponding, and the method has the advantage of high accuracy;

(2) the data generated by the document generation system can be used as training data to improve the accuracy of the original document processing model;

(3) the method can improve the accuracy of the existing document model, the data generated by the document generation system can be used as training data to improve the accuracy of the original document processing model, and compared with the conventional template document generation method, the method has higher standardization degree and higher efficiency;

(4) the document generation system required by the invention can be built through a low-code system, and has better universality.

Drawings

FIG. 1 is a schematic illustration of an input interface of a document creation system of the present invention;

FIG. 2 is a flow chart of the present invention;

FIG. 3 is a schematic diagram of the structure of the knowledge-graph generated by the present invention (when the structure data is generated);

FIG. 4 is a schematic diagram of the structure of a knowledge graph generated by the present invention (when semi-structural data or non-structural data is generated);

fig. 5 is a knowledge structure of an experimental report in an embodiment of the present invention.

Detailed Description

The invention is further described with reference to specific examples.

First, some terms involved in the present invention are explained:

knowledge graph: the structured semantic knowledge base is used for rapidly describing concepts and mutual relations in the physical world, the knowledge map is converted into simple and clear triples of entities, relations and entities by effectively processing, processing and integrating data of complex documents, and finally a large amount of knowledge is aggregated, so that rapid response and reasoning of the knowledge are realized.

The basic unit of the knowledge graph is a triple formed by "Entity-Relationship-Entity (Entity)", wherein an Entity refers to something which is distinguishable and independent, an Entity is the most basic element in the knowledge graph, and different relationships exist between different entities. The relation is to connect different entities, refers to the relation between the entities, and connects the nodes in the knowledge graph through the relation nodes to form a big graph; an attribute is a property that an entity has, and an entity can be described by several attributes.

The raw data types of the knowledge graph include:

structured Data (structured Data): structured data refers to data that exists in a fixed format within a record file, and generally includes RDDs (flexible Distributed data sets) and tabular data. Structured data refers to data that can be represented and stored in a two-dimensional form using a relational database. The general characteristics are as follows: the data is in row units, one row of data represents information of one entity, and the attribute of each row of data is the same, and the data can be data in a relational database and an object-oriented database.

Semi-structured Data (Semi-structured Data): like XML, JSON, encyclopedia, semi-structured data is data that is intermediate between fully structured data and unstructured data, which is a form of structured data, entities belonging to the same class may have different attributes, even if they are grouped together, the order of which is not important. Including log files, XML documents, JSON documents, Email, etc.

Unstructured Data (unstructured Data): meaning that the information does not have a predefined data model or is not organized in a predefined manner, unstructured data generally refers to literal data, but there is much information in the data such as time, numbers, etc. Including office documents, text, pictures, XML, HTML, various types of reports, images, audio/video information, and so forth, in all formats.

Information extraction: the method comprises the steps of extracting entities, attributes and interrelations among the entities from various types of data sources, and forming ontology knowledge expression on the basis of the entity, wherein the ontology knowledge expression specifically comprises entity identification, attribute extraction and relationship extraction.

Entity extraction is also known as Named Entity Recognition (NER), which refers to the automatic recognition of named entities from a text dataset. The quality (accuracy and recall) of entity extraction has a great influence on the subsequent knowledge acquisition efficiency and quality, and is therefore the most basic and critical part of information extraction.

And (3) extracting the relation: the text corpus is subjected to entity extraction to obtain a series of discrete named entities, in order to obtain semantic information, the association relationship between the entities needs to be extracted from the related corpus, the entities (concepts) are linked through the association relationship, a reticular knowledge structure can be formed, and the purpose of researching the relationship extraction technology is to solve the basic problem of how to extract the relationship between the entities from the text corpus.

And (3) extracting attributes: the goal of attribute extraction is to collect attribute information for a particular entity from different information sources. For example, information such as a nickname, a birthday, nationality, and an educational background of a public person can be obtained from the network public information. The attribute extraction technology can collect the information from various data sources, and complete delineation of entity attributes is achieved.

Entity unification: different expressions are given to the same entity, and sometimes it is necessary to unify the different expressions into the same expression. If the entities are unified, the difficulty of some NLP tasks can be reduced. The common application scenario is that in the process of constructing a knowledge graph, place names, company names, professional terms and the like need to be unified.

Entity disambiguation: the technology specially used for solving the ambiguity problem generated by the entities with the same name can accurately establish entity links according to the current context through entity disambiguation.

Reference resolution (also known as coreference resolution, entity matching, or entity synonymy): the method is mainly used for solving the problem that a plurality of designations correspond to the same entity object. In a session, multiple references may point to the same entity object. These references can be associated or merged into the correct entity object using coreference resolution techniques.

The acquisition of structured data, semi-structured data and unstructured data is very important as the original data type of the knowledge graph, and the invention focuses on how to acquire the structured data, the semi-structured data and the unstructured data. In general, structured data, semi-structured data, or unstructured data can be obtained from original documents, specifically, structured data, semi-structured data, or unstructured data can be obtained from original documents of various forms (including word, PDF, PPT, excel, picture, or text, and in particular, when obtaining, text is extracted by an OCR (Optical Character Recognition) technique; for word, PPT, and excel, a formatted document needs to be analyzed to obtain content, pdf may need to be analyzed directly, or a text may need to be extracted by an OCR technology, that is, various original documents are converted into texts, and then the texts are converted into structured data, semi-structured data, or unstructured data by using an ETL (Extract-Transform-Load, data warehouse technology) technology, a data labeling technology, and other technologies, which are used for describing processes of extracting (Extract), converting (Transform), loading (Load) data from a source end to a destination end.

For situations without an original document, the present invention may directly obtain the document, structured data, semi-structured data, or unstructured data without the original document using a document generation system. Before a document generation system is used for generating a document (i.e. a document with the same format as an original document, such as word, PPT, PDF, EXCEL, a picture or a text) and one or more of structured data, semi-structured data or unstructured data, a knowledge structure needs to be defined, and the knowledge structure needs to be defined according to a preset requirement of a user, specifically: the method comprises the steps of determining which contents need to be extracted, such as which entities need to be extracted, which relationships need to be extracted, which attributes need to be extracted, and the like, wherein the whole knowledge structure is defined according to requirements, specifically, for one experiment record, the entities at least comprise experimenters, experimental subjects, experimental instruments, experimental reagents and the like, and the attributes can be attributes of people, states of the experimental subjects, specific information of the experimental instruments, specific parameters of the experimental reagents and the like. If the performance test of the automobile is involved, the entity is various automobile accessories, the attribute is parameters of various accessories, the relationship is the connection relationship between the accessories and the like. Therefore, when the related content is determined, the content of the entity, the relation and the attribute is also determined, and then the content of the knowledge structure is also determined, and after the knowledge structure is determined, the document generation system can be used for generating the structured data, the semi-structured data or the unstructured data according to the content of the knowledge structure.

The document generation system is that one or more of a document or structured data, semi-structured data or unstructured data can be generated by inputting some parameters without an original document, as shown in fig. 1, for example, a test document is generated, PPT title content is input in an input interface of the document generation system shown in fig. 1, the finally obtained document is in a PPT format, the name of the PPT is the input PPT title content, and then personnel involved in the test, including personnel names, duties and personnel mailboxes, such as personnel involved in the test, including a, and the duties are group leaders of the test group, are input; and B, the job testing team performs team leader, then the description of the problem to be tested is input step by step, the description can be completed by uploading pictures or inputting text contents, and after all the contents are completed, a PPT document can be finally generated, wherein the PPT is the description of the whole process of the test, including testers, the problem of the test, the test result and the like. Meanwhile, structured data can be generated, such as a table and the like, wherein the table lists personnel participating in the experiment, problems of the experiment, experiment results and the like, and of course, semi-structured data or unstructured data can also be generated, and the generated structured data, semi-structured data or unstructured data can be used as input of the knowledge graph.

It should be noted that when the input content is accurate and correct, the structured data is obtained according to the defined knowledge structure, and the structured data obtained under the condition that the input content is accurate and correct can be directly used by the knowledge graph without further processing by using NLP technology, as shown in fig. 3 specifically; when unstructured data or semi-structured data is generated according to an existing document or when unstructured data or semi-structured data is generated under the condition that input data is redundant or not accurate enough, the semi-structured data or the unstructured data needs to be processed, and the specific processing steps are as follows: carrying out entity identification on an entity, carrying out relationship extraction on a relationship in the entity, carrying out attribute extraction on an attribute in the entity, and obtaining first knowledge content according to the entity, the relationship and the attribute of a knowledge structure, wherein the first knowledge content refers to the entity, the relationship and the attribute in the structured data of incremental document conversion; then, entity unification, entity disambiguation and reference resolution are performed on the entities, relationships and attributes obtained after processing by using a natural language processing technology to obtain a second knowledge content, specifically as shown in fig. 4, in general, a data format obtained for an existing document or a directly-entered content (i.e., an incremental document) includes structured data, semi-structured data and unstructured data, that is, the data content is composed of structured data, semi-structured data and unstructured data, and specifically, various combinations are possible, such as a combination of structured data and semi-structured data, a combination of structured data and unstructured data, a combination of semi-annual structured data and unstructured data, a combination of structured data, semi-structured data and unstructured data, and in some cases, individual semi-structured data or unstructured data can also be obtained, but structured data will only be obtained if the incremental document is processed and the input content is completely accurate; or when the existing document is processed, the knowledge structure is very simple, and all key points required by the knowledge structure can be covered when the existing document is processed by technologies such as OCR, ETL, document analysis, data annotation and the like, the existing document can be directly used to obtain the structured data. And the first knowledge content and the second knowledge content are both contents processed by the NLP technology, and when the method is specifically implemented, one or two of the first knowledge content and the second knowledge content are stored in the graph database to complete the construction of the knowledge graph.

In order to ensure the accuracy of the data, the structured data, the semi-structured data or the unstructured data obtained in the above steps may be subjected to entity identification, relationship extraction and attribute extraction by using NLP (Natural Language Processing) technology, so as to obtain the entities, the relationships and the attributes required by the knowledge graph. After the entities, the relationships, and the attributes are obtained, the entities, the relationships, the attributes, and other contents need to be integrated, and the specifically adopted method includes Entity unification (which is used for giving two entities and judging whether the two entities point to the same Entity), Entity Disambiguation (i.e., understanding the specific meaning of one Entity), and reference Resolution (Co-reference Resolution, which is what an Entity in a text refers to) so as to obtain the correct knowledge content. For example, if the entity is unambiguous, such as mouse, rabbit, fruit fly, i.e., no processing is required, the correct content can be obtained, but if there are multiple entity names, such as family creatures, family creatures limited, family, it is necessary to process these inputs using NLP techniques to obtain the exact entity name. Similarly, attributes and relationships require similar processing. The processed data can be directly used as input to obtain a final knowledge graph.

The process of the present invention, which includes both the case where an original document exists and the case where no original document is used to generate a knowledge-graph, is described fully below in conjunction with FIGS. 2 and 3:

(1) defining a knowledge structure according to requirements, specifically: defining which contents are extracted from an original document, such as which entities, which relations, which attributes and the like need to be extracted, wherein the whole knowledge structure is defined according to requirements, and generally, which contents are mainly covered in the original document determines the scope of the knowledge structure; it should be noted that the attribute of the knowledge structure is an attribute of an entity in the knowledge structure, and the relationship of the knowledge structure is a relationship between entities in the knowledge structure.

(2) Performing text conversion on an existing original document, converting all original documents into texts, specifically requiring a document parsing technology, for example, extracting texts by using an OCR (Optical Character Recognition) technology for pictures; for word, PPT, and excel, a formatted document needs to be analyzed to obtain content, pdf may need to be analyzed directly, or a text may need to be extracted by an OCR technology, that is, various original documents are converted into texts, and then the texts are converted into structured data, semi-structured data, or unstructured data by using an ETL (Extract-Transform-Load, data warehouse technology) technology, a data labeling technology, and other technologies, which are used for describing processes of extracting (Extract), converting (Transform), loading (Load) data from a source end to a destination end.

(3) In order to further enhance the accuracy of the data, entity identification, relationship extraction, and attribute extraction are further performed on the structured data, semi-structured data, or unstructured data obtained in step (2) by using NLP (Natural Language Processing) to obtain entities, relationships, and attributes, and the extracted contents are determined by the knowledge structure of step (1), that is, which contents are extracted is determined according to the knowledge structure in step (1).

(4) After the entities, the relationships, and the attributes are obtained, the entities, the relationships, the attributes, and other contents are integrated, and the specifically adopted method includes Entity unification (which is used for giving two entities and judging whether the two entities point to the same Entity), Entity Disambiguation (which is to say, understanding the specific meaning of one Entity), and reference Resolution (which is to say, what an Entity in a text refers to) so as to obtain the correct knowledge content.

(5) And storing the result obtained in the step into a map database to obtain a knowledge map.

(6) Under the condition that an original document (which can be called stock data) does not exist, a knowledge structure can be defined according to requirements, namely what kind of knowledge graph needs to be obtained, entities, relations and attributes of the corresponding knowledge structure are defined, and then the document or structured data, semi-structured data and unstructured data are generated through a document generation system, and the specific implementation process is as follows:

1) defining a relational database according to the step (1), wherein the primary key of the database is the entity of the knowledge structure in the step (1), and other columns of the database can be the attribute of the entity, the relationship between the entity and other entities, and the like;

2) after the relational database is completed according to the knowledge structure definition, the specific meaning of the input box can be defined in the input interface of the document generation system shown in fig. 1, for example, the specific meaning of the input box 1 representing the entity, the specific meaning of the input box 2 representing the attribute 1 of the entity, the specific meaning of the input box 3 representing the attribute 2 …, and the like are related to the definition of the knowledge structure. In the input process, the situation of input redundancy or inaccurate information may exist, for example, a plurality of contents with different names and constant specific meanings such as the science biology and the science are input, and at the moment, the information can be standardized and then put in storage by using the technologies such as entity identification, relation extraction and the like so as to ensure the accuracy of data;

3) after the data are recorded into the database, structured data can be filled into the designed PPT, word, pdf template and the like according to the designed PPT, word, pdf template and the like to obtain a required document, and the part of the document generation system corresponding to the knowledge structure completes the structuring of knowledge while generating the document, namely, the obtained structured data, semi-structured data and unstructured data can be directly imported into the structured database, namely, the function of the structured database is to store the result (structured data, semi-structured data and unstructured data) generated by the document generation system on one hand, and on the other hand, the structured data, semi-structured data and unstructured data stored in the structured database can be used as marking data to further process the original document so as to save the step of manual marking, the efficiency is improved. Meanwhile, structured data, semi-structured data and unstructured data can also be imported into a graph database to complete construction of the knowledge graph, and the graph database is different from the structural database and is used for displaying and storing the knowledge graph.

(7) Either the stock data (i.e., the original document) or the incremental data (i.e., in the absence of the original document) may need external data to supplement, completing the knowledge structure supplement. This process is accomplished during the conversion of structured, semi-structured, or unstructured data into graph data. In general, when the content of the original document cannot completely cover the content required for the knowledge structure in step (1), data needs to be acquired from the outside for supplement. For example, the attributes of the entity are defined in step (1), and there are 3 attributes, but only two attributes are provided in the existing original document or the document generated by the document generation system, and one attribute is missing, and external data is needed to supplement the one attribute. And (4) after the missing attributes are obtained, supplementing the missing attributes into corresponding fields in the step (3) and the step (6), and simultaneously synchronizing the supplemented attributes into a graph database so as to realize the integrity of the knowledge structure. Similarly, if a relationship or entity is missing in step (1), the relationship or entity may also be obtained from external data to supplement the missing relationship or entity to ensure the integrity of the knowledge structure.

(8) When the number of original documents is large, the whole system may have a requirement on the accuracy of the entity identification, attribute extraction, relationship extraction, and other models in step (2), for example, the required accuracy is more than 90%. At this time, in order to improve the accuracy of the relevant model, a large amount of labeled data needs to be made on the original document, which results in low efficiency. In the invention, the document generation system may structure the user input and generate a required document (the document generated by the document generation system is the same as the original document in form, i.e. in the format of PPT, word, pdf, etc., except that the original document is an existing document, for example, an existing document in a library or an enterprise, and the document generated by the document generation system is generated by using the content input by the user under the condition that the original document does not exist), and the generated document forms a one-to-one correspondence with corresponding structured data, semi-structured data or unstructured data, and such a relationship may be used as training data to improve the accuracy of a model for processing the original document. The specific process is as follows: the method is characterized in that the document generated by the document generation system corresponds to the structured data, the semi-structured data or the unstructured data one by one to serve as the labeled data, the effect of the treatment process of the labeled data and the original document is similar, the labeled data of the document generation system can further improve the accuracy of the NLP model and the treatment effect of the original document, and the method is particularly suitable for occasions with large original document quantity. Under the condition that the original document is not required to be labeled, the original document is not processed and labeled firstly, and structured data is generated by the document generation system and used for training the model to process the original document, so that a large amount of data labeling work can be reduced, and the accuracy of the whole scheme on knowledge processing can be improved.

The invention also includes a system for generating a knowledge graph, comprising: the system comprises a document conversion module and an integration module, wherein the document conversion module is used for converting an original document or received content into structured data, semi-structured data and/or unstructured data according to a knowledge structure; and the integration module is used for integrating the entities, the relations and the attributes to obtain the knowledge graph.

In specific implementation, in order to improve accuracy, the system further includes an entity identification module, configured to process entities, relationships, and attributes in the structured data, the semi-structured data, and the unstructured data, specifically: and carrying out entity identification on the entities, carrying out relationship extraction on the relationships in the entities, and carrying out attribute extraction on the attributes in the entities so as to obtain more accurate entities, relationships and attributes.

When there is a lack of an entity, relationship or attribute obtained from a document, structured data, semi-structured data or unstructured data generated from an original document or received content, the missing entity, relationship or attribute also needs to be obtained from external data, and therefore, in specific implementation, an external data obtaining module is also needed to obtain data from the outside to supplement the missing entity, relationship or attribute so as to ensure the integrity of the knowledge graph.

The present invention also includes a computer readable storage medium having stored thereon computer program instructions for execution by a processor of the above-described method of generating a knowledge-graph. Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium which can be used to store information that can be accessed by a computing device.

These computer program instructions may also be stored in a memory, where the computer program instructions stored in the memory may be executed by a processor to implement a method for generating a knowledge graph. The device comprises the memory and the processor.

Specifically, referring to fig. 5, an experimental report is taken as an example, a knowledge graph is generated by using an existing experimental report (corresponding to an original document), and a subsequently generated experimental report (i.e., an experimental report generated without the original document) is required to be fused into the knowledge graph. The method comprises the following specific steps:

(1) defining the structure of knowledge according to requirements, here we need to extract the following: including experiment ID, experiment purpose, experiment conditions (including experiment temperature, experiment reagent, etc.), experiment date, experiment steps, experiment conclusion, experimenter, etc., thus forming a knowledge structure as shown in fig. 3.

(2) The existing experiment report has file formats such as pdf, word, picture, PPT and the like. For example, pictures need to be extracted by OCR technology, words, PPT, excel need to be parsed to obtain contents, pdf also needs to be parsed directly, or OCR technology is also needed, that is, all existing documents are converted into texts, and after the texts are converted into texts, the technologies such as ETL technology, data labeling and the like are used to convert the texts into structured data, semi-structured data or unstructured data.

(3) After the structured data, the semi-structured data or the unstructured data are obtained, entity identification needs to be carried out on the structured data, the semi-structured data or the unstructured data, wherein the entity identification comprises entity identification carried out on experiment ID, experiment temperature, reagent names, experiment dates, experimenters and the like, and specific entities are identified; relationship extraction (extracting which personnel participate in the experiment and who is responsible for the experiment, etc.), attribute extraction (such as the attribute of the experimental reagent, etc.), and the extracted content is determined by the knowledge structure of the first step, namely the knowledge structure determines which content is extracted.

(4) After the entities, the relationships and the attributes are obtained, the entities, the relationships and the attributes are integrated, which mainly involves entity unification (for example, different names and short names of the same entity, such as different names of a science biology and a science, need to be unified into a science), entity disambiguation (mainly aiming at distinguishing entities with the same name but different meanings), and reference resolution (determining a reference word, definitely referring to contents, and the like), so as to obtain the correct knowledge content.

(5) By storing the above-obtained results in a graph database, a knowledge map obtained from an existing experimental report can be obtained.

For the subsequent newly added experiment report, the experiment report can be generated simultaneously by a direct generation mode without manually writing the experiment report, and the method comprises the following specific steps of:

with the document generating system shown in fig. 1, it is understood that, in specific implementation, corresponding modifications may be performed according to specific forms or contents of documents, and if an experiment report needs to be generated in this embodiment, corresponding modifications need to be performed on the document generating system shown in fig. 1, so as to obtain an experiment report generating system, where the experiment report generating system is to completely cover the content of the knowledge structure in the above steps. The method comprises experiment ID, experiment purposes, experiment conditions (including experiment temperature, experiment reagents and the like), experiment dates, experiment steps, experiment conclusions, experimenters and the like. When a new experiment report needs to be written, only corresponding contents need to be input according to the requirements of an experiment report generation system, and thus, required words, PPT, pdf, pictures and the like are generated without manual writing.

When the experiment report generation system generates the experiment report, the knowledge is structured, structured data, semi-structured data or unstructured data corresponding to the knowledge structure are obtained, and the structured data, the semi-structured data or the unstructured data can be directly imported into a structured database to be used as marking data for retraining and optimizing the models in the step (3) and the step (4), so that the accuracy of the whole scheme on knowledge processing is improved; meanwhile, the structured data, the semi-structured data or the unstructured data can be imported into a graph database to complete the construction of the knowledge graph.

Claims

1. A method of generating a knowledge graph, characterized by: the method comprises the following steps:

defining a knowledge structure;

and integrating the entities, the relations and the attributes of one or more of the structured data, the semi-structured data and the unstructured data to obtain the knowledge graph.

2. The method of generating a knowledge-graph of claim 1 wherein: the defining a knowledge structure comprises:

acquiring the existing document;

the preset requirements comprise entities, relations and attributes required for generating the knowledge graph; the knowledge structure includes entities, relationships, and attributes of the knowledge structure.

3. The method of generating a knowledge-graph of claim 2 wherein: before converting the existing document and the incremental document into one or more of structured data, semi-structured data and unstructured data according to the knowledge structure, the method further comprises the following steps: generating an incremental document according to the knowledge structure, specifically:

receiving data input by a user;

4. The method of generating a knowledge-graph of claim 2 wherein: before integrating the entities, relationships and attributes of one or more of the structured data, semi-structured data and unstructured data, the method further comprises the following steps:

5. The method of generating a knowledge-graph of claim 2 wherein: before integrating the entities, relationships and attributes of one or more of the structured data, semi-structured data and unstructured data, the method further comprises the following steps:

acquiring first knowledge content according to the entity, the relation and the attribute of the knowledge structure; wherein the first knowledge content is entities, relationships and attributes in the structured data of the incremental document transformation.

6. A method of generating a knowledge-graph as claimed in claim 4 or 5, wherein: the integrating the entities, relationships and attributes of one or more of the structured data, semi-structured data, unstructured data comprises:

and carrying out entity unification, entity disambiguation and reference resolution on the entities, the relations and the attributes obtained after the processing by using the natural language processing technology to obtain second knowledge content.

7. The method of generating a knowledge-graph of claim 6 wherein: the integrating the entities, relationships and attributes of one or more of the structured data, the semi-structured data and the unstructured data to obtain the knowledge graph comprises:

8. The method of generating a knowledge-graph of claim 7 wherein: said storing one or both of said first knowledge content, second knowledge content into a graph database comprises:

9. The method of generating a knowledge-graph of claim 5 wherein: the method further comprises the following steps:

and taking the first knowledge content as the marking data of the natural language processing technology, and training a model in the natural language processing technology.

10. An electronic device, comprising: a memory and a processor;

the memory is to store one or more computer program instructions;

the processor is to execute the one or more computer program instructions to: performing the steps of the method of any one of claims 1-9.