CN116860997A

CN116860997A - Knowledge graph construction method, device, equipment and storage medium

Info

Publication number: CN116860997A
Application number: CN202310826006.3A
Authority: CN
Inventors: 周雷皓
Original assignee: Beijing Qingsongchou Information Technology Co ltd
Current assignee: Beijing Qingsongchou Information Technology Co ltd
Priority date: 2023-07-06
Filing date: 2023-07-06
Publication date: 2023-10-10

Abstract

The embodiment of the application provides a method, a device, equipment and a storage medium for constructing a knowledge graph, and relates to the technical field of knowledge graphs. The method comprises the following steps: determining a target industry according to the map construction requirement; acquiring characteristic training data corresponding to the target industry, and training a preselected basic large language model based on the characteristic training data to obtain a customized large language model; extracting information from the target text data by using the customized large language model, and generating a target knowledge base based on the extracted information; and constructing a knowledge graph applicable to the target industry based on the target knowledge base. According to the embodiment of the application, the specific training data is obtained according to actual requirements to perform customized training on the large language model, so that the information extraction effect on text data in specific industries can be improved, and the applicability of the constructed knowledge graph in application in the industries is improved.

Description

Knowledge graph construction method, device, equipment and storage medium

Technical Field

The application relates to the technical field of knowledge graphs, in particular to a method, a device, equipment and a storage medium for constructing a knowledge graph.

Background

In the prior art, knowledge graph construction generally relies on manual labeling and vertical domain models. However, these methods have certain limitations in processing domain-specific industry data. For example, manual labeling is high in cost and low in efficiency, and the model in the vertical field is poor in universality, so that the requirement of multiple service directions is difficult to meet. Therefore, an improvement is needed for a knowledge graph construction method for industries such as money, insurance and health.

Disclosure of Invention

The embodiment of the application aims to provide a method, a device, equipment and a storage medium for constructing a knowledge graph, which are used for improving the applicability of the knowledge graph in processing specific industry data.

In a first aspect, an embodiment of the present application provides a method for constructing a knowledge graph, including:

determining a target industry according to the map construction requirement;

acquiring characteristic training data corresponding to the target industry, and training a preselected basic large language model based on the characteristic training data to obtain a customized large language model; wherein the characteristic training data comprises a unique vocabulary of the target industry;

extracting information from the target text data by using the customized large language model, and generating a target knowledge base based on the extracted information; the information extraction comprises entity extraction, relation extraction and attribute extraction;

and constructing a knowledge graph applicable to the target industry based on the target knowledge base.

In the embodiment of the application, the target industry is determined according to the current map construction requirement, and the specific training data of the industry is obtained to carry out customized training on the large language model, so that the information extraction effect on the text data of the specific industry can be improved, and the applicability of the constructed knowledge map in the application of the industry is improved.

In some possible embodiments, the property training data further includes model architecture adjustment parameters corresponding to the target industry; the training of the pre-selected basic large language model based on the characteristic training data comprises:

and performing model architecture adjustment on the basic large language model based on the model architecture adjustment parameters.

In the embodiment of the application, besides model training based on the industry specific vocabulary, the model is subjected to architecture adjustment by acquiring the model architecture adjustment parameters, so that the model which is more suitable for the target industry can be trained, the accuracy and coverage rate of information extraction are improved, and the applicability of the constructed knowledge graph is further improved.

In some possible embodiments, the characteristic training data further comprises a specific loss function corresponding to the target industry; the training of the pre-selected basic large language model based on the characteristic training data comprises:

and performing performance evaluation and model tuning on the basic large language model based on the pre-acquired verification data set and the specific loss function.

In the embodiment of the application, the model training process can be accurately guided by acquiring the specific loss function corresponding to the industry, so that the information extraction model which is more suitable for the target industry is acquired, the accuracy and coverage rate of information extraction are improved, and the applicability of the constructed knowledge graph is further improved.

In some possible embodiments, before the extracting information from the target text data using the customized large language model, the method further includes:

respectively extracting test information from the test text data based on a plurality of preset extraction strategies by utilizing the customized large language model;

determining extraction performance corresponding to the extraction strategies based on the extracted test information, and determining an extraction strategy with optimal extraction performance as a target extraction strategy;

the information extraction of the target text data by using the customized large language model is specifically as follows:

and extracting information from the target text data based on the target extraction strategy by utilizing the customized large language model.

In the embodiment of the application, a plurality of extraction strategies are pre-configured, information extraction is respectively carried out based on each extraction strategy, and then the optimal extraction strategy is screened according to the extraction performance corresponding to each extraction strategy obtained through evaluation and is used as the strategy for finally carrying out information extraction, so that the accuracy and coverage rate of information extraction can be improved, and the applicability of constructing and obtaining a knowledge graph is further improved.

In some possible embodiments, the generating the target knowledge base based on the extracted information includes:

inquiring and judging whether a target knowledge base corresponding to the target industry exists in a pre-configured database;

if yes, integrating the extracted information into the target knowledge base;

if not, generating a target knowledge base based on the extracted information.

In the embodiment of the application, after the information is extracted, whether the knowledge base in the same field exists is judged, if the knowledge base in the same field exists, the newly extracted information is directly integrated into the target knowledge base, so that the information integrity of the knowledge base corresponding to the target industry can be improved, and the applicability of constructing and obtaining the knowledge map is further improved.

In some possible embodiments, after the knowledge-graph applicable to the target industry is constructed based on the target knowledge base, the method further includes:

monitoring a knowledge base information updating event in real time, and fusing newly-added information to the target knowledge base based on the knowledge base information updating event;

reconstructing a knowledge graph applicable to the target industry based on the updated target knowledge base.

In the embodiment of the application, through monitoring the change of the knowledge base information in real time, when a user expands and optimizes the knowledge base, the newly added information can be fused to the knowledge base, and a new knowledge graph is constructed based on the fused knowledge base, so that the automatic update and supplement of the knowledge graph are realized, and the applicability of constructing the knowledge graph is further improved.

In a second aspect, an embodiment of the present application provides a knowledge graph construction apparatus, including:

the industry determining module is used for determining a target industry according to the map construction requirement;

the model training module is used for acquiring characteristic training data corresponding to the target industry, and training a preselected basic large language model based on the characteristic training data to obtain a customized large language model; wherein the characteristic training data comprises a unique vocabulary of the target industry;

the information extraction module is used for extracting information from the target text data by utilizing the customized large language model and generating a target knowledge base based on the extracted information; the information extraction comprises entity extraction, relation extraction and attribute extraction;

and the map construction module is used for constructing a knowledge map applicable to the target industry based on the target knowledge base.

In a third aspect, an embodiment of the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory and capable of running on the processor, where the processor executes the program to implement the method according to any embodiment of the first aspect.

In a fourth aspect, embodiments of the present application provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the method according to any of the embodiments of the first aspect.

In a fifth aspect, embodiments of the present application provide a computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements the method according to any of the embodiments of the first aspect.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and should not be considered as limiting the scope, and other related drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flow chart of a knowledge graph construction method according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of data collection and preprocessing according to an embodiment of the present application;

FIG. 3 is a schematic flow chart of a customized large language model training provided by an embodiment of the present application;

FIG. 4 is a flow chart of entity, relationship and attribute extraction provided by an embodiment of the present application;

fig. 5 is a schematic flow chart of knowledge graph construction according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a knowledge graph construction device according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only to distinguish the description, and are not to be construed as indicating or implying relative importance.

As shown in fig. 1, the embodiment of the present application provides a method for constructing a knowledge graph, which may include the steps of:

s1, determining a target industry according to a map construction requirement;

s2, acquiring characteristic training data corresponding to the target industry, and training a preselected basic large language model based on the characteristic training data to obtain a customized large language model; wherein the characteristic training data comprises a unique vocabulary of the target industry;

s3, extracting information from the target text data by utilizing the customized large language model, and generating a target knowledge base based on the extracted information; the information extraction comprises entity extraction, relation extraction and attribute extraction;

s4, constructing a knowledge graph suitable for the target industry based on the target knowledge base.

It should be noted that, the model training information base corresponding to a plurality of specific industries can be pre-constructed, so that the industries or fields to which the atlas needs to be applied are determined according to the construction requirements before the knowledge atlas is constructed, and thus the specific training data of the corresponding target industries are obtained. It can be understood that the map construction requirement can be a predefined specific instruction with a corresponding relation with each target industry, or can be construction requirement information input randomly, and the construction requirement information can be analyzed through a preset analysis model to obtain corresponding keywords, so that the corresponding target industry is determined according to the association relation.

It should be noted that, the model training information base corresponding to each specific industry can be adjusted according to actual needs to adapt to the change of industry knowledge. In the process of constructing the knowledge graph, after the characteristic training data corresponding to the target industry are acquired, the basic large language model can be trained based on the characteristic training data to obtain a customized large language model, and further information extraction can be carried out on text data of the target industry, and the knowledge graph suitable for the target industry is constructed.

It can be appreciated that in the training process of the large language model, the source of the data can be pre-collected original text data corresponding to the target industry, and the collected original text data can be subjected to de-duplication and cleaning operations, such as removing duplicate data, correcting spelling errors, processing messy code problems and the like; then carrying out structuring treatment on the cleaned data, for example dividing the data into fields such as text, title, keywords and the like; labeling the data, including labeling of entities, relationships and attributes, as a reference dataset for training and verifying the model; and then dividing the reference data set into a training set, a verification set and a test set which are respectively used for training, verifying and testing the customized model.

In the embodiment of the application, the target industry is determined according to the current map construction requirement, the specific training data of the industry is obtained to carry out customized training on the large language model, and compared with the general pre-training model and the vertical field model, the customized training model of the embodiment of the application can improve the information extraction effect on the text data of the specific industry and improve the applicability of the constructed knowledge map in the application of the industry.

It should be noted that, besides performing customized training on the model based on the industry-specific vocabulary, the model architecture adjustment parameters can be preconfigured for a specific industry, and the model architecture can be adjusted based on the model architecture adjustment parameters during model training, so that a customized information extraction model more suitable for the target industry can be trained.

It should be noted that, the loss function is used to evaluate the difference between the calculation result and the true value of the model, in order to train the customized model more adapted to the target industry, the specific loss function is pre-configured to perform performance evaluation and model tuning on the basic model, so as to obtain the customized information extraction model more adapted to the target industry.

In the embodiment of the application, a plurality of information extraction strategies can be configured based on differentiated extraction sequences, extraction types, extraction thresholds and the like, before the information extraction strategies are applied to information extraction operation, information extraction is performed on test text data based on different extraction strategies, extraction performance evaluation is performed on extraction results from set dimensions, such as the dimensions of extraction duration, extraction quantity and the like, different weights can be respectively configured for each dimension and combined evaluation is performed, and therefore the extraction strategy with optimal performance is screened out and is used as a target extraction strategy finally used for information extraction.

if yes, integrating the extracted information into the target knowledge base;

if not, generating a target knowledge base based on the extracted information.

In the embodiment of the application, when a target knowledge base is generated based on the extracted information, firstly, inquiring and judging whether a knowledge base corresponding to a target industry exists in the database, if so, directly integrating the currently extracted information into the original target knowledge base, and if not, newly creating and generating a target knowledge base.

It should be noted that, besides extracting information through the information extraction model, a user may extend and optimize the knowledge base through other ways, so that by monitoring the information update event of the knowledge base in real time, the knowledge base can be updated according to the updated information, and thus, a complete and up-to-date knowledge map can be constructed according to actual requirements.

Referring to fig. 2 to 5, a method for constructing a knowledge graph according to an embodiment of the present application is specifically illustrated below. Taking the money, insurance and health industries as examples, the embodiment of the application can be realized by the following steps:

1. referring to fig. 2, data collection and preprocessing:

a. text data relating to the funding, insurance and health industries is collected from a variety of channels such as networks, databases, API interfaces, etc.

b. The collected original text data is subjected to de-duplication and cleaning operations, such as repeated data removal, spelling error correction, messy code treatment and the like.

c. The cleaned data is structured, for example, the data is divided into fields such as text, title, keywords, etc.

d. The data is manually annotated, including labeling of entities, relationships, and attributes, as a benchmark dataset for training and validating models.

By collecting and preprocessing text data for the funding, insurance and health industries, the quality of training data is improved.

2. Referring to fig. 3, custom large language model training:

a. an appropriate large language model is selected as the base model.

b. Dividing the preprocessed reference data set into a training set, a verification set and a test set, which are respectively used for training, verifying and testing the model.

c. Aiming at the characteristics of money, insurance and health industries, corresponding characteristic training data are acquired to perform customized training on a large language model, such as adding special vocabulary of the industry, adjusting model architecture, adopting a specific loss function and the like.

d. And (3) finding out the optimal model parameters and training strategies through performance evaluation and model tuning of the verification set.

3. Referring to fig. 4, entity, relationship, and attribute extraction:

a. the custom trained large language model is applied to text data for the funding, insurance and health industries.

b. Entity extraction is performed on the text by using the model, wherein the entity extraction comprises a funding project entity, an insurance purchasing entity and the like.

c. And carrying out relation extraction on the extracted entities, such as donation relation, family relation and the like among the entities.

d. And extracting attributes of the extracted entity, such as the amount of the funded money, the term of insurance purchase and the like.

e. And selecting the extraction method with optimal performance by comparing different algorithms with parameter settings.

4. Please refer to fig. 5, knowledge graph construction:

a. integrating the extracted entities, relations and attributes into a knowledge base to construct a basic knowledge graph structure.

b. And expanding and optimizing the knowledge graph, such as adding new entity types, relationship types and the like.

c. By fusing with the prior knowledge base, the automatic updating and supplementing of the knowledge graph are realized.

d. And constructing a knowledge graph reflecting the characteristics of money raising, insurance and health industries, and providing data support and decision basis for enterprises.

Compared with the prior art, the method has the advantages that the large language model is customized and trained aiming at the characteristics of money-raising, insurance and health industries so as to fully mine the characteristics of industry data, and the accuracy and coverage rate of entity, relationship and attribute extraction are improved. In addition, the entity, relationship and attribute extraction method based on the large language model of the customized training can automatically identify and extract related entities, relationships and attributes from text data of the funding, insurance and health industries; compared with a general pre-training model and a vertical field model, the method has higher accuracy and coverage rate in extraction effect, thereby improving the applicability of the knowledge graph in specific industries.

Referring to fig. 6, fig. 6 is a block diagram illustrating a knowledge graph construction apparatus according to some embodiments of the present application. It should be understood that the knowledge graph construction apparatus corresponds to the above-described method embodiment of fig. 1, and is capable of performing the steps involved in the above-described method embodiment, and specific functions of the knowledge graph construction apparatus may be referred to the above description, and detailed descriptions thereof are omitted herein as appropriate to avoid redundancy.

The knowledge graph construction apparatus of fig. 6 includes at least one software function module which can be stored in a memory in the form of software or firmware or solidified in the knowledge graph construction apparatus, and includes:

an industry determination module 610, configured to determine a target industry according to a map construction requirement;

the model training module 620 is configured to obtain characteristic training data corresponding to the target industry, and train a pre-selected basic large language model based on the characteristic training data to obtain a customized large language model; wherein the characteristic training data comprises a unique vocabulary of the target industry;

the information extraction module 630 is configured to extract information from the target text data by using the customized large language model, and generate a target knowledge base based on the extracted information; the information extraction comprises entity extraction, relation extraction and attribute extraction;

and the map construction module 640 is used for constructing a knowledge map applicable to the target industry based on the target knowledge base.

It can be understood that the embodiment of the device is corresponding to the embodiment of the method of the present application, and the device for constructing a knowledge graph provided by the embodiment of the present application can implement the method for constructing a knowledge graph provided by any one of the embodiments of the method of the present application.

It will be clear to those skilled in the art that, for convenience and brevity of description, reference may be made to the corresponding procedure in the foregoing method for the specific working procedure of the apparatus described above, and this will not be repeated here.

As shown in fig. 7, some embodiments of the present application provide an electronic device 700, the electronic device 700 comprising: the memory 710, the processor 720, and a computer program stored on the memory 710 and executable on the processor 720, wherein the processor 720 can implement the method of any embodiment as included in the knowledge graph construction method described above when reading the program from the memory 710 via the bus 730 and executing the program.

Processor 720 may process the digital signals and may include various computing structures. Such as a complex instruction set computer architecture, a reduced instruction set computer architecture, or an architecture that implements a combination of instruction sets. In some examples, processor 720 may be a microprocessor.

Memory 710 may be used for storing instructions to be executed by processor 720 or data related to execution of the instructions. Such instructions and/or data may include code to implement some or all of the functions of one or more of the modules described in embodiments of the present application. The processor 720 of the disclosed embodiments may be configured to execute instructions in the memory 710 to implement the methods shown above. Memory 710 includes dynamic random access memory, static random access memory, flash memory, optical memory, or other memory known to those skilled in the art.

Some embodiments of the application also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the method of the method embodiment.

Some embodiments of the application also provide a computer program product which, when run on a computer, causes the computer to perform the method of the method embodiments.

It should be noted that, in the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described as different from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other. For the apparatus class embodiments, the description is relatively simple as it is substantially similar to the method embodiments, and reference is made to the description of the method embodiments for relevant points.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The apparatus embodiments described above are merely illustrative, for example, of the flowcharts and block diagrams in the figures that illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present application may be integrated together to form a single part, or each module may exist alone, or two or more modules may be integrated to form a single part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and variations will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application. It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.

The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Claims

1. The method for constructing the knowledge graph is characterized by comprising the following steps of:

determining a target industry according to the map construction requirement;

2. The knowledge-graph construction method according to claim 1, wherein the characteristic training data further includes model architecture adjustment parameters corresponding to the target industry; the training of the pre-selected basic large language model based on the characteristic training data comprises:

3. The knowledge-graph construction method according to claim 1, wherein the characteristic training data further includes a specific loss function corresponding to the target industry; the training of the pre-selected basic large language model based on the characteristic training data comprises:

4. The knowledge-graph construction method according to claim 1, further comprising, before the information extraction of the target text data using the customized large language model:

5. The knowledge-graph construction method according to claim 1, wherein the generating a target knowledge base based on the extracted information comprises:

if yes, integrating the extracted information into the target knowledge base;

if not, generating a target knowledge base based on the extracted information.

6. The knowledge-graph construction method according to claim 1, characterized by further comprising, after the knowledge-graph applicable to the target industry is constructed based on the target knowledge base:

7. The knowledge graph construction device is characterized by comprising:

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor, when executing the program, implements the method for constructing a knowledge graph according to any one of claims 1-6.

9. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when executed by a processor, performs the method of constructing a knowledge-graph according to any one of claims 1-6.

10. A computer program product, characterized in that the computer program product comprises a computer program, wherein the computer program, when executed by a processor, implements the method for constructing a knowledge-graph according to any one of claims 1-6.