CN117150050A

CN117150050A - Knowledge graph construction method and system based on large language model

Info

Publication number: CN117150050A
Application number: CN202311423122.7A
Authority: CN
Inventors: 赵策; 王亚; 屠静; 苏岳; 万晶晶; 李伟伟; 孙岩; 颉彬; 周勤民; 张玥; 潘亮亮; 刘岩
Original assignee: Zhuoshi Future Beijing technology Co ltd
Current assignee: Zhuoshi Future Beijing technology Co ltd
Priority date: 2023-10-31
Filing date: 2023-10-31
Publication date: 2023-12-01
Anticipated expiration: 2043-10-31
Also published as: CN117150050B

Abstract

The invention provides a knowledge graph construction method and system based on a large language model, and belongs to the technical field of text processing. The method comprises the following steps: carrying out text clustering on the knowledge text data to obtain a plurality of knowledge text data sets T with different text types; submitting the knowledge text data set T to a first HDFS for distributed file storage; according to the length of the knowledge text type, the knowledge text type is orderly extracted from the first HDFS, a preset large language model CoT is adopted, knowledge entity identification is carried out on the extracted knowledge text type, and associated information of each knowledge entity is obtained; submitting the associated information of each knowledge entity to a second HDFS for distributed file storage; and constructing graph node links among all the knowledge entities according to the associated information of all the knowledge entities stored in the second HDFS to obtain a knowledge graph. The invention can adapt to the language processing and saving functions of massive knowledge text data and process the knowledge graph construction of large-scale text types.

Description

Knowledge graph construction method and system based on large language model

Technical Field

The invention relates to the technical field of text processing, in particular to a knowledge graph construction method and system based on a large language model.

Background

The Knowledge Graph (knowledgegraph) is a framework for visually displaying the core structure, development history, front edge field and overall Knowledge of disciplines by using a visual Graph, the complex Knowledge field is displayed by data mining, information processing, knowledge metering and Graph drawing, the dynamic development rule of the Knowledge field is revealed, and a practical and valuable reference is provided for discipline research. In the book emotion, the knowledge map is called knowledge domain visualization or knowledge domain mapping map, which is a series of different graphs showing the knowledge development process and the structural relationship, and knowledge resources and carriers thereof are described by using a visualization technology, and knowledge and the interrelationship between the knowledge resources and the carriers are mined, analyzed, constructed, drawn and displayed.

The basic composition unit of the knowledge graph is an entity-relation-entity triplet, and the entities and related attribute-value pairs thereof are mutually connected through the relation to form a net-shaped knowledge structure. The general flow is as follows:

and extracting the entity and entity relation from the knowledge text data, and establishing a knowledge network map between the entities according to the extracted entity relation.

However, the traditional knowledge graph construction process is mainly used for processing a single text or only two or three texts, is only suitable for extracting small-scale data sets, is very laborious for processing data sets with more than two types of text data sets, cannot be quickly suitable for and process large-scale text type knowledge graph construction, is slow for constructing a knowledge graph by using a traditional single entity extraction method for large-scale data sets containing multiple text types, can only be used for one type of integrated processing, and has a longer period for generating the knowledge graph. Therefore, it is not suitable for the current demand of big data development.

Moreover, under the condition of facing a large data text data set, the traditional knowledge graph construction method does not have the data storage capacity of the large data graph, and is easy to cause the problem of insufficient memory and machine locking.

Disclosure of Invention

The embodiment of the invention provides a knowledge graph construction method and a knowledge graph construction system based on a large language model, which can adapt to language processing and storage functions of massive knowledge text data and process knowledge graph construction of large-scale text types. The technical scheme is as follows:

in one aspect, a knowledge graph construction method based on a large language model is provided, and the method is applied to electronic equipment and comprises the following steps:

acquiring knowledge text data for constructing a knowledge graph and preprocessing the knowledge text data;

carrying out text clustering on the preprocessed knowledge text data to obtain a plurality of knowledge text data sets T with different text types; wherein t= { knowledge text type1, knowledge text type2, knowledge text type3.

Submitting the knowledge text data set T to a first HDFS, and storing distributed files; wherein, HDFS represents a Hadoop distributed file system;

according to the length of the knowledge text type, extracting corresponding knowledge text types from the first HDFS in order, and carrying out knowledge entity identification on the extracted knowledge text types by adopting a preset large language model CoT to obtain associated information of each knowledge entity;

submitting the associated information of each knowledge entity to a second HDFS, and storing distributed files;

submitting the associated information of each knowledge entity to a knowledge graph construction module, and constructing graph node links among the knowledge entities by the knowledge graph construction module according to the associated information of each knowledge entity stored in the second HDFS to obtain a knowledge graph.

Further, the text clustering of the preprocessed knowledge text data to obtain a plurality of knowledge text data sets T with different text types includes:

constructing a support vector machine, and deploying the support vector machine on a background server;

the preprocessed knowledge text data is sent to the background server to serve as a text clustering sample, and the background server forwards the text clustering sample to the support vector machine to perform text clustering;

the support vector machine performs text structure recognition and clustering processing on the samples by using a support vector clustering algorithm to obtain a plurality of knowledge text types with different text types and outputs the knowledge text types;

and the background server gathers the knowledge text types of a plurality of different text types to obtain the knowledge text data set T.

Further, submitting the knowledge text data set T to a first HDFS, and performing distributed file storage includes:

calculating the text type length of each item of the knowledge text type in the knowledge text data set T, and marking the calculated length value on each item of the knowledge text type;

sequentially arranging the knowledge text types of each item in the knowledge text data set T according to the sequence from large to small by the length values, and rearranging the knowledge text data set T;

traversing all storage nodes of a first HDFS, checking available storage nodes, and sequentially storing all knowledge text types in the rearranged knowledge text data set T in the storage nodes of the first HDFS according to a rearrangement sequence;

and sending the storage addresses of the knowledge text data blocks to a background server.

Further, the sequentially extracting the corresponding knowledge text types from the first HDFS according to the length of the knowledge text types includes:

and sequentially and orderly retrieving each item of knowledge text type from the rearranged knowledge text data set T according to the length value of the knowledge text type, and sending the knowledge text type to the large language model CoT.

Further, the constructing step of the large language model CoT includes:

acquiring training data for training a large language model CoT, wherein the training data comprises text data of different text types/structures;

selecting a GPT natural language processing model, and learning and training a knowledge entity, an association relation of the knowledge entity and an attribute of the knowledge entity in the training data;

when training reaches a preset optimization iteration training condition, stopping training, and generating the large language model CoT;

testing the large language model CoT by using the obtained test set, and judging whether the prediction accuracy of the large language model CoT meets the standard;

and after reaching the standard, optimally training the large language model CoT by using a real-time knowledge text sample, and deploying the large language model CoT on a background server after the optimization training is finished.

Further, the step of performing knowledge entity recognition on the extracted knowledge text type by using a preset large language model CoT, and obtaining association information of each knowledge entity includes:

inputting the knowledge text types which are sequentially called into the large language model CoT;

carrying out knowledge entity identification on the knowledge text type by using the large language model CoT to obtain each knowledge entity contained in the knowledge text type;

extracting the association relation between the knowledge entities according to the context of the knowledge entities, and extracting the attribute information of the knowledge entities;

and outputting the association information formed by the association relation and the attribute information of each knowledge entity.

Further, after knowledge entity identification is performed on the extracted knowledge text type by adopting a preset large language model CoT to obtain associated information of each knowledge entity, the method further comprises:

and carrying out knowledge entity identification on the knowledge text type by using a graph neural network GNN to obtain the associated information of each knowledge entity, and carrying out contrast verification and result correction on the associated information obtained by using the graph neural network GNN and the associated information obtained by using the large language model CoT.

Further, submitting the association information of each knowledge entity to a knowledge graph construction module, wherein the knowledge graph construction module constructs graph node links between each knowledge entity according to the association information of each knowledge entity stored in the second HDFS, and obtaining a knowledge graph includes:

a knowledge graph construction tool TopBraid Composer is deployed in advance on the knowledge graph construction module;

transmitting the association information of the knowledge entities to the TopBraid Composer, and reading the association relation between the knowledge entities and the attribute information of the knowledge entities in the association information by the TopBraid Composer;

distributing corresponding map nodes for the knowledge entities, and establishing map links among the map nodes according to the read association relation among the knowledge entities and the attribute information of the knowledge entities to obtain a knowledge map;

and storing the knowledge graph into a dynamic database Nosql for supporting dynamic application of the knowledge graph.

In one aspect, a knowledge graph construction system based on a large language model is provided, including:

the acquisition module is used for acquiring knowledge text data for constructing a knowledge graph and preprocessing the knowledge text data;

the clustering module is used for carrying out text clustering on the preprocessed knowledge text data to obtain a plurality of knowledge text data sets T with different text types; wherein t= { knowledge text type1, knowledge text type2, knowledge text type3.

The first storage module is used for carrying out distributed file storage on the knowledge text data set T;

the recognition module is used for orderly extracting the corresponding knowledge text types from the first storage module according to the length of the knowledge text types, and carrying out knowledge entity recognition on the extracted knowledge text types by adopting a preset large language model CoT to obtain the associated information of each knowledge entity;

the second storage module is used for carrying out distributed file storage on the associated information of each knowledge entity;

and the construction module is used for constructing graph node links among the knowledge entities according to the associated information of the knowledge entities stored in the second storage module to obtain a knowledge graph.

In one aspect, an electronic device is provided, where the electronic device includes a processor and a memory, where the memory stores at least one instruction, and the at least one instruction is loaded and executed by the processor to implement the knowledge graph construction method based on a large language model.

In one aspect, a computer readable storage medium is provided, in which at least one instruction is stored, and the at least one instruction is loaded and executed by a processor to implement the knowledge graph construction method based on a large language model.

The technical scheme provided by the embodiment of the invention has the beneficial effects that at least:

1) The knowledge graph construction method based on the large language model CoT can convert knowledge text data into a knowledge graph form by utilizing strong natural language processing capability so as to better understand and organize knowledge, is expected to provide a stronger tool for knowledge management and application in various fields, can quickly and efficiently generate the knowledge graph with high capacity and wide range, and reduces the period;

2) Clustering and distributed storage technology of knowledge text data are utilized to perform clustering storage on massive knowledge text data, and associated information of each knowledge entity extracted by a large language model CoT is stored in a distributed mode, so that the problem that the prior art cannot adapt to language processing and storage functions of massive knowledge samples is solved, the storage problem and data processing flow of the knowledge samples are solved, the storage and calling processing of the associated information of the massive knowledge samples are adapted, and knowledge map construction of large-scale text types is quickly adapted and processed;

drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a knowledge graph construction method based on a large language model according to an embodiment of the present invention;

FIG. 2 is a detailed flow chart of a knowledge graph construction method based on a large language model according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a distributed file storage flow according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the embodiments of the present invention will be described in further detail with reference to the accompanying drawings. It will be apparent that the described embodiments are some, but not all, embodiments of the invention. All other embodiments, which can be made by a person skilled in the art without creative efforts, based on the described embodiments of the present invention fall within the protection scope of the present invention.

As shown in fig. 1 and fig. 2, an embodiment of the present invention provides a knowledge graph construction method based on a large language model, where the method may be implemented by an electronic device, and the electronic device may be a terminal or a server, and the method includes:

s101, acquiring knowledge text data for constructing a knowledge graph and preprocessing the knowledge text data;

in this embodiment, a large amount of knowledge text data may be collected as a sample for constructing a knowledge graph from a plurality of data sources, such as text documents, web pages, databases, log files, social media, and other channels, using tools such as web crawlers or APIs, where the knowledge text data includes: entity, relationship, and attribute information of the knowledge graph.

In this embodiment, the preprocessing is mainly data cleaning: duplicate, invalid or erroneous data is removed, data with inconsistent formats is processed, and format standardization is performed to ensure data quality.

S102, carrying out text clustering on the preprocessed knowledge text data to obtain knowledge text data sets T with different text types (or structures); wherein t= { knowledge text type1, knowledge text type2, knowledge text type3. The method specifically comprises the following steps:

a1, constructing a support vector machine (SVC), and deploying the support vector machine in a background server;

a2, the preprocessed knowledge text data is sent to the background server to serve as a text clustering sample, and the background server forwards the knowledge text data to the support vector machine to perform text clustering;

a3, the support vector machine utilizes a support vector clustering algorithm to perform text structure recognition and clustering processing on the samples to obtain a plurality of knowledge text types with different text types and output the knowledge text types;

in this embodiment, in order to improve the efficiency of text storage and recognition of samples, the samples may be classified, and if the samples are samples of an enterprise, the samples may be collected from a log database, a document database, a technical document recycle bin, etc. of the enterprise, and the collected samples are sent to a background server for clustering.

In this embodiment, a supervised learning manner may be used to classify samples, for example, using a generalized linear classifier to perform nonlinear classification on massive knowledge text data. The support vector machine adopting the support vector clustering algorithm is specifically used, a support vector machine corresponding to the clustering algorithm is constructed, and the support vector machine is deployed on a background server so as to execute clustering operation.

In this embodiment, the support vector machine performs text structure recognition and clustering processing on the samples, and performs sample classification, so as to classify the input knowledge text data into sample sets of different text types, i.e., knowledge text types of independent text types.

And A4, the background server gathers the outputted knowledge text types of a plurality of different text types to obtain the knowledge text data set T.

In this embodiment, the support vector machine outputs each sample set, and the background server gathers each sample set to form the knowledge text data set T. At this time, the sequence of each knowledge text type in the knowledge text data set T is chapter-free, and in order to facilitate distributed storage, the ordered distributed storage needs to be performed by using the HDFS.

S103, submitting the knowledge text data set T to a first HDFS, and storing a distributed file; wherein, HDFS represents a Hadoop distributed file system; as shown in fig. 3, the method specifically includes the following steps:

b1, calculating the text type length of each item of the knowledge text type in the knowledge text data set T, and marking the calculated length value on each item of the knowledge text type;

in this embodiment, the text type length of the knowledge text type refers to the number of characters that can be stored (the capacity size of one knowledge text type), so that a certain storage space (the storage capacity of the storage node in the HDFS may be different) is allocated for each character in advance according to the length value, so that the time consumption caused by capacity matching during storage can be avoided. Thus, in designing database tables or data models, the limitations of field length need to be considered for reasonable storage and text.

B2, orderly arranging all the knowledge text types in the knowledge text data set T according to the sequence from the large value to the small value, and rearranging the knowledge text data set T;

in this embodiment, the text type length of each knowledge text type in the knowledge text data set T may be automatically read by the background server, read from the file attribute of the knowledge text type, and then sequentially arrange the knowledge text types of each item in the knowledge text data set T according to the length value from large to small, and reorder the knowledge text data set T. After rearrangement, the text type length of the original knowledge text data set T, such as the possible knowledge text type3, is the largest, and then it is arranged in the first position, and the knowledge text type1 is replaced. Other knowledge text types are ranked in the same way. If the length values are consistent, the sequence is no matter what the sequence is.

In this embodiment, the larger the length value of the text type, the more preferentially processed, the preferentially distributed stored and preferentially used for the subsequent CoT identification.

B3, traversing all storage nodes of the first HDFS, checking available storage nodes, and sequentially storing all knowledge text types in the rearranged knowledge text data set T in the storage nodes of the first HDFS according to a rearrangement sequence;

in this embodiment, in order to improve the storage efficiency of the text and to sequentially call and process the knowledge text types from the first HDFS, the knowledge text types in the knowledge text data set T are sequentially distributed and stored. Specific: after sorting, traversing each storage node of the first HDFS, checking available storage nodes, and storing all knowledge text types in the rearranged knowledge text data set T in the storage nodes of the first HDFS (storage nodes with storage capacity adapted in advance) in a rearranged order.

In this embodiment, the first HDFS may perform distributed storage processing on each item of the knowledge text type in the knowledge text data set T by using a batch processing function, such as Apache Spark, and store each item of the knowledge text type onto an idle storage node.

And B4, transmitting the storage addresses of the knowledge text data blocks to a background server.

In this embodiment, the storage addresses of the knowledge text data blocks are sent to the background server, so that the background server can conveniently call the data of each knowledge text type according to the address response.

S104, sequentially extracting the corresponding knowledge text types from the first HDFS according to the length of the knowledge text types, and carrying out knowledge entity identification on the extracted knowledge text types by adopting a preset large language model CoT to obtain the associated information of each knowledge entity;

in this embodiment, according to the length value of the knowledge text type, the knowledge text types of each item are sequentially and orderly fetched from the rearranged knowledge text data set T, and sent to the large language model CoT.

In this embodiment, the step of constructing the large language model CoT includes:

c1, acquiring training data for training a large language model CoT, wherein the training data comprises text data of different text types/structures;

in this embodiment, the training data may be structured, semi-structured, or unstructured data, such as text, audio, video, graphics, and the like.

C2, selecting a GPT natural language processing model, and learning and training the knowledge entity, the association relation of the knowledge entity and the attribute of the knowledge entity in the training data;

in this embodiment, the training of the large language model CoT may refer to the training mode of the training model of the existing deep learning technology, such as CNN.

In this embodiment, a large language model suitable for CoT, such as GPT4, is selected, learning and identifying a knowledge entity in the knowledge text data, an association relationship of the knowledge entity, and an attribute of the knowledge entity, and extracting the knowledge entity, the attribute of the knowledge entity, and the association relationship contained in the text data.

C3, stopping training when training reaches preset optimization iteration training conditions, and generating the large language model CoT;

c4, testing the large language model CoT by using the obtained test set, and judging whether the prediction accuracy of the large language model CoT meets the standard or not;

in this embodiment, accuracy of Accuracy may be used to evaluate Accuracy of the large language model CoT generated by training, and if the Accuracy reaches 0.95, training is stopped.

And C5, optimally training the large language model CoT by using a real-time knowledge text sample after reaching the standard, and deploying the large language model CoT on a background server after the optimization training is finished.

In this embodiment, the knowledge text data of the site may also be used to perform real-time optimization training on the large language model CoT, so that the large language model CoT learns the text features of the site in real time. And determining an optimized data set according to the knowledge source and the type of the knowledge graph which are required by the user.

In this embodiment, the knowledge entity identification is performed on the extracted knowledge text type by using a preset large language model CoT to obtain the associated information of each knowledge entity, which may specifically include the following steps:

d1, inputting the knowledge text types which are sequentially called into the large language model CoT;

d2, carrying out knowledge entity identification on the knowledge text type by using the large language model CoT to obtain each knowledge entity contained in the knowledge text type;

d3, extracting the association relation between the knowledge entities according to the context of the knowledge entities, and extracting the attribute information of the knowledge entities;

and D4, outputting the association information formed by the association relation and the attribute information of each knowledge entity.

In this embodiment, the large language model CoT outputs the knowledge entity identifying each extracted knowledge text type, and the attribute information of each knowledge entity and the association relationship between the knowledge entities as the association information corresponding to the knowledge text type.

In this embodiment, the large language model CoT can identify entities, relationships and attributes contained in the text in the knowledge text type, convert text data into nodes and edges in the knowledge graph, further map concepts in the text into the graph structure, and determine relationships between the entities according to the text context.

In this embodiment, during outputting, a package mode may be adopted to output a data packet containing the associated information for each knowledge text type, and the data packet of each knowledge entity is submitted to the second HDFS for distributed storage.

In this embodiment, after a preset large language model CoT is adopted to identify knowledge entities of the extracted knowledge text type, and associated information of each knowledge entity is obtained, the method further includes:

As shown in fig. 2, in order to further improve accuracy of the text recognition result, in this embodiment, knowledge entity recognition may be further performed on the knowledge text type by using the graph neural network GNN on the background server, and accuracy of the recognition result of the knowledge text type by using the graph neural network GNN to prove the large language model CoT.

In this embodiment, the process of knowledge entity recognition by the graph neural network GNN for the knowledge text type may refer to the process of text recognition by the large language model CoT, the background server compares the results of two model recognition outputs to determine whether there is a large difference between the recognition result of the large language model CoT and the recognition result output by the graph neural network GNN, and if there is a large difference, the recognition result of the graph neural network GNN may be used to correct the associated information output by the large language model CoT; and otherwise, giving up. For specific reference, it is determined by an administrator whether to intervene in the correction.

S105, submitting the associated information of each knowledge entity to a second HDFS for distributed file storage;

in this embodiment, the specific storage manner of "submitting the associated information of each knowledge entity to the second HDFS for distributed file storage" may refer to the above scheme of "submitting the knowledge text data set T to the first HDFS for distributed file storage".

S106, submitting the associated information of each knowledge entity to a knowledge graph construction module, and constructing graph node links among the knowledge entities by the knowledge graph construction module according to the associated information of each knowledge entity stored in the second HDFS to obtain a knowledge graph; the method specifically comprises the following steps:

e1, pre-deploying a knowledge graph construction tool TopBraid Composer on the knowledge graph construction module;

in this embodiment, topBraid Composer is a knowledge graph construction tool based on the Semantic Web technology, which can allocate corresponding graph nodes for each knowledge entity, allocate attributes of the corresponding knowledge entities for each graph node, and establish graph links between the corresponding graph nodes according to the relationships between the knowledge entities, so as to link the corresponding graph nodes to form a graph network.

E2, sending the association information of the knowledge entities to the TopBraid Composer, and reading the association relation between the knowledge entities in the association information and the attribute information of the knowledge entities by the TopBraid Composer;

e3, distributing corresponding map nodes for the knowledge entities, and establishing map links among the map nodes according to the read association relation among the knowledge entities and the attribute information of the knowledge entities to obtain a knowledge map;

and E4, storing the knowledge graph into a dynamic database Nosql for supporting dynamic application of the knowledge graph.

In this embodiment, when the knowledge graph is constructed, the method mainly includes:

entity identification: identifying entities (such as person names, place names, organization names and the like) from knowledge text data by using a large language model CoT, taking the identified entities as nodes of a map, and distributing a unique identifier for each entity;

and (3) relation extraction: identifying relationships between entities, e.g. "A is the originator of B", defining the type of relationship, based on context in the text;

extracting attributes: attribute information of the entity, such as birthday of a person, longitude and latitude of place, etc., is extracted from the text, and attribute information such as description, category, attribute value, etc. is added to the entity.

In this embodiment, the Nosql database is a data read-write module with high performance and uses various data models, so that various types of text knowledge on the knowledge graph can be processed efficiently, and dynamic information can be provided, so that the Nosql database is selected for storage.

The knowledge graph construction method based on the large language model has the following advantages:

the knowledge graph construction method based on the large language model provided by the embodiment can be customized according to specific requirements and application scenes, so that the organization is helped to better organize and understand a large amount of information and knowledge, and various intelligent applications are supported.

The invention also provides a specific implementation mode of the knowledge graph construction system based on the large language model, and the knowledge graph construction system based on the large language model corresponds to the specific implementation mode of the knowledge graph construction method based on the large language model, and the knowledge graph construction system based on the large language model can achieve the purpose of the invention by executing the flow steps in the specific implementation mode of the method, so that the explanation in the specific implementation mode of the knowledge graph construction method based on the large language model is also applicable to the specific implementation mode of the knowledge graph construction system based on the large language model, which is provided by the invention, and will not be repeated in the following specific implementation mode of the invention.

The embodiment of the invention also provides a knowledge graph construction system based on the large language model, which comprises the following steps:

The first storage module is used for carrying out distributed file storage on the knowledge text data set T; the first storage module is a first HDFS, which represents a Hadoop distributed file system;

the second storage module is used for carrying out distributed file storage on the associated information of each knowledge entity; the second storage module is a second HDFS;

The knowledge graph construction system based on the large language model provided by the embodiment of the invention has at least the following beneficial effects:

the knowledge graph construction system based on the large language model provided by the embodiment can be customized according to specific requirements and application scenes, so that the organization is helped to better organize and understand a large amount of information and knowledge, and various intelligent applications are supported.

Fig. 4 is a schematic structural diagram of an electronic device 600 according to an embodiment of the present invention, where the electronic device 600 may have a relatively large difference due to different configurations or performances, and may include one or more processors (central processing units, CPU) 601 and one or more memories 602, where at least one instruction is stored in the memories 602, and the at least one instruction is loaded and executed by the processors 601 to implement the above-mentioned knowledge graph construction method based on a large language model.

In an exemplary embodiment, a computer readable storage medium, such as a memory including instructions executable by a processor in a terminal to perform the above-described knowledge-graph construction method based on a large language model, is also provided. For example, the computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or terminal device comprising the element.

References in the specification to "one embodiment," "an example embodiment," "some embodiments," etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the relevant art to effect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.

The invention is intended to cover any alternatives, modifications, equivalents, and variations that fall within the spirit and scope of the invention. In the following description of preferred embodiments of the invention, specific details are set forth in order to provide a thorough understanding of the invention, and the invention will be fully understood to those skilled in the art without such details. In other instances, well-known methods, procedures, flows, components, circuits, and the like have not been described in detail so as not to unnecessarily obscure aspects of the present invention.

Those of ordinary skill in the art will appreciate that all or a portion of the steps in implementing the methods of the embodiments described above may be implemented by a program that instructs associated hardware, and the program may be stored on a computer readable storage medium, such as: ROM/RAM, magnetic disks, optical disks, etc.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

1. The knowledge graph construction method based on the large language model is characterized by comprising the following steps of:

according to the length of the knowledge text types, sequentially extracting the corresponding knowledge text types from the first HDFS, and carrying out knowledge entity identification on the extracted knowledge text types by adopting a preset large language model CoT to obtain the associated information of each knowledge entity;

2. The knowledge graph construction method based on the large language model according to claim 1, wherein the text clustering the preprocessed knowledge text data to obtain a plurality of knowledge text data sets T of different text types includes:

3. The knowledge graph construction method based on a large language model according to claim 1, wherein submitting the knowledge text data set T to a first HDFS for distributed file storage comprises:

4. The knowledge graph construction method based on a large language model according to claim 3, wherein the sequentially extracting the corresponding knowledge text types from the first HDFS according to the lengths of the knowledge text types comprises:

5. The knowledge graph construction method based on the large language model according to claim 1, wherein the large language model CoT construction step includes:

6. The knowledge graph construction method based on a large language model according to claim 1, wherein the knowledge entity identification is performed on the extracted knowledge text type by using a preset large language model CoT, and obtaining the associated information of each knowledge entity comprises:

7. The knowledge graph construction method based on a large language model according to claim 1, wherein after knowledge entity identification is performed on the extracted knowledge text type by using a preset large language model CoT, and associated information of each knowledge entity is obtained, the method further comprises:

8. The knowledge graph construction method based on the large language model according to claim 1, wherein submitting the associated information of each knowledge entity to a knowledge graph construction module, the knowledge graph construction module constructing graph node links between each knowledge entity according to the associated information of each knowledge entity stored in the second HDFS, and obtaining a knowledge graph comprises:

9. A knowledge graph construction system based on a large language model, comprising: