CN117454884A

CN117454884A - Method, system, electronic device and storage medium for correcting historical character information

Info

Publication number: CN117454884A
Application number: CN202311760431.3A
Authority: CN
Inventors: 杨子昭
Original assignee: Shanghai Mido Technology Co ltd
Current assignee: Shanghai Mido Technology Co ltd
Priority date: 2023-12-20
Filing date: 2023-12-20
Publication date: 2024-01-26
Anticipated expiration: 2043-12-20
Also published as: CN117454884B

Abstract

The application provides a method, a system, an electronic device and a storage medium for correcting historical character information, wherein the method comprises the following steps: extracting a model by using a pre-trained relation to identify a text to be corrected, so as to obtain a model identification result; judging whether the model identification result comprises triplet information or not; if yes, inputting a model identification result into a pre-constructed knowledge graph, and correcting the model identification result by using the knowledge graph; otherwise, correcting the text to be corrected based on Langchain, a large language model and a knowledge graph base. The method adopts an automatic technology, reduces the time and workload of manual auditing, and improves the error correction efficiency; error correction is performed based on Langchain, a large language model and a knowledge graph base, and the pertinence, accuracy and reliability of error correction are improved by combining professional knowledge and context information; more context information and associated knowledge are introduced through the knowledge graph, so that the error correction capability of historical characters lacking structural information is improved, and a large number of false positives and false negatives are avoided.

Description

Method, system, electronic device and storage medium for correcting historical character information

Technical Field

The application belongs to the technical field of natural language processing, and particularly relates to a method, a system, electronic equipment and a storage medium for correcting historical character information.

Background

The history persona is an important object of history research and cultural inheritance, and its accurate information is critical to understanding the impact of history events and personas. However, due to the complexity and diversity of the historical literature, there are many places of errors and inaccuracy in the historical persona information. These errors may be due to erroneous documentation of historical literature, misleading of traditional ideas, or negligence of editors. Wrong historical character information can negatively affect academic research, educational education and cultural inheritance, and reduce the accuracy and credibility of the related fields. Therefore, the correction of the historical character information is necessary, the accuracy of the historical research can be improved, and the integrity of the historical cultural heritage is protected.

However, the existing historical relic information error correction method has the following technical defects: 1) Manual auditing is time-consuming and labor-consuming, and is easy to miss and error; 2) The large-scale collection and arrangement of the historical personage information requires a great deal of time, manpower and material resources, and is easy to leak and error; 3) The general model can not accurately identify the historical character information and needs to be optimized according to the professional knowledge in the related field; 4) Context-based information error correction is prone to failure in the absence of relevant structural information.

Therefore, how to provide a method, a system, an electronic device and a storage medium for correcting the historical personage information to improve the efficiency and accuracy of correcting the historical personage information is a technical problem to be solved by those skilled in the art.

Disclosure of Invention

The invention aims to provide a historical personage information error correction method, a system, electronic equipment and a storage medium, which are used for improving the efficiency and accuracy of historical personage information error correction.

In a first aspect, the present application provides a method for correcting error of historical personage information, including: extracting a model by using a pre-trained relation to identify a text to be corrected, so as to obtain a model identification result; judging whether the model identification result comprises triplet information or not; if yes, inputting the model identification result into a pre-constructed knowledge graph, and correcting the model identification result by using the knowledge graph; otherwise, correcting the text to be corrected based on Langchain, a large language model and a knowledge graph base.

In one implementation manner of the first aspect, the training method of the relation extraction model includes: collecting a first labeling sample related to a historical persona; training a few-sample learning model by using the first labeling sample to learn the relation among entities in the first labeling sample; predicting a non-labeling sample containing unknown entities and entity relations by using the trained few-sample learning model to obtain a second labeling sample; manually correcting the prediction error of the second labeling sample to obtain a corrected second labeling sample; determining the dividing proportion of the training set and the testing set; randomly distributing the corrected second labeling sample to a training data set and a test data set according to the dividing proportion; training a relationship extraction model using the training data set; evaluating a training effect of the relation extraction model using the test dataset; and storing the relation extraction model reaching the ideal training effect.

In an implementation manner of the first aspect, the identifying, by the relation extraction model, the text to be corrected by using a pipeline method, and obtaining a model identification result includes: identifying a main body in the text to be corrected; identifying objects in the text to be corrected; constructing an entity pair based on the identified subject and object; classifying the pair of entities to determine a type of relationship between the subject and the object; based on the subject, the object, and the relationship type, a triplet is constructed.

In an implementation manner of the first aspect, the identifying, by the relational extraction model, the text to be corrected by using a joint extraction method of parameter sharing, and obtaining a model identification result includes: identifying a main body in the text to be corrected based on a first parameter, and calculating a loss value generated by main body extraction; identifying objects in the text to be corrected based on the second parameter, and calculating a loss value generated by object extraction; determining a relationship type between the subject and the object based on a third parameter, and calculating a loss value generated by relationship extraction; calculating a joint loss value based on the loss value generated by the subject extraction, the loss value generated by the object extraction, and the loss value generated by the relation extraction; judging whether the joint loss value reaches a preset threshold value or not; if yes, constructing a triplet based on the subject, the object and the relationship type; otherwise, updating the first parameter, the second parameter and the third parameter based on the joint loss value, and repeating the entity identification, relationship determination and loss value calculation processes until the joint loss value reaches a preset threshold.

In an implementation manner of the first aspect, the identifying, by the relational extraction model, the text to be corrected by using a joint extraction method of joint decoding, and obtaining a model identification result includes: synchronously identifying a subject, an object and a relationship type in the text to be corrected; based on the subject, the object, and the relationship type, a triplet is constructed.

In an implementation manner of the first aspect, correcting the model identification result using the knowledge-graph includes: acquiring a triplet in the model identification result; inquiring information related to the triples in the knowledge graph to obtain an inquiry result of the knowledge graph; comparing the query result of the knowledge graph with the model identification result to judge whether the model identification result is accurate or not; and carrying out entity correction or relation correction on the inaccurate model identification result, and outputting the corrected model identification result.

In an implementation manner of the first aspect, correcting the text to be corrected based on LangChain, a large language model and a knowledge-graph library includes: extracting a first text vector from the knowledge-graph library by using the LangChain, and establishing a vector storage library based on the first text vector; extracting a second text vector from the text to be corrected; obtaining the first text vector similar to the second text vector from the vector storage library to obtain a similar text vector; splicing the second text vector and the similar text vector to obtain a prompt word; inputting the prompt word into the large language model; and processing the prompt word by using the large language model to obtain the text after error correction.

In a second aspect, the present application provides a system for correcting errors in historical personage information, comprising: the recognition module is used for extracting a model by applying the pre-trained relation to recognize the text to be corrected, so as to obtain a model recognition result; the judging module is used for judging whether the model identification result comprises triplet information or not; the first correction module is used for inputting the model identification result into a pre-constructed knowledge graph when the model identification result comprises triplet information, and correcting the model identification result by using the knowledge graph; and the second correction module is used for correcting the text to be corrected based on Langchain, the large language model and the knowledge graph base when the model identification result does not comprise the triplet information.

In a third aspect, the present application provides an electronic device, comprising: a processor and a memory; the memory is used for storing a computer program; the processor is configured to execute the computer program stored in the memory, so that the electronic device performs the method of any one of the above.

In a fourth aspect, the present application provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs a method according to any of the preceding claims.

As described above, the method, system, electronic device and storage medium for correcting historical character information described in the present application have the following beneficial effects:

(1) By adopting an automatic technology, the time and the workload of manual auditing are reduced, and the error correction efficiency is improved;

(2) Error correction is performed based on Langchain, a large language model and a knowledge graph base, and professional knowledge and context information are combined, so that pertinence, accuracy and reliability of error correction of historical characters are improved;

(3) Unstructured information can be quickly arranged into structured information by using a few-sample learning model, so that a large number of related data of historical characters can be efficiently collected and arranged, the time and labor cost are reduced, and the arrangement accuracy is improved;

(4) More context information and associated knowledge are introduced through the knowledge graph, so that the error correction capability of historical characters lacking structural information is improved, and a large number of false positives and false negatives are avoided.

Drawings

FIG. 1 is a flow chart illustrating a method for correcting historical personage information according to an embodiment of the present application.

FIG. 2 is a flow chart of a method for correcting historical personage information according to another embodiment of the present application.

FIG. 3 is a training flowchart of a relation extraction model in one embodiment of a method for correcting historical persona information according to the present application.

FIG. 4 is a schematic diagram of an embodiment of a system for correcting historical personage information according to the present application.

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Description of element reference numerals

10. Identification module

20. Judgment module

30. First correction module

40. Second correction module

50. Processor and method for controlling the same

60. Memory device

Description of the embodiments

Other advantages and effects of the present application will become apparent to those skilled in the art from the present disclosure, when the following description of the embodiments is taken in conjunction with the accompanying drawings. The present application may be embodied or carried out in other specific embodiments, and the details of the present application may be modified or changed from various points of view and applications without departing from the spirit of the present application. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict.

It should be noted that, the illustrations provided in the following embodiments merely illustrate the basic concepts of the application by way of illustration, and only the components related to the application are shown in the drawings and are not drawn according to the number, shape and size of the components in actual implementation, and the form, number and proportion of the components in actual implementation may be arbitrarily changed, and the layout of the components may be more complex.

In addition, descriptions such as those related to "first," "second," and the like, are provided for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated in this application. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In addition, the technical solutions of the embodiments may be combined with each other, but it is necessary to base that the technical solutions can be realized by those skilled in the art, and when the technical solutions are contradictory or cannot be realized, the combination of the technical solutions should be regarded as not exist and not within the protection scope of the present application.

The following embodiments of the present application provide a method, a system, an electronic device, and a storage medium for correcting historical personage information, which can be applied to platforms such as historical academic research, educational teaching field, online encyclopedia, and the like. For example, in historical research, historians and researchers can utilize the technique to correct errors in historical personage information, improving the accuracy and reliability of the historical research. In the educational field, students and teachers learn and teach the correct information of historic characters by using this technique. On online encyclopedia and the like platforms, the technology can help edit and manage historical persona information better by administrators to provide more accurate knowledge content. The automatic verification method and the automatic verification device adopt an automatic technology, reduce the time and workload of manual verification, and improve the error correction efficiency. The following describes the technical solutions in the embodiments of the present application in detail with reference to the drawings in the embodiments of the present application.

Referring to fig. 1, a flowchart of a method for correcting historical personage information according to an embodiment of the present application is shown.

As shown in fig. 1, the present embodiment provides a method for correcting error of historical personal information, which includes the following steps S100 to S400.

And step S100, a pre-trained relation extraction model is applied to identify the text to be corrected, and a model identification result is obtained.

In this embodiment, the relationship extraction model is used to extract specific event or fact information from the text to be corrected, where the information generally includes entities and relationships. For example, time, place, key characters may be extracted from news, and product names, development time, performance indexes, etc. may be extracted from technical documents using a relationship extraction model.

Specifically, the relation extraction model adopted in the application is a BERT (Bidirectional Encoder Representation from Transformers) model. BERT is a pre-trained language characterization model that emphasizes that instead of pre-training using a traditional one-way language model or a method of shallow stitching of two one-way language models, a new masked speech model (Masked Language Model, MLM) is used to generate deep bi-directional language characterization. The model has the following main advantages:

(1) Pre-training a bidirectional transducer by adopting MLM to generate deep bidirectional language representation;

(2) After pre-training, only one extra output layer is added to make fine-tuning (fine-tune) to get state-of-the-art performance in various downstream tasks. In this process, no task-specific structural modifications to the BERT are required.

The prior pre-training model is limited by a unidirectional language model (from left to right or from right to left), so that the characterization capability of the model is limited, and only unidirectional context information can be acquired. The BERT uses MLM for pre-training and uses deep bi-directional transducer components (unidirectional transducers are generally known as Transformer decoder, each token can only be atted to the left currently, bi-directional transducers are known as Transformer decoder, each token can be atted to all token) to build the whole model, thus generating deep bi-directional language representations that can fuse context information.

In an embodiment of the present application, the relation extraction model identifies the text to be corrected by using a pipeline method, and the obtaining a model identification result includes step S111 to step S115.

And step S111, identifying a main body in the text to be corrected.

And step S112, identifying objects in the text to be corrected.

Step S113, constructing an entity pair based on the identified subject and the object.

Step S114, classifying the relationship between the entity pairs to determine the relationship type between the subject and the object.

Step S115, constructing an SPO triplet based on the subject, the object, and the relationship type.

For example, in the sample to be corrected, "Zhang San," in which 13 days of 7 months occur in Shanghai in 1983, the 110 m column operator of the Chinese male track and field team, "the relation extraction model may extract a triplet example (Zhang San, birth place, shanghai), where" Zhang San "is the subject, and" Shanghai "is the object, and" Zhang San "and" Shanghai "are the birth place relation.

In another embodiment of the present application, the identifying the text to be corrected by the relation extraction model using a joint extraction method of parameter sharing, and obtaining a model identification result includes:

step S121, identifying a subject in the text to be corrected based on the first parameter, and calculating a loss value generated by extracting the subject.

Step S122, identifying the object in the text to be corrected based on the second parameter, and calculating a loss value generated by object extraction.

Step S123, determining the relation type between the subject and the object based on the third parameter, and calculating a loss value generated by relation extraction.

Step S124, calculating a joint loss value based on the loss value generated by the subject extraction, the loss value generated by the object extraction, and the loss value generated by the relation extraction.

And step S125, judging whether the joint loss value reaches a preset threshold value.

And step 126, if yes, constructing an SPO triplet based on the subject, the object and the relationship type.

Step S127, otherwise, updating the first parameter, the second parameter and the third parameter based on the joint loss value, and repeating the above entity identification, relationship determination and loss value calculation processes until the joint loss value reaches a preset threshold.

In yet another embodiment, the relation extraction model identifies a text to be corrected by adopting a joint extraction method of joint decoding, and obtains a model identification result, including:

step S131, synchronously identifying the subject, the object and the relationship type in the text to be corrected.

Step S132, constructing an SPO triplet based on the subject, the object and the relationship type.

For example, in the text to be corrected, "some of the li a words, unsealed (Jins Henan) people," the relation extraction model may extract two triplet instances of (li a, word, some) and (li a, through, unsealed), where "li a" is the subject, "some" and "unsealed" are objects, and "word" and "through" are the relation types between the subject and the objects.

It should be noted that, the above-mentioned object extraction process and the relationship extraction process are optional steps, because not all the model recognition results are complete (subject, relationship, object) triples, sometimes, since the text to be corrected itself is incomplete, for example, only includes subject features, the model can only recognize the subject features finally, and the object features and relationship types cannot be determined, and the complete triples cannot be constructed.

And step 200, judging whether the model identification result comprises triplet information.

Specifically, the triplet information includes the subject, the object, and a relationship type between the subject and the object.

And step S300, if the model identification result comprises triplet information, inputting the model identification result into a pre-constructed knowledge graph, and correcting the model identification result by using the knowledge graph.

Knowledge graph is a structured data model for representing and organizing knowledge, and by modeling entities, attributes and relationships in the form of graphs, a huge knowledge network consisting of nodes (points) and edges (edges) is formed. The knowledge graph is used for describing information of multiple aspects of names, liveness, achievement, family relations and the like of historical characters, and provides efficient query and reasoning capability. By structuring and correlating information of history people, the relationships and contexts between history people can be better understood.

In one embodiment, constructing the knowledge-graph includes: knowledge extraction, knowledge fusion and quality control. Entity extraction and relationship extraction are typical of knowledge extraction. In unstructured knowledge extraction, identifying entities of business targets from texts through entity extraction; semantic or logical relationships between two entities are obtained through relationship extraction. Because knowledge extraction sources are various, knowledge obtained from different sources is different, and therefore, demands are put on knowledge fusion, including entity alignment, attribute fusion, attribute value normalization and the like. In addition, quality control is also performed on the knowledge graph, and the missing, error and old knowledge is complemented, corrected and updated.

In one embodiment, correcting the model recognition result using the knowledge-graph includes:

step 301, obtaining a triplet in the model identification result.

And step S302, inquiring information related to the triples in the knowledge graph to obtain an inquiry result of the knowledge graph.

And step S303, comparing the query result of the knowledge graph with the model identification result to judge whether the model identification result is accurate or not.

And step S304, performing entity correction or relation correction on the inaccurate model identification result, and outputting the corrected model identification result.

For example, if the above text "some of the a-letters of the plum" is unpacked (Jinhe Henan) person "to be corrected, if the correction is performed according to the rule and the keyword, the error report of the a-letter is easily made to be the a-letter, and the a-letter is similar to the a-letter in sound form, but there is actually some information about the a-letter, and in fact, the a-letter is a Henan consolidated person (Jinhe nan consolidated person), so that the error report is made here.

In this embodiment, two triples of (plum a, word, or some) and (plum a, native, unsealed) extracted from the text to be corrected "some plum a word, unsealed (Jinzhu Henan)" are input into the knowledge graph, and the knowledge graph can easily determine that "plum a" should be "Zhang san" through query and comparison peer operation.

Specifically, comparing the query result of the knowledge graph with the model identification result includes: checking whether an entity in the model identification result exists in the knowledge graph or not; if the entity identification information exists, the entity identification accuracy can be confirmed; if not present, it may be considered to be added to the knowledge-graph or marked as an unknown entity.

In another embodiment, comparing the query result of the knowledge graph with the model identification result includes: checking whether the relationship type in the model identification result exists in the knowledge graph or not; if the relationship type identification is accurate, the relationship type identification can be confirmed; if not present, it may be considered to be added to the knowledge-graph or marked as an unknown relationship.

Further, the corrected model recognition result can be used for inquiring, reasoning, supplementing the knowledge graph and the like.

It should be noted that, the construction and correction of the knowledge graph is a continuous process, and needs to be continuously updated and perfected. At the same time, the accuracy of the correction also depends on the quality and integrity of the knowledge-graph. Therefore, before the step of executing the recognition result of the knowledge-graph correction model is executed, effective pretreatment works such as data cleaning, entity recognition, relation extraction and the like are required to be carried out so as to improve the accuracy and reliability of the correction result.

In the implementation mode, more context information and associated knowledge are introduced through the knowledge graph, so that the error correction capability of historical characters lacking structural information is improved, and a large number of false positives and false negatives are avoided.

In other embodiments, library checking correction may also be performed based on mondab.

And step 400, correcting the text to be corrected based on Langchain, a large language model and a knowledge graph base if the model identification result does not include triplet information.

Langchain is an open-source application development framework, and currently supports two programming languages, namely Python and TypeScript. It gives two major core capabilities to Large Language Models (LLM): data perception, connecting the language model with other data sources; agent capabilities, allowing the language model to interact with its environment. The main application scenarios of LangChain include personal assistants, document-based questions and answers, chat robots, look-up table data, code analysis, etc.

The large language model adopted by the application is ChatGLM-6B. Specifically, chatGLM-6B is an open-source conversational language model supporting chinese-english bilingual language with 62 billion parameters based on the General Language Model (GLM) architecture. In combination with the model quantization technique, the user can perform local deployment on the consumer-level graphics card (minimum only 6GB of video memory is needed at the INT4 quantization level). ChatGLM-6B uses a similar technique as ChatGPT, optimized for chinese questions and answers and dialogs. Through the training of Chinese-English bilingual with about 1T identifier, and the addition of techniques such as supervision fine tuning, feedback self-help, human feedback reinforcement learning, and the like, 62 hundred million-parameter ChatGLM-6B can generate answers which are quite consistent with human preferences.

The knowledge-graph base (Knowledge Graph Repository) is a collection of stored and managed knowledge-graph data. It is a database containing a large number of entities, relationships and attributes for supporting knowledge graph construction and querying. Knowledge-graph libraries typically provide a set of tools and interfaces that allow users to retrieve, update, and query knowledge-graph data. The knowledge-graph library adopted in the embodiment corresponds to the knowledge-graph constructed in step S300.

Referring to fig. 2, a flowchart of a method for correcting historical personage information according to another embodiment of the present application is shown.

As shown in fig. 2, in an embodiment of the present application, correcting the text to be corrected based on LangChain, a large language model and a knowledge-graph library includes:

step S401, extracting a first text vector from the knowledge graph base by using the Langchain, and establishing a vector storage base based on the first text vector; extracting a second text vector from the text to be corrected; obtaining the first text vector similar to the second text vector from the vector storage library to obtain a similar text vector; and splicing the second text vector and the similar text vector to obtain a prompt word (prompt).

Specifically, extracting a first text vector from the knowledge-graph library and building a vector store based on the first text vector comprises: traversing and reading nodes and edges in the knowledge graph to obtain text information associated with the entity, the attribute and the relation; preprocessing the extracted text information to obtain preprocessed text information; vectorizing the preprocessed text information to obtain the first text vector; the first text vector is stored to build the vector store.

For example, the first text vector extracted from the knowledge-graph library may be a label of an entity, a description of a relationship, a value of an attribute, and the like. The preprocessing steps include punctuation removal, conversion to lowercase, word segmentation, and the like. When the extracted text is segmented, a word segmentation algorithm, a part-of-speech tagging and the like can be adopted to segment each text into words or phrases. Common text vectorization methods include a Bag of Words model (Bag of Words), a TF-IDF (Term Frequency-Inverse Document Frequency) model, a Word2Vec model, etc., which can convert text into vectors, facilitating capturing semantic and contextual information of the text. In storing the first text vector, a suitable database or data structure may be selected to store the vector, such as a relational database, a graph database, or a distributed file system, for example.

Extracting a second text vector from the text to be corrected comprises: segmenting the text to be corrected by using a word segmentation algorithm to obtain a word segmentation result; performing duplication elimination on the word segmentation result, and constructing a vocabulary based on the duplication elimination word segmentation result; converting the vocabulary into word vectors using a pre-trained word vector model; and carrying out normalization processing on the word vector to obtain the second text vector.

For example, existing chinese word segmentation tools, such as jieba library, THULAC, etc., may be used to perform word segmentation on text to be corrected. The pre-trained word vector model can be used to include a word Bag model (Bag of Words) or a TF-IDF model, wherein the word Bag model is used for counting the occurrence frequency of each word in the text to be corrected, a vector is formed, the dimension of the vector is the size of a vocabulary, each dimension corresponds to one word, and the value is the occurrence frequency of the word in the text to be corrected; the TF-IDF model is used for calculating the TF-IDF value of each word to form a vector, the dimension of the vector is the size of a vocabulary, each dimension corresponds to one word, and the value is the TF-IDF value of the word. Common normalization methods are maximum and minimum normalization and Z-score normalization. The vector values are scaled to a fixed range, such as [0,1] or [ -1,1].

Obtaining the first text vector from the vector memory bank that is similar to the second text vector, the obtaining similar text vector comprising: calculating the similarity between the second text vector and each of the first text vectors in the vector store; and taking the first text vector with the highest similarity as the similar text vector.

In this embodiment, assuming that the dimension of the second text vector is n and the dimension of the similar text vector is m, the dimension of the spliced prompt text vector is n+m.

Step S402, inputting the prompt word into the large language model.

And step S403, processing the prompt word by using the large language model to obtain the text after error correction.

In the implementation mode, error correction is performed based on Langchain, a large language model and a knowledge graph base, and the pertinence, accuracy and reliability of error correction are improved by combining professional knowledge and context information.

Referring to fig. 3, a training flowchart of a relation extraction model of the historical character information error correction method according to the present application is shown in an embodiment.

As shown in fig. 3, the training method of the relation extraction model includes the following steps S501 to S509.

Step S501, collecting a first labeling sample related to a historical persona.

Step S502, training a learning model with few samples (fewshot) by using the first labeling sample, so as to learn the relationship between the entities in the first labeling sample.

And S503, predicting a non-labeling sample containing unknown entities and entity relations by using the trained few-sample learning model to obtain a second labeling sample.

And step S504, manually correcting the prediction error of the second labeling sample to obtain a corrected second labeling sample.

Step S505, determining the dividing ratio of the training set and the test set.

And step S506, randomly distributing the corrected second labeling sample to a training data set and a test data set according to the dividing proportion.

And S507, training a relation extraction model by using the training data set.

And step S508, evaluating the training effect of the relation extraction model by using the test data set.

Step S509, saving the relation extraction model reaching the ideal training effect.

Specifically, the sample used in the application mainly covers descriptive text related to historical figures in the government field. In order to improve the accuracy and generalization capability of the model, the method and the device adopt sample data with the characteristics of rich historical background, wide regional coverage, various character types, language styles and the like, are beneficial to improving the expression of the model on the historical character information error correction task, and enhance the understanding and adaptability of the model to multiple cultures.

For example, the sample may include relevant information in politics, economy, culture, science and technology, etc. at various times from ancient times to modern times. Also, historical persona descriptions of different countries and regions may be included. In addition, samples of different language styles such as ancient books, modern books, poems and the like, and characteristics and achievements of historical characters in different fields such as politicians, military families, literature, artists, scientists and the like are considered. These integrated samples will help to promote the performance and application capabilities of the model.

In this embodiment, the labeling sample includes text (text) and a triplet tag (triple_list) corresponding to the text.

For example, a particular annotation sample may be represented as:

"text" "adapted from the same name of a famous composer Zhang three to put through a novel",

"triple_list": [

[

"some book name",

"author",

zhang Sanzhi (Zhang Sanzhi) "

]]

In this embodiment, the triplet tag is structured data, and includes a relationship between entities, or a specific attribute of a certain entity. For example, a particular triplet tag may be represented as:

triple_list（predict,object_type,subject_type,object,subject）

wherein the triplet tag is constructed as shown in table 1.

TABLE 1 composition of triplet tags

In performing model evaluation, three main indicators are adopted in the application:

(1) Precision (Precision), also known as Precision, represents the ability of a model to correctly determine samples that are predicted to be positive. Generally, the higher the precision, the more reliable the model is in judging that the prediction is positive. The calculation formula of the precision ratio is as follows:

Precision = TP / (TP + FP)

wherein TP represents True positves, i.e., the number of samples that are actually positive and predicted by the model to be positive; FP represents a False positive (False Positives), i.e. the number of samples that are actually negative but are erroneously predicted to be positive by the model.

(2) Recall, also known as Recall, represents the ability of a model to make a correct determination of what is actually positive. Generally, the higher the recall, the higher the model means that the fewer false positives the "actually positive" samples are, the lower the probability of missed judgment. The calculation formula of the recall ratio is as follows:

Recall = TP / (TP + FN)

wherein TP represents True positves, i.e., the number of samples that are actually positive and predicted by the model to be positive; FN represents False Negatives, i.e. the number of samples that are actually positive but are erroneously predicted negative by the model.

(3) The F1 value (F1-Score) is an index integrating the precision and recall and is used for comprehensively considering the performance of the model in terms of precision and recall. The calculation formula of the F1 value is as follows:

F1-Score = 2 * (Precision * Recall) / (Precision + Recall)

when the precision and recall are both high, the F1 value approaches 1, which means that the model has good comprehensive performance in terms of both precision and recall. And if the precision or recall is low, the F1 value will drop accordingly.

In the implementation mode, the accuracy and the recognition capability of the relation extraction model are comprehensively considered, so that the performance of the model in the task can be more comprehensively evaluated.

It should be noted that in practical applications, different trade-offs may be considered according to the requirements of a specific task. Sometimes more precision may be required to ensure that only truly desired results are predicted; sometimes more attention may be paid to recall to ensure that as many real results as possible are captured. Thus, when using these assessment indicators, a comprehensive consideration needs to be made in conjunction with specific business scenarios and requirements.

In the implementation mode, unstructured information can be quickly arranged into structured information by using a few-sample learning model, so that a large number of related data of historical characters can be efficiently collected and arranged, the time and labor cost are reduced, and the arrangement accuracy is improved.

The protection scope of the error correction method for the historical personage information according to the embodiment of the application is not limited to the execution sequence of the steps listed in the embodiment, and all the schemes implemented by adding or removing steps and replacing steps according to the prior art made according to the principles of the application are included in the protection scope of the application.

Referring to fig. 4, a schematic structural diagram of an embodiment of a system for correcting error of historical personage information according to the present application is shown.

As shown in fig. 4, the present embodiment provides a system for correcting errors of historical personal information, comprising:

the recognition module 10 is used for extracting a model to recognize the text to be corrected by applying the pre-trained relation to obtain a model recognition result.

And the judging module 20 is used for judging whether the model identification result comprises triplet information.

The first correction module 30 is configured to input the model identification result into a pre-constructed knowledge graph when the model identification result includes triplet information, and correct the model identification result by using the knowledge graph.

And a second correction module 40, configured to correct the text to be corrected based on LangChain, a large language model and a knowledge graph base when the model recognition result does not include triplet information.

It should be noted that, the structures and principles of the identification module 10, the judgment module 20, the first correction module 30 and the second correction module 40 are in one-to-one correspondence with the steps in the above-mentioned error correction method for the historical personage information, so that the description thereof will not be repeated here.

The embodiment of the application provides a historical personage information error correction system, which can realize the historical personage information error correction method described in the application, but the implementation device of the historical personage information error correction method described in the application includes but is not limited to the structure of the historical personage information error correction system listed in the embodiment, and all structural variations and substitutions of the prior art according to the principles of the application are included in the protection scope of the application.

As shown in fig. 5, the present application provides an electronic device, including: processor 50 and memory 60.

The memory 60 is used for storing a computer program.

The processor 50 is configured to execute a computer program stored in the memory 60 to cause the electronic device to perform any one of the methods described above.

In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, or methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules/units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple modules or units may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or modules or units, which may be in electrical, mechanical or other forms.

The modules/units illustrated as separate components may or may not be physically separate, and components shown as modules/units may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules/units may be selected according to actual needs to achieve the purposes of the embodiments of the present application. For example, functional modules/units in various embodiments of the present application may be integrated into one processing module, or each module/unit may exist alone physically, or two or more modules/units may be integrated into one module/unit.

Those of ordinary skill would further appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

Embodiments of the present application also provide a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements a method as described in any of the above. Those of ordinary skill in the art will appreciate that all or part of the steps in the method implementing the above embodiments may be implemented by a program to instruct a processor, where the program may be stored in a computer readable storage medium, where the storage medium is a non-transitory (non-transitory) medium, such as a random access memory, a read only memory, a flash memory, a hard disk, a solid state disk, a magnetic tape (magnetic tape), a floppy disk (floppy disk), an optical disk (optical disk), and any combination thereof. The storage media may be any available media that can be accessed by a computer or a data storage device such as a server, data center, or the like that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a digital video disc (digital video disc, DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

Embodiments of the present application may also provide a computer program product comprising one or more computer instructions. When the computer instructions are loaded and executed on a computing device, the processes or functions described in accordance with the embodiments of the present application are produced in whole or in part. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, or data center to another website, computer, or data center by a wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.).

The computer program product is executed by a computer, which performs the method according to the preceding method embodiment. The computer program product may be a software installation package, which may be downloaded and executed on a computer in case the aforementioned method is required.

The descriptions of the processes or structures corresponding to the drawings have emphasis, and the descriptions of other processes or structures may be referred to for the parts of a certain process or structure that are not described in detail.

The foregoing embodiments are merely illustrative of the principles of the present application and their effectiveness, and are not intended to limit the application. Modifications and variations may be made to the above-described embodiments by those of ordinary skill in the art without departing from the spirit and scope of the present application. Accordingly, it is intended that all equivalent modifications and variations which may be accomplished by persons skilled in the art without departing from the spirit and technical spirit of the disclosure be covered by the claims of this application.

Claims

1. A method for correcting errors in historical personage information, comprising:

extracting a model by using a pre-trained relation to identify a text to be corrected, so as to obtain a model identification result;

judging whether the model identification result comprises triplet information or not;

if yes, inputting the model identification result into a pre-constructed knowledge graph, and correcting the model identification result by using the knowledge graph;

otherwise, correcting the text to be corrected based on Langchain, a large language model and a knowledge graph base;

based on Langchain, a large language model and a knowledge-graph library, correcting the text to be corrected comprises:

extracting a first text vector from the knowledge-graph library by using the LangChain, and establishing a vector storage library based on the first text vector; extracting a second text vector from the text to be corrected; obtaining the first text vector similar to the second text vector from the vector storage library to obtain a similar text vector; splicing the second text vector and the similar text vector to obtain a prompt word;

Inputting the prompt word into the large language model;

and processing the prompt word by using the large language model to obtain the text after error correction.

2. The method of claim 1, wherein the training method of the relation extraction model comprises:

collecting a first labeling sample related to a historical persona;

training a few-sample learning model by using the first labeling sample to learn the relation among entities in the first labeling sample;

predicting a non-labeling sample containing unknown entities and entity relations by using the trained few-sample learning model to obtain a second labeling sample;

manually correcting the prediction error of the second labeling sample to obtain a corrected second labeling sample;

determining the dividing proportion of the training set and the testing set;

randomly distributing the corrected second labeling sample to a training data set and a test data set according to the dividing proportion;

training a relationship extraction model using the training data set;

evaluating a training effect of the relation extraction model using the test dataset;

and storing the relation extraction model reaching the ideal training effect.

3. The method of claim 1, wherein the relation extraction model identifies the text to be corrected using a pipeline method, and obtaining a model identification result comprises:

Identifying a main body in the text to be corrected;

identifying objects in the text to be corrected;

constructing an entity pair based on the identified subject and object;

classifying the pair of entities to determine a type of relationship between the subject and the object;

based on the subject, the object, and the relationship type, a triplet is constructed.

4. The method of claim 1, wherein the relationship extraction model identifies the text to be corrected using a joint extraction method of parameter sharing, and obtaining a model identification result includes:

identifying a main body in the text to be corrected based on a first parameter, and calculating a loss value generated by main body extraction;

identifying objects in the text to be corrected based on the second parameter, and calculating a loss value generated by object extraction;

determining a relationship type between the subject and the object based on a third parameter, and calculating a loss value generated by relationship extraction;

calculating a joint loss value based on the loss value generated by the subject extraction, the loss value generated by the object extraction, and the loss value generated by the relation extraction;

judging whether the joint loss value reaches a preset threshold value or not;

If yes, constructing a triplet based on the subject, the object and the relationship type;

otherwise, updating the first parameter, the second parameter and the third parameter based on the joint loss value, and repeating the entity identification, relationship determination and loss value calculation processes until the joint loss value reaches a preset threshold.

5. The method of claim 1, wherein the relationship extraction model identifies the text to be corrected using a joint extraction method of joint decoding, and obtaining a model identification result comprises:

synchronously identifying a subject, an object and a relationship type in the text to be corrected;

6. The method of claim 1, wherein correcting the model recognition result using the knowledge-graph comprises:

acquiring a triplet in the model identification result;

inquiring information related to the triples in the knowledge graph to obtain an inquiry result of the knowledge graph;

comparing the query result of the knowledge graph with the model identification result to judge whether the model identification result is accurate or not;

And carrying out entity correction or relation correction on the inaccurate model identification result, and outputting the corrected model identification result.

7. A system for correcting errors in historical personage information, comprising:

the recognition module is used for extracting a model by applying the pre-trained relation to recognize the text to be corrected, so as to obtain a model recognition result;

the judging module is used for judging whether the model identification result comprises triplet information or not;

the first correction module is used for inputting the model identification result into a pre-constructed knowledge graph when the model identification result comprises triplet information, and correcting the model identification result by using the knowledge graph;

the second correction module is used for correcting the text to be corrected based on Langchain, a large language model and a knowledge graph base when the model identification result does not comprise triplet information;

Inputting the prompt word into the large language model;

8. An electronic device, comprising: a processor and a memory;

the memory is used for storing a computer program;

the processor is configured to execute the computer program stored in the memory, to cause the electronic device to perform the method of any one of claims 1 to 6.

9. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method of any of claims 1 to 6.