CN114398492B - Knowledge graph construction method, terminal and medium in digital field - Google Patents

Knowledge graph construction method, terminal and medium in digital field Download PDF

Info

Publication number
CN114398492B
CN114398492B CN202111601561.3A CN202111601561A CN114398492B CN 114398492 B CN114398492 B CN 114398492B CN 202111601561 A CN202111601561 A CN 202111601561A CN 114398492 B CN114398492 B CN 114398492B
Authority
CN
China
Prior art keywords
data
initial
model
sample
labeling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111601561.3A
Other languages
Chinese (zh)
Other versions
CN114398492A (en
Inventor
聂海姣
吴高丽
邱银贵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Senzongai Digital Beijing Technology Co ltd
Original Assignee
Senzongai Digital Beijing Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Senzongai Digital Beijing Technology Co ltd filed Critical Senzongai Digital Beijing Technology Co ltd
Priority to CN202111601561.3A priority Critical patent/CN114398492B/en
Publication of CN114398492A publication Critical patent/CN114398492A/en
Application granted granted Critical
Publication of CN114398492B publication Critical patent/CN114398492B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2433Single-class perspective, e.g. one-against-all classification; Novelty detection; Outlier detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Animal Behavior & Ethology (AREA)
  • Fuzzy Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The application relates to a method, a terminal and a medium for constructing a knowledge graph in the digital field, wherein the method comprises the following steps: acquiring unstructured data, and preprocessing the unstructured data to obtain initial data; performing unsupervised pre-training on a preset pre-training model based on the initial data to obtain a discrimination model; carrying out primary labeling on the initial data to obtain sample labeling data; constructing and training a target model based on the discrimination model and the sample labeling data, and carrying out named entity recognition fine adjustment on the initial data based on the target model to obtain labeling data; performing entity disambiguation on the labeled data to obtain final data; and constructing a knowledge graph based on the final data. The method and the device have the effects of reducing the cost of manual labeling and utilizing rich semantic information of unstructured data.

Description

Knowledge graph construction method, terminal and medium in digital field
Technical Field
The present application relates to the technical field of knowledge graph construction, and in particular, to a method, a terminal, and a medium for constructing a knowledge graph in the digital domain.
Background
The knowledge graph is a modern theory which achieves the aim of multi-discipline fusion by combining theories and methods of applying subjects such as mathematics, graphics, information visualization technology, information science and the like with methods such as metrology introduction analysis, co-occurrence analysis and the like and utilizing a visualized graph to vividly display the core structure, development history, frontier field and overall knowledge framework of the subjects. The method displays the complex knowledge field through data mining, information processing, knowledge measurement and graph drawing, reveals the dynamic development rule of the knowledge field, and provides a practical and valuable reference for subject research. With the technical development and application of artificial intelligence, a knowledge graph is one of key technologies, and has been widely applied to the fields of intelligent search, intelligent question answering, personalized recommendation, content distribution and the like.
The existing knowledge graph construction method trains a high-precision knowledge graph construction model through a high-quality labeled data set, the training of the high-quality model is seriously dependent on the labeled data, but most of data acquired on the Internet is unsupervised data, namely unstructured data, namely no label, and the data cannot be directly taken by the existing supervision model for training because of no label. The existing method for labeling data is generally manual labeling, is very expensive, wastes time and labor, does not utilize rich semantic information of unstructured data on the Internet, and therefore needs to be improved.
Disclosure of Invention
In order to reduce the cost of manual labeling, the application provides a method, a terminal and a medium for constructing a knowledge graph in the digital field.
In a first aspect, the present application provides a method for constructing a knowledge graph in the digital domain, which adopts the following technical scheme:
a method for constructing a knowledge graph in the digital field comprises the following steps:
acquiring unstructured data, and preprocessing the unstructured data to obtain initial data;
performing unsupervised pre-training on a preset pre-training model based on the initial data to obtain a discrimination model;
carrying out primary labeling on the initial data to obtain sample labeling data;
constructing and training a target model based on the discrimination model and the sample labeling data, and carrying out named entity recognition fine adjustment on the initial data based on the target model to obtain labeling data;
performing entity disambiguation on the labeled data to obtain final data;
and constructing a knowledge graph based on the final data.
By adopting the technical scheme, the unsupervised pre-training is carried out on the initial data, so that the noise influence on the model caused by huge data volume can be reduced; through preliminary marking, obtain sample marking data, obtain the target model through the training according to sample marking data and predetermined model, mark remaining initial data through the target model, reduced the input of manpower data mark to a very big extent, practiced thrift time and material cost.
Optionally, the obtaining of the unstructured data and the preprocessing of the unstructured data to obtain initial data includes the following steps:
extracting text data from the plurality of types of unstructured data;
and segmenting the text data, filtering out special characters, and carrying out error correction processing on the text data to obtain initial data.
By adopting the technical scheme, the unstructured data is preprocessed, the influence on the model in the subsequent process is reduced, and the smooth implementation of knowledge graph construction work is facilitated.
Optionally, the performing unsupervised pre-training on the preset pre-training model based on the initial data to obtain a discriminant model includes the following steps:
loading a preset pre-training model according to the initial data;
and learning the semantic features of the initial data through the pre-training model to obtain a pre-trained discrimination model.
By adopting the technical scheme, unsupervised pre-training can utilize semantic information rich in unstructured data to obtain a discrimination model.
Optionally, the preliminary labeling of the sample initial data to obtain sample labeled data includes the following steps:
selecting sample initial data and a plurality of groups of non-sample initial data from the initial data;
carrying out preliminary labeling on the initial sample data based on a preset named entity recognition model to obtain preliminary labeling data;
and performing supplementary marking and error correction on the preliminary marking data to obtain sample marking data.
By adopting the technical scheme, the model can be trained by using the sample marking data as the basis only by marking the initial data of the sample.
Optionally, the training of a target model based on the discriminant model and the sample labeling data, and the fine adjustment of named entity recognition on the initial data based on the target model to obtain labeling data includes the following steps:
adding an optimization layer after the discrimination model to construct an initial target model;
training the initial target model based on the sample labeling data to obtain a target model;
performing primary labeling on the first group of non-sample initial data based on the target model to obtain first initial labeling data;
correcting the first initial labeling data to obtain first labeling data;
training the target model based on first labeling data to obtain a first target model;
performing preliminary labeling on a second group of the non-sample initial data based on the first target model to obtain second initial labeling data;
correcting the second initial labeling data to obtain second labeling data;
and based on an iterative processing method, marking and correcting the non-sample initial data through the target model to obtain marked data.
By adopting the technical scheme, the iterative training is carried out on the target model on the basis of the sample marking data, and the investment of manpower data marking is greatly reduced.
Optionally, the performing entity disambiguation on the labeled data to obtain final data includes the following steps:
extracting entities from the annotation data, and selecting sample entities and a plurality of groups of non-sample entities from the entities;
constructing a synonym table, mining the sample entity and synonyms of the sample entity, and recording the synonym table;
constructing an initial synonym mining training set according to the sample entity and the synonym of the sample entity;
iteratively training a preset synonym mining model through the synonym mining training set to obtain a target synonym mining model;
mining the entity and the corresponding synonym based on the target synonym mining model, and recording the entity and the corresponding synonym into the synonym table;
and performing entity disambiguation on the tagged data based on the synonym table to obtain final data.
By adopting the technical scheme, the synonym in the sample entity is only needed to be artificially mined, the synonym mining training set is constructed according to the sample entity and the synonym thereof, and the synonym mining training model is trained, so that the construction work of the artificial complex rule of entity ambiguity is greatly reduced.
Optionally, the constructing a knowledge graph based on the final data includes the following steps:
extracting entities from the final data to obtain the relationship and the attribute of the entities;
and constructing a knowledge graph based on the relationship and the attributes.
In a second aspect, the present application provides a terminal device, which adopts the following technical solution:
a terminal device comprises a memory, a processor and a computer program which is stored in the memory and can run on the processor, wherein the processor adopts the above knowledge graph construction method in the digital domain when loading and executing the computer program.
By adopting the technical scheme, the computer program is generated by the knowledge graph construction method in the digital field and is stored in the memory so as to be loaded and executed by the processor, so that the terminal equipment is manufactured according to the memory and the processor, and the use is convenient.
In a third aspect, the present application provides a computer-readable storage medium, which adopts the following technical solutions:
a computer-readable storage medium, in which a computer program is stored, which, when loaded and executed by a processor, employs the above-mentioned method for constructing a knowledge-graph in the digital domain.
By adopting the technical scheme, the knowledge graph construction method in the digital field is used for generating the computer program, the computer program is stored in the computer readable storage medium and loaded and executed by the processor, and the computer program can be conveniently read and stored through the computer readable storage medium.
Drawings
Fig. 1 is an overall flowchart of a method for constructing a knowledge graph in the digital domain in the embodiment of the present application.
Fig. 2 is a schematic flowchart of steps S201 to S202 in a knowledge graph construction method in the digital domain according to an embodiment of the present application.
Fig. 3 is a schematic flowchart of steps S301 to S302 in a method for constructing a knowledge graph in the digital domain according to an embodiment of the present application.
Fig. 4 is a schematic diagram of an ELECTRA model in a method for constructing a knowledge graph in the digital domain according to an embodiment of the present application.
Fig. 5 is a flowchart illustrating steps S501 to S503 in a method for constructing a knowledge graph in the digital domain according to an embodiment of the present application.
Fig. 6 is a flowchart illustrating steps S501 to S508 in a method for constructing a knowledge graph in the digital domain according to an embodiment of the present application.
Fig. 7 is a schematic diagram of a target model in a knowledge graph construction method in the digital domain according to an embodiment of the present application.
Fig. 8 is a flowchart illustrating steps S601 to S606 in a method for constructing a knowledge graph in the digital domain according to an embodiment of the present application.
Fig. 9 is schematic diagrams of two synonym mining models in a knowledge graph construction method in the digital domain according to an embodiment of the present application.
Fig. 10 is a flowchart illustrating steps S701 to S702 in a method for constructing a knowledge graph in the digital domain according to an embodiment of the present application.
Detailed Description
The present application is described in further detail below with reference to figures 1-10.
The embodiment of the application discloses a method for constructing a knowledge graph in the digital field, which comprises the following steps with reference to fig. 1:
s101, acquiring unstructured data, and preprocessing the unstructured data to obtain initial data;
s102, performing unsupervised pre-training on a preset pre-training model based on initial data to obtain a discrimination model;
s103, carrying out primary labeling on the initial data to obtain sample labeling data;
s104, constructing a target model based on the discrimination model, and carrying out named entity identification fine tuning on initial data based on sample labeling data and the target model to obtain labeling data;
s105, performing entity disambiguation on the marked data to obtain final data;
and S106, constructing a knowledge graph based on the final data.
In step S101, acquiring unstructured data, after acquiring initial data, processing the unstructured data, and converting the unstructured data into initial data convenient to process, with reference to fig. 2, the method specifically includes the following steps:
s201, extracting text data from multiple types of unstructured data;
s202, segmenting the text data, filtering out special characters, and carrying out error correction processing on the text data to obtain initial data.
Specifically, in step S201, generally, the unstructured data includes a plurality of formats such as office documents, data, pictures, XML, HTML, various reports, images, and audio/video information, and these formats need to be converted into text data, and the conversion method is as follows:
PDF document conversion: analyzing the PDF document through a PDF analysis tool and the like to obtain text data;
image conversion: converting the image into characters as text data by OCR (optical character recognition);
audio conversion: recognizing characters in the audio through a voice recognition technology and extracting the characters to serve as text data;
video conversion: converting the image into text by OCR (optical character recognition) by extracting a video frame; or extracting audio from the video, and recognizing and extracting characters in the audio as text data through a voice recognition technology. For example, through various channels, the finally obtained text data is "cloud server ecs (elastic computer service)" which is a fixed and scalable cloud computing service … … "
In step S202, the acquired text data needs to be preliminarily processed, in this embodiment, the text data is firstly segmented, special characters in the segmented text data are filtered, such as line feed characters, spaces, emoticons, and the like, and then a sentence is preliminarily corrected, and a more obvious error in the text data is processed to obtain initial data.
In step S102, generally speaking, the magnitude of the unstructured data is large, and is at least ten million-level data volume, so that the magnitude of the obtained initial data is also large, and before performing other tasks, unsupervised pre-training needs to be performed on the initial data to train a model for the batch of initial data, referring to fig. 3, which specifically includes the following steps:
s301, loading a preset pre-training model according to initial data;
s302, learning semantic features of the initial data through a pre-training model to obtain a pre-trained discrimination model.
Specifically, the pretraining generally adopts a Model such as BERT, and is performed by using a mask Language (Masked Language Model) task and an nsp (next sequence prediction) task, but the pretraining and the downstream task are inconsistent by adopting the BERT Model.
In this embodiment, in order to improve this situation, an electrra model is selected to pre-train the pre-training initial data, the electrra adopts an rtd (planned token detection) mode, the electrra model is composed of two parts, namely, a Generator and a Discriminator, and the input received by the Discriminator during training is the original data and does not include characters of the Mask, so that the difference from the downstream task is not caused.
In addition, the training of the ELECTRA is a token-level binary task, similar to named entity recognition of a downstream task, similar to the downstream task, so that the possibility of inconsistency between pre-training and the downstream task is reduced, and the training cost is lower.
More specifically, referring to fig. 4, after the initial data is input to the ELECTRA model, the mask positions on the data are randomly generated, the data with the mask are input to the Generator, the Generator predicts the positions of the mask, the Discriminator determines which positions are generated by the Generator, the generated label is 1, but not 0, and the Discriminator learns the semantic features in the initial data to form a discriminant model including only the Discriminator.
In step S103, since the trained model does not label the initial data when the initial data is obtained, part of the initial data needs to be labeled by using the published named entity recognition fine-tuning model, and then the labeling condition is corrected manually, with reference to fig. 5, the method specifically includes the following steps:
s401, selecting sample initial data and a plurality of groups of non-sample initial data from the initial data;
s402, carrying out preliminary labeling on initial sample data based on a preset named entity recognition model to obtain preliminary labeling data;
and S403, performing supplementary marking and error correction on the preliminary marking data to obtain sample marking data.
Specifically, in step S401, assuming that 1000 pieces of initial data are acquired, 1000 pieces of data are selected as sample initial data, and the rest non-sample initial data are also divided according to the 1000 pieces/group.
Specifically, in step S402, in this embodiment, an already disclosed named entity recognition model of the trained BERT is selected, and preliminary labeling is performed on the sample initial data to obtain preliminary labeling data.
Specifically, in step S403, after the preliminary labeling data is obtained, because the accuracy of the entity identified and labeled by the selected BERT named entity identification model is limited, the entity that is not labeled in the preliminary labeling data needs to be manually labeled in a supplementary manner, and some entities with wrong labels are corrected to obtain the sample labeling data.
In step S104, after the discriminant model and the sample label data are obtained, a target model is constructed and trained according to the discriminant model and the sample label data, and a named entity recognition fine-tuning training is performed on the non-sample initial data based on the target model to obtain label data. Referring to fig. 6, the method specifically includes the following steps:
s501, adding an optimization layer after the model is judged, and constructing an initial target model;
s502, performing entity recognition training on the initial target model based on the labeling data to obtain a target model;
s503, carrying out primary labeling on the first group of non-sample initial data based on the target model to obtain first initial labeling data;
s504, correcting the first initial annotation data to obtain first annotation data;
s505, training the target model based on the first labeling data to obtain a first target model;
s506, carrying out primary labeling on the second group of non-sample initial data based on the first target model to obtain second initial labeling data;
s507, correcting the second initial labeling data to obtain second labeling data;
and S508, labeling the non-sample initial data through the target model based on an iterative processing method to obtain labeled data.
Specifically, referring to fig. 7, in this embodiment, in order to further optimize named entity identification of the initial data, a CRF layer is added after the discriminant model, and an initial target model is constructed, so that a correlation relationship of words can be modeled based on markov properties, and accuracy of named entity identification can be improved.
Specifically, after an initial target is constructed, an initial target model is loaded according to sample labeling data, and entity recognition training is carried out to obtain a target model. At the moment, the target model can already identify the entity in the sample marking data, and the target model is used for marking the non-sample initial data, so that the workload of manual marking is greatly reduced.
Specifically, the target model labeling has a problem of low accuracy due to the small amount of sample labeling data, and therefore, the target model needs to be trained continuously. And labeling the first group of non-sample initial data by using the target model to obtain first initial labeled data. After the first initial labeling data are obtained, supplementary labeling and error correction processing are manually carried out on the first initial labeling data, and accurate first labeling data are obtained. And training the target model according to the accurate first marking data to obtain a first target model. And labeling the second group of non-sample initial data by using the first target model again to obtain second initial labeling data, and manually performing supplementary labeling and error correction processing on the second initial labeling data to obtain accurate second labeling data. And finally obtaining the labeled data of all the entities by using an iterative processing method of target model labeling and manual correction.
More specifically, for the preliminarily labeled non-sample initial data, whether a corresponding target model needs to be trained is determined through accuracy, in this embodiment, if the accuracy of the nth initial labeled data reaches more than 90%, it is determined that the nth-1 target model can be used as a required named entity recognition fine-tuning model, and the model can be directly adopted for subsequent labeling; and if the labeling accuracy rate of a certain group of non-sample initial data in the subsequent extraction test does not reach 90%, manually correcting the group of data, and training the model based on the corrected data to obtain a new trained model.
In addition, since the labeling data of the previous labeling entity may be less, a label smoothing regularization method needs to be added to the initial target model to prevent overfitting, and for the influence of the class imbalance, focal loss needs to be added to alleviate the influence. After the data volume is large, removing the tribes such as focal length and label smoothening, and carrying out normal training on the target model.
In step S105, after the annotation data is obtained, since the entity in the annotation data has many aliases and the aliases are strange, the entity disambiguation processing needs to be continuously performed on the annotation data, with reference to fig. 8, the method specifically includes the following steps:
s601, extracting entities from the labeled data, and selecting sample entities and a plurality of groups of non-sample entities from the entities;
s602, constructing a synonym table, mining a sample entity and synonyms of the sample entity, and recording the synonym table;
s603, constructing an initial synonym mining training set according to the sample entity and the synonyms of the sample entity;
s604, iteratively training a preset synonym mining model through a synonym mining training set to obtain a target synonym mining model;
s605, mining an entity and a corresponding synonym based on the target synonym mining model, and recording the entity and the corresponding synonym into a synonym table;
s606, carrying out entity disambiguation on the labeled data based on the synonym table to obtain final data.
Specifically, entities are extracted from the labeling data, sample entities and multiple groups of non-sample entities are selected from the entities, the sample entities are analyzed, synonyms of the samples are mined in a human-defined mode, and the sample entities and the synonyms of the sample entities are recorded into a constructed synonym table.
Specifically, an initial synonym mining training set is constructed according to a sample entity and synonyms of the sample entity, in this embodiment, the constructed training set includes context and query, where context includes two cases, the first case is that one context includes more than 2 entities, for example, "after price reduction of B product under company a flag, sales rushes up", and the corresponding query constructs a sample by "whether company a and B product are synonyms" or not; the other is that 2 contexts respectively contain at least one entity, for example, two contexts are "company a and company B invest strategically in $ 1.5 million in company C", "group a is an investment organization focused on early startup and invests company C in combination with company B", and the corresponding query constructs a sample by "whether company a and group a are synonyms".
Specifically, referring to fig. 9, in this embodiment, a BERT model or a variant model of the BERT model, such as Roberta, BERT-wwm, is selected as a preset synonym mining model, the BERT model is trained by using the constructed initial synonym mining training set, and a first synonym mining model is obtained after the training. Selecting a first group of non-sample entities to construct a training set, inputting the training set into a first synonym mining model to obtain a result of whether some entities in the non-sample entities are synonyms, if so, outputting '1', and if not, outputting '0'.
Specifically, after the first synonym mining model generates a synonym mining result, the result is manually checked, the wrong synonym entity is corrected, the wrong synonym entity is reduced, and the correct synonym entity is recorded into the synonym table. And training the first synonym mining model according to the corrected synonym mining result to obtain a second synonym mining model. And performing synonym mining on the training set constructed by the second group of non-sample entities by using a second synonym mining model. The iterative training process is not repeated herein, and a synonym table which is relatively complete for the initial data and a synonym mining model with high synonym mining accuracy are finally formed. After the synonym table is constructed, the unknown entities in the rest non-sample entities find out the corresponding known entities through the synonym table, namely the unknown entities and the known entities can be correspondingly constructed into final sentences, namely the final data required by the construction of the knowledge graph.
Since the training set includes two cases, the corresponding synonym mining model also has two cases. Referring to fig. 9, in the first case, synonym entities appear in the same sentence, and the input is a context sentence and a question sentence; the second case is where the synonym entity appears in a different sentence, the input being 2 context sentences and one question sentence. Both cases can cover most cases, and many synonym entities can be mined from unsupervised corpora.
In step S106, after the final data is obtained, a knowledge graph can be constructed according to the final data, and with reference to fig. 10, the method specifically includes the following steps:
s701, extracting an entity from the final data to obtain the relationship and the attribute of the entity;
s702, constructing a knowledge graph based on the relationship and the attributes.
Specifically, in this embodiment, the knowledge graph is constructed by combining neo4j (a graph database) and mysql (a relational database), neo4j mainly stores entities and relationships between the entities, mysql mainly stores attributes related to the entities, and the knowledge graph is constructed according to neo4j and mysql. When using the knowledge graph, the desired entity is found through neo4j, and then the attributes of the corresponding entity are obtained through mysql.
The implementation principle of the knowledge graph construction method in the digital field in the embodiment of the application is as follows: acquiring unstructured data, and preprocessing the unstructured data to obtain initial data; obtaining a discrimination model by carrying out unsupervised pre-training on initial data, and constructing a target model according to the discrimination model; the method comprises the steps of manually marking sample initial data to obtain sample marking data, iteratively training a target model through the sample marking data and non-sample initial data to obtain marking data, wherein the marking data are manually corrected in the training process, and a high-precision model is obtained at extremely low cost through a small amount of manual marking data; extracting entities from the labeling data, manually constructing an initial synonym mining training set according to the sample entities, and training a synonym mining model according to the initial synonym mining training set; the synonyms in the non-sample entities are mined through the synonym mining model, manual correction is carried out, the corrected synonym mining training set is used again to train the synonym mining model, a synonym table is constructed through iterative training, known entities corresponding to strange entities are selected from the synonym table, final data are obtained, and the workload of manual mining is reduced. And finally, constructing a knowledge graph according to the final data. According to the method and the device, abundant semantic information of unstructured data is utilized, the artificial workload is reduced, and the labor cost overhead is saved.
The embodiment of the application also discloses a terminal device, which comprises a memory, a processor and a computer program which is stored in the memory and can run on the processor, wherein when the processor executes the computer program, the method for constructing the knowledge graph in the digital field in the embodiment is adopted.
The terminal device may adopt a computer device such as a desktop computer, a notebook computer, or a cloud server, and the terminal device includes but is not limited to a processor and a memory, for example, the terminal device may further include an input/output device, a network access device, a bus, and the like.
The processor may be a Central Processing Unit (CPU), and of course, according to an actual use situation, other general processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and the like may also be used, and the general processor may be a microprocessor or any conventional processor, and the present application does not limit the present invention.
The memory may be an internal storage unit of the terminal device, for example, a hard disk or a memory of the terminal device, or an external storage device of the terminal device, for example, a plug-in hard disk, a Smart Memory Card (SMC), a secure digital card (SD) or a flash memory card (FC) equipped on the terminal device, and the memory may also be a combination of the internal storage unit of the terminal device and the external storage device, and the memory is used for storing a computer program and other programs and data required by the terminal device, and the memory may also be used for temporarily storing data that has been output or will be output, which is not limited in this application.
The terminal device stores the knowledge graph construction method in the digital domain in the embodiment in a memory of the terminal device, and the knowledge graph construction method is loaded and executed on a processor of the terminal device, so that the terminal device is convenient to use.
The embodiment of the application also discloses a computer readable storage medium, and the computer readable storage medium stores a computer program, wherein when the computer program is executed by a processor, the method for constructing the knowledge graph in the digital domain in the embodiment is adopted.
The computer program may be stored in a computer readable medium, the computer program includes computer program code, the computer program code may be in a source code form, an object code form, an executable file or some intermediate form, and the like, the computer readable medium includes any entity or device capable of carrying the computer program code, a recording medium, a usb disk, a removable hard disk, a magnetic disk, an optical disk, a computer memory, a Read Only Memory (ROM), a Random Access Memory (RAM), an electrical carrier signal, a telecommunication signal, a software distribution medium, and the like, and the computer readable medium includes but is not limited to the above components.
The method for constructing the knowledge graph in the digital domain in the embodiments is stored in a computer-readable storage medium through the computer-readable storage medium, and is loaded and executed on a processor, so as to facilitate the storage and application of the method.
The above embodiments are preferred embodiments of the present application, and the protection scope of the present application is not limited by the above embodiments, so: all equivalent changes made according to the structure, shape and principle of the present application shall be covered by the protection scope of the present application.

Claims (8)

1. A knowledge graph construction method in the digital field is characterized by comprising the following steps:
acquiring unstructured data, and preprocessing the unstructured data to obtain initial data;
performing unsupervised pre-training on a preset pre-training model based on the initial data to obtain a discrimination model;
carrying out primary labeling on the initial data to obtain sample labeling data;
constructing and training a target model based on the discrimination model and the sample labeling data, and carrying out named entity recognition fine adjustment on the initial data based on the target model to obtain labeling data;
extracting entities from the annotation data, and selecting sample entities and a plurality of groups of non-sample entities from the entities;
constructing a synonym table, mining the sample entity and synonyms of the sample entity, and recording the synonym table;
constructing an initial synonym mining training set according to the sample entity and the synonym of the sample entity;
iteratively training a preset synonym mining model through the synonym mining training set to obtain a target synonym mining model;
mining the entity and the corresponding synonym based on the target synonym mining model, and recording the entity and the corresponding synonym into the synonym table;
performing entity disambiguation on the tagged data based on the synonym table to obtain final data;
and constructing a knowledge graph based on the final data.
2. The method for constructing a knowledge graph in the digital domain according to claim 1, wherein the steps of obtaining unstructured data, preprocessing the unstructured data and obtaining initial data comprise:
extracting text data from a plurality of types of the unstructured data;
and segmenting the text data, filtering out special characters, and carrying out error correction processing on the text data to obtain initial data.
3. The method for constructing a knowledge graph in the digital domain according to claim 2, wherein the unsupervised pre-training of the pre-trained model based on the initial data to obtain the discriminant model comprises the following steps:
loading a preset pre-training model according to the initial data;
and learning the semantic features of the initial data through the pre-training model to obtain a pre-trained discrimination model.
4. The method for constructing a knowledge graph in the digital domain according to claim 3, wherein the preliminary labeling is performed on the initial data, and the obtaining of the sample labeled data comprises the following steps:
selecting sample initial data and a plurality of groups of non-sample initial data from the initial data;
carrying out preliminary labeling on the initial sample data based on a preset named entity recognition model to obtain preliminary labeling data;
and performing supplementary marking and error correction on the preliminary marking data to obtain sample marking data.
5. The method for constructing a knowledge graph in the digital domain according to claim 4, wherein the training of a target model based on the discriminant model and the sample annotation data and the fine adjustment of named entity recognition on the initial data based on the target model to obtain annotation data comprises the following steps:
adding an optimization layer after the discrimination model to construct an initial target model;
training the initial target model based on the sample labeling data to obtain a target model;
performing primary labeling on the first group of non-sample initial data based on the target model to obtain first initial labeling data;
correcting the first initial labeling data to obtain first labeling data;
training the target model based on first labeling data to obtain a first target model;
performing preliminary labeling on the second group of non-sample initial data based on the first target model to obtain second initial labeling data;
correcting the second initial labeling data to obtain second labeling data;
and based on an iterative processing method, labeling and correcting the non-sample initial data through the target model to obtain labeled data.
6. The method for constructing a knowledge graph in the digital domain according to claim 1, wherein the constructing a knowledge graph based on the final data comprises the following steps:
extracting entities from the final data to obtain the relationship and the attribute of the entities;
and constructing a knowledge graph based on the relationship and the attributes.
7. A terminal device comprising a memory, a processor and a computer program stored in the memory and being executable on the processor, characterized in that the method of any of claims 1-6 is used when the computer program is loaded and executed by the processor.
8. A computer-readable storage medium, in which a computer program is stored, which, when loaded and executed by a processor, carries out the method of any one of claims 1-6.
CN202111601561.3A 2021-12-24 2021-12-24 Knowledge graph construction method, terminal and medium in digital field Active CN114398492B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111601561.3A CN114398492B (en) 2021-12-24 2021-12-24 Knowledge graph construction method, terminal and medium in digital field

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111601561.3A CN114398492B (en) 2021-12-24 2021-12-24 Knowledge graph construction method, terminal and medium in digital field

Publications (2)

Publication Number Publication Date
CN114398492A CN114398492A (en) 2022-04-26
CN114398492B true CN114398492B (en) 2022-08-30

Family

ID=81226619

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111601561.3A Active CN114398492B (en) 2021-12-24 2021-12-24 Knowledge graph construction method, terminal and medium in digital field

Country Status (1)

Country Link
CN (1) CN114398492B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118070886B (en) * 2024-04-19 2024-07-30 南昌工程学院 Reservoir flood control emergency plan knowledge graph construction method and system

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108021931A (en) * 2017-11-20 2018-05-11 阿里巴巴集团控股有限公司 A kind of data sample label processing method and device
CN110020438A (en) * 2019-04-15 2019-07-16 上海冰鉴信息科技有限公司 Enterprise or tissue Chinese entity disambiguation method and device based on recognition sequence
CN110334212A (en) * 2019-07-01 2019-10-15 南京审计大学 A kind of territoriality audit knowledge mapping construction method based on machine learning
CN110598000A (en) * 2019-08-01 2019-12-20 达而观信息科技(上海)有限公司 Relationship extraction and knowledge graph construction method based on deep learning model
CN110990590A (en) * 2019-12-20 2020-04-10 北京大学 Dynamic financial knowledge map construction method based on reinforcement learning and transfer learning
CN112200317A (en) * 2020-09-28 2021-01-08 西南电子技术研究所(中国电子科技集团公司第十研究所) Multi-modal knowledge graph construction method
CN112802570A (en) * 2021-02-07 2021-05-14 成都延华西部健康医疗信息产业研究院有限公司 Named entity recognition system and method for electronic medical record
CN113177124A (en) * 2021-05-11 2021-07-27 北京邮电大学 Vertical domain knowledge graph construction method and system
CN113283244A (en) * 2021-07-20 2021-08-20 湖南达德曼宁信息技术有限公司 Pre-training model-based bidding data named entity identification method
CN113449113A (en) * 2020-03-27 2021-09-28 京东数字科技控股有限公司 Knowledge graph construction method and device, electronic equipment and storage medium
CN113672737A (en) * 2020-05-13 2021-11-19 复旦大学 Knowledge graph entity concept description generation system
CN113779272A (en) * 2021-09-15 2021-12-10 上海泓笛数据科技有限公司 Data processing method, device and equipment based on knowledge graph and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109241078B (en) * 2018-08-30 2021-07-20 中国地质大学(武汉) Knowledge graph organization query method based on mixed database
US10902203B2 (en) * 2019-04-23 2021-01-26 Oracle International Corporation Named entity disambiguation using entity distance in a knowledge graph
CN112084752B (en) * 2020-09-08 2023-07-21 中国平安财产保险股份有限公司 Sentence marking method, device, equipment and storage medium based on natural language

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108021931A (en) * 2017-11-20 2018-05-11 阿里巴巴集团控股有限公司 A kind of data sample label processing method and device
CN110020438A (en) * 2019-04-15 2019-07-16 上海冰鉴信息科技有限公司 Enterprise or tissue Chinese entity disambiguation method and device based on recognition sequence
CN110334212A (en) * 2019-07-01 2019-10-15 南京审计大学 A kind of territoriality audit knowledge mapping construction method based on machine learning
CN110598000A (en) * 2019-08-01 2019-12-20 达而观信息科技(上海)有限公司 Relationship extraction and knowledge graph construction method based on deep learning model
CN110990590A (en) * 2019-12-20 2020-04-10 北京大学 Dynamic financial knowledge map construction method based on reinforcement learning and transfer learning
CN113449113A (en) * 2020-03-27 2021-09-28 京东数字科技控股有限公司 Knowledge graph construction method and device, electronic equipment and storage medium
CN113672737A (en) * 2020-05-13 2021-11-19 复旦大学 Knowledge graph entity concept description generation system
CN112200317A (en) * 2020-09-28 2021-01-08 西南电子技术研究所(中国电子科技集团公司第十研究所) Multi-modal knowledge graph construction method
CN112802570A (en) * 2021-02-07 2021-05-14 成都延华西部健康医疗信息产业研究院有限公司 Named entity recognition system and method for electronic medical record
CN113177124A (en) * 2021-05-11 2021-07-27 北京邮电大学 Vertical domain knowledge graph construction method and system
CN113283244A (en) * 2021-07-20 2021-08-20 湖南达德曼宁信息技术有限公司 Pre-training model-based bidding data named entity identification method
CN113779272A (en) * 2021-09-15 2021-12-10 上海泓笛数据科技有限公司 Data processing method, device and equipment based on knowledge graph and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
一种准确而高效的领域知识图谱构建方法;杨玉基 等;《软件学报》;20180208;第29卷(第10期);2931-2947 *
基于关联图和文本相似度的实体消歧技术研究;王章辉 等;《计算机与数字工程》;20211220;第49卷(第12期);2469-2475 *

Also Published As

Publication number Publication date
CN114398492A (en) 2022-04-26

Similar Documents

Publication Publication Date Title
CN111177326B (en) Key information extraction method and device based on fine labeling text and storage medium
CN109635288B (en) Resume extraction method based on deep neural network
CN111259631B (en) Referee document structuring method and referee document structuring device
CN110598000A (en) Relationship extraction and knowledge graph construction method based on deep learning model
CN111428467A (en) Method, device, equipment and storage medium for generating reading comprehension question topic
DE102018007165A1 (en) FORECASTING STYLES WITHIN A TEXT CONTENT
CN113468887A (en) Student information relation extraction method and system based on boundary and segment classification
CN112380848B (en) Text generation method, device, equipment and storage medium
CN116304023A (en) Method, system and storage medium for extracting bidding elements based on NLP technology
Kim Analysis of standard vocabulary use of the open government data: the case of the public data portal of Korea
CN114398492B (en) Knowledge graph construction method, terminal and medium in digital field
CN118113852A (en) Financial problem answering method, device, equipment, system, medium and product
CN114239579A (en) Electric power searchable document extraction method and device based on regular expression and CRF model
CN112347121B (en) Configurable natural language sql conversion method and system
CN114254641A (en) Chemical reaction event extraction method and system based on deep learning
CN111460224B (en) Comment data quality labeling method, comment data quality labeling device, comment data quality labeling equipment and storage medium
CN117591571A (en) Intelligent document writing system for assisting writing
CN114842982B (en) Knowledge expression method, device and system for medical information system
CN116069946A (en) Biomedical knowledge graph construction method based on deep learning
CN115796141A (en) Text data enhancement method and device, electronic equipment and storage medium
CN115470790A (en) Method and device for identifying named entities in file
CN114661900A (en) Text annotation recommendation method, device, equipment and storage medium
CN109657207B (en) Formatting processing method and processing device for clauses
CN112507060A (en) Domain corpus construction method and system
CN117852637B (en) Definition-based subject concept knowledge system automatic construction method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant