CN112507125A - Triple information extraction method, device, equipment and computer readable storage medium - Google Patents

Triple information extraction method, device, equipment and computer readable storage medium Download PDF

Info

Publication number
CN112507125A
CN112507125A CN202011415288.0A CN202011415288A CN112507125A CN 112507125 A CN112507125 A CN 112507125A CN 202011415288 A CN202011415288 A CN 202011415288A CN 112507125 A CN112507125 A CN 112507125A
Authority
CN
China
Prior art keywords
information
text
triple
training
triplet
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011415288.0A
Other languages
Chinese (zh)
Inventor
侯丽
刘翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202011415288.0A priority Critical patent/CN112507125A/en
Publication of CN112507125A publication Critical patent/CN112507125A/en
Priority to PCT/CN2021/082660 priority patent/WO2022116417A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition

Abstract

The invention discloses a method, a device and equipment for extracting triple information and a computer readable storage medium, wherein the method comprises the following steps: crawling massive entry information in internet data through a crawler tool, wherein the entry information comprises data of a plurality of different fields; determining sentences containing triple information in the entry information based on the entry information, and preliminarily extracting the triple information in the sentences to obtain initial triple information; carrying out data annotation on any common text based on the initial triple information to obtain an annotated common text, and taking the annotated common text as a training text; training the bert pre-training language model based on the training text, obtaining a triple extraction model when the training of the bert pre-training language model is completed, and determining triple information corresponding to any text based on the triple extraction model. The method and the device can identify the possible triple information in any text, so that the final high-quality triple information can be extracted finally.

Description

Triple information extraction method, device, equipment and computer readable storage medium
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a computer-readable storage medium for extracting triplet information.
Background
At present, only a very small amount of knowledge on the internet is manually organized into formats which can be analyzed by machines, such as various encyclopedias and vertical domain databases, however, the information is only one of the chestnuts in canghai, and the requirements of increasing automation and intelligence cannot be met no matter coverage, updating frequency and reliability degree.
Knowledge graph construction technology develops to the present, a partially mature algorithm can be used for extracting a small number of specific type entities with obvious characteristics and relations, and some open-source NLP tools can be used for extracting the specific type entities. For example, the stanford-corenlp NLP tool of stanford university supports extraction of 23 types of entities such as name of a person, place, organization, number, currency, date and time, and the LTP tool of hayada open source supports identification of three types of entities such as name of a person, organization and place. In the relation extraction of the triples, the implementation manner in the prior art is to constrain the relations in a plurality of known categories, and then use a classification model to classify the relations of the sentences containing the entities, so as to extract the relations in the triples.
However, with the development of internet technology, a large amount of texts contain entities and relationship types with different types, and the existing triple information extraction technology only extracts several entities and relationships with specific types, that is, extracts limited types agreed in advance, and cannot extract triple information with different types in the large amount of texts.
The above is only for the purpose of assisting understanding of the technical aspects of the present invention, and does not represent an admission that the above is prior art.
Disclosure of Invention
The invention mainly aims to provide a method, a device and equipment for extracting triple information and a computer readable storage medium, and aims to solve the technical problem that the existing triple information extraction technology only extracts several entities and relations of specific types and cannot extract triple information of different types in texts.
In order to achieve the above object, the present invention provides a triplet information extraction method, including the following steps:
crawling massive vocabulary entry information in internet data through a crawler tool, wherein the vocabulary entry information comprises data of a plurality of different fields;
determining sentences containing triple information in the entry information based on the entry information, and preliminarily extracting the triple information in the sentences to obtain initial triple information;
carrying out data annotation on any common text based on the initial triple information to obtain an annotated common text, and taking the annotated common text as a training text;
training a bert pre-training language model based on the training text, obtaining a triple extraction model when training the bert pre-training language model is completed, and determining triple information corresponding to any text based on the triple extraction model.
Optionally, the step of determining, based on the entry information, a sentence in which the entry information includes triple information, and preliminarily extracting triple information from the sentence to obtain initial triple information includes:
extracting useful texts in the entry information through a text recognition model to obtain text information, wherein the useful texts in the entry information comprise semi-structured first text information and unstructured second text information;
analyzing the text information to obtain sentences containing triple information in the text information;
and extracting the triple information in the sentence to obtain initial triple information.
Optionally, after the step of determining the triplet information corresponding to any text based on the triplet extraction model, the method further includes:
inputting the triple information corresponding to the arbitrary text into a preset knowledge system frame to construct a knowledge system map containing multi-field data;
and when the question information input by the user is received, matching the knowledge data contained in the knowledge map according to the question information, and determining the answer information corresponding to the question information.
Optionally, the triplet information includes associated information, and the step of inputting the triplet information corresponding to the arbitrary text into a preset knowledge system framework to construct a knowledge system graph including multi-domain data includes:
inputting the triple information into a preset knowledge system frame, and acquiring associated information of each triple information;
performing association arrangement on each triplet according to the association information of each triplet information to determine a triplet information tree;
and constructing a knowledge system map containing multi-domain data based on the triple information tree.
Optionally, after the step of inputting the triplet information corresponding to the arbitrary text into a preset knowledge system framework to construct a knowledge system graph containing multi-domain data, the method further includes:
if request information for processing the newly added data information is received, verifying the newly added data information according to a preset information verification rule;
and if the newly added data information passes the verification, adding the newly added data information into the knowledge system map to obtain an updated knowledge system map.
Optionally, the training of the bert pre-training language model based on the training text, and the obtaining of the triplet extraction model when the training of the bert pre-training language model is completed includes:
inputting the training text into the bert pre-training language model, and determining an entity link relation of an entity in the training text;
determining a model adjustment parameter according to an actual entity link relation corresponding to the ordinary text and the entity link relation, wherein the actual entity link relation is determined by the label information of the ordinary text;
and training the bert pre-training language model based on the model adjusting parameters, and obtaining a triple extraction model when training the bert pre-training language model is completed.
Optionally, the bert pre-training language model includes a transformer structure, the inputting the training text into the bert pre-training language model, the determining an entity link relationship of an entity in the training text includes:
inputting the training text into the bert pre-training language model, and obtaining the vector representation of each character in the training text through the transformer structure;
and representing the vector of the entity information in the training text as the entity link relation of the entity in the training text.
In addition, to achieve the above object, the present invention further provides a triplet information extracting apparatus, including:
the system comprises a crawling module, a searching module and a searching module, wherein the crawling module is used for crawling massive entry information in internet data through a crawler tool, and the entry information comprises data of a plurality of different fields;
the first extraction module is used for determining sentences containing triple information in the entry information based on the entry information, and preliminarily extracting the triple information in the sentences to obtain initial triple information;
the marking module is used for carrying out data marking on any ordinary text based on the initial triple information to obtain a marked ordinary text, and the marked ordinary text is used as a training text;
and the second extraction module is used for training the bert pre-training language model based on the training text, obtaining a triple extraction model when the training of the bert pre-training language model is finished, and determining triple information corresponding to any text based on the triple extraction model.
In addition, to achieve the above object, the present invention also provides a triplet information extraction device, including: the system comprises a memory, a processor and a triplet information extraction program stored on the memory and operable on the processor, wherein when executed by the processor, the triplet information extraction program implements the steps of the triplet information extraction method as described above.
In addition, to achieve the above object, the present invention also provides a computer readable storage medium, on which a triplet information extraction program is stored, which when executed by a processor implements the steps of the triplet information extraction method as described above.
The method comprises the steps of crawling massive cross-domain entry information from an encyclopedia, finding out sentences containing triple information from the entry information, preliminarily extracting the triple information in the sentences to obtain initial triple information, aligning the initial triple information to a pre-obtained ordinary text, and carrying out automatic triple information labeling on the ordinary text, wherein the labeled data, namely the labeled ordinary text is used as a training text of a subsequent bert pre-training language model; and then, the obtained training text is used as the input of the deep learning algorithm, and the triple extraction model is obtained through training, so that after the triple extraction model is trained, the triple information in any text can be identified according to the trained triple extraction model, and the final high-quality triple information can be extracted finally.
Drawings
Fig. 1 is a schematic structural diagram of a triplet information extraction device in a hardware operating environment according to an embodiment of the present invention;
fig. 2 is a schematic flow chart of a first embodiment of a triplet information extraction method according to the present invention;
fig. 3 is a flowchart illustrating a second embodiment of the triplet information extraction method according to the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
As shown in fig. 1, fig. 1 is a schematic structural diagram of a triplet information extraction device in a hardware operating environment according to an embodiment of the present invention.
The triple information extraction device in the embodiment of the present invention may be a PC, or may be a mobile terminal device having a display function, such as a smart phone, a tablet computer, an electronic book reader, an MP3(Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3) player, an MP4(Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4) player, a portable computer, and the like.
As shown in fig. 1, the triplet information extraction device may include: a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, a communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the processor 1001.
Optionally, the triplet information extracting device may further include a camera, a Radio Frequency (RF) circuit, a sensor, an audio circuit, a WiFi module, and the like. Such as light sensors, motion sensors, and other sensors. In particular, the light sensor may comprise an ambient light sensor that may adjust the brightness of the display screen according to the brightness of ambient light, and a proximity sensor that may turn off the display screen and/or the backlight when the triplet information extraction device is moved to the ear. As one of the motion sensors, the gravity acceleration sensor can detect the magnitude of acceleration in each direction (generally, three axes), can detect the magnitude and direction of gravity when the motion sensor is stationary, and can be used for recognizing applications of triple information extraction device gestures (such as horizontal and vertical screen switching, related games, magnetometer gesture calibration), vibration recognition related functions (such as pedometer and tapping), and the like; of course, the triplet information extraction device may also be configured with other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which are not described herein again.
Those skilled in the art will appreciate that the triplet information extraction device structure shown in fig. 1 does not constitute a limitation of the triplet information extraction device and may include more or less components than those shown, or some components in combination, or a different arrangement of components.
As shown in fig. 1, a memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a triplet information extraction program.
In the triplet information extraction device shown in fig. 1, the network interface 1004 is mainly used for connecting to a background server and performing data communication with the background server; the user interface 1003 is mainly used for connecting a client (user side) and performing data communication with the client; the processor 1001 may be configured to call the triplet information extraction program stored in the memory 1005 and execute the triplet information extraction method provided by the embodiment of the present invention.
The invention also provides a triple information extraction method, and referring to fig. 2, fig. 2 is a schematic flow chart of a first embodiment of the triple information extraction method of the invention.
In this embodiment, the triplet information extraction method includes the following steps:
step S10, crawling massive vocabulary entry information in Internet data through a crawler tool, wherein the vocabulary entry information comprises data of a plurality of different fields;
in the embodiment, a crawler tool is used for automatically crawling a large amount of entry information of hundred-degree encyclopedia in internet data, and the domain of the crawled entry information covers all domains contained in the hundred-degree encyclopedia such as characters, life, culture, science, sports, economy, history, society, geography, nature and art and at least comprises 20 million. The Crawler tool is a program or script for automatically capturing world wide Web information according to a certain rule and requirements, and is realized by combining one or more Crawler technologies of a General Web Crawler (General Purpose Web Crawler), a Focused Web Crawler (Focused Web Crawler), an Incremental Web Crawler (Incremental Web Crawler) and a Deep Web Crawler (Deep Web Crawler).
Step S20, determining sentences containing triple information in the entry information based on the entry information, and preliminarily extracting the triple information in the sentences to obtain initial triple information;
in this embodiment, after obtaining entry information of each of the different fields, first extracting the triplet information in the entry information preliminarily. Specifically, from the crawled entry information, preprocessing operation is performed on the entry information, wherein the preprocessing operation can be operations of analyzing, segmenting, filtering and the like on the entry information in sequence, so that redundant information in the entry information is filtered preliminarily. After preprocessing operation is performed on the entry information, namely redundant information in the entry information is preliminarily filtered, the entry information after the preprocessing operation is performed is analyzed, and sentences containing triple information in the entry information are analyzed; and after the sentences containing the triple information in the entry information are obtained, the triple information in the sentences containing the triple information is preliminarily extracted to obtain initial triple information.
Further, step S20 includes:
step S21, extracting useful texts in the entry information through a text recognition model to obtain text information, wherein the useful texts in the entry information comprise semi-structured first text information and unstructured second text information;
step S22, analyzing the text information to obtain sentences containing triple information in the text information;
and step S23, extracting the triple information in the sentence to obtain initial triple information.
In this embodiment, from the crawled entry information, a preprocessing operation is performed on the entry information to filter out redundant information in the entry information. And then, inputting the entry information after filtering the redundant information into a pre-trained text recognition model to extract useful texts in the entry information so as to extract the useful texts in the entry information to obtain text information, wherein the text information comprises semi-structured first text information and unstructured second text information, namely the semi-structured first text information and scattered information unstructured second text information, so that the semi-structured first text information and the unstructured second text information in the entry information are analyzed. The semi-structured first text information refers to a text with a certain structure, such as resume information, which has a certain structure but is not a very neat and strict structure, and is not a very fixed structure, so that the resume information belongs to the semi-structured text information. The unstructured second text information refers to unstructured text information, such as a piece of news, which the contributor has written freely without a fixed structure. The text recognition model is used for extracting semi-structured text information and unstructured text information in the entry information, the text recognition model comprises a text box recognition module and a character recognition module, the text box recognition module is used for recognizing character positions in the entry information, the character recognition module is used for recognizing characters in a text box, and the text recognition model can be an OCR model.
After semi-structured first text information and unstructured second text information are obtained, analyzing the first text information and the second text information according to a preset analysis rule, and thus filtering out sentences of triple information contained in the first text information and the second text information; and then extracting the triple information in the sentence to obtain initial triple information.
Step S30, based on the initial triple information, carrying out data annotation on any ordinary text to obtain an annotated ordinary text, and using the annotated ordinary text as a training text;
in this embodiment, after the initial triplet information in the entry information is obtained, the ordinary text is obtained, the extracted initial triplet information is aligned to the sentence of the ordinary text, and each literal character in the sentence of the ordinary text is subjected to automatic sequence tagging. The initial character of the entity information in the plain text can be labeled as SUB-B, the initial character of the associated information in the plain text can be labeled as PDC-B, the initial character of the attribute information in the plain text can be labeled as OBJ-B, and all the characters of other non-triple information in the plain text are labeled as O. And after the ordinary text is labeled, taking the labeled ordinary text as a training text of the bert pre-training language model to train the bert pre-training language model.
And step S40, training the bert pre-training language model based on the training text, obtaining a triple extraction model when the training of the bert pre-training language model is completed, and determining triple information corresponding to any text based on the triple extraction model.
In this embodiment, the labeled training text is input into the bert pre-training language model for training, a matrix output by the last hidden layer of the bert pre-training language model is obtained, a full-link layer operation is performed, probability values of each character of the training text predicted as different labels are calculated, a cross entropy loss value corresponding to the probability is calculated, parameters of the learning model are updated reversely according to the cross entropy loss value, and the learning model is trained until the model converges and then stored for subsequent triplet extraction. The triple information of the sentence comprises entity information, association information and attribute information of the sentence, wherein the entity information is abstract of an objective object, the attribute information represents the property of the object, and the association information represents the relationship between the entity and the entity.
After the triple extraction model is trained, any text can be predicted, specifically, a news text (or other texts) is randomly selected from the internet, sentences of the news text are segmented, then the triple extraction model is used for extracting the triples of each sentence of the news text, and the triples extracted from all the sentences are combined to obtain triple information. And finally, using an NLP component to carry out advanced verification and filtering on the identified result, and extracting a final high-quality triple result. Specifically, word segmentation and part-of-speech tagging are carried out on a sentence from which certain triple information is extracted, entity recognition is carried out on the sentence, if the extracted entity information of the triple information is a noun and a idiom in a recognized entity or part-of-speech tagging result in the extracted triple information, the triple information is reserved as a prediction result, and accordingly triple information of a predicted text is obtained, otherwise, the triple information is discarded, and the triple information extraction model is adjusted again.
It should be emphasized that, in order to further ensure the privacy and security of the triplet information, the triplet information may also be stored in a node of a block chain.
The method for extracting the triple information provided by the embodiment comprises the steps of crawling massive cross-domain entry information from encyclopedia, finding out sentences containing triple information from the entry information, preliminarily extracting the triple information from the sentences to obtain initial triple information, aligning the initial triple information to a pre-obtained ordinary text, and automatically labeling the triple information of the ordinary text, wherein the labeled data, namely the labeled ordinary text, is used as a training text of a subsequent bert pre-training language model; and then, training to obtain a triple extraction model by using the obtained training text as the input of the deep learning algorithm, so that after the triple extraction model is trained, possible triple information in any text can be identified according to the trained triple extraction model, and finally, the final high-quality triple information can be extracted.
Based on the first embodiment, a second embodiment of the triplet information extraction method of the present invention is proposed, and referring to fig. 3, in this embodiment, after step S40, the method further includes:
step S50, inputting the triple information corresponding to the arbitrary text into a preset knowledge system frame to construct a knowledge system map containing multi-field data;
and step S60, when the question information input by the user is received, matching the knowledge data contained in the knowledge map according to the question information, and determining the answer information corresponding to the question information.
In this embodiment, after identifying the triple information of any text, the obtained triple information, that is, the entity information, the association information, and the attribute information, may be input to a preset knowledge system framework, so that a knowledge system graph with an association may be constructed. The knowledge system framework is a template framework for constructing the association relationship between the three groups of information, and the information of the three groups of information can be stored and understood and processed by the computer equipment by using the knowledge system framework. The knowledge system map is a database which is constructed by a knowledge system framework and used for storing and associating the triple information. After the knowledge system map is constructed based on the triple information, a system corresponding to the knowledge system map provides an information retrieval function, namely, question information can be input to the system corresponding to the knowledge system map, the knowledge system map queries knowledge data in the knowledge map according to the question information, when relevant information matched with the question information is queried, the relevant information matched with the question information is assembled according to a preset language sequence to obtain answer information, and the answer information is output.
Further, the triple information includes associated information, and the step of inputting the triple information corresponding to the arbitrary text into a preset knowledge system framework to construct a knowledge system graph including multi-domain data includes:
step S501, inputting the triple information into a preset knowledge system frame, and acquiring associated information of each triple information;
step S502, performing association arrangement on each triplet according to the association information of each triplet information, and determining a triplet information tree;
and S503, constructing a knowledge system map containing multi-domain data based on the triple information tree.
In this embodiment, after identifying the triple information of any text, the obtained triple information, that is, the entity information, the association information, and the attribute information, may be input to a preset knowledge system framework to obtain the association information of each triple information. Then, performing association sorting on the obtained multiple triples according to the data association information corresponding to each data information to comb out a triplet information tree with association relation among the triplet information with association relation; and storing the triple information tree according to a template corresponding to a preset knowledge system frame to obtain a knowledge system map with relevance. The knowledge graph comprises a plurality of triple information trees, and each triple information tree stores associated triple information and the relation between the triple information.
Further, after the step of inputting the triplet information corresponding to the arbitrary text into a preset knowledge system framework to construct a knowledge system graph containing multi-domain data, the method further includes:
step S70, if request information for processing the newly added data information is received, the newly added data information is verified according to a preset information verification rule;
and step S80, if the newly added data information passes the verification, adding the newly added data information into the knowledge system map to obtain an updated knowledge system map.
In this embodiment, if request information for processing the new data information is received, the new data information to be added is checked according to a preset information check rule. The newly added data information is data information for supplementing the knowledge data in the knowledge system graph, the newly added data information includes a plurality of newly added knowledge data, and the preset information verification rule is verification information for verifying the knowledge data included in the knowledge system graph.
Specifically, the information verification rule comprises knowledge data classification verification information, standard unified verification information, duplication-elimination verification information and associated verification information. The classification check information is information for classifying the newly added knowledge data according to the attribute information of the newly added knowledge data; the standard unified verification information is information for carrying out standardized processing on units of data such as time, money amount and the like in the newly added knowledge data; the duplication-elimination check information is information for judging whether duplication exists between the newly added knowledge data and the original knowledge data in the knowledge map, and if the duplication exists between the newly added knowledge data and the original knowledge data in the knowledge map, the newly added knowledge data is not added into the knowledge map; the association check information is information for sorting the association relationship between the newly added knowledge data and other knowledge data.
The newly added data information is verified through the preset information verification rule, so that the problems of classification errors, non-uniform unit standards and the like in the knowledge system spectrum obtained after updating can be avoided, and the standardization of the newly added data information in the knowledge spectrum is realized.
Further, the training of the bert pre-training language model based on the training text, and the obtaining of the triple extraction model when the training of the bert pre-training language model is completed, includes:
step S41, inputting the training text into the bert pre-training language model, and determining the entity link relation of the entity in the training text;
step S42, determining model adjustment parameters according to the actual entity link relation corresponding to the plain text and the entity link relation, wherein the actual entity link relation is determined by the label information of the plain text;
and step S43, training the bert pre-training language model based on the model adjusting parameters, and obtaining a triple extraction model when training the bert pre-training language model is completed.
In this embodiment, the labeled training text is input into the bert pre-training language model for training, a matrix output by the last hidden layer of the bert pre-training language model is obtained, a full-link layer operation is performed, an entity link relation of an entity in the training text is determined, a probability value of predicting each character of the training text to be different labels is calculated according to an actual entity link relation and an entity link relation corresponding to the common text, a model adjustment parameter corresponding to the probability value is calculated, a model parameter of the bert pre-training language model is updated reversely according to the model adjustment parameter, and the model is trained until the model converges and stored for subsequent triple extraction. Wherein, the model adjusting parameter can be a cross entropy loss value.
Further, the bert pre-training language model comprises a transform structure, the step of inputting the training text into the bert pre-training language model, and the step of determining the entity link relationship of the entity in the training text comprises:
step S411, inputting the training text into the bert pre-training language model, and obtaining a vector representation of each character in the training text through the transformer structure;
step S412, the vector representation of the entity information in the training text is used as the entity link relation of the entity in the training text.
In this embodiment, the labeled training text is input into the bert pre-training language model for training, a matrix output by the last hidden layer of the bert pre-training language model is obtained through a transformer structure in the bert pre-training language model, and a full-connected layer operation is performed to obtain a vector representation of each character in the training text; and after the vector representation of each character in the training text is obtained, the vector representation of the entity information in the training text is used as the entity link relation in the training text.
According to the triple information extraction method provided by the embodiment, the triple information is input into a preset knowledge system framework to construct a knowledge system map containing multi-field data; and when the question information input by the user is received, matching the knowledge data contained in the knowledge map according to the question information, and determining the answer information corresponding to the question information. In this embodiment, after extracting the triplet information of any text in different fields, the triplet information is input to a preset knowledge system framework based on the extracted triplet information, and a knowledge system graph containing data of each field is constructed, so that an analyzable, retrievable and traceable knowledge system graph is constructed, and after the knowledge system graph is constructed, the answer information corresponding to the input question information is output according to the knowledge system graph.
In addition, an embodiment of the present invention further provides a triplet information extraction device, where the triplet information extraction device includes:
the system comprises a crawling module, a searching module and a searching module, wherein the crawling module is used for crawling massive entry information in internet data through a crawler tool, and the entry information comprises data of a plurality of different fields;
the first extraction module is used for determining sentences containing triple information in the entry information based on the entry information, and preliminarily extracting the triple information in the sentences to obtain initial triple information;
the marking module is used for carrying out data marking on any ordinary text based on the initial triple information to obtain a marked ordinary text, and the marked ordinary text is used as a training text;
and the second extraction module is used for training the bert pre-training language model based on the training text, obtaining a triple extraction model when the training of the bert pre-training language model is finished, and determining triple information corresponding to any text based on the triple extraction model.
Further, the first extraction module is further configured to:
extracting useful texts in the entry information through a text recognition model to obtain text information, wherein the useful texts in the entry information comprise semi-structured first text information and unstructured second text information;
analyzing the text information to obtain sentences containing triple information in the text information;
and extracting the triple information in the sentence to obtain initial triple information.
Further, the second extraction module is further configured to:
inputting the triple information corresponding to the arbitrary text into a preset knowledge system frame to construct a knowledge system map containing multi-field data;
and when the question information input by the user is received, matching the knowledge data contained in the knowledge map according to the question information, and determining the answer information corresponding to the question information.
Further, the second extraction module is further configured to:
inputting the triple information into a preset knowledge system frame, and acquiring associated information of each triple information;
performing association arrangement on each triplet according to the association information of each triplet information to determine a triplet information tree;
and constructing a knowledge system map containing multi-domain data based on the triple information tree.
Further, the second extraction module is further configured to:
if request information for processing the newly added data information is received, verifying the newly added data information according to a preset information verification rule;
and if the newly added data information passes the verification, adding the newly added data information into the knowledge system map to obtain an updated knowledge system map.
Further, the second extraction module is further configured to:
inputting the training text into the bert pre-training language model, and determining an entity link relation of an entity in the training text;
determining a model adjustment parameter according to an actual entity link relation corresponding to the ordinary text and the entity link relation, wherein the actual entity link relation is determined by the label information of the ordinary text;
and training the bert pre-training language model based on the model adjusting parameters, and obtaining a triple extraction model when training the bert pre-training language model is completed.
Further, the second extraction module is further configured to:
inputting the training text into the bert pre-training language model, and obtaining the vector representation of each character in the training text through the transformer structure;
and representing the vector of the entity information in the training text as the entity link relation of the entity in the training text.
In addition, an embodiment of the present invention further provides a computer-readable storage medium, where a triplet information extraction program is stored on the computer-readable storage medium, and when being executed by a processor, the triplet information extraction program implements the steps of the triplet information extraction method according to any one of the above embodiments.
The specific embodiment of the computer-readable storage medium of the present invention is substantially the same as the embodiments of the triplet information extraction method, and will not be described in detail herein.
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A triplet information extraction method is characterized by comprising the following steps:
crawling massive vocabulary entry information in internet data through a crawler tool, wherein the vocabulary entry information comprises data of a plurality of different fields;
determining sentences containing triple information in the entry information based on the entry information, and preliminarily extracting the triple information in the sentences to obtain initial triple information;
carrying out data annotation on any common text based on the initial triple information to obtain an annotated common text, and taking the annotated common text as a training text;
training a bert pre-training language model based on the training text, obtaining a triple extraction model when training the bert pre-training language model is completed, and determining triple information corresponding to any text based on the triple extraction model.
2. The method of extracting triplet information as claimed in claim 1, wherein the step of determining a sentence containing triplet information in the entry information based on the entry information, and preliminarily extracting triplet information in the sentence to obtain initial triplet information comprises:
extracting useful texts in the entry information through a text recognition model to obtain text information, wherein the useful texts in the entry information comprise semi-structured first text information and unstructured second text information;
analyzing the text information to obtain sentences containing triple information in the text information;
and extracting the triple information in the sentence to obtain initial triple information.
3. The method for extracting triplet information as claimed in claim 1, wherein after the step of determining triplet information corresponding to any text based on the triplet extraction model, the method further comprises:
inputting the triple information corresponding to the arbitrary text into a preset knowledge system frame to construct a knowledge system map containing multi-field data;
and when the question information input by the user is received, matching the knowledge data contained in the knowledge map according to the question information, and determining the answer information corresponding to the question information.
4. The triple information extraction method according to claim 3, wherein the triple information includes associated information, and the step of inputting the triple information corresponding to the arbitrary text into a preset knowledge system framework to construct a knowledge system graph including multi-domain data includes:
inputting the triple information into a preset knowledge system frame, and acquiring associated information of each triple information;
performing association arrangement on each triplet according to the association information of each triplet information to determine a triplet information tree;
and constructing a knowledge system map containing multi-domain data based on the triple information tree.
5. The triple information extraction method according to claim 3, wherein after the step of inputting the triple information corresponding to the arbitrary text into a preset knowledge system framework to construct a knowledge system graph containing multi-domain data, the method further comprises:
if request information for processing the newly added data information is received, verifying the newly added data information according to a preset information verification rule;
and if the newly added data information passes the verification, adding the newly added data information into the knowledge system map to obtain an updated knowledge system map.
6. The triplet information extraction method of any one of claims 1 through 5 wherein said training of the bert pre-trained language model based on said training text, the step of deriving the triplet extraction model upon completion of training said bert pre-trained language model comprises:
inputting the training text into the bert pre-training language model, and determining an entity link relation of an entity in the training text;
determining a model adjustment parameter according to an actual entity link relation corresponding to the ordinary text and the entity link relation, wherein the actual entity link relation is determined by the label information of the ordinary text;
and training the bert pre-training language model based on the model adjusting parameters, and obtaining a triple extraction model when training the bert pre-training language model is completed.
7. The triplet information extraction method of claim 6, wherein the bert pre-trained language model comprises a transform structure, the step of inputting the training text into the bert pre-trained language model, and the step of determining the entity link relationship of the entities in the training text comprises:
inputting the training text into the bert pre-training language model, and obtaining the vector representation of each character in the training text through the transformer structure;
and representing the vector of the entity information in the training text as the entity link relation of the entity in the training text.
8. A triplet information extraction apparatus characterized by comprising:
the system comprises a crawling module, a searching module and a searching module, wherein the crawling module is used for crawling massive entry information in internet data through a crawler tool, and the entry information comprises data of a plurality of different fields;
the first extraction module is used for determining sentences containing triple information in the entry information based on the entry information, and preliminarily extracting the triple information in the sentences to obtain initial triple information;
the marking module is used for carrying out data marking on any ordinary text based on the initial triple information to obtain a marked ordinary text, and the marked ordinary text is used as a training text;
and the second extraction module is used for training the bert pre-training language model based on the training text, obtaining a triple extraction model when the training of the bert pre-training language model is finished, and determining triple information corresponding to any text based on the triple extraction model.
9. A triplet information extraction apparatus characterized by comprising: a memory, a processor, and a triplet information extraction program stored on the memory and executable on the processor, the triplet information extraction program when executed by the processor implementing the steps of the triplet information extraction method of any one of claims 1 to 7.
10. A computer-readable storage medium, having stored thereon a triplet information extraction program, which when executed by a processor implements the steps of the triplet information extraction method of any one of claims 1 to 7.
CN202011415288.0A 2020-12-03 2020-12-03 Triple information extraction method, device, equipment and computer readable storage medium Pending CN112507125A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202011415288.0A CN112507125A (en) 2020-12-03 2020-12-03 Triple information extraction method, device, equipment and computer readable storage medium
PCT/CN2021/082660 WO2022116417A1 (en) 2020-12-03 2021-03-24 Triple information extraction method, apparatus, and device, and computer-readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011415288.0A CN112507125A (en) 2020-12-03 2020-12-03 Triple information extraction method, device, equipment and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN112507125A true CN112507125A (en) 2021-03-16

Family

ID=74970684

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011415288.0A Pending CN112507125A (en) 2020-12-03 2020-12-03 Triple information extraction method, device, equipment and computer readable storage medium

Country Status (2)

Country Link
CN (1) CN112507125A (en)
WO (1) WO2022116417A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113051356A (en) * 2021-04-21 2021-06-29 深圳壹账通智能科技有限公司 Open relationship extraction method and device, electronic equipment and storage medium
CN113094469A (en) * 2021-04-02 2021-07-09 清华大学 Text data analysis method and device, electronic equipment and storage medium
CN113282762A (en) * 2021-05-27 2021-08-20 深圳数联天下智能科技有限公司 Knowledge graph construction method and device, electronic equipment and storage medium
CN114398943A (en) * 2021-12-09 2022-04-26 北京百度网讯科技有限公司 Sample enhancement method and device thereof
CN114595686A (en) * 2022-03-11 2022-06-07 北京百度网讯科技有限公司 Knowledge extraction method, and training method and device of knowledge extraction model
WO2022116417A1 (en) * 2020-12-03 2022-06-09 平安科技(深圳)有限公司 Triple information extraction method, apparatus, and device, and computer-readable storage medium

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115168599B (en) * 2022-06-20 2023-06-20 北京百度网讯科技有限公司 Multi-triplet extraction method, device, equipment, medium and product
CN115238688B (en) * 2022-08-15 2023-08-01 广州市刑事科学技术研究所 Method, device, equipment and storage medium for analyzing association relation of electronic information data
CN115309870B (en) * 2022-10-11 2022-12-20 启元世界(北京)信息技术服务有限公司 Knowledge acquisition method and device
CN115909386B (en) * 2023-01-06 2023-05-12 中国石油大学(华东) Method, equipment and storage medium for supplementing and correcting pipeline instrument flow chart
CN116701665A (en) * 2023-08-08 2023-09-05 滨州医学院 Deep learning-based traditional Chinese medicine ancient book knowledge graph construction method
CN117033667B (en) * 2023-10-07 2024-01-09 之江实验室 Knowledge graph construction method and device, storage medium and electronic equipment
CN117131208B (en) * 2023-10-24 2024-02-02 北京中企慧云科技有限公司 Industrial science and technology text data pushing method, device, equipment and medium
CN117151659B (en) * 2023-10-31 2024-03-22 浙江万维空间信息技术有限公司 Ecological restoration engineering full life cycle tracing method based on large language model
CN117150050B (en) * 2023-10-31 2024-01-26 卓世科技(海南)有限公司 Knowledge graph construction method and system based on large language model
CN117540035A (en) * 2024-01-09 2024-02-09 安徽思高智能科技有限公司 RPA knowledge graph construction method based on entity type information fusion

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160055243A1 (en) * 2014-08-22 2016-02-25 Ut Battelle, Llc Web crawler for acquiring content
CN106294593B (en) * 2016-07-28 2019-04-09 浙江大学 In conjunction with the Relation extraction method of subordinate clause grade remote supervisory and semi-supervised integrated study
CN108733792B (en) * 2018-05-14 2020-12-01 北京大学深圳研究生院 Entity relation extraction method
CN109472033B (en) * 2018-11-19 2022-12-06 华南师范大学 Method and system for extracting entity relationship in text, storage medium and electronic equipment
CN112507125A (en) * 2020-12-03 2021-03-16 平安科技(深圳)有限公司 Triple information extraction method, device, equipment and computer readable storage medium

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022116417A1 (en) * 2020-12-03 2022-06-09 平安科技(深圳)有限公司 Triple information extraction method, apparatus, and device, and computer-readable storage medium
CN113094469A (en) * 2021-04-02 2021-07-09 清华大学 Text data analysis method and device, electronic equipment and storage medium
CN113094469B (en) * 2021-04-02 2022-07-05 清华大学 Text data analysis method and device, electronic equipment and storage medium
CN113051356A (en) * 2021-04-21 2021-06-29 深圳壹账通智能科技有限公司 Open relationship extraction method and device, electronic equipment and storage medium
CN113282762A (en) * 2021-05-27 2021-08-20 深圳数联天下智能科技有限公司 Knowledge graph construction method and device, electronic equipment and storage medium
CN113282762B (en) * 2021-05-27 2023-06-02 深圳数联天下智能科技有限公司 Knowledge graph construction method, knowledge graph construction device, electronic equipment and storage medium
CN114398943A (en) * 2021-12-09 2022-04-26 北京百度网讯科技有限公司 Sample enhancement method and device thereof
CN114595686A (en) * 2022-03-11 2022-06-07 北京百度网讯科技有限公司 Knowledge extraction method, and training method and device of knowledge extraction model
CN114595686B (en) * 2022-03-11 2023-02-03 北京百度网讯科技有限公司 Knowledge extraction method, and training method and device of knowledge extraction model

Also Published As

Publication number Publication date
WO2022116417A1 (en) 2022-06-09

Similar Documents

Publication Publication Date Title
CN112507125A (en) Triple information extraction method, device, equipment and computer readable storage medium
CN111026842B (en) Natural language processing method, natural language processing device and intelligent question-answering system
CN109325165B (en) Network public opinion analysis method, device and storage medium
CN109145216A (en) Network public-opinion monitoring method, device and storage medium
CN112749284B (en) Knowledge graph construction method, device, equipment and storage medium
CN111695439A (en) Image structured data extraction method, electronic device and storage medium
CN110825956A (en) Information flow recommendation method and device, computer equipment and storage medium
CN112686022A (en) Method and device for detecting illegal corpus, computer equipment and storage medium
CN112287069B (en) Information retrieval method and device based on voice semantics and computer equipment
CN112101437A (en) Fine-grained classification model processing method based on image detection and related equipment thereof
CN112632278A (en) Labeling method, device, equipment and storage medium based on multi-label classification
CN111291210A (en) Image material library generation method, image material recommendation method and related device
CN111753522A (en) Event extraction method, device, equipment and computer readable storage medium
CN114399396A (en) Insurance product recommendation method and device, computer equipment and storage medium
CN112417121A (en) Client intention recognition method and device, computer equipment and storage medium
CN115438149A (en) End-to-end model training method and device, computer equipment and storage medium
CN114090792A (en) Document relation extraction method based on comparison learning and related equipment thereof
CN113822040A (en) Subjective question marking and scoring method and device, computer equipment and storage medium
CN113486664A (en) Text data visualization analysis method, device, equipment and storage medium
CN114842982B (en) Knowledge expression method, device and system for medical information system
CN115730603A (en) Information extraction method, device, equipment and storage medium based on artificial intelligence
CN113609833A (en) Dynamic generation method and device of file, computer equipment and storage medium
CN114706948A (en) News processing method and device, storage medium and electronic equipment
CN112199954A (en) Disease entity matching method and device based on voice semantics and computer equipment
CN112069807A (en) Text data theme extraction method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination