WO2022222300A1

WO2022222300A1 - Open relationship extraction method and apparatus, electronic device, and storage medium

Info

Publication number: WO2022222300A1
Application number: PCT/CN2021/109488
Authority: WO
Inventors: 朱昱锦
Original assignee: 深圳壹账通智能科技有限公司
Priority date: 2021-04-21
Filing date: 2021-07-30
Publication date: 2022-10-27
Also published as: CN113051356B; CN113051356A

Abstract

The present application relates to artificial intelligence technology, and discloses an open relationship extraction method, comprising: obtaining an original training set by using remote supervision and entity linking techniques; performing policy annotation and entity reinforcement processing on the original training set to obtain a standard training set; using the standard training set to perform entity fine adjustment and relationship fine adjustment on a pre-trained language model to obtain an open entity extraction model and an open relationship extraction model; extracting, by using the open entity extraction model, entities in a text to be classified; and predicting an entity relationship between the entities by using the open relationship extraction model, and clustering the entities and the entity relationship to obtain a relationship extraction result. In addition, the present application further relates to blockchain technology, and the relationship extraction result may be stored in a node of a blockchain. The present application further provides an open relationship extraction apparatus, an electronic device, and a computer-readable storage medium. The present application can solve the problem of relatively low extraction efficiency of an open relationship.

Description

Open relationship extraction method, device, electronic device and storage medium

This application claims the priority of the Chinese patent application filed on April 21, 2021 with the application number CN202110428927.5 and the title of the invention is "Open Relationship Extraction Method, Device, Electronic Device and Storage Medium", the entire content of which is Incorporated herein by reference.

technical field

The present application relates to the technical field of artificial intelligence, and in particular, to an open relationship extraction method, apparatus, electronic device, and computer-readable storage medium.

Background technique

Relation extraction is an important supporting technology in the field of information extraction and knowledge graph construction. There are many practical scenarios, such as building large-scale general/vertical domain graphs, extracting information from application forms for pre-loan review, etc. However, the traditional relationship extraction technology faces two problems and is difficult to put into practice: 1) It requires more labeled data to train the relationship classification model, resulting in high data cost and labeling cost; 2) The relationship type often requires business definition, which is limited and cannot be changed. Many requirements do not have a predefined set of relationships.

Based on this, open relation extraction technology has received attention in recent years. Open relation extraction technology requires inputting a piece of text, and automatically outputs all possible relation triples (head entity, relation, tail entity) and bigrams (head entity, tail entity) from it. Among them, the "relationship" field in the triple is the descriptor that comes with the context. Open relation extraction has always been intractable due to type uncertainty. The inventor realized that the traditional solutions mainly include: 1. Matching by using a rule-based method combined with a booststrapping method. The classic methods include ReVerb, OLLIE, OpenIE, etc., but most of these solutions are for English, and it is difficult to migrate to Chinese Text, and the matching rules are strict, and the processing method is inflexible; 2. Use the sequence labeling model idea to parse the surface form, regard the relationship as a class of entities, and use the semantic role labeling algorithm to directly extract triples from the text, Such as SurfaceForm-SRL, but this method fails when no relational mention is found, and it cannot process sentences containing multiple triples, resulting in low relation extraction accuracy; 3. Use half-pointer and half-labeling The scheme uses two-layer network blocks to process text, first extracts the head entity from the text, and then jointly extracts the tail entity and determines the relationship type according to the output of the head entity and the hidden layer, which constitutes a behavioral relationship class number, which is listed as the text length However, when dealing with open relationship extraction, the number of relationship types becomes the text length, so that the model needs to calculate a tensor whose size is the number of batch samples × the number of head entities × text length × text length. The multi-triple problem in the text also improves the accuracy, but it takes up a lot of computing resources and is extremely inefficient.

SUMMARY OF THE INVENTION

An open relation extraction method, comprising:

Obtain the original entity data set and the original relationship data set, carry out remote supervision on the original entity data set and the original relationship data set respectively, and carry out entity identification between the supervised original entity data set and the original relationship data set. Chain finger, get the original training set;

Performing strategy labeling and entity strengthening processing on the original training set in turn to obtain a standard training set;

Obtaining a pre-trained language model, using the standard training set to perform entity fine-tuning on the language model to obtain an open entity extraction model, and using the standard training set to perform relationship fine-tuning on the language model to obtain an open relationship extraction model;

Segmenting the text to be classified, obtaining segmented text, and extracting entities in the segmented text by using the open entity extraction model;

The entity relationship of the entity is predicted by the open relationship extraction model, and the entity and the entity relationship are clustered to obtain a relationship extraction result.

An apparatus for extracting open relationships, the apparatus comprising:

The training set building module is used to obtain the original entity data set and the original relationship data set, perform remote supervision on the original entity data set and the original relationship data set respectively, and compare the supervised original entity data set with all the original entity data sets. Perform entity chaining on the original relational data set to obtain the original training set;

an entity reinforcement module, which is used to sequentially perform strategy labeling and entity reinforcement processing on the original training set to obtain a standard training set;

a model building module for obtaining a pre-trained language model, performing entity fine-tuning on the language model using the standard training set to obtain an open entity extraction model, and using the standard training set to perform relational fine-tuning on the language model, Get the open relation extraction model;

an entity extraction module, used for segmenting the text to be classified, obtaining segmented text, and extracting entities in the segmented text by using the open entity extraction model;

The relationship extraction module is used for predicting the entity relationship of the entity by using the open relationship extraction model, and clustering the entity and the entity relationship to obtain a relationship extraction result.

An electronic device comprising:

a memory that stores at least one instruction; and

A processor that executes the instructions stored in the memory to achieve the following steps:

In order to solve the above problem, the present application also provides a computer-readable storage medium, where at least one instruction is stored in the computer-readable storage medium, and the at least one instruction is executed by a processor in an electronic device to implement the following steps:

The present application can solve the problem of low extraction efficiency of open relationships.

Description of drawings

FIG. 1 is a schematic flowchart of an open relationship extraction method provided by an embodiment of the present application;

Fig. 2 is a detailed implementation flow diagram of one of the steps in Fig. 1;

Fig. 3 is the detailed implementation flow schematic diagram of another step in Fig. 1;

Fig. 4 is a detailed implementation flow diagram of another step in Fig. 1;

Fig. 5 is the detailed implementation flow schematic diagram of another step in Fig. 1;

6 is a functional block diagram of an apparatus for extracting open relationships provided by an embodiment of the present application;

FIG. 7 is a schematic structural diagram of an electronic device implementing the method for extracting an open relationship according to an embodiment of the present application.

The realization, functional characteristics and advantages of the purpose of the present application will be further described with reference to the accompanying drawings in conjunction with the embodiments.

Detailed ways

It should be understood that the specific embodiments described herein are only used to explain the present application, but not to limit the present application.

The embodiment of the present application provides an open relationship extraction method. The execution subject of the open relationship extraction method includes, but is not limited to, at least one of electronic devices that can be configured to execute the method provided by the embodiments of the present application, such as a server and a terminal. In other words, the open relationship extraction method can be executed by software or hardware installed in a terminal device or a server device, and the software can be a blockchain platform. The server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like.

Referring to FIG. 1 , it is a schematic flowchart of a method for extracting an open relationship according to an embodiment of the present application. In this embodiment, the open relationship extraction method includes:

S1, obtain the original entity data set and the original relationship data set, respectively carry out remote supervision on the original entity data set and the original relationship data set, and supervise the original entity data set and the original relationship data set. Perform entity chaining to get the original training set.

Specifically, referring to FIG. 2 , the obtaining of the original entity data set and the original relational data set includes:

S10, using a preset data grabbing tool to grab text data from the source website, and segmenting the text data to obtain text segments, and summarizing the text segments to obtain the original relational data set;

S11. Obtain an open-source entity data set, wherein the entity data set includes triplet information and description information corresponding to each triplet information, and perform deduplication processing on the triplet information to obtain a deduplication triplet The original entity data set is obtained by summarizing the deduplication triplet and the description information corresponding to the triplet information.

Wherein, the preset data crawling tool can be Hawk data crawling tool, and the source website can be portal websites and professional websites in different fields, including: finance, law, medical care, education, entertainment , sports, etc. Using the Hawk data crawling tool, the text data in the source website can be directly crawled. In the embodiment of the present application, 3 sentences may be set as the minimum segmentation unit of the text data, and the length of each sentence does not exceed 256 characters. The open source entity datasets may include datasets such as the Chinese General Encyclopedia Knowledge Graph (CN-DBpedia), and CN-DBpedia is mainly collected from the plain text pages of Chinese encyclopedia websites (such as Baidu Encyclopedia, Interactive Encyclopedia, Chinese Wikipedia, etc.). Entity information is extracted, and after filtering, fusion, inference and other operations, a high-quality structured data set is finally formed. The graph not only contains (head entity, relationship, tail entity) triple information, but also contains entity description information ( From Baidu Encyclopedia, etc.).

Specifically, performing deduplication processing on the triplet information to obtain a deduplication triplet, including:

Selecting target triples from the entity data set in turn;

Calculate the distance value between the target triplet and all unselected triplet information in the entity data set;

When the distance value is greater than the preset distance threshold, it is determined that the target triplet is not repeated, and the target triplet is reselected from the entity data set for calculation;

When the distance value is less than or equal to a preset distance threshold, it is determined that the target triplet is repeated, and the target triplet is deleted to obtain a deduplication triplet.

In the embodiment of the present application, the following distance algorithm is used to calculate the distance value between the target triplet and all unselected triplet information in the entity data set:

Wherein, d is the distance value, w _j is the jth target triple, w _k is any unselected triple information in the entity data set, n is the number of triple information in the entity data set .

By performing deduplication processing on triplet information in the entity data set in the embodiments of the present application, subsequent processing of the same triplet information can be avoided and the amount of data processing can be reduced, which is beneficial to improve the efficiency of open relationship extraction.

Further, performing remote supervision on the original entity data set and the original relationship data set respectively, and performing entity chaining on the supervised original entity data set and the original relationship data set to obtain the original training set, including:

Matching the triplet information in the original entity data set and the text segmentation in the original relationship data set, and performing position labeling according to the matching result to obtain matching data;

Using a pre-built disambiguation model, calculate the matching probability of the matching result in the matching data and the description information corresponding to the triple information in the original entity data set;

When the matching probability is greater than a preset threshold, the original training set is obtained by summarizing the text segmentation and the triplet information.

The remote supervision refers to a method of using ready-made triples in an open source knowledge graph to perform automatic labeling without manual participation, and obtain a large number of labeled data sets. In the embodiment of the present application, the triples in the original entity data set and the text segmentation in the original relationship data set are matched, and it is required that at least the head entity and the tail entity in the triple are in the context of the current text segmentation, and Label the position of the entity in the current text segment (eg, "text": text segment, "entity_idx": {entity_1: [start, end], entity_2: [start, end], ...}, where, " text" represents the current text segment, "entity_idx" represents the position of the entity in the current text segment), and the matched triples and text segments are aggregated to obtain the matching data. At the same time, a pre-built disambiguation model can be used to perform the entity link reference, and the pre-built disambiguation model can be a BERT model trained by open source short text matching, and the entity link refers to the BERT model as The backbone, splicing the text in the matching data (including the triplet and the text segment where the triplet is located) and the description of the triplet in the original entity data set as input, and outputting a matching probability, the preset threshold can be 0.5, when the matching probability is greater than 0.5, it means that the entities in the matching data and the original entity dataset are of the same type. Through the remote supervision and entity chaining, information such as the relationship between the entity and the entity, entity description and other information can be quickly determined. At the same time, since the original training set contains both entity information and relationship information, it can be directly used to train the relationship extraction model and entity extraction model.

In the embodiment of the present application, by performing remote supervision and entity linking on the original entity data set and the original relationship data set, a large amount of information-rich original training sets can be obtained without manual annotation.

S2. Perform strategy labeling and entity reinforcement processing on the original training set in turn to obtain a standard training set.

Specifically, as shown in FIG. 3 , the strategy labeling and entity reinforcement processing are sequentially performed on the original training set to obtain a standard training set, including:

S20, classifying the text segments in the original training set by using preset labeling symbols to obtain classified samples, and label the triples in the classified samples to obtain labeled entities;

S21. Use a preset natural language processing library to perform entity enhancement processing on the labeled entity, and collect the enhanced classification samples to obtain the standard training set.

In the embodiment of the present application, strategy labeling may be performed based on the method of MTB (Matching The Blank, blank matching), wherein the preset labeling symbols may be <tag> and </tag>, and <tag> and </ The part enclosed by the tag> is the mention of the entity or relationship in the sentence. For example, the classification sample can be [CLS]XXX<entity_head>XXX</entity_head>XXX<rel>XXX</rel>XXX<entity_tail>XXX< /en tity_tail>XXX[SEP], entity_head, rel, and entity_tail represent head entity, relationship, and tail entity respectively. [CLS] and [SEP] are the spacers, [CLS] is the classification bit, this position outputs the binary classification result 0/1, indicating whether there is a relationship between the two entities at present, [SEP] is the termination bit, indicating the end of the sentence.

In the embodiment of the present application, the BIO sequence labeling mode may be used to label the entities in the classified samples, the words mentioned by the entities are labelled as B or I, and the non-entities are labelled as O. Since this is open entity recognition, it is only divided into two categories: yes/no entity. The preset natural language processing library can be the HanLP natural language processing library, and the dependency syntax parsing tool in the HanLP natural language processing library is used to analyze the prefix of the current entity to perform entity enhancement on the current entity, for example, the current entity is " Cook", prefixed with "Apple CEO", then the enhanced entity is "Apple CEO Cook".

The implementation of the present application can improve the accuracy of model training by performing policy labeling and entity enhancement processing on the original training set.

S3. Obtain a pre-trained language model, use the standard training set to perform entity fine-tuning on the language model to obtain an open entity extraction model, and use the standard training set to perform relationship fine-tuning on the language model to obtain an open relationship extraction model Model.

In the embodiment of the present application, the pre-trained language model may be a large-scale unsupervised pre-trained language model based on the BERT algorithm in the open source transformer project. The model is written using the pytorch framework and has been previously performed on large-scale open source Chinese corpus. Training, the training process adopts the method of cloze to determine the error, that is, the input Chinese expected text to intentionally cover a few words, check whether the model predicts the masked words according to the unmasked context when outputting, and calculate the predicted value of the model. The difference between the value and the true value until the difference is below a pre-specified threshold.

Specifically, as shown in FIG. 4 , using the standard training set to perform entity fine-tuning on the language model to obtain an open entity extraction model, and using the standard training set to perform relationship fine-tuning on the language model to obtain an open entity extraction model Relation extraction models, including:

S30, randomly adding blanks in the classified samples to obtain training samples, and using the language model to predict entities in the training samples to obtain predicted entities;

S31. Calculate the difference between the predicted entity and the real entity in the training sample, and when the difference is less than a preset threshold, determine that the language model is the open entity extraction model;

S32, using a preset relationship span prediction layer to calculate the relationship span between the predicted entities;

S33. Based on the relationship span, use a preset two-class linear layer to output a prediction result between the prediction entities, wherein the prediction result includes the existence of a relationship;

S34. When the ratio of the prediction result of the existence of the relationship to all the prediction results is greater than a preset relationship threshold, combine the language model, the relationship span prediction layer and the binary linear layer to obtain the open relationship Extract the model.

The relationship span can be represented by a one-hot vector, and the [CLS] bit is used to determine the prediction result between the prediction entities in the binary classification linear layer. 0 or 1, 0 indicates that the relationship does not exist , 1 means the relationship exists. At the same time, through the relationship span prediction layer and the two-class linear layer, the relationship prediction is simplified into a limited two-class problem, which greatly simplifies the training process of the model.

S4. Segment the text to be classified to obtain segmented text, and use the open entity extraction model to extract entities in the segmented text.

Specifically, as shown in FIG. 5 , the text to be classified is segmented to obtain segmented text, and the open entity extraction model is used to extract entities in the segmented text, including:

S40, segmenting the text to be classified according to the punctuation in the text to be classified, to obtain segmented sentences to be classified;

S41. Extract all entities in the text to be classified by using the open entity extraction model to obtain entities to be classified.

In this embodiment of the present application, entities in the text to be classified can be rapidly extracted through the open entity extraction model, which improves the rate of entity relationship prediction.

S5. Predict the entity relationship of the entity by using the open relationship extraction model, and cluster the entity and the entity relationship to obtain a relationship extraction result.

Specifically, predicting the entity relationship of the entity by using the open relationship extraction model, and clustering the entity and the entity relationship to obtain a relationship extraction result, including:

Based on the entity to be classified, the open relationship extraction model is used to extract the relationship in the segmented sentence to be classified, and the entity to be classified that has no relationship is filtered out to obtain a predicted triplet;

The predicted triples are clustered by using a preset clustering method to obtain a plurality of cluster clusters, wherein the cluster clusters include the relationship extraction result.

In the embodiment of the present application, using the open relationship extraction model, a triple (head entity, relationship, tail entity) and a double (head entity, None, tail entity) can be obtained, where a double represents an unrelated entity. For entity pairs, the accuracy of relationship prediction is improved by filtering the bigram. The preset clustering method may be a K-means clustering method. The K-means clustering method vectorizes the relationship in the predicted triplet through the word2vec algorithm, and calculates the distance between the vectors, according to the distance. The predicted triples are gathered into K central points to form K clusters. At this time, a type name is manually summarized for each cluster, so as to classify the predicted triples. At the same time, when each cluster is stable (not changing), each cluster will find the mean of all relation vectors in the cluster, and then the new relation will be compared with the mean of each existing cluster. If the similarity with multiple clusters (which can be Euclidean distance) is higher than the predefined similarity threshold, it will be classified into the most similar cluster. If the similarity with all clusters is lower than the predefined similarity threshold The similarity threshold is independently classified into the "unknown" class. When the relationship in the "unknown" class accumulates to a certain amount (usually 70% of the known class relationship), the K-means clustering method and manual definition type are repeated for the unknown relationship. .

In the embodiment of the present application, by clustering the entities and the entity relationships, the extracted open relationships can be automatically classified, which improves the efficiency of open relationship extraction.

In this application, by performing remote supervision and entity chaining on the original entity data set and the original relationship data set, a large number of original training sets with rich information can be obtained. Chinese open relation extraction. In addition, strategy labeling and entity enhancement processing are performed on the original training set, which improves the accuracy of open relationship extraction. At the same time, it is only necessary to use the standard training set to perform entity fine-tuning and relation fine-tuning on the language model, and then an open entity extraction model and an open relationship extraction model can be directly obtained, which does not require a large amount of computing resources and simplifies the model training process. The efficiency of open relation extraction is improved. Therefore, the embodiments of the present application can solve the problem of low extraction efficiency of open relationships.

As shown in FIG. 6 , it is a functional block diagram of an apparatus for extracting an open relationship provided by an embodiment of the present application.

The open relationship extraction apparatus 100 described in this application may be installed in an electronic device. According to the implemented functions, the open relationship extraction apparatus 100 may include a training set construction module 101 , an entity enhancement module 102 , a model construction module 103 , an entity extraction module 104 and a relationship extraction module 105 . The modules described in this application may also be referred to as units, which refer to a series of computer program segments that can be executed by the processor of an electronic device and can perform fixed functions, and are stored in the memory of the electronic device.

In this embodiment, the functions of each module/unit are as follows:

The training set construction module 101 is used to obtain the original entity data set and the original relationship data set, perform remote supervision on the original entity data set and the original relationship data set respectively, and compare the supervised original entity data set. Perform entity chaining between the set and the original relational data set to obtain the original training set.

Specifically, the training set construction module 101 obtains the original entity data set and the original relationship data set through the following operations:

Use a preset data grabbing tool to grab text data from the source website, and segment the text data to obtain text segments, and summarize the text segments to obtain the original relational data set;

Obtain an open source entity data set, wherein the entity data set includes triplet information and description information corresponding to each triplet information, and deduplicates the triplet information to obtain a deduplication triplet , summarizing the deduplication triplet and the description information corresponding to the triplet information to obtain the original entity data set.

Wherein, the preset data crawling tool can be Hawk data crawling tool, and the source website can be portal websites and professional websites in different fields, including: finance, law, medical care, education, entertainment , sports, etc. Using the Hawk data crawling tool, the text data in the source website can be directly crawled. In the embodiment of the present application, 3 sentences can be set as the minimum segmentation unit of the text data, the length of each sentence is not more than 256 words, and when exceeding, it is cut down to 2 sentences or even 1 sentence or skipped directly. The open source entity datasets may include datasets such as the Chinese General Encyclopedia Knowledge Graph (CN-DBpedia), and CN-DBpedia is mainly collected from the plain text pages of Chinese encyclopedia websites (such as Baidu Encyclopedia, Interactive Encyclopedia, Chinese Wikipedia, etc.). Entity information is extracted, and after filtering, fusion, inference and other operations, a high-quality structured data set is finally formed. The graph not only contains (head entity, relationship, tail entity) triple information, but also contains entity description information ( From Baidu Encyclopedia, etc.).

In detail, the training set building module 101 obtains deduplication triples through the following operations:

Selecting target triples from the entity data set in turn;

Further, the training set construction module 101 obtains the original training set through the following operations:

The entity reinforcement module 102 is configured to sequentially perform strategy labeling and entity reinforcement processing on the original training set to obtain a standard training set.

Specifically, the entity reinforcement module 102 obtains the standard training set through the following operations:

Classify the text segments in the original training set by using preset annotation symbols to obtain classified samples, and annotate the triples in the classified samples to obtain labeled entities;

Use a preset natural language processing library to perform entity enhancement processing on the labeled entities, and collect the enhanced classification samples to obtain the standard training set.

The model building module 103 is used to obtain a pre-trained language model, use the standard training set to perform entity fine-tuning on the language model to obtain an open entity extraction model, and use the standard training set to perform entity fine-tuning on the language model. The relationship is fine-tuned to obtain an open relationship extraction model.

Specifically, the model building module 103 obtains the open entity extraction model and the open relationship extraction model through the following operations:

Randomly adding blank bits to the classified samples to obtain training samples, and using the language model to predict entities in the training samples to obtain predicted entities;

Calculate the difference between the predicted entity and the real entity in the training sample, and when the difference is less than a preset threshold, determine that the language model is the open entity extraction model;

Calculate the relationship span between the predicted entities by using a preset relationship span prediction layer;

Based on the relationship span, a preset two-class linear layer is used to output a prediction result between the prediction entities, wherein the prediction result includes the existence of a relationship;

When the ratio of the prediction result of the existence of the relationship to all the prediction results is greater than a preset relationship threshold, combine the language model, the relationship span prediction layer and the two-class linear layer to obtain the open relationship extraction model .

The entity extraction module 104 is configured to segment the text to be classified, obtain segmented text, and extract entities in the segmented text by using the open entity extraction model.

Specifically, the entity extraction module 104 extracts entities in the segmented text through the following operations:

Segmenting the text to be classified according to the punctuation marks in the text to be classified, to obtain the segmented sentences to be classified;

All entities in the text to be classified are extracted by using the open entity extraction model to obtain entities to be classified.

The relationship extraction module 105 is configured to use the open relationship extraction model to predict the entity relationship of the entity, and to cluster the entity and the entity relationship to obtain a relationship extraction result.

Specifically, the relationship extraction module 105 obtains the relationship extraction result through the following operations:

In the embodiment of the present application, using the open relationship extraction model, a triple (head entity, relationship, tail entity) and a double (head entity, None, tail entity) can be obtained, where a double represents an unrelated entity. Entity pairs, the accuracy of relationship prediction is improved by filtering the bigram. The preset clustering method may be a K-means clustering method. The K-means clustering method vectorizes the relationship in the predicted triplet through the word2vec algorithm, and calculates the distance between the vectors, according to the distance. The predicted triples are gathered into K central points to form K clusters. At this time, a type name is manually summarized for each cluster, so as to classify the predicted triples. At the same time, when each cluster is stable (not changing), each cluster will find the mean of all relation vectors in the cluster, and then the new relation will be compared with the mean of each existing cluster. If the similarity with multiple clusters (which can be Euclidean distance) is higher than the predefined similarity threshold, it will be classified into the most similar cluster. If the similarity with all clusters is lower than the predefined similarity threshold The similarity threshold is independently classified into the "unknown" class. When the relationship in the "unknown" class accumulates to a certain amount (usually 70% of the known class relationship), the K-means clustering method and manual definition type are repeated for the unknown relationship. .

As shown in FIG. 7 , it is a schematic structural diagram of an electronic device for implementing an open relationship extraction method provided by an embodiment of the present application.

The electronic device 1 may include a processor 10, a memory 11 and a bus, and may also include a computer program stored in the memory 11 and executable on the processor 10, such as an open relationship extraction program 12.

Wherein, the memory 11 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, mobile hard disk, multimedia card, card-type memory (for example: SD or DX memory, etc.), magnetic memory, magnetic disk, CD etc. The memory 11 may be an internal storage unit of the electronic device 1 in some embodiments, such as a mobile hard disk of the electronic device 1 . In other embodiments, the memory 11 may also be an external storage device of the electronic device 1, such as a pluggable mobile hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital) equipped on the electronic device 1. , SD) card, flash memory card (Flash Card), etc. Further, the memory 11 may also include both an internal storage unit of the electronic device 1 and an external storage device. The memory 11 can not only be used to store application software installed in the electronic device 1 and various types of data, such as the code of the open relationship extraction program 12, etc., but also can be used to temporarily store data that has been output or will be output.

In some embodiments, the processor 10 may be composed of integrated circuits, for example, may be composed of a single packaged integrated circuit, or may be composed of multiple integrated circuits packaged with the same function or different functions, including one or more integrated circuits. Central Processing Unit (CPU), microprocessor, digital processing chip, graphics processor and combination of various control chips, etc. The processor 10 is the control core (Control Unit) of the electronic device, and uses various interfaces and lines to connect the various components of the entire electronic device, by running or executing the program or module (for example, opening the device) stored in the memory 11. relationship extraction program, etc.), and call the data stored in the memory 11 to execute various functions of the electronic device 1 and process data.

The bus may be a peripheral component interconnect (PCI for short) bus or an extended industry standard architecture (Extended industry standard architecture, EISA for short) bus or the like. The bus can be divided into address bus, data bus, control bus and so on. The bus is configured to implement connection communication between the memory 11 and at least one processor 10 and the like.

FIG. 7 only shows an electronic device with components. Those skilled in the art can understand that the structure shown in FIG. 7 does not constitute a limitation on the electronic device 1, and may include fewer or more components than those shown in the drawings. components, or a combination of certain components, or a different arrangement of components.

For example, although not shown, the electronic device 1 may also include a power supply (such as a battery) for powering the various components, preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so that the power management The device implements functions such as charge management, discharge management, and power consumption management. The power source may also include one or more DC or AC power sources, recharging devices, power failure detection circuits, power converters or inverters, power status indicators, and any other components. The electronic device 1 may further include various sensors, Bluetooth modules, Wi-Fi modules, etc., which will not be repeated here.

Further, the electronic device 1 may also include a network interface, optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a Bluetooth interface, etc.), which is usually used in the electronic device 1 Establish a communication connection with other electronic devices.

Optionally, the electronic device 1 may further include a user interface, and the user interface may be a display (Display), an input unit (eg, a keyboard (Keyboard)), optionally, the user interface may also be a standard wired interface or a wireless interface. Optionally, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, and the like. The display may also be appropriately called a display screen or a display unit, which is used for displaying information processed in the electronic device 1 and for displaying a visualized user interface.

It should be understood that the embodiments are only used for illustration, and are not limited by this structure in the scope of the patent application.

The open relationship extraction program 12 stored in the memory 11 in the electronic device 1 is a combination of multiple instructions, and when running in the processor 10, it can realize:

Specifically, for the specific implementation method of the above-mentioned instruction by the processor 10, reference may be made to the description of the relevant steps in the corresponding embodiments of FIG. 1 to FIG. 5 , which will not be repeated here.

Further, if the modules/units integrated in the electronic device 1 are implemented in the form of software functional units and sold or used as independent products, they may be stored in a computer-readable storage medium. The computer-readable storage medium may be volatile or non-volatile. For example, the computer-readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a USB flash drive, a removable hard disk, a magnetic disk, an optical disc, a computer memory, a read-only memory (ROM, Read-Only). Memory).

The present application also provides a computer-readable storage medium, the computer-readable storage medium may be volatile or non-volatile, and the readable storage medium stores a computer program, and the computer program is stored in When executed by the processor of the electronic device, it can achieve:

In the several embodiments provided in this application, it should be understood that the disclosed apparatus, apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative. For example, the division of the modules is only a logical function division, and there may be other division manners in actual implementation.

The modules described as separate components may or may not be physically separated, and components shown as modules may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

In addition, each functional module in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware, or can be implemented in the form of hardware plus software function modules.

It will be apparent to those skilled in the art that the present application is not limited to the details of the above-described exemplary embodiments, but that the present application may be implemented in other specific forms without departing from the spirit or essential characteristics of the present application.

Accordingly, the embodiments are to be regarded in all respects as illustrative and not restrictive, and the scope of the application is to be defined by the appended claims rather than the foregoing description, which is therefore intended to fall within the scope of the claims. All changes within the meaning and scope of the equivalents of , are included in this application. Any reference signs in the claims shall not be construed as limiting the involved claim.

The embodiments of the present application may acquire and process related data based on artificial intelligence technology. Among them, artificial intelligence (AI) is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. .

The basic technologies of artificial intelligence generally include technologies such as sensors, special artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics. Artificial intelligence software technology mainly includes computer vision technology, robotics technology, biometrics technology, speech processing technology, natural language processing technology, and machine learning/deep learning.

The blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

Furthermore, it is clear that the word "comprising" does not exclude other units or steps and the singular does not exclude the plural. Several units or means recited in the system claims can also be realized by one unit or means by means of software or hardware. Second-class terms are used to denote names and do not denote any particular order.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present application rather than limitations. Although the present application has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the present application can be Modifications or equivalent substitutions can be made without departing from the spirit and scope of the technical solutions of the present application.

Claims

A method for extracting open relationships, wherein the method comprises:

Obtain the original entity data set and the original relationship data set, carry out remote supervision on the original entity data set and the original relationship data set respectively, and carry out entity identification between the supervised original entity data set and the original relationship data set. Chain finger, get the original training set;

Performing strategy labeling and entity strengthening processing on the original training set in turn to obtain a standard training set;

Obtaining a pre-trained language model, using the standard training set to perform entity fine-tuning on the language model to obtain an open entity extraction model, and using the standard training set to perform relationship fine-tuning on the language model to obtain an open relationship extraction model;

Segmenting the text to be classified, obtaining segmented text, and extracting entities in the segmented text by using the open entity extraction model;

The entity relationship of the entity is predicted by the open relationship extraction model, and the entity and the entity relationship are clustered to obtain a relationship extraction result.
The method for extracting open relationships as claimed in claim 1, wherein the acquiring the original entity data set and the original relationship data set comprises:

Use a preset data grabbing tool to grab text data from the source website, and segment the text data to obtain text segments, and summarize the text segments to obtain the original relational data set;

Obtain an open source entity data set, wherein the entity data set includes triplet information and description information corresponding to each triplet information, and deduplicates the triplet information to obtain a deduplication triplet , summarizing the deduplication triplet and the description information corresponding to the triplet information to obtain the original entity data set.
The method for extracting open relationships according to claim 2, wherein the original entity dataset and the original relationship dataset are respectively supervised remotely, and the supervised original entity dataset is compared with the original entity dataset. Perform entity chaining on the relational dataset to obtain the original training set, including:

Matching the triplet information in the original entity data set and the text segmentation in the original relationship data set, and performing position labeling according to the matching result to obtain matching data;

Using a pre-built disambiguation model, calculate the matching probability of the matching result in the matching data and the description information corresponding to the triple information in the original entity data set;

When the matching probability is greater than a preset threshold, the original training set is obtained by summarizing the text segmentation and the triplet information.
The method for extracting open relationships as claimed in claim 2, wherein the original training set is sequentially subjected to policy labeling and entity reinforcement processing to obtain a standard training set, comprising:

Classify the text segments in the original training set by using preset annotation symbols to obtain classified samples, and annotate the triples in the classified samples to obtain labeled entities;

Use a preset natural language processing library to perform entity enhancement processing on the labeled entities, and collect the enhanced classification samples to obtain the standard training set.
The method for extracting open relationships according to claim 4, wherein the language model is fine-tuned by using the standard training set to obtain an open entity extraction model, and the language model is fine-tuned by using the standard training set. Perform relation fine-tuning to obtain an open relation extraction model, including:

Randomly adding blank bits to the classified samples to obtain training samples, and using the language model to predict entities in the training samples to obtain predicted entities;

Calculate the difference between the predicted entity and the real entity in the training sample, and when the difference is less than a preset threshold, determine that the language model is the open entity extraction model;

Calculate the relationship span between the predicted entities by using a preset relationship span prediction layer;

Based on the relationship span, a preset two-class linear layer is used to output the prediction result between the prediction entities, wherein the prediction result includes the existence of a relationship;

When the ratio of the prediction result of the existence of the relationship to all the prediction results is greater than a preset relationship threshold, combine the language model, the relationship span prediction layer and the two-class linear layer to obtain the open relationship extraction model .
The method for extracting open relationships according to claim 1, wherein, segmenting the text to be classified to obtain segmented text, and extracting entities in the segmented text by using the open entity extraction model, comprising:

Segmenting the text to be classified according to the punctuation marks in the text to be classified, to obtain the segmented sentences to be classified;

All entities in the text to be classified are extracted by using the open entity extraction model to obtain entities to be classified.
The method for extracting open relationships according to claim 6, wherein the use of the open relationship extraction model to predict the entity relationship of the entity, and clustering the entity and the entity relationship to obtain a relationship extraction result, include:

Based on the entity to be classified, the open relationship extraction model is used to extract the relationship in the segmented sentence to be classified, and the entity to be classified that has no relationship is filtered out to obtain a predicted triplet;

The predicted triples are clustered by using a preset clustering method to obtain a plurality of cluster clusters, wherein the cluster clusters include the relationship extraction result.
An apparatus for extracting open relationships, wherein the apparatus comprises:

The training set building module is used to obtain the original entity data set and the original relationship data set, perform remote supervision on the original entity data set and the original relationship data set respectively, and compare the supervised original entity data set with all the original entity data sets. Perform entity chaining on the original relational data set to obtain the original training set;

an entity reinforcement module, which is used to sequentially perform strategy labeling and entity reinforcement processing on the original training set to obtain a standard training set;

a model building module for obtaining a pre-trained language model, performing entity fine-tuning on the language model using the standard training set to obtain an open entity extraction model, and using the standard training set to perform relational fine-tuning on the language model, Get the open relation extraction model;

an entity extraction module, used for segmenting the text to be classified, obtaining segmented text, and extracting entities in the segmented text by using the open entity extraction model;

The relationship extraction module is used for predicting the entity relationship of the entity by using the open relationship extraction model, and clustering the entity and the entity relationship to obtain a relationship extraction result.
An electronic device, wherein the electronic device comprises:

at least one processor; and,

a memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the steps of:

Obtain the original entity data set and the original relationship data set, carry out remote supervision on the original entity data set and the original relationship data set respectively, and carry out entity identification between the supervised original entity data set and the original relationship data set. Chain finger, get the original training set;

Performing strategy labeling and entity strengthening processing on the original training set in turn to obtain a standard training set;

Obtaining a pre-trained language model, using the standard training set to perform entity fine-tuning on the language model to obtain an open entity extraction model, and using the standard training set to perform relationship fine-tuning on the language model to obtain an open relationship extraction model;

Segmenting the text to be classified, obtaining segmented text, and extracting entities in the segmented text by using the open entity extraction model;

The entity relationship of the entity is predicted by the open relationship extraction model, and the entity and the entity relationship are clustered to obtain a relationship extraction result.
The electronic device according to claim 9, wherein the acquiring the original entity data set and the original relationship data set comprises:

Use a preset data grabbing tool to grab text data from the source website, and segment the text data to obtain text segments, and summarize the text segments to obtain the original relational data set;

Obtain an open source entity data set, wherein the entity data set includes triplet information and description information corresponding to each triplet information, and deduplicates the triplet information to obtain a deduplication triplet , summarizing the deduplication triplet and the description information corresponding to the triplet information to obtain the original entity data set.
The electronic device according to claim 10, wherein the original entity data set and the original relationship data set are respectively supervised remotely, and the supervised original entity data set and the original relationship data are supervised. Perform entity chaining on the set to obtain the original training set, including:

Matching the triplet information in the original entity data set and the text segmentation in the original relationship data set, and performing position labeling according to the matching result to obtain matching data;

Using a pre-built disambiguation model, calculate the matching probability of the matching result in the matching data and the description information corresponding to the triple information in the original entity data set;

When the matching probability is greater than a preset threshold, the original training set is obtained by summarizing the text segmentation and the triplet information.
The electronic device according to claim 10 , wherein the original training set is sequentially subjected to strategy labeling and entity enhancement processing to obtain a standard training set, comprising:

Classify the text segments in the original training set by using preset annotation symbols to obtain classified samples, and annotate the triples in the classified samples to obtain labeled entities;

Use a preset natural language processing library to perform entity enhancement processing on the labeled entities, and collect the enhanced classification samples to obtain the standard training set.
13. The electronic device of claim 12, wherein the language model is subjected to entity fine-tuning using the standard training set to obtain an open entity extraction model, and the language model is relationally performed using the standard training set Fine-tuning to obtain an open relation extraction model, including:

Randomly adding blank bits to the classified samples to obtain training samples, and using the language model to predict entities in the training samples to obtain predicted entities;

Calculate the difference between the predicted entity and the real entity in the training sample, and when the difference is less than a preset threshold, determine that the language model is the open entity extraction model;

Calculate the relationship span between the predicted entities by using a preset relationship span prediction layer;

Based on the relationship span, a preset two-class linear layer is used to output a prediction result between the prediction entities, wherein the prediction result includes the existence of a relationship;

When the ratio of the prediction result of the existence of the relationship to all the prediction results is greater than a preset relationship threshold, combine the language model, the relationship span prediction layer and the two-class linear layer to obtain the open relationship extraction model .
The electronic device according to claim 9, wherein the segmenting the text to be classified to obtain segmented text, and extracting entities in the segmented text by using the open entity extraction model, comprising:

Segmenting the text to be classified according to the punctuation marks in the text to be classified, to obtain the segmented sentences to be classified;

All entities in the text to be classified are extracted by using the open entity extraction model to obtain entities to be classified.
The electronic device according to claim 14, wherein, predicting the entity relationship of the entity by using the open relationship extraction model, and clustering the entity and the entity relationship to obtain a relationship extraction result, comprising:

Based on the entity to be classified, the open relationship extraction model is used to extract the relationship in the segmented sentence to be classified, and the entity to be classified that has no relationship is filtered out to obtain a predicted triplet;

The predicted triples are clustered by using a preset clustering method to obtain a plurality of cluster clusters, wherein the cluster clusters include the relationship extraction result.
A computer-readable storage medium storing a computer program, wherein the computer program implements the following steps when executed by a processor:

Obtain the original entity data set and the original relationship data set, carry out remote supervision on the original entity data set and the original relationship data set respectively, and carry out entity identification between the supervised original entity data set and the original relationship data set. Chain finger, get the original training set;

Performing strategy labeling and entity strengthening processing on the original training set in turn to obtain a standard training set;

Obtaining a pre-trained language model, using the standard training set to perform entity fine-tuning on the language model to obtain an open entity extraction model, and using the standard training set to perform relationship fine-tuning on the language model to obtain an open relationship extraction model;

Segmenting the text to be classified, obtaining segmented text, and extracting entities in the segmented text by using the open entity extraction model;

The entity relationship of the entity is predicted by the open relationship extraction model, and the entity and the entity relationship are clustered to obtain a relationship extraction result.
The computer-readable storage medium of claim 16, wherein the obtaining the original entity data set and the original relationship data set comprises:

Use a preset data grabbing tool to grab text data from the source website, and segment the text data to obtain text segments, and summarize the text segments to obtain the original relational data set;

Obtain an open source entity data set, wherein the entity data set includes triplet information and description information corresponding to each triplet information, and deduplicates the triplet information to obtain a deduplication triplet , summarizing the deduplication triplet and the description information corresponding to the triplet information to obtain the original entity data set.
The computer-readable storage medium of claim 17, wherein the original entity dataset and the original relationship dataset are separately supervised remotely, and the supervised original entity dataset is compared with the original entity dataset. Perform entity chaining on the original relational data set to obtain the original training set, including:

Matching the triplet information in the original entity data set and the text segmentation in the original relationship data set, and performing position labeling according to the matching result to obtain matching data;

Using a pre-built disambiguation model, calculate the matching probability of the matching result in the matching data and the description information corresponding to the triple information in the original entity data set;

When the matching probability is greater than a preset threshold, the original training set is obtained by summarizing the text segmentation and the triplet information.
The computer-readable storage medium according to claim 17, wherein the step of sequentially performing policy labeling and entity enhancement processing on the original training set to obtain a standard training set, comprising:

Classify the text segments in the original training set by using preset annotation symbols to obtain classified samples, and annotate the triples in the classified samples to obtain labeled entities;

Use a preset natural language processing library to perform entity enhancement processing on the labeled entities, and collect the enhanced classification samples to obtain the standard training set.
19. The computer-readable storage medium of claim 19, wherein the entity fine-tuning of the language model using the standard training set results in an open entity extraction model, and wherein the language model is fine-tuned using the standard training set The model is fine-tuned to obtain an open relation extraction model, including:

Randomly adding blank bits to the classified samples to obtain training samples, and using the language model to predict entities in the training samples to obtain predicted entities;

Calculate the difference between the predicted entity and the real entity in the training sample, and when the difference is less than a preset threshold, determine that the language model is the open entity extraction model;

Calculate the relationship span between the predicted entities by using a preset relationship span prediction layer;

Based on the relationship span, a preset two-class linear layer is used to output a prediction result between the prediction entities, wherein the prediction result includes the existence of a relationship;

When the ratio of the prediction result of the relationship to all prediction results is greater than a preset relationship threshold, combine the language model, the relationship span prediction layer and the binary linear layer to obtain the open relationship extraction model .