WO2022222300A1 - Open relationship extraction method and apparatus, electronic device, and storage medium - Google Patents

Open relationship extraction method and apparatus, electronic device, and storage medium Download PDF

Info

Publication number
WO2022222300A1
WO2022222300A1 PCT/CN2021/109488 CN2021109488W WO2022222300A1 WO 2022222300 A1 WO2022222300 A1 WO 2022222300A1 CN 2021109488 W CN2021109488 W CN 2021109488W WO 2022222300 A1 WO2022222300 A1 WO 2022222300A1
Authority
WO
WIPO (PCT)
Prior art keywords
entity
relationship
original
open
data set
Prior art date
Application number
PCT/CN2021/109488
Other languages
French (fr)
Chinese (zh)
Inventor
朱昱锦
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2022222300A1 publication Critical patent/WO2022222300A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Definitions

  • the present application relates to the technical field of artificial intelligence, and in particular, to an open relationship extraction method, apparatus, electronic device, and computer-readable storage medium.
  • Relation extraction is an important supporting technology in the field of information extraction and knowledge graph construction. There are many practical scenarios, such as building large-scale general/vertical domain graphs, extracting information from application forms for pre-loan review, etc.
  • the traditional relationship extraction technology faces two problems and is difficult to put into practice: 1) It requires more labeled data to train the relationship classification model, resulting in high data cost and labeling cost; 2) The relationship type often requires business definition, which is limited and cannot be changed. Many requirements do not have a predefined set of relationships.
  • Open relation extraction technology requires inputting a piece of text, and automatically outputs all possible relation triples (head entity, relation, tail entity) and bigrams (head entity, tail entity) from it.
  • the "relationship" field in the triple is the descriptor that comes with the context.
  • Open relation extraction has always been intractable due to type uncertainty.
  • the classic methods include ReVerb, OLLIE, OpenIE, etc., but most of these solutions are for English, and it is difficult to migrate to Chinese Text, and the matching rules are strict, and the processing method is inflexible; 2.
  • the scheme uses two-layer network blocks to process text, first extracts the head entity from the text, and then jointly extracts the tail entity and determines the relationship type according to the output of the head entity and the hidden layer, which constitutes a behavioral relationship class number, which is listed as the text length
  • the number of relationship types becomes the text length, so that the model needs to calculate a tensor whose size is the number of batch samples ⁇ the number of head entities ⁇ text length ⁇ text length.
  • the multi-triple problem in the text also improves the accuracy, but it takes up a lot of computing resources and is extremely inefficient.
  • An open relation extraction method comprising:
  • Segmenting the text to be classified obtaining segmented text, and extracting entities in the segmented text by using the open entity extraction model
  • the entity relationship of the entity is predicted by the open relationship extraction model, and the entity and the entity relationship are clustered to obtain a relationship extraction result.
  • An apparatus for extracting open relationships comprising:
  • the training set building module is used to obtain the original entity data set and the original relationship data set, perform remote supervision on the original entity data set and the original relationship data set respectively, and compare the supervised original entity data set with all the original entity data sets. Perform entity chaining on the original relational data set to obtain the original training set;
  • an entity reinforcement module which is used to sequentially perform strategy labeling and entity reinforcement processing on the original training set to obtain a standard training set
  • a model building module for obtaining a pre-trained language model, performing entity fine-tuning on the language model using the standard training set to obtain an open entity extraction model, and using the standard training set to perform relational fine-tuning on the language model, Get the open relation extraction model;
  • an entity extraction module used for segmenting the text to be classified, obtaining segmented text, and extracting entities in the segmented text by using the open entity extraction model
  • the relationship extraction module is used for predicting the entity relationship of the entity by using the open relationship extraction model, and clustering the entity and the entity relationship to obtain a relationship extraction result.
  • An electronic device comprising:
  • a processor that executes the instructions stored in the memory to achieve the following steps:
  • Segmenting the text to be classified obtaining segmented text, and extracting entities in the segmented text by using the open entity extraction model
  • the entity relationship of the entity is predicted by the open relationship extraction model, and the entity and the entity relationship are clustered to obtain a relationship extraction result.
  • the present application also provides a computer-readable storage medium, where at least one instruction is stored in the computer-readable storage medium, and the at least one instruction is executed by a processor in an electronic device to implement the following steps:
  • Segmenting the text to be classified obtaining segmented text, and extracting entities in the segmented text by using the open entity extraction model
  • the entity relationship of the entity is predicted by the open relationship extraction model, and the entity and the entity relationship are clustered to obtain a relationship extraction result.
  • the present application can solve the problem of low extraction efficiency of open relationships.
  • FIG. 1 is a schematic flowchart of an open relationship extraction method provided by an embodiment of the present application
  • Fig. 2 is a detailed implementation flow diagram of one of the steps in Fig. 1;
  • Fig. 3 is the detailed implementation flow schematic diagram of another step in Fig. 1;
  • Fig. 4 is a detailed implementation flow diagram of another step in Fig. 1;
  • Fig. 5 is the detailed implementation flow schematic diagram of another step in Fig. 1;
  • FIG. 6 is a functional block diagram of an apparatus for extracting open relationships provided by an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of an electronic device implementing the method for extracting an open relationship according to an embodiment of the present application.
  • the embodiment of the present application provides an open relationship extraction method.
  • the execution subject of the open relationship extraction method includes, but is not limited to, at least one of electronic devices that can be configured to execute the method provided by the embodiments of the present application, such as a server and a terminal.
  • the open relationship extraction method can be executed by software or hardware installed in a terminal device or a server device, and the software can be a blockchain platform.
  • the server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like.
  • the open relationship extraction method includes:
  • S1 obtain the original entity data set and the original relationship data set, respectively carry out remote supervision on the original entity data set and the original relationship data set, and supervise the original entity data set and the original relationship data set. Perform entity chaining to get the original training set.
  • the obtaining of the original entity data set and the original relational data set includes:
  • the preset data crawling tool can be Hawk data crawling tool
  • the source website can be portal websites and professional websites in different fields, including: finance, law, medical care, education, entertainment , sports, etc.
  • the text data in the source website can be directly crawled.
  • 3 sentences may be set as the minimum segmentation unit of the text data, and the length of each sentence does not exceed 256 characters.
  • the open source entity datasets may include datasets such as the Chinese General Encyclopedia Knowledge Graph (CN-DBpedia), and CN-DBpedia is mainly collected from the plain text pages of Chinese encyclopedia websites (such as Baidu Encyclopedia, Interactive Encyclopedia, Chinese Wikipedia, etc.).
  • Entity information is extracted, and after filtering, fusion, inference and other operations, a high-quality structured data set is finally formed.
  • the graph not only contains (head entity, relationship, tail entity) triple information, but also contains entity description information ( From Baidu Encyclopedia, etc.).
  • Performing deduplication processing on the triplet information to obtain a deduplication triplet including:
  • the target triplet is not repeated, and the target triplet is reselected from the entity data set for calculation;
  • the target triplet is repeated, and the target triplet is deleted to obtain a deduplication triplet.
  • the following distance algorithm is used to calculate the distance value between the target triplet and all unselected triplet information in the entity data set:
  • d is the distance value
  • w j is the jth target triple
  • w k is any unselected triple information in the entity data set
  • n is the number of triple information in the entity data set .
  • performing remote supervision on the original entity data set and the original relationship data set respectively, and performing entity chaining on the supervised original entity data set and the original relationship data set to obtain the original training set including:
  • the original training set is obtained by summarizing the text segmentation and the triplet information.
  • the remote supervision refers to a method of using ready-made triples in an open source knowledge graph to perform automatic labeling without manual participation, and obtain a large number of labeled data sets.
  • the triples in the original entity data set and the text segmentation in the original relationship data set are matched, and it is required that at least the head entity and the tail entity in the triple are in the context of the current text segmentation, and Label the position of the entity in the current text segment (eg, "text”: text segment, "entity_idx”: ⁇ entity_1: [start, end], entity_2: [start, end], ... ⁇ , where, " text” represents the current text segment, "entity_idx” represents the position of the entity in the current text segment), and the matched triples and text segments are aggregated to obtain the matching data.
  • a pre-built disambiguation model can be used to perform the entity link reference, and the pre-built disambiguation model can be a BERT model trained by open source short text matching, and the entity link refers to the BERT model as The backbone, splicing the text in the matching data (including the triplet and the text segment where the triplet is located) and the description of the triplet in the original entity data set as input, and outputting a matching probability, the preset threshold can be 0.5, when the matching probability is greater than 0.5, it means that the entities in the matching data and the original entity dataset are of the same type.
  • the preset threshold can be 0.5, when the matching probability is greater than 0.5, it means that the entities in the matching data and the original entity dataset are of the same type.
  • the strategy labeling and entity reinforcement processing are sequentially performed on the original training set to obtain a standard training set, including:
  • strategy labeling may be performed based on the method of MTB (Matching The Blank, blank matching), wherein the preset labeling symbols may be ⁇ tag> and ⁇ /tag>, and ⁇ tag> and ⁇ / The part enclosed by the tag> is the mention of the entity or relationship in the sentence.
  • the classification sample can be [CLS]XXX ⁇ entity_head>XXX ⁇ /entity_head>XXX ⁇ rel>XXX ⁇ /rel>XXX ⁇ entity_tail>XXX ⁇ /en tity_tail>XXX[SEP], entity_head, rel, and entity_tail represent head entity, relationship, and tail entity respectively.
  • [CLS] and [SEP] are the spacers
  • [CLS] is the classification bit, this position outputs the binary classification result 0/1, indicating whether there is a relationship between the two entities at present
  • [SEP] is the termination bit, indicating the end of the sentence.
  • the BIO sequence labeling mode may be used to label the entities in the classified samples, the words mentioned by the entities are labelled as B or I, and the non-entities are labelled as O. Since this is open entity recognition, it is only divided into two categories: yes/no entity.
  • the preset natural language processing library can be the HanLP natural language processing library, and the dependency syntax parsing tool in the HanLP natural language processing library is used to analyze the prefix of the current entity to perform entity enhancement on the current entity, for example, the current entity is " Cook", prefixed with "Apple CEO", then the enhanced entity is "Apple CEO Cook".
  • the implementation of the present application can improve the accuracy of model training by performing policy labeling and entity enhancement processing on the original training set.
  • the pre-trained language model may be a large-scale unsupervised pre-trained language model based on the BERT algorithm in the open source transformer project.
  • the model is written using the pytorch framework and has been previously performed on large-scale open source Chinese corpus. Training, the training process adopts the method of cloze to determine the error, that is, the input Chinese expected text to intentionally cover a few words, check whether the model predicts the masked words according to the unmasked context when outputting, and calculate the predicted value of the model. The difference between the value and the true value until the difference is below a pre-specified threshold.
  • Relation extraction models including:
  • the relationship span can be represented by a one-hot vector, and the [CLS] bit is used to determine the prediction result between the prediction entities in the binary classification linear layer. 0 or 1, 0 indicates that the relationship does not exist , 1 means the relationship exists.
  • the relationship prediction is simplified into a limited two-class problem, which greatly simplifies the training process of the model.
  • the text to be classified is segmented to obtain segmented text
  • the open entity extraction model is used to extract entities in the segmented text, including:
  • entities in the text to be classified can be rapidly extracted through the open entity extraction model, which improves the rate of entity relationship prediction.
  • predicting the entity relationship of the entity by using the open relationship extraction model, and clustering the entity and the entity relationship to obtain a relationship extraction result including:
  • the open relationship extraction model is used to extract the relationship in the segmented sentence to be classified, and the entity to be classified that has no relationship is filtered out to obtain a predicted triplet;
  • the predicted triples are clustered by using a preset clustering method to obtain a plurality of cluster clusters, wherein the cluster clusters include the relationship extraction result.
  • a triple head entity, relationship, tail entity
  • a double head entity, None, tail entity
  • the preset clustering method may be a K-means clustering method.
  • the K-means clustering method vectorizes the relationship in the predicted triplet through the word2vec algorithm, and calculates the distance between the vectors, according to the distance.
  • the predicted triples are gathered into K central points to form K clusters. At this time, a type name is manually summarized for each cluster, so as to classify the predicted triples.
  • each cluster when each cluster is stable (not changing), each cluster will find the mean of all relation vectors in the cluster, and then the new relation will be compared with the mean of each existing cluster. If the similarity with multiple clusters (which can be Euclidean distance) is higher than the predefined similarity threshold, it will be classified into the most similar cluster. If the similarity with all clusters is lower than the predefined similarity threshold The similarity threshold is independently classified into the "unknown" class. When the relationship in the "unknown" class accumulates to a certain amount (usually 70% of the known class relationship), the K-means clustering method and manual definition type are repeated for the unknown relationship. .
  • the extracted open relationships can be automatically classified, which improves the efficiency of open relationship extraction.
  • FIG. 6 it is a functional block diagram of an apparatus for extracting an open relationship provided by an embodiment of the present application.
  • the open relationship extraction apparatus 100 described in this application may be installed in an electronic device. According to the implemented functions, the open relationship extraction apparatus 100 may include a training set construction module 101 , an entity enhancement module 102 , a model construction module 103 , an entity extraction module 104 and a relationship extraction module 105 .
  • the modules described in this application may also be referred to as units, which refer to a series of computer program segments that can be executed by the processor of an electronic device and can perform fixed functions, and are stored in the memory of the electronic device.
  • each module/unit is as follows:
  • the training set construction module 101 is used to obtain the original entity data set and the original relationship data set, perform remote supervision on the original entity data set and the original relationship data set respectively, and compare the supervised original entity data set. Perform entity chaining between the set and the original relational data set to obtain the original training set.
  • the training set construction module 101 obtains the original entity data set and the original relationship data set through the following operations:
  • the entity data set includes triplet information and description information corresponding to each triplet information, and deduplicates the triplet information to obtain a deduplication triplet , summarizing the deduplication triplet and the description information corresponding to the triplet information to obtain the original entity data set.
  • the preset data crawling tool can be Hawk data crawling tool
  • the source website can be portal websites and professional websites in different fields, including: finance, law, medical care, education, entertainment , sports, etc.
  • the text data in the source website can be directly crawled.
  • 3 sentences can be set as the minimum segmentation unit of the text data, the length of each sentence is not more than 256 words, and when exceeding, it is cut down to 2 sentences or even 1 sentence or skipped directly.
  • the open source entity datasets may include datasets such as the Chinese General Encyclopedia Knowledge Graph (CN-DBpedia), and CN-DBpedia is mainly collected from the plain text pages of Chinese encyclopedia websites (such as Baidu Encyclopedia, Interactive Encyclopedia, Chinese Wikipedia, etc.). Entity information is extracted, and after filtering, fusion, inference and other operations, a high-quality structured data set is finally formed.
  • the graph not only contains (head entity, relationship, tail entity) triple information, but also contains entity description information ( From Baidu Encyclopedia, etc.).
  • the training set building module 101 obtains deduplication triples through the following operations:
  • the target triplet is not repeated, and the target triplet is reselected from the entity data set for calculation;
  • the target triplet is repeated, and the target triplet is deleted to obtain a deduplication triplet.
  • the following distance algorithm is used to calculate the distance value between the target triplet and all unselected triplet information in the entity data set:
  • d is the distance value
  • w j is the jth target triple
  • w k is any unselected triple information in the entity data set
  • n is the number of triple information in the entity data set .
  • training set construction module 101 obtains the original training set through the following operations:
  • the original training set is obtained by summarizing the text segmentation and the triplet information.
  • the remote supervision refers to a method of using ready-made triples in an open source knowledge graph to perform automatic labeling without manual participation, and obtain a large number of labeled data sets.
  • the triples in the original entity data set and the text segmentation in the original relationship data set are matched, and it is required that at least the head entity and the tail entity in the triple are in the context of the current text segmentation, and Label the position of the entity in the current text segment (eg, "text”: text segment, "entity_idx”: ⁇ entity_1: [start, end], entity_2: [start, end], ... ⁇ , where, " text” represents the current text segment, "entity_idx” represents the position of the entity in the current text segment), and the matched triples and text segments are aggregated to obtain the matching data.
  • a pre-built disambiguation model can be used to perform the entity link reference, and the pre-built disambiguation model can be a BERT model trained by open source short text matching, and the entity link refers to the BERT model as The backbone, splicing the text in the matching data (including the triplet and the text segment where the triplet is located) and the description of the triplet in the original entity data set as input, and outputting a matching probability, the preset threshold can be 0.5, when the matching probability is greater than 0.5, it means that the entities in the matching data and the original entity dataset are of the same type.
  • the preset threshold can be 0.5, when the matching probability is greater than 0.5, it means that the entities in the matching data and the original entity dataset are of the same type.
  • the entity reinforcement module 102 is configured to sequentially perform strategy labeling and entity reinforcement processing on the original training set to obtain a standard training set.
  • the entity reinforcement module 102 obtains the standard training set through the following operations:
  • strategy labeling may be performed based on the method of MTB (Matching The Blank, blank matching), wherein the preset labeling symbols may be ⁇ tag> and ⁇ /tag>, and ⁇ tag> and ⁇ / The part enclosed by the tag> is the mention of the entity or relationship in the sentence.
  • the classification sample can be [CLS]XXX ⁇ entity_head>XXX ⁇ /entity_head>XXX ⁇ rel>XXX ⁇ /rel>XXX ⁇ entity_tail>XXX ⁇ /en tity_tail>XXX[SEP], entity_head, rel, and entity_tail represent head entity, relationship, and tail entity respectively.
  • [CLS] and [SEP] are the spacers
  • [CLS] is the classification bit, this position outputs the binary classification result 0/1, indicating whether there is a relationship between the two entities at present
  • [SEP] is the termination bit, indicating the end of the sentence.
  • the BIO sequence labeling mode may be used to label the entities in the classified samples, the words mentioned by the entities are labelled as B or I, and the non-entities are labelled as O. Since this is open entity recognition, it is only divided into two categories: yes/no entity.
  • the preset natural language processing library can be the HanLP natural language processing library, and the dependency syntax parsing tool in the HanLP natural language processing library is used to analyze the prefix of the current entity to perform entity enhancement on the current entity, for example, the current entity is " Cook", prefixed with "Apple CEO", then the enhanced entity is "Apple CEO Cook".
  • the implementation of the present application can improve the accuracy of model training by performing policy labeling and entity enhancement processing on the original training set.
  • the model building module 103 is used to obtain a pre-trained language model, use the standard training set to perform entity fine-tuning on the language model to obtain an open entity extraction model, and use the standard training set to perform entity fine-tuning on the language model.
  • the relationship is fine-tuned to obtain an open relationship extraction model.
  • the pre-trained language model may be a large-scale unsupervised pre-trained language model based on the BERT algorithm in the open source transformer project.
  • the model is written using the pytorch framework and has been previously performed on large-scale open source Chinese corpus. Training, the training process adopts the method of cloze to determine the error, that is, the input Chinese expected text to intentionally cover a few words, check whether the model predicts the masked words according to the unmasked context when outputting, and calculate the predicted value of the model. The difference between the value and the true value until the difference is below a pre-specified threshold.
  • model building module 103 obtains the open entity extraction model and the open relationship extraction model through the following operations:
  • a preset two-class linear layer is used to output a prediction result between the prediction entities, wherein the prediction result includes the existence of a relationship;
  • the relationship span can be represented by a one-hot vector, and the [CLS] bit is used to determine the prediction result between the prediction entities in the binary classification linear layer. 0 or 1, 0 indicates that the relationship does not exist , 1 means the relationship exists.
  • the relationship prediction is simplified into a limited two-class problem, which greatly simplifies the training process of the model.
  • the entity extraction module 104 is configured to segment the text to be classified, obtain segmented text, and extract entities in the segmented text by using the open entity extraction model.
  • the entity extraction module 104 extracts entities in the segmented text through the following operations:
  • All entities in the text to be classified are extracted by using the open entity extraction model to obtain entities to be classified.
  • entities in the text to be classified can be rapidly extracted through the open entity extraction model, which improves the rate of entity relationship prediction.
  • the relationship extraction module 105 is configured to use the open relationship extraction model to predict the entity relationship of the entity, and to cluster the entity and the entity relationship to obtain a relationship extraction result.
  • the relationship extraction module 105 obtains the relationship extraction result through the following operations:
  • the open relationship extraction model is used to extract the relationship in the segmented sentence to be classified, and the entity to be classified that has no relationship is filtered out to obtain a predicted triplet;
  • the predicted triples are clustered by using a preset clustering method to obtain a plurality of cluster clusters, wherein the cluster clusters include the relationship extraction result.
  • a triple head entity, relationship, tail entity
  • a double head entity, None, tail entity
  • the preset clustering method may be a K-means clustering method.
  • the K-means clustering method vectorizes the relationship in the predicted triplet through the word2vec algorithm, and calculates the distance between the vectors, according to the distance.
  • the predicted triples are gathered into K central points to form K clusters. At this time, a type name is manually summarized for each cluster, so as to classify the predicted triples.
  • each cluster when each cluster is stable (not changing), each cluster will find the mean of all relation vectors in the cluster, and then the new relation will be compared with the mean of each existing cluster. If the similarity with multiple clusters (which can be Euclidean distance) is higher than the predefined similarity threshold, it will be classified into the most similar cluster. If the similarity with all clusters is lower than the predefined similarity threshold The similarity threshold is independently classified into the "unknown" class. When the relationship in the "unknown" class accumulates to a certain amount (usually 70% of the known class relationship), the K-means clustering method and manual definition type are repeated for the unknown relationship. .
  • the extracted open relationships can be automatically classified, which improves the efficiency of open relationship extraction.
  • FIG. 7 it is a schematic structural diagram of an electronic device for implementing an open relationship extraction method provided by an embodiment of the present application.
  • the electronic device 1 may include a processor 10, a memory 11 and a bus, and may also include a computer program stored in the memory 11 and executable on the processor 10, such as an open relationship extraction program 12.
  • the memory 11 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, mobile hard disk, multimedia card, card-type memory (for example: SD or DX memory, etc.), magnetic memory, magnetic disk, CD etc.
  • the memory 11 may be an internal storage unit of the electronic device 1 in some embodiments, such as a mobile hard disk of the electronic device 1 .
  • the memory 11 may also be an external storage device of the electronic device 1, such as a pluggable mobile hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital) equipped on the electronic device 1. , SD) card, flash memory card (Flash Card), etc.
  • the memory 11 may also include both an internal storage unit of the electronic device 1 and an external storage device.
  • the memory 11 can not only be used to store application software installed in the electronic device 1 and various types of data, such as the code of the open relationship extraction program 12, etc., but also can be used to temporarily store data that has been output or will be output.
  • the processor 10 may be composed of integrated circuits, for example, may be composed of a single packaged integrated circuit, or may be composed of multiple integrated circuits packaged with the same function or different functions, including one or more integrated circuits.
  • Central Processing Unit CPU
  • microprocessor digital processing chip
  • graphics processor and combination of various control chips, etc.
  • the processor 10 is the control core (Control Unit) of the electronic device, and uses various interfaces and lines to connect the various components of the entire electronic device, by running or executing the program or module (for example, opening the device) stored in the memory 11. relationship extraction program, etc.), and call the data stored in the memory 11 to execute various functions of the electronic device 1 and process data.
  • the bus may be a peripheral component interconnect (PCI for short) bus or an extended industry standard architecture (Extended industry standard architecture, EISA for short) bus or the like.
  • PCI peripheral component interconnect
  • EISA Extended industry standard architecture
  • the bus can be divided into address bus, data bus, control bus and so on.
  • the bus is configured to implement connection communication between the memory 11 and at least one processor 10 and the like.
  • FIG. 7 only shows an electronic device with components. Those skilled in the art can understand that the structure shown in FIG. 7 does not constitute a limitation on the electronic device 1, and may include fewer or more components than those shown in the drawings. components, or a combination of certain components, or a different arrangement of components.
  • the electronic device 1 may also include a power supply (such as a battery) for powering the various components, preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so that the power management
  • the device implements functions such as charge management, discharge management, and power consumption management.
  • the power source may also include one or more DC or AC power sources, recharging devices, power failure detection circuits, power converters or inverters, power status indicators, and any other components.
  • the electronic device 1 may further include various sensors, Bluetooth modules, Wi-Fi modules, etc., which will not be repeated here.
  • the electronic device 1 may also include a network interface, optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a Bluetooth interface, etc.), which is usually used in the electronic device 1 Establish a communication connection with other electronic devices.
  • a network interface optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a Bluetooth interface, etc.), which is usually used in the electronic device 1 Establish a communication connection with other electronic devices.
  • the electronic device 1 may further include a user interface, and the user interface may be a display (Display), an input unit (eg, a keyboard (Keyboard)), optionally, the user interface may also be a standard wired interface or a wireless interface.
  • the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, and the like.
  • the display may also be appropriately called a display screen or a display unit, which is used for displaying information processed in the electronic device 1 and for displaying a visualized user interface.
  • the open relationship extraction program 12 stored in the memory 11 in the electronic device 1 is a combination of multiple instructions, and when running in the processor 10, it can realize:
  • Segmenting the text to be classified obtaining segmented text, and extracting entities in the segmented text by using the open entity extraction model
  • the entity relationship of the entity is predicted by the open relationship extraction model, and the entity and the entity relationship are clustered to obtain a relationship extraction result.
  • the modules/units integrated in the electronic device 1 may be stored in a computer-readable storage medium.
  • the computer-readable storage medium may be volatile or non-volatile.
  • the computer-readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a USB flash drive, a removable hard disk, a magnetic disk, an optical disc, a computer memory, a read-only memory (ROM, Read-Only). Memory).
  • the present application also provides a computer-readable storage medium, the computer-readable storage medium may be volatile or non-volatile, and the readable storage medium stores a computer program, and the computer program is stored in When executed by the processor of the electronic device, it can achieve:
  • Segmenting the text to be classified obtaining segmented text, and extracting entities in the segmented text by using the open entity extraction model
  • the entity relationship of the entity is predicted by the open relationship extraction model, and the entity and the entity relationship are clustered to obtain a relationship extraction result.
  • modules described as separate components may or may not be physically separated, and components shown as modules may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional module in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware, or can be implemented in the form of hardware plus software function modules.
  • AI artificial intelligence
  • digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • the basic technologies of artificial intelligence generally include technologies such as sensors, special artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes computer vision technology, robotics technology, biometrics technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • the blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

Abstract

The present application relates to artificial intelligence technology, and discloses an open relationship extraction method, comprising: obtaining an original training set by using remote supervision and entity linking techniques; performing policy annotation and entity reinforcement processing on the original training set to obtain a standard training set; using the standard training set to perform entity fine adjustment and relationship fine adjustment on a pre-trained language model to obtain an open entity extraction model and an open relationship extraction model; extracting, by using the open entity extraction model, entities in a text to be classified; and predicting an entity relationship between the entities by using the open relationship extraction model, and clustering the entities and the entity relationship to obtain a relationship extraction result. In addition, the present application further relates to blockchain technology, and the relationship extraction result may be stored in a node of a blockchain. The present application further provides an open relationship extraction apparatus, an electronic device, and a computer-readable storage medium. The present application can solve the problem of relatively low extraction efficiency of an open relationship.

Description

开放关系抽取方法、装置、电子设备及存储介质Open relationship extraction method, device, electronic device and storage medium
本申请要求于2021年4月21日提交中国专利局、申请号为CN202110428927.5,发明名称为“开放关系抽取方法、装置、电子设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed on April 21, 2021 with the application number CN202110428927.5 and the title of the invention is "Open Relationship Extraction Method, Device, Electronic Device and Storage Medium", the entire content of which is Incorporated herein by reference.
技术领域technical field
本申请涉及人工智能技术领域,尤其涉及一种开放关系抽取方法、装置、电子设备及计算机可读存储介质。The present application relates to the technical field of artificial intelligence, and in particular, to an open relationship extraction method, apparatus, electronic device, and computer-readable storage medium.
背景技术Background technique
关系抽取是信息提取与知识图谱构建领域的重要支撑技术,有许多实践场景,例如构建大规模通用/垂直领域图谱、从申请表中提取信息进行贷前审核等。但传统的关系抽取技术面临两个问题而难以投入实践:1)需要较多标注数据训练关系分类模型,导致数据成本和标注成本高昂;2)关系类型往往需要业务定义,有限而不可变更,现实中许多需求则并没有预定义的关系集。Relation extraction is an important supporting technology in the field of information extraction and knowledge graph construction. There are many practical scenarios, such as building large-scale general/vertical domain graphs, extracting information from application forms for pre-loan review, etc. However, the traditional relationship extraction technology faces two problems and is difficult to put into practice: 1) It requires more labeled data to train the relationship classification model, resulting in high data cost and labeling cost; 2) The relationship type often requires business definition, which is limited and cannot be changed. Many requirements do not have a predefined set of relationships.
基于此,近年来开放关系抽取技术得到关注。开放关系抽取技术要求输入一段文本,从中自动输出所有可能的关系三元组(头实体,关系,尾实体)与二元组(头实体,尾实体)。其中,三元组内的“关系”字段为上下文中自带的描述词。由于类型不确定,开放关系抽取一直难以处理。发明人意识到传统方案主要有:1、采用语法规则(rule-based)结合自举法(booststrapping)方式匹配,经典方法有ReVerb、OLLIE、OpenIE等,但这些方案大多针对英文,难以迁移至中文文本,且匹配规则严格,处理方式不灵活;2、利用序列标注模型思路解析表层形式(surface form),将关系也看作一类实体,用语义角色标注算法直接从文本中提取三元组,如SurfaceForm-SRL,但这种方法在找不到关系提及(mention)的情况下失效,也无法处理包含多个三元组的句子,导致关系抽取准确率低;3、利用半指针半标注方案,该方案利用两层网络块处理文本,首先从文本中提取头实体,然后根据头实体与隐藏层输出联合提取尾实体并判断关系类型,这就构成一个行为关系类数,列为文本长度的样本矩阵,但是当处理的是开放关系抽取时,关系类型数变为文本长度,使得模型需要计算一个大小为批样本数×头实体个数×文本长度×文本长度的张量,虽然解决了文本中多三元组问题,也提升了精度,但占用大量计算资源,效率极低。Based on this, open relation extraction technology has received attention in recent years. Open relation extraction technology requires inputting a piece of text, and automatically outputs all possible relation triples (head entity, relation, tail entity) and bigrams (head entity, tail entity) from it. Among them, the "relationship" field in the triple is the descriptor that comes with the context. Open relation extraction has always been intractable due to type uncertainty. The inventor realized that the traditional solutions mainly include: 1. Matching by using a rule-based method combined with a booststrapping method. The classic methods include ReVerb, OLLIE, OpenIE, etc., but most of these solutions are for English, and it is difficult to migrate to Chinese Text, and the matching rules are strict, and the processing method is inflexible; 2. Use the sequence labeling model idea to parse the surface form, regard the relationship as a class of entities, and use the semantic role labeling algorithm to directly extract triples from the text, Such as SurfaceForm-SRL, but this method fails when no relational mention is found, and it cannot process sentences containing multiple triples, resulting in low relation extraction accuracy; 3. Use half-pointer and half-labeling The scheme uses two-layer network blocks to process text, first extracts the head entity from the text, and then jointly extracts the tail entity and determines the relationship type according to the output of the head entity and the hidden layer, which constitutes a behavioral relationship class number, which is listed as the text length However, when dealing with open relationship extraction, the number of relationship types becomes the text length, so that the model needs to calculate a tensor whose size is the number of batch samples × the number of head entities × text length × text length. The multi-triple problem in the text also improves the accuracy, but it takes up a lot of computing resources and is extremely inefficient.
发明内容SUMMARY OF THE INVENTION
一种开放关系抽取方法,包括:An open relation extraction method, comprising:
获取原始实体数据集及原始关系数据集,分别对所述原始实体数据集及所述原始关系数据集进行远程监督,并将监督到的所述原始实体数据集与所述原始关系数据集进行实体链指,得到原始训练集;Obtain the original entity data set and the original relationship data set, carry out remote supervision on the original entity data set and the original relationship data set respectively, and carry out entity identification between the supervised original entity data set and the original relationship data set. Chain finger, get the original training set;
对所述原始训练集依次进行策略标注及实体加强处理,得到标准训练集;Performing strategy labeling and entity strengthening processing on the original training set in turn to obtain a standard training set;
获取预训练的语言模型,利用所述标准训练集对所述语言模型进行实体微调,得到开放实体抽取模型,及利用所述标准训练集对所述语言模型进行关系微调,得到开放关系抽取模型;Obtaining a pre-trained language model, using the standard training set to perform entity fine-tuning on the language model to obtain an open entity extraction model, and using the standard training set to perform relationship fine-tuning on the language model to obtain an open relationship extraction model;
对待分类文本进行切分,得到切分文本,并利用所述开放实体抽取模型提取所述切分文本中的实体;Segmenting the text to be classified, obtaining segmented text, and extracting entities in the segmented text by using the open entity extraction model;
利用所述开放关系抽取模型预测所述实体的实体关系,并对所述实体及所述实体关系进行聚类,得到关系抽取结果。The entity relationship of the entity is predicted by the open relationship extraction model, and the entity and the entity relationship are clustered to obtain a relationship extraction result.
一种开放关系抽取装置,所述装置包括:An apparatus for extracting open relationships, the apparatus comprising:
训练集构建模块,用于获取原始实体数据集及原始关系数据集,分别对所述原始实体 数据集及所述原始关系数据集进行远程监督,并将监督到的所述原始实体数据集与所述原始关系数据集进行实体链指,得到原始训练集;The training set building module is used to obtain the original entity data set and the original relationship data set, perform remote supervision on the original entity data set and the original relationship data set respectively, and compare the supervised original entity data set with all the original entity data sets. Perform entity chaining on the original relational data set to obtain the original training set;
实体加强模块,用于对所述原始训练集依次进行策略标注及实体加强处理,得到标准训练集;an entity reinforcement module, which is used to sequentially perform strategy labeling and entity reinforcement processing on the original training set to obtain a standard training set;
模型构建模块,用于获取预训练的语言模型,利用所述标准训练集对所述语言模型进行实体微调,得到开放实体抽取模型,及利用所述标准训练集对所述语言模型进行关系微调,得到开放关系抽取模型;a model building module for obtaining a pre-trained language model, performing entity fine-tuning on the language model using the standard training set to obtain an open entity extraction model, and using the standard training set to perform relational fine-tuning on the language model, Get the open relation extraction model;
实体抽取模块,用于对待分类文本进行切分,得到切分文本,并利用所述开放实体抽取模型提取所述切分文本中的实体;an entity extraction module, used for segmenting the text to be classified, obtaining segmented text, and extracting entities in the segmented text by using the open entity extraction model;
关系抽取模块,用于利用所述开放关系抽取模型预测所述实体的实体关系,并对所述实体及所述实体关系进行聚类,得到关系抽取结果。The relationship extraction module is used for predicting the entity relationship of the entity by using the open relationship extraction model, and clustering the entity and the entity relationship to obtain a relationship extraction result.
一种电子设备,所述电子设备包括:An electronic device comprising:
存储器,存储至少一个指令;及a memory that stores at least one instruction; and
处理器,执行所述存储器中存储的指令以实现如下步骤:A processor that executes the instructions stored in the memory to achieve the following steps:
获取原始实体数据集及原始关系数据集,分别对所述原始实体数据集及所述原始关系数据集进行远程监督,并将监督到的所述原始实体数据集与所述原始关系数据集进行实体链指,得到原始训练集;Obtain the original entity data set and the original relationship data set, carry out remote supervision on the original entity data set and the original relationship data set respectively, and carry out entity identification between the supervised original entity data set and the original relationship data set. Chain finger, get the original training set;
对所述原始训练集依次进行策略标注及实体加强处理,得到标准训练集;Performing strategy labeling and entity strengthening processing on the original training set in turn to obtain a standard training set;
获取预训练的语言模型,利用所述标准训练集对所述语言模型进行实体微调,得到开放实体抽取模型,及利用所述标准训练集对所述语言模型进行关系微调,得到开放关系抽取模型;Obtaining a pre-trained language model, using the standard training set to perform entity fine-tuning on the language model to obtain an open entity extraction model, and using the standard training set to perform relationship fine-tuning on the language model to obtain an open relationship extraction model;
对待分类文本进行切分,得到切分文本,并利用所述开放实体抽取模型提取所述切分文本中的实体;Segmenting the text to be classified, obtaining segmented text, and extracting entities in the segmented text by using the open entity extraction model;
利用所述开放关系抽取模型预测所述实体的实体关系,并对所述实体及所述实体关系进行聚类,得到关系抽取结果。The entity relationship of the entity is predicted by the open relationship extraction model, and the entity and the entity relationship are clustered to obtain a relationship extraction result.
为了解决上述问题,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质中存储有至少一个指令,所述至少一个指令被电子设备中的处理器执行以实现如下步骤:In order to solve the above problem, the present application also provides a computer-readable storage medium, where at least one instruction is stored in the computer-readable storage medium, and the at least one instruction is executed by a processor in an electronic device to implement the following steps:
获取原始实体数据集及原始关系数据集,分别对所述原始实体数据集及所述原始关系数据集进行远程监督,并将监督到的所述原始实体数据集与所述原始关系数据集进行实体链指,得到原始训练集;Obtain the original entity data set and the original relationship data set, carry out remote supervision on the original entity data set and the original relationship data set respectively, and carry out entity identification between the supervised original entity data set and the original relationship data set. Chain finger, get the original training set;
对所述原始训练集依次进行策略标注及实体加强处理,得到标准训练集;Performing strategy labeling and entity strengthening processing on the original training set in turn to obtain a standard training set;
获取预训练的语言模型,利用所述标准训练集对所述语言模型进行实体微调,得到开放实体抽取模型,及利用所述标准训练集对所述语言模型进行关系微调,得到开放关系抽取模型;Obtaining a pre-trained language model, using the standard training set to perform entity fine-tuning on the language model to obtain an open entity extraction model, and using the standard training set to perform relationship fine-tuning on the language model to obtain an open relationship extraction model;
对待分类文本进行切分,得到切分文本,并利用所述开放实体抽取模型提取所述切分文本中的实体;Segmenting the text to be classified, obtaining segmented text, and extracting entities in the segmented text by using the open entity extraction model;
利用所述开放关系抽取模型预测所述实体的实体关系,并对所述实体及所述实体关系进行聚类,得到关系抽取结果。The entity relationship of the entity is predicted by the open relationship extraction model, and the entity and the entity relationship are clustered to obtain a relationship extraction result.
本申请可以解决开放关系抽取效率较低的问题。The present application can solve the problem of low extraction efficiency of open relationships.
附图说明Description of drawings
图1为本申请一实施例提供的开放关系抽取方法的流程示意图;FIG. 1 is a schematic flowchart of an open relationship extraction method provided by an embodiment of the present application;
图2为图1中其中一个步骤的详细实施流程示意图;Fig. 2 is a detailed implementation flow diagram of one of the steps in Fig. 1;
图3为图1中另一个步骤的详细实施流程示意图;Fig. 3 is the detailed implementation flow schematic diagram of another step in Fig. 1;
图4为图1中另一个步骤的详细实施流程示意图;Fig. 4 is a detailed implementation flow diagram of another step in Fig. 1;
图5为图1中另一个步骤的详细实施流程示意图;Fig. 5 is the detailed implementation flow schematic diagram of another step in Fig. 1;
图6为本申请一实施例提供的开放关系抽取装置的功能模块图;6 is a functional block diagram of an apparatus for extracting open relationships provided by an embodiment of the present application;
图7为本申请一实施例提供的实现所述开放关系抽取方法的电子设备的结构示意图。FIG. 7 is a schematic structural diagram of an electronic device implementing the method for extracting an open relationship according to an embodiment of the present application.
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The realization, functional characteristics and advantages of the purpose of the present application will be further described with reference to the accompanying drawings in conjunction with the embodiments.
具体实施方式Detailed ways
应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。It should be understood that the specific embodiments described herein are only used to explain the present application, but not to limit the present application.
本申请实施例提供一种开放关系抽取方法。所述开放关系抽取方法的执行主体包括但不限于服务端、终端等能够被配置为执行本申请实施例提供的该方法的电子设备中的至少一种。换言之,所述开放关系抽取方法可以由安装在终端设备或服务端设备的软件或硬件来执行,所述软件可以是区块链平台。所述服务端包括但不限于:单台服务器、服务器集群、云端服务器或云端服务器集群等。The embodiment of the present application provides an open relationship extraction method. The execution subject of the open relationship extraction method includes, but is not limited to, at least one of electronic devices that can be configured to execute the method provided by the embodiments of the present application, such as a server and a terminal. In other words, the open relationship extraction method can be executed by software or hardware installed in a terminal device or a server device, and the software can be a blockchain platform. The server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like.
参照图1所示,为本申请一实施例提供的开放关系抽取方法的流程示意图。在本实施例中,所述开放关系抽取方法包括:Referring to FIG. 1 , it is a schematic flowchart of a method for extracting an open relationship according to an embodiment of the present application. In this embodiment, the open relationship extraction method includes:
S1、获取原始实体数据集及原始关系数据集,分别对所述原始实体数据集及所述原始关系数据集进行远程监督,并将监督到的所述原始实体数据集与所述原始关系数据集进行实体链指,得到原始训练集。S1, obtain the original entity data set and the original relationship data set, respectively carry out remote supervision on the original entity data set and the original relationship data set, and supervise the original entity data set and the original relationship data set. Perform entity chaining to get the original training set.
具体地,参照图2所示,所述获取原始实体数据集及原始关系数据集,包括:Specifically, referring to FIG. 2 , the obtaining of the original entity data set and the original relational data set includes:
S10、利用预设的数据抓取工具从源网站中抓取文本数据,并对所述文本数据进行切分,得到文本断句,汇总所述文本断句得到所述原始关系数据集;S10, using a preset data grabbing tool to grab text data from the source website, and segmenting the text data to obtain text segments, and summarizing the text segments to obtain the original relational data set;
S11、获取开源的实体数据集,其中,所述实体数据集中包括三元组信息及每个三元组信息对应的描述信息,并对所述三元组信息进行去重处理,得到去重三元组,汇总所述去重三元组及所述三元组信息对应的描述信息得到所述原始实体数据集。S11. Obtain an open-source entity data set, wherein the entity data set includes triplet information and description information corresponding to each triplet information, and perform deduplication processing on the triplet information to obtain a deduplication triplet The original entity data set is obtained by summarizing the deduplication triplet and the description information corresponding to the triplet information.
其中,所述预设的数据抓取工具可以为Hawk数据抓取工具,所述源网站可以为不同领域的门户网站及专业网站,包括:金融类、法律类、医疗类、教育类、娱乐类、体育类等。利用所述Hawk数据抓取工具可以直接抓取所述源网站中的文本数据。本申请实施例中,可以设置3句为所述文本数据的最小切分单位,每句长度不超过256字,超过时则删减为2句甚至1句或直接跳过。所述开源的实体数据集可以包括中文通用百科知识图谱(CN-DBpedia)等数据集,CN-DBpedia主要从中文百科类网站(如百度百科、互动百科、中文维基百科等)的纯文本页面中提取实体信息,经过滤、融合、推断等操作后,最终形成高质量的结构化数据集,该图谱中不仅包含(头实体,关系,尾实体)三元组信息,还含有实体的描述信息(来自百度百科等处)。Wherein, the preset data crawling tool can be Hawk data crawling tool, and the source website can be portal websites and professional websites in different fields, including: finance, law, medical care, education, entertainment , sports, etc. Using the Hawk data crawling tool, the text data in the source website can be directly crawled. In the embodiment of the present application, 3 sentences may be set as the minimum segmentation unit of the text data, and the length of each sentence does not exceed 256 characters. The open source entity datasets may include datasets such as the Chinese General Encyclopedia Knowledge Graph (CN-DBpedia), and CN-DBpedia is mainly collected from the plain text pages of Chinese encyclopedia websites (such as Baidu Encyclopedia, Interactive Encyclopedia, Chinese Wikipedia, etc.). Entity information is extracted, and after filtering, fusion, inference and other operations, a high-quality structured data set is finally formed. The graph not only contains (head entity, relationship, tail entity) triple information, but also contains entity description information ( From Baidu Encyclopedia, etc.).
详细地,所述对所述三元组信息进行去重处理,得到去重三元组,包括:Specifically, performing deduplication processing on the triplet information to obtain a deduplication triplet, including:
依次从所述实体数据集中选取目标三元组;Selecting target triples from the entity data set in turn;
计算所述目标三元组与所述实体数据集中所有未被选取的三元组信息的距离值;Calculate the distance value between the target triplet and all unselected triplet information in the entity data set;
当所述距离值大于预设的距离阈值时,确定所述目标三元组不重复,重新从所述实体数据集中选取目标三元组进行计算;When the distance value is greater than the preset distance threshold, it is determined that the target triplet is not repeated, and the target triplet is reselected from the entity data set for calculation;
当所述距离值小于或等于预设的距离阈值时,确定所述目标三元组重复,并删除所述目标三元组,得到去重三元组。When the distance value is less than or equal to a preset distance threshold, it is determined that the target triplet is repeated, and the target triplet is deleted to obtain a deduplication triplet.
本申请实施例中,利用下述距离算法计算所述目标三元组与所述实体数据集中所有未被选取的三元组信息的距离值:In the embodiment of the present application, the following distance algorithm is used to calculate the distance value between the target triplet and all unselected triplet information in the entity data set:
Figure PCTCN2021109488-appb-000001
Figure PCTCN2021109488-appb-000001
其中,d为所述距离值,w j为第j个目标三元组,w k为实体数据集中任意一个未被选取的三元组信息,n为所述实体数据集中三元组信息的数量。 Wherein, d is the distance value, w j is the jth target triple, w k is any unselected triple information in the entity data set, n is the number of triple information in the entity data set .
本申请实施例通过对实体数据集中的三元组信息进行去重处理,可避免后续对相同的三元组信息进行处理并降低数据处理量,有利于提高开放关系抽取的效率。By performing deduplication processing on triplet information in the entity data set in the embodiments of the present application, subsequent processing of the same triplet information can be avoided and the amount of data processing can be reduced, which is beneficial to improve the efficiency of open relationship extraction.
进一步地,所述分别对所述原始实体数据集及所述原始关系数据集进行远程监督,并将监督到的所述原始实体数据集与所述原始关系数据集进行实体链指,得到原始训练集,包括:Further, performing remote supervision on the original entity data set and the original relationship data set respectively, and performing entity chaining on the supervised original entity data set and the original relationship data set to obtain the original training set, including:
将所述原始实体数据集中的三元组信息和所述原始关系数据集中的文本断句进行匹配,并根据匹配结果进行位置标注,得到匹配数据;Matching the triplet information in the original entity data set and the text segmentation in the original relationship data set, and performing position labeling according to the matching result to obtain matching data;
利用预构建的消歧模型,计算所述匹配数据中所述匹配结果及所述原始实体数据集中所述三元组信息对应的描述信息的匹配概率;Using a pre-built disambiguation model, calculate the matching probability of the matching result in the matching data and the description information corresponding to the triple information in the original entity data set;
当所述匹配概率大于预设阈值时,汇总所述文本断句及所述三元组信息得到所述原始训练集。When the matching probability is greater than a preset threshold, the original training set is obtained by summarizing the text segmentation and the triplet information.
其中,所述远程监督是指利用开源的知识图谱里的现成三元组进行无人工参与的自动标注,得到大量有标注的数据集的方法。本申请实施例中,将所述原始实体数据集中的三元组和所述原始关系数据集中的文本断句进行匹配,要求至少三元组中头实体与尾实体都在当前文本断句上下文中,并标注实体在当前文本断句的位置(比如,“text”:文本断句,“entity_idx”:{实体_1:[起点,终点],实体_2:[起点,终点],……},其中,“text”表示当前的文本断句,“entity_idx”表示实体在当前文本断句的位置),汇总匹配后的三元组及文本断句,得到所述匹配数据。同时,可以利用预构建的消歧模型来进行所述实体链指,所述预构建的消歧模型可以为经过开源的短文本匹配训练的BERT模型,所述实体链指以所述BERT模型为主干,拼接所述匹配数据中的文本(包括三元组及三元组所在的文本断句)及三元组在原始实体数据集中的描述作为输入,输出一个匹配概率,所述预设阈值可以为0.5,当匹配概率大于0.5则表示匹配数据及原始实体数据集两处的实体是同一类实体。通过所述远程监督及实体链指可以快速确定实体与实体之前的关系、实体描述等信息,同时,由于所述原始训练集中既包含实体信息,又包含关系信息,可以直接用于训练关系抽取模型及实体抽取模型。The remote supervision refers to a method of using ready-made triples in an open source knowledge graph to perform automatic labeling without manual participation, and obtain a large number of labeled data sets. In the embodiment of the present application, the triples in the original entity data set and the text segmentation in the original relationship data set are matched, and it is required that at least the head entity and the tail entity in the triple are in the context of the current text segmentation, and Label the position of the entity in the current text segment (eg, "text": text segment, "entity_idx": {entity_1: [start, end], entity_2: [start, end], ...}, where, " text" represents the current text segment, "entity_idx" represents the position of the entity in the current text segment), and the matched triples and text segments are aggregated to obtain the matching data. At the same time, a pre-built disambiguation model can be used to perform the entity link reference, and the pre-built disambiguation model can be a BERT model trained by open source short text matching, and the entity link refers to the BERT model as The backbone, splicing the text in the matching data (including the triplet and the text segment where the triplet is located) and the description of the triplet in the original entity data set as input, and outputting a matching probability, the preset threshold can be 0.5, when the matching probability is greater than 0.5, it means that the entities in the matching data and the original entity dataset are of the same type. Through the remote supervision and entity chaining, information such as the relationship between the entity and the entity, entity description and other information can be quickly determined. At the same time, since the original training set contains both entity information and relationship information, it can be directly used to train the relationship extraction model and entity extraction model.
本申请实施例中,通过对所述原始实体数据集及所述原始关系数据集进行远程监督及实体链指,无需人工标注即可得到大量信息丰富的原始训练集。In the embodiment of the present application, by performing remote supervision and entity linking on the original entity data set and the original relationship data set, a large amount of information-rich original training sets can be obtained without manual annotation.
S2、对所述原始训练集依次进行策略标注及实体加强处理,得到标准训练集。S2. Perform strategy labeling and entity reinforcement processing on the original training set in turn to obtain a standard training set.
具体地,参照图3所示,所述对所述原始训练集依次进行策略标注及实体加强处理,得到标准训练集,包括:Specifically, as shown in FIG. 3 , the strategy labeling and entity reinforcement processing are sequentially performed on the original training set to obtain a standard training set, including:
S20、利用预设的标注符号对所述原始训练集中的文本断句进行分类,得到分类样本,并对所述分类样本中的三元组进行标注,得到标注实体;S20, classifying the text segments in the original training set by using preset labeling symbols to obtain classified samples, and label the triples in the classified samples to obtain labeled entities;
S21、利用预设的自然语言处理库对所述标注实体进行实体加强处理,汇总加强后的分类样本得到所述标准训练集。S21. Use a preset natural language processing library to perform entity enhancement processing on the labeled entity, and collect the enhanced classification samples to obtain the standard training set.
本申请实施例中,可以基于MTB(Matching The Blank,空白匹配)的方法来进行策略标注,其中,所述预设的标注符号可以为<tag>及</tag>,<tag>与</tag>围住的部分即实体或关系在句中的提及,比如,分类样本可以为[CLS]XXX<entity_head>XXX</entity_head>XXX<rel>XXX</rel>XXX<entity_tail>XXX</en tity_tail>XXX[SEP],entity_head、rel、entity_tail分别表示头实体,关系,尾实体。[CLS]与[SEP]为间隔符,[CLS]为分类位,该位置输出二分类结果0/1,表示当前两个实体间是否存在关系,[SEP]为终止位,表示句子结束。In the embodiment of the present application, strategy labeling may be performed based on the method of MTB (Matching The Blank, blank matching), wherein the preset labeling symbols may be <tag> and </tag>, and <tag> and </ The part enclosed by the tag> is the mention of the entity or relationship in the sentence. For example, the classification sample can be [CLS]XXX<entity_head>XXX</entity_head>XXX<rel>XXX</rel>XXX<entity_tail>XXX< /en tity_tail>XXX[SEP], entity_head, rel, and entity_tail represent head entity, relationship, and tail entity respectively. [CLS] and [SEP] are the spacers, [CLS] is the classification bit, this position outputs the binary classification result 0/1, indicating whether there is a relationship between the two entities at present, [SEP] is the termination bit, indicating the end of the sentence.
本申请实施例中,可以使用BIO序列标注模式对所述分类样本中的实体进行标注,将实体提及的字标注为B或I,非实体标注为O。由于这里是开放实体识别,因此只分是/不 是实体两类。所述预设的自然语言处理库可以为HanLP自然语言处理库,利用HanLP自然语言处理库中的依存句法解析工具,分析当前实体的前缀,来对当前实体进行实体加强,比如,当前实体为“库克”,前缀为“苹果公司CEO”,则加强实体为“苹果公司CEO库克”。In the embodiment of the present application, the BIO sequence labeling mode may be used to label the entities in the classified samples, the words mentioned by the entities are labelled as B or I, and the non-entities are labelled as O. Since this is open entity recognition, it is only divided into two categories: yes/no entity. The preset natural language processing library can be the HanLP natural language processing library, and the dependency syntax parsing tool in the HanLP natural language processing library is used to analyze the prefix of the current entity to perform entity enhancement on the current entity, for example, the current entity is " Cook", prefixed with "Apple CEO", then the enhanced entity is "Apple CEO Cook".
本申请实施通过对所述原始训练集进行策略标注及实体加强处理,可以提高模型训练的精度。The implementation of the present application can improve the accuracy of model training by performing policy labeling and entity enhancement processing on the original training set.
S3、获取预训练的语言模型,利用所述标准训练集对所述语言模型进行实体微调,得到开放实体抽取模型,及利用所述标准训练集对所述语言模型进行关系微调,得到开放关系抽取模型。S3. Obtain a pre-trained language model, use the standard training set to perform entity fine-tuning on the language model to obtain an open entity extraction model, and use the standard training set to perform relationship fine-tuning on the language model to obtain an open relationship extraction model Model.
本申请实施例中,所述预训练的语言模型可以为开源transformer项目中基于BERT算法的大规模无监督预训练语言模型,该模型使用pytorch框架编写,已经事先在大规模开源中文语料上进行过训练,训练过程采用完形填空的方式进行误差判定,即输入的中文预料文本中故意遮住几个字,输出时查看模型是否根据未遮住的上下文预测出遮住的字,计算模型预测的值和真实值之间的差异,直至所述差异低于事先指定的阈值。In the embodiment of the present application, the pre-trained language model may be a large-scale unsupervised pre-trained language model based on the BERT algorithm in the open source transformer project. The model is written using the pytorch framework and has been previously performed on large-scale open source Chinese corpus. Training, the training process adopts the method of cloze to determine the error, that is, the input Chinese expected text to intentionally cover a few words, check whether the model predicts the masked words according to the unmasked context when outputting, and calculate the predicted value of the model. The difference between the value and the true value until the difference is below a pre-specified threshold.
具体地,参照图4所示,所述利用所述标准训练集对所述语言模型进行实体微调,得到开放实体抽取模型,及利用所述标准训练集对所述语言模型进行关系微调,得到开放关系抽取模型,包括:Specifically, as shown in FIG. 4 , using the standard training set to perform entity fine-tuning on the language model to obtain an open entity extraction model, and using the standard training set to perform relationship fine-tuning on the language model to obtain an open entity extraction model Relation extraction models, including:
S30、在所述分类样本中随机添加空白位,得到训练样本,利用所述语言模型预测所述训练样本中的实体,得到预测实体;S30, randomly adding blanks in the classified samples to obtain training samples, and using the language model to predict entities in the training samples to obtain predicted entities;
S31、计算所述预测实体和所述训练样本中真实实体的差值,当所述差值小于预设的阈值时,确定所述语言模型为所述开放实体抽取模型;S31. Calculate the difference between the predicted entity and the real entity in the training sample, and when the difference is less than a preset threshold, determine that the language model is the open entity extraction model;
S32、利用预设的关系跨度预测层计算所述预测实体间的关系跨度;S32, using a preset relationship span prediction layer to calculate the relationship span between the predicted entities;
S33、基于所述关系跨度,利用预设的二分类线性层输出所述预测实体间的预测结果,其中,所述预测结果包括关系存在;S33. Based on the relationship span, use a preset two-class linear layer to output a prediction result between the prediction entities, wherein the prediction result includes the existence of a relationship;
S34、当所述关系存在的预测结果与所有预测结果的比值大于预设的关系阈值时,组合所述语言模型、所述关系跨度预测层及所述二分类线性层,以得到所述开放关系抽取模型。S34. When the ratio of the prediction result of the existence of the relationship to all the prediction results is greater than a preset relationship threshold, combine the language model, the relationship span prediction layer and the binary linear layer to obtain the open relationship Extract the model.
其中,所述关系跨度可以使用独热(one-hot)向量表示,在所述二分类线性层中通过[CLS]位判定所述预测实体之间的预测结果0或1,0表示关系不存在,1表示关系存在。同时,通过所述关系跨度预测层及所述二分类线性层,将关系预测化简为有限的二分类问题,极大地简化了模型的训练过程。The relationship span can be represented by a one-hot vector, and the [CLS] bit is used to determine the prediction result between the prediction entities in the binary classification linear layer. 0 or 1, 0 indicates that the relationship does not exist , 1 means the relationship exists. At the same time, through the relationship span prediction layer and the two-class linear layer, the relationship prediction is simplified into a limited two-class problem, which greatly simplifies the training process of the model.
S4、对待分类文本进行切分,得到切分文本,并利用所述开放实体抽取模型提取所述切分文本中的实体。S4. Segment the text to be classified to obtain segmented text, and use the open entity extraction model to extract entities in the segmented text.
具体地,参照图5所示,所述对待分类文本进行切分,得到切分文本,并利用所述开放实体抽取模型提取所述切分文本中的实体,包括:Specifically, as shown in FIG. 5 , the text to be classified is segmented to obtain segmented text, and the open entity extraction model is used to extract entities in the segmented text, including:
S40、根据所述待分类文本中的标点符号将所述待分类文本进行断句,得到待分类断句;S40, segmenting the text to be classified according to the punctuation in the text to be classified, to obtain segmented sentences to be classified;
S41、利用所述开放实体抽取模型抽取所述待分类文本中的所有实体,得到待分类实体。S41. Extract all entities in the text to be classified by using the open entity extraction model to obtain entities to be classified.
本申请实施例通过所述开放实体抽取模型可以快速的抽取所述待分类文本中的实体,提高了实体关系预测的速率。In this embodiment of the present application, entities in the text to be classified can be rapidly extracted through the open entity extraction model, which improves the rate of entity relationship prediction.
S5、利用所述开放关系抽取模型预测所述实体的实体关系,并对所述实体及所述实体关系进行聚类,得到关系抽取结果。S5. Predict the entity relationship of the entity by using the open relationship extraction model, and cluster the entity and the entity relationship to obtain a relationship extraction result.
具体地,所述利用所述开放关系抽取模型预测所述实体的实体关系,并对所述实体及所述实体关系进行聚类,得到关系抽取结果,包括:Specifically, predicting the entity relationship of the entity by using the open relationship extraction model, and clustering the entity and the entity relationship to obtain a relationship extraction result, including:
基于所述待分类实体,利用所述开放关系抽取模型抽取所述待分类断句中的关系,并过滤掉没有关系的所述待分类实体,得到预测三元组;Based on the entity to be classified, the open relationship extraction model is used to extract the relationship in the segmented sentence to be classified, and the entity to be classified that has no relationship is filtered out to obtain a predicted triplet;
利用预设的聚类方法对所述预测三元组进行聚类,得到多个聚类团,其中,所述聚类团中包括所述关系抽取结果。The predicted triples are clustered by using a preset clustering method to obtain a plurality of cluster clusters, wherein the cluster clusters include the relationship extraction result.
本申请实施例中,利用所述开放关系抽取模型可以得到三元组(头实体,关系,尾实体)及二元组(头实体,None,尾实体),其中,二元组表示没有关系的实体对,通过过滤所述二元组,提高了关系预测的准确率。所述预设的聚类方法可以为K均值聚类方法,所述K均值聚类方法通过word2vec算法将所述预测三元组中的关系向量化,并计算向量间的距离,根据所述距离将所述预测三元组聚拢到K个中心点,形成K个聚类团,这时由人工对每个聚类团概括出类型名称,从而对所述预测三元组进行分类。同时,每个聚类团稳定(不在发生变化)时,每个聚类团会求出团内所有关系向量的均值,之后新的关系会分别与每个已有聚类团的均值进行比较,若与多个聚类团的相似度(可以为欧式距离)均高于预先定义的相似阈值,则归到最相似的那个团中,若与所有聚类团的相似度均低于预定义的相似阈值,则独立归到“未知”类,当“未知”类中的关系积累到一定量(一般为已知类关系的70%),则针对未知关系重复K均值聚类方法及人工定义类型。In the embodiment of the present application, using the open relationship extraction model, a triple (head entity, relationship, tail entity) and a double (head entity, None, tail entity) can be obtained, where a double represents an unrelated entity. For entity pairs, the accuracy of relationship prediction is improved by filtering the bigram. The preset clustering method may be a K-means clustering method. The K-means clustering method vectorizes the relationship in the predicted triplet through the word2vec algorithm, and calculates the distance between the vectors, according to the distance. The predicted triples are gathered into K central points to form K clusters. At this time, a type name is manually summarized for each cluster, so as to classify the predicted triples. At the same time, when each cluster is stable (not changing), each cluster will find the mean of all relation vectors in the cluster, and then the new relation will be compared with the mean of each existing cluster. If the similarity with multiple clusters (which can be Euclidean distance) is higher than the predefined similarity threshold, it will be classified into the most similar cluster. If the similarity with all clusters is lower than the predefined similarity threshold The similarity threshold is independently classified into the "unknown" class. When the relationship in the "unknown" class accumulates to a certain amount (usually 70% of the known class relationship), the K-means clustering method and manual definition type are repeated for the unknown relationship. .
本申请实施例中,通过对所述实体及所述实体关系进行聚类,可以自动对抽取到的开放关系进行分类,提高了开放关系抽取的效率。In the embodiment of the present application, by clustering the entities and the entity relationships, the extracted open relationships can be automatically classified, which improves the efficiency of open relationship extraction.
本申请通过对所述原始实体数据集及所述原始关系数据集进行远程监督及实体链指,可以得到大量信息丰富的原始训练集,根据原始训练集的不同,不仅适用于英文,也适用于中文开放关系抽取。并且,对所述原始训练集进行策略标注及实体加强处理,提高了开放关系抽取的准确率。同时,仅需要利用所述标准训练集对所述语言模型进行实体微调及关系微调,便可直接得到开放实体抽取模型及开放关系抽取模型,不需要占用大量的计算资源,简化了模型训练过程,提高了开放关系抽取的效率。因此本申请实施例可以解决开放关系抽取效率较低的问题。In this application, by performing remote supervision and entity chaining on the original entity data set and the original relationship data set, a large number of original training sets with rich information can be obtained. Chinese open relation extraction. In addition, strategy labeling and entity enhancement processing are performed on the original training set, which improves the accuracy of open relationship extraction. At the same time, it is only necessary to use the standard training set to perform entity fine-tuning and relation fine-tuning on the language model, and then an open entity extraction model and an open relationship extraction model can be directly obtained, which does not require a large amount of computing resources and simplifies the model training process. The efficiency of open relation extraction is improved. Therefore, the embodiments of the present application can solve the problem of low extraction efficiency of open relationships.
如图6所示,是本申请一实施例提供的开放关系抽取装置的功能模块图。As shown in FIG. 6 , it is a functional block diagram of an apparatus for extracting an open relationship provided by an embodiment of the present application.
本申请所述开放关系抽取装置100可以安装于电子设备中。根据实现的功能,所述开放关系抽取装置100可以包括训练集构建模块101、实体加强模块102、模型构建模块103、实体抽取模块104及关系抽取模块105。本申请所述模块也可以称之为单元,是指一种能够被电子设备处理器所执行,并且能够完成固定功能的一系列计算机程序段,其存储在电子设备的存储器中。The open relationship extraction apparatus 100 described in this application may be installed in an electronic device. According to the implemented functions, the open relationship extraction apparatus 100 may include a training set construction module 101 , an entity enhancement module 102 , a model construction module 103 , an entity extraction module 104 and a relationship extraction module 105 . The modules described in this application may also be referred to as units, which refer to a series of computer program segments that can be executed by the processor of an electronic device and can perform fixed functions, and are stored in the memory of the electronic device.
在本实施例中,关于各模块/单元的功能如下:In this embodiment, the functions of each module/unit are as follows:
所述训练集构建模块101,用于获取原始实体数据集及原始关系数据集,分别对所述原始实体数据集及所述原始关系数据集进行远程监督,并将监督到的所述原始实体数据集与所述原始关系数据集进行实体链指,得到原始训练集。The training set construction module 101 is used to obtain the original entity data set and the original relationship data set, perform remote supervision on the original entity data set and the original relationship data set respectively, and compare the supervised original entity data set. Perform entity chaining between the set and the original relational data set to obtain the original training set.
具体地,所述训练集构建模块101通过下述操作获取原始实体数据集及原始关系数据集:Specifically, the training set construction module 101 obtains the original entity data set and the original relationship data set through the following operations:
利用预设的数据抓取工具从源网站中抓取文本数据,并对所述文本数据进行切分,得到文本断句,汇总所述文本断句得到所述原始关系数据集;Use a preset data grabbing tool to grab text data from the source website, and segment the text data to obtain text segments, and summarize the text segments to obtain the original relational data set;
获取开源的实体数据集,其中,所述实体数据集中包括三元组信息及每个三元组信息对应的描述信息,并对所述三元组信息进行去重处理,得到去重三元组,汇总所述去重三元组及所述三元组信息对应的描述信息得到所述原始实体数据集。Obtain an open source entity data set, wherein the entity data set includes triplet information and description information corresponding to each triplet information, and deduplicates the triplet information to obtain a deduplication triplet , summarizing the deduplication triplet and the description information corresponding to the triplet information to obtain the original entity data set.
其中,所述预设的数据抓取工具可以为Hawk数据抓取工具,所述源网站可以为不同领域的门户网站及专业网站,包括:金融类、法律类、医疗类、教育类、娱乐类、体育类等。利用所述Hawk数据抓取工具可以直接抓取所述源网站中的文本数据。本申请实施例中,可以设置3句为所述文本数据的最小切分单位,每句长度不超过256字,超过时则删 减为2句甚至1句或直接跳过。所述开源的实体数据集可以包括中文通用百科知识图谱(CN-DBpedia)等数据集,CN-DBpedia主要从中文百科类网站(如百度百科、互动百科、中文维基百科等)的纯文本页面中提取实体信息,经过滤、融合、推断等操作后,最终形成高质量的结构化数据集,该图谱中不仅包含(头实体,关系,尾实体)三元组信息,还含有实体的描述信息(来自百度百科等处)。Wherein, the preset data crawling tool can be Hawk data crawling tool, and the source website can be portal websites and professional websites in different fields, including: finance, law, medical care, education, entertainment , sports, etc. Using the Hawk data crawling tool, the text data in the source website can be directly crawled. In the embodiment of the present application, 3 sentences can be set as the minimum segmentation unit of the text data, the length of each sentence is not more than 256 words, and when exceeding, it is cut down to 2 sentences or even 1 sentence or skipped directly. The open source entity datasets may include datasets such as the Chinese General Encyclopedia Knowledge Graph (CN-DBpedia), and CN-DBpedia is mainly collected from the plain text pages of Chinese encyclopedia websites (such as Baidu Encyclopedia, Interactive Encyclopedia, Chinese Wikipedia, etc.). Entity information is extracted, and after filtering, fusion, inference and other operations, a high-quality structured data set is finally formed. The graph not only contains (head entity, relationship, tail entity) triple information, but also contains entity description information ( From Baidu Encyclopedia, etc.).
详细地,所述训练集构建模块101通过下述操作得到去重三元组:In detail, the training set building module 101 obtains deduplication triples through the following operations:
依次从所述实体数据集中选取目标三元组;Selecting target triples from the entity data set in turn;
计算所述目标三元组与所述实体数据集中所有未被选取的三元组信息的距离值;Calculate the distance value between the target triplet and all unselected triplet information in the entity data set;
当所述距离值大于预设的距离阈值时,确定所述目标三元组不重复,重新从所述实体数据集中选取目标三元组进行计算;When the distance value is greater than the preset distance threshold, it is determined that the target triplet is not repeated, and the target triplet is reselected from the entity data set for calculation;
当所述距离值小于或等于预设的距离阈值时,确定所述目标三元组重复,并删除所述目标三元组,得到去重三元组。When the distance value is less than or equal to a preset distance threshold, it is determined that the target triplet is repeated, and the target triplet is deleted to obtain a deduplication triplet.
本申请实施例中,利用下述距离算法计算所述目标三元组与所述实体数据集中所有未被选取的三元组信息的距离值:In the embodiment of the present application, the following distance algorithm is used to calculate the distance value between the target triplet and all unselected triplet information in the entity data set:
Figure PCTCN2021109488-appb-000002
Figure PCTCN2021109488-appb-000002
其中,d为所述距离值,w j为第j个目标三元组,w k为实体数据集中任意一个未被选取的三元组信息,n为所述实体数据集中三元组信息的数量。 Wherein, d is the distance value, w j is the jth target triple, w k is any unselected triple information in the entity data set, n is the number of triple information in the entity data set .
本申请实施例通过对实体数据集中的三元组信息进行去重处理,可避免后续对相同的三元组信息进行处理并降低数据处理量,有利于提高开放关系抽取的效率。By performing deduplication processing on triplet information in the entity data set in the embodiments of the present application, subsequent processing of the same triplet information can be avoided and the amount of data processing can be reduced, which is beneficial to improve the efficiency of open relationship extraction.
进一步地,所述训练集构建模块101通过下述操作得到所述原始训练集:Further, the training set construction module 101 obtains the original training set through the following operations:
将所述原始实体数据集中的三元组信息和所述原始关系数据集中的文本断句进行匹配,并根据匹配结果进行位置标注,得到匹配数据;Matching the triplet information in the original entity data set and the text segmentation in the original relationship data set, and performing position labeling according to the matching result to obtain matching data;
利用预构建的消歧模型,计算所述匹配数据中所述匹配结果及所述原始实体数据集中所述三元组信息对应的描述信息的匹配概率;Using a pre-built disambiguation model, calculate the matching probability of the matching result in the matching data and the description information corresponding to the triple information in the original entity data set;
当所述匹配概率大于预设阈值时,汇总所述文本断句及所述三元组信息得到所述原始训练集。When the matching probability is greater than a preset threshold, the original training set is obtained by summarizing the text segmentation and the triplet information.
其中,所述远程监督是指利用开源的知识图谱里的现成三元组进行无人工参与的自动标注,得到大量有标注的数据集的方法。本申请实施例中,将所述原始实体数据集中的三元组和所述原始关系数据集中的文本断句进行匹配,要求至少三元组中头实体与尾实体都在当前文本断句上下文中,并标注实体在当前文本断句的位置(比如,“text”:文本断句,“entity_idx”:{实体_1:[起点,终点],实体_2:[起点,终点],……},其中,“text”表示当前的文本断句,“entity_idx”表示实体在当前文本断句的位置),汇总匹配后的三元组及文本断句,得到所述匹配数据。同时,可以利用预构建的消歧模型来进行所述实体链指,所述预构建的消歧模型可以为经过开源的短文本匹配训练的BERT模型,所述实体链指以所述BERT模型为主干,拼接所述匹配数据中的文本(包括三元组及三元组所在的文本断句)及三元组在原始实体数据集中的描述作为输入,输出一个匹配概率,所述预设阈值可以为0.5,当匹配概率大于0.5则表示匹配数据及原始实体数据集两处的实体是同一类实体。通过所述远程监督及实体链指可以快速确定实体与实体之前的关系、实体描述等信息,同时,由于所述原始训练集中既包含实体信息,又包含关系信息,可以直接用于训练关系抽取模型及实体抽取模型。The remote supervision refers to a method of using ready-made triples in an open source knowledge graph to perform automatic labeling without manual participation, and obtain a large number of labeled data sets. In the embodiment of the present application, the triples in the original entity data set and the text segmentation in the original relationship data set are matched, and it is required that at least the head entity and the tail entity in the triple are in the context of the current text segmentation, and Label the position of the entity in the current text segment (eg, "text": text segment, "entity_idx": {entity_1: [start, end], entity_2: [start, end], ...}, where, " text" represents the current text segment, "entity_idx" represents the position of the entity in the current text segment), and the matched triples and text segments are aggregated to obtain the matching data. At the same time, a pre-built disambiguation model can be used to perform the entity link reference, and the pre-built disambiguation model can be a BERT model trained by open source short text matching, and the entity link refers to the BERT model as The backbone, splicing the text in the matching data (including the triplet and the text segment where the triplet is located) and the description of the triplet in the original entity data set as input, and outputting a matching probability, the preset threshold can be 0.5, when the matching probability is greater than 0.5, it means that the entities in the matching data and the original entity dataset are of the same type. Through the remote supervision and entity chaining, information such as the relationship between the entity and the entity, entity description and other information can be quickly determined. At the same time, since the original training set contains both entity information and relationship information, it can be directly used to train the relationship extraction model and entity extraction model.
本申请实施例中,通过对所述原始实体数据集及所述原始关系数据集进行远程监督及实体链指,无需人工标注即可得到大量信息丰富的原始训练集。In the embodiment of the present application, by performing remote supervision and entity linking on the original entity data set and the original relationship data set, a large amount of information-rich original training sets can be obtained without manual annotation.
所述实体加强模块102,用于对所述原始训练集依次进行策略标注及实体加强处理, 得到标准训练集。The entity reinforcement module 102 is configured to sequentially perform strategy labeling and entity reinforcement processing on the original training set to obtain a standard training set.
具体地,所述实体加强模块102通过下述操作得到所述标准训练集:Specifically, the entity reinforcement module 102 obtains the standard training set through the following operations:
利用预设的标注符号对所述原始训练集中的文本断句进行分类,得到分类样本,并对所述分类样本中的三元组进行标注,得到标注实体;Classify the text segments in the original training set by using preset annotation symbols to obtain classified samples, and annotate the triples in the classified samples to obtain labeled entities;
利用预设的自然语言处理库对所述标注实体进行实体加强处理,汇总加强后的分类样本得到所述标准训练集。Use a preset natural language processing library to perform entity enhancement processing on the labeled entities, and collect the enhanced classification samples to obtain the standard training set.
本申请实施例中,可以基于MTB(Matching The Blank,空白匹配)的方法来进行策略标注,其中,所述预设的标注符号可以为<tag>及</tag>,<tag>与</tag>围住的部分即实体或关系在句中的提及,比如,分类样本可以为[CLS]XXX<entity_head>XXX</entity_head>XXX<rel>XXX</rel>XXX<entity_tail>XXX</en tity_tail>XXX[SEP],entity_head、rel、entity_tail分别表示头实体,关系,尾实体。[CLS]与[SEP]为间隔符,[CLS]为分类位,该位置输出二分类结果0/1,表示当前两个实体间是否存在关系,[SEP]为终止位,表示句子结束。In the embodiment of the present application, strategy labeling may be performed based on the method of MTB (Matching The Blank, blank matching), wherein the preset labeling symbols may be <tag> and </tag>, and <tag> and </ The part enclosed by the tag> is the mention of the entity or relationship in the sentence. For example, the classification sample can be [CLS]XXX<entity_head>XXX</entity_head>XXX<rel>XXX</rel>XXX<entity_tail>XXX< /en tity_tail>XXX[SEP], entity_head, rel, and entity_tail represent head entity, relationship, and tail entity respectively. [CLS] and [SEP] are the spacers, [CLS] is the classification bit, this position outputs the binary classification result 0/1, indicating whether there is a relationship between the two entities at present, [SEP] is the termination bit, indicating the end of the sentence.
本申请实施例中,可以使用BIO序列标注模式对所述分类样本中的实体进行标注,将实体提及的字标注为B或I,非实体标注为O。由于这里是开放实体识别,因此只分是/不是实体两类。所述预设的自然语言处理库可以为HanLP自然语言处理库,利用HanLP自然语言处理库中的依存句法解析工具,分析当前实体的前缀,来对当前实体进行实体加强,比如,当前实体为“库克”,前缀为“苹果公司CEO”,则加强实体为“苹果公司CEO库克”。In the embodiment of the present application, the BIO sequence labeling mode may be used to label the entities in the classified samples, the words mentioned by the entities are labelled as B or I, and the non-entities are labelled as O. Since this is open entity recognition, it is only divided into two categories: yes/no entity. The preset natural language processing library can be the HanLP natural language processing library, and the dependency syntax parsing tool in the HanLP natural language processing library is used to analyze the prefix of the current entity to perform entity enhancement on the current entity, for example, the current entity is " Cook", prefixed with "Apple CEO", then the enhanced entity is "Apple CEO Cook".
本申请实施通过对所述原始训练集进行策略标注及实体加强处理,可以提高模型训练的精度。The implementation of the present application can improve the accuracy of model training by performing policy labeling and entity enhancement processing on the original training set.
所述模型构建模块103,用于获取预训练的语言模型,利用所述标准训练集对所述语言模型进行实体微调,得到开放实体抽取模型,及利用所述标准训练集对所述语言模型进行关系微调,得到开放关系抽取模型。The model building module 103 is used to obtain a pre-trained language model, use the standard training set to perform entity fine-tuning on the language model to obtain an open entity extraction model, and use the standard training set to perform entity fine-tuning on the language model. The relationship is fine-tuned to obtain an open relationship extraction model.
本申请实施例中,所述预训练的语言模型可以为开源transformer项目中基于BERT算法的大规模无监督预训练语言模型,该模型使用pytorch框架编写,已经事先在大规模开源中文语料上进行过训练,训练过程采用完形填空的方式进行误差判定,即输入的中文预料文本中故意遮住几个字,输出时查看模型是否根据未遮住的上下文预测出遮住的字,计算模型预测的值和真实值之间的差异,直至所述差异低于事先指定的阈值。In the embodiment of the present application, the pre-trained language model may be a large-scale unsupervised pre-trained language model based on the BERT algorithm in the open source transformer project. The model is written using the pytorch framework and has been previously performed on large-scale open source Chinese corpus. Training, the training process adopts the method of cloze to determine the error, that is, the input Chinese expected text to intentionally cover a few words, check whether the model predicts the masked words according to the unmasked context when outputting, and calculate the predicted value of the model. The difference between the value and the true value until the difference is below a pre-specified threshold.
具体地,所述模型构建模块103通过下述操作得到得到开放实体抽取模型及开放关系抽取模型:Specifically, the model building module 103 obtains the open entity extraction model and the open relationship extraction model through the following operations:
在所述分类样本中随机添加空白位,得到训练样本,利用所述语言模型预测所述训练样本中的实体,得到预测实体;Randomly adding blank bits to the classified samples to obtain training samples, and using the language model to predict entities in the training samples to obtain predicted entities;
计算所述预测实体和所述训练样本中真实实体的差值,当所述差值小于预设的阈值时,确定所述语言模型为所述开放实体抽取模型;Calculate the difference between the predicted entity and the real entity in the training sample, and when the difference is less than a preset threshold, determine that the language model is the open entity extraction model;
利用预设的关系跨度预测层计算所述预测实体间的关系跨度;Calculate the relationship span between the predicted entities by using a preset relationship span prediction layer;
基于所述关系跨度,利用预设的二分类线性层输出所述预测实体间的预测结果,其中,所述预测结果包括关系存在;Based on the relationship span, a preset two-class linear layer is used to output a prediction result between the prediction entities, wherein the prediction result includes the existence of a relationship;
当所述关系存在的预测结果与所有预测结果的比值大于预设的关系阈值时,组合所述语言模型、所述关系跨度预测层及所述二分类线性层,以得到所述开放关系抽取模型。When the ratio of the prediction result of the existence of the relationship to all the prediction results is greater than a preset relationship threshold, combine the language model, the relationship span prediction layer and the two-class linear layer to obtain the open relationship extraction model .
其中,所述关系跨度可以使用独热(one-hot)向量表示,在所述二分类线性层中通过[CLS]位判定所述预测实体之间的预测结果0或1,0表示关系不存在,1表示关系存在。同时,通过所述关系跨度预测层及所述二分类线性层,将关系预测化简为有限的二分类问题,极大地简化了模型的训练过程。The relationship span can be represented by a one-hot vector, and the [CLS] bit is used to determine the prediction result between the prediction entities in the binary classification linear layer. 0 or 1, 0 indicates that the relationship does not exist , 1 means the relationship exists. At the same time, through the relationship span prediction layer and the two-class linear layer, the relationship prediction is simplified into a limited two-class problem, which greatly simplifies the training process of the model.
所述实体抽取模块104,用于对待分类文本进行切分,得到切分文本,并利用所述开 放实体抽取模型提取所述切分文本中的实体。The entity extraction module 104 is configured to segment the text to be classified, obtain segmented text, and extract entities in the segmented text by using the open entity extraction model.
详细地,所述实体抽取模块104通过下述操作提取所述切分文本中的实体:Specifically, the entity extraction module 104 extracts entities in the segmented text through the following operations:
根据所述待分类文本中的标点符号将所述待分类文本进行断句,得到待分类断句;Segmenting the text to be classified according to the punctuation marks in the text to be classified, to obtain the segmented sentences to be classified;
利用所述开放实体抽取模型抽取所述待分类文本中的所有实体,得到待分类实体。All entities in the text to be classified are extracted by using the open entity extraction model to obtain entities to be classified.
本申请实施例通过所述开放实体抽取模型可以快速的抽取所述待分类文本中的实体,提高了实体关系预测的速率。In this embodiment of the present application, entities in the text to be classified can be rapidly extracted through the open entity extraction model, which improves the rate of entity relationship prediction.
所述关系抽取模块105,用于利用所述开放关系抽取模型预测所述实体的实体关系,并对所述实体及所述实体关系进行聚类,得到关系抽取结果。The relationship extraction module 105 is configured to use the open relationship extraction model to predict the entity relationship of the entity, and to cluster the entity and the entity relationship to obtain a relationship extraction result.
具体地,所述关系抽取模块105通过下述操作得到关系抽取结果:Specifically, the relationship extraction module 105 obtains the relationship extraction result through the following operations:
基于所述待分类实体,利用所述开放关系抽取模型抽取所述待分类断句中的关系,并过滤掉没有关系的所述待分类实体,得到预测三元组;Based on the entity to be classified, the open relationship extraction model is used to extract the relationship in the segmented sentence to be classified, and the entity to be classified that has no relationship is filtered out to obtain a predicted triplet;
利用预设的聚类方法对所述预测三元组进行聚类,得到多个聚类团,其中,所述聚类团中包括所述关系抽取结果。The predicted triples are clustered by using a preset clustering method to obtain a plurality of cluster clusters, wherein the cluster clusters include the relationship extraction result.
本申请实施例中,利用所述开放关系抽取模型可以得到三元组(头实体,关系,尾实体)及二元组(头实体,None,尾实体),其中,二元组表示没有关系的实体对,通过过滤所述二元组,提高了关系预测的准确率。所述预设的聚类方法可以为K均值聚类方法,所述K均值聚类方法通过word2vec算法将所述预测三元组中的关系向量化,并计算向量间的距离,根据所述距离将所述预测三元组聚拢到K个中心点,形成K个聚类团,这时由人工对每个聚类团概括出类型名称,从而对所述预测三元组进行分类。同时,每个聚类团稳定(不在发生变化)时,每个聚类团会求出团内所有关系向量的均值,之后新的关系会分别与每个已有聚类团的均值进行比较,若与多个聚类团的相似度(可以为欧式距离)均高于预先定义的相似阈值,则归到最相似的那个团中,若与所有聚类团的相似度均低于预定义的相似阈值,则独立归到“未知”类,当“未知”类中的关系积累到一定量(一般为已知类关系的70%),则针对未知关系重复K均值聚类方法及人工定义类型。In the embodiment of the present application, using the open relationship extraction model, a triple (head entity, relationship, tail entity) and a double (head entity, None, tail entity) can be obtained, where a double represents an unrelated entity. Entity pairs, the accuracy of relationship prediction is improved by filtering the bigram. The preset clustering method may be a K-means clustering method. The K-means clustering method vectorizes the relationship in the predicted triplet through the word2vec algorithm, and calculates the distance between the vectors, according to the distance. The predicted triples are gathered into K central points to form K clusters. At this time, a type name is manually summarized for each cluster, so as to classify the predicted triples. At the same time, when each cluster is stable (not changing), each cluster will find the mean of all relation vectors in the cluster, and then the new relation will be compared with the mean of each existing cluster. If the similarity with multiple clusters (which can be Euclidean distance) is higher than the predefined similarity threshold, it will be classified into the most similar cluster. If the similarity with all clusters is lower than the predefined similarity threshold The similarity threshold is independently classified into the "unknown" class. When the relationship in the "unknown" class accumulates to a certain amount (usually 70% of the known class relationship), the K-means clustering method and manual definition type are repeated for the unknown relationship. .
本申请实施例中,通过对所述实体及所述实体关系进行聚类,可以自动对抽取到的开放关系进行分类,提高了开放关系抽取的效率。In the embodiment of the present application, by clustering the entities and the entity relationships, the extracted open relationships can be automatically classified, which improves the efficiency of open relationship extraction.
如图7所示,是本申请一实施例提供的实现开放关系抽取方法的电子设备的结构示意图。As shown in FIG. 7 , it is a schematic structural diagram of an electronic device for implementing an open relationship extraction method provided by an embodiment of the present application.
所述电子设备1可以包括处理器10、存储器11和总线,还可以包括存储在所述存储器11中并可在所述处理器10上运行的计算机程序,如开放关系抽取程序12。The electronic device 1 may include a processor 10, a memory 11 and a bus, and may also include a computer program stored in the memory 11 and executable on the processor 10, such as an open relationship extraction program 12.
其中,所述存储器11至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、移动硬盘、多媒体卡、卡型存储器(例如:SD或DX存储器等)、磁性存储器、磁盘、光盘等。所述存储器11在一些实施例中可以是电子设备1的内部存储单元,例如该电子设备1的移动硬盘。所述存储器11在另一些实施例中也可以是电子设备1的外部存储设备,例如电子设备1上配备的插接式移动硬盘、智能存储卡(Smart Media Card,SMC)、安全数字(Secure Digital,SD)卡、闪存卡(Flash Card)等。进一步地,所述存储器11还可以既包括电子设备1的内部存储单元也包括外部存储设备。所述存储器11不仅可以用于存储安装于电子设备1的应用软件及各类数据,例如开放关系抽取程序12的代码等,还可以用于暂时地存储已经输出或者将要输出的数据。Wherein, the memory 11 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, mobile hard disk, multimedia card, card-type memory (for example: SD or DX memory, etc.), magnetic memory, magnetic disk, CD etc. The memory 11 may be an internal storage unit of the electronic device 1 in some embodiments, such as a mobile hard disk of the electronic device 1 . In other embodiments, the memory 11 may also be an external storage device of the electronic device 1, such as a pluggable mobile hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital) equipped on the electronic device 1. , SD) card, flash memory card (Flash Card), etc. Further, the memory 11 may also include both an internal storage unit of the electronic device 1 and an external storage device. The memory 11 can not only be used to store application software installed in the electronic device 1 and various types of data, such as the code of the open relationship extraction program 12, etc., but also can be used to temporarily store data that has been output or will be output.
所述处理器10在一些实施例中可以由集成电路组成,例如可以由单个封装的集成电路所组成,也可以是由多个相同功能或不同功能封装的集成电路所组成,包括一个或者多个中央处理器(Central Processing unit,CPU)、微处理器、数字处理芯片、图形处理器及各种控制芯片的组合等。所述处理器10是所述电子设备的控制核心(Control Unit),利用各种接口和线路连接整个电子设备的各个部件,通过运行或执行存储在所述存储器11内的程序或者模块(例如开放关系抽取程序等),以及调用存储在所述存储器11内的数据,以 执行电子设备1的各种功能和处理数据。In some embodiments, the processor 10 may be composed of integrated circuits, for example, may be composed of a single packaged integrated circuit, or may be composed of multiple integrated circuits packaged with the same function or different functions, including one or more integrated circuits. Central Processing Unit (CPU), microprocessor, digital processing chip, graphics processor and combination of various control chips, etc. The processor 10 is the control core (Control Unit) of the electronic device, and uses various interfaces and lines to connect the various components of the entire electronic device, by running or executing the program or module (for example, opening the device) stored in the memory 11. relationship extraction program, etc.), and call the data stored in the memory 11 to execute various functions of the electronic device 1 and process data.
所述总线可以是外设部件互连标准(peripheral component interconnect,简称PCI)总线或扩展工业标准结构(extended industry standard architecture,简称EISA)总线等。该总线可以分为地址总线、数据总线、控制总线等。所述总线被设置为实现所述存储器11以及至少一个处理器10等之间的连接通信。The bus may be a peripheral component interconnect (PCI for short) bus or an extended industry standard architecture (Extended industry standard architecture, EISA for short) bus or the like. The bus can be divided into address bus, data bus, control bus and so on. The bus is configured to implement connection communication between the memory 11 and at least one processor 10 and the like.
图7仅示出了具有部件的电子设备,本领域技术人员可以理解的是,图7示出的结构并不构成对所述电子设备1的限定,可以包括比图示更少或者更多的部件,或者组合某些部件,或者不同的部件布置。FIG. 7 only shows an electronic device with components. Those skilled in the art can understand that the structure shown in FIG. 7 does not constitute a limitation on the electronic device 1, and may include fewer or more components than those shown in the drawings. components, or a combination of certain components, or a different arrangement of components.
例如,尽管未示出,所述电子设备1还可以包括给各个部件供电的电源(比如电池),优选地,电源可以通过电源管理装置与所述至少一个处理器10逻辑相连,从而通过电源管理装置实现充电管理、放电管理、以及功耗管理等功能。电源还可以包括一个或一个以上的直流或交流电源、再充电装置、电源故障检测电路、电源转换器或者逆变器、电源状态指示器等任意组件。所述电子设备1还可以包括多种传感器、蓝牙模块、Wi-Fi模块等,在此不再赘述。For example, although not shown, the electronic device 1 may also include a power supply (such as a battery) for powering the various components, preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so that the power management The device implements functions such as charge management, discharge management, and power consumption management. The power source may also include one or more DC or AC power sources, recharging devices, power failure detection circuits, power converters or inverters, power status indicators, and any other components. The electronic device 1 may further include various sensors, Bluetooth modules, Wi-Fi modules, etc., which will not be repeated here.
进一步地,所述电子设备1还可以包括网络接口,可选地,所述网络接口可以包括有线接口和/或无线接口(如WI-FI接口、蓝牙接口等),通常用于在该电子设备1与其他电子设备之间建立通信连接。Further, the electronic device 1 may also include a network interface, optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a Bluetooth interface, etc.), which is usually used in the electronic device 1 Establish a communication connection with other electronic devices.
可选地,该电子设备1还可以包括用户接口,用户接口可以是显示器(Display)、输入单元(比如键盘(Keyboard)),可选地,用户接口还可以是标准的有线接口、无线接口。可选地,在一些实施例中,显示器可以是LED显示器、液晶显示器、触控式液晶显示器以及OLED(Organic Light-Emitting Diode,有机发光二极管)触摸器等。其中,显示器也可以适当的称为显示屏或显示单元,用于显示在电子设备1中处理的信息以及用于显示可视化的用户界面。Optionally, the electronic device 1 may further include a user interface, and the user interface may be a display (Display), an input unit (eg, a keyboard (Keyboard)), optionally, the user interface may also be a standard wired interface or a wireless interface. Optionally, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, and the like. The display may also be appropriately called a display screen or a display unit, which is used for displaying information processed in the electronic device 1 and for displaying a visualized user interface.
应该了解,所述实施例仅为说明之用,在专利申请范围上并不受此结构的限制。It should be understood that the embodiments are only used for illustration, and are not limited by this structure in the scope of the patent application.
所述电子设备1中的所述存储器11存储的开放关系抽取程序12是多个指令的组合,在所述处理器10中运行时,可以实现:The open relationship extraction program 12 stored in the memory 11 in the electronic device 1 is a combination of multiple instructions, and when running in the processor 10, it can realize:
获取原始实体数据集及原始关系数据集,分别对所述原始实体数据集及所述原始关系数据集进行远程监督,并将监督到的所述原始实体数据集与所述原始关系数据集进行实体链指,得到原始训练集;Obtain the original entity data set and the original relationship data set, carry out remote supervision on the original entity data set and the original relationship data set respectively, and carry out entity identification between the supervised original entity data set and the original relationship data set. Chain finger, get the original training set;
对所述原始训练集依次进行策略标注及实体加强处理,得到标准训练集;Performing strategy labeling and entity strengthening processing on the original training set in turn to obtain a standard training set;
获取预训练的语言模型,利用所述标准训练集对所述语言模型进行实体微调,得到开放实体抽取模型,及利用所述标准训练集对所述语言模型进行关系微调,得到开放关系抽取模型;Obtaining a pre-trained language model, using the standard training set to perform entity fine-tuning on the language model to obtain an open entity extraction model, and using the standard training set to perform relationship fine-tuning on the language model to obtain an open relationship extraction model;
对待分类文本进行切分,得到切分文本,并利用所述开放实体抽取模型提取所述切分文本中的实体;Segmenting the text to be classified, obtaining segmented text, and extracting entities in the segmented text by using the open entity extraction model;
利用所述开放关系抽取模型预测所述实体的实体关系,并对所述实体及所述实体关系进行聚类,得到关系抽取结果。The entity relationship of the entity is predicted by the open relationship extraction model, and the entity and the entity relationship are clustered to obtain a relationship extraction result.
具体地,所述处理器10对上述指令的具体实现方法可参考图1至图5对应实施例中相关步骤的描述,在此不赘述。Specifically, for the specific implementation method of the above-mentioned instruction by the processor 10, reference may be made to the description of the relevant steps in the corresponding embodiments of FIG. 1 to FIG. 5 , which will not be repeated here.
进一步地,所述电子设备1集成的模块/单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读存储介质中。所述计算机可读存储介质可以是易失性的,也可以是非易失性的。例如,所述计算机可读介质可以包括:能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)。Further, if the modules/units integrated in the electronic device 1 are implemented in the form of software functional units and sold or used as independent products, they may be stored in a computer-readable storage medium. The computer-readable storage medium may be volatile or non-volatile. For example, the computer-readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a USB flash drive, a removable hard disk, a magnetic disk, an optical disc, a computer memory, a read-only memory (ROM, Read-Only). Memory).
本申请还提供一种计算机可读存储介质,所述计算机可读存储介质可以是易失性的, 也可以是非易失性的,所述可读存储介质存储有计算机程序,所述计算机程序在被电子设备的处理器所执行时,可以实现:The present application also provides a computer-readable storage medium, the computer-readable storage medium may be volatile or non-volatile, and the readable storage medium stores a computer program, and the computer program is stored in When executed by the processor of the electronic device, it can achieve:
获取原始实体数据集及原始关系数据集,分别对所述原始实体数据集及所述原始关系数据集进行远程监督,并将监督到的所述原始实体数据集与所述原始关系数据集进行实体链指,得到原始训练集;Obtain the original entity data set and the original relationship data set, carry out remote supervision on the original entity data set and the original relationship data set respectively, and carry out entity identification between the supervised original entity data set and the original relationship data set. Chain finger, get the original training set;
对所述原始训练集依次进行策略标注及实体加强处理,得到标准训练集;Performing strategy labeling and entity strengthening processing on the original training set in turn to obtain a standard training set;
获取预训练的语言模型,利用所述标准训练集对所述语言模型进行实体微调,得到开放实体抽取模型,及利用所述标准训练集对所述语言模型进行关系微调,得到开放关系抽取模型;Obtaining a pre-trained language model, using the standard training set to perform entity fine-tuning on the language model to obtain an open entity extraction model, and using the standard training set to perform relationship fine-tuning on the language model to obtain an open relationship extraction model;
对待分类文本进行切分,得到切分文本,并利用所述开放实体抽取模型提取所述切分文本中的实体;Segmenting the text to be classified, obtaining segmented text, and extracting entities in the segmented text by using the open entity extraction model;
利用所述开放关系抽取模型预测所述实体的实体关系,并对所述实体及所述实体关系进行聚类,得到关系抽取结果。The entity relationship of the entity is predicted by the open relationship extraction model, and the entity and the entity relationship are clustered to obtain a relationship extraction result.
在本申请所提供的几个实施例中,应该理解到,所揭露的设备,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。In the several embodiments provided in this application, it should be understood that the disclosed apparatus, apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative. For example, the division of the modules is only a logical function division, and there may be other division manners in actual implementation.
所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。The modules described as separate components may or may not be physically separated, and components shown as modules may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
另外,在本申请各个实施例中的各功能模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能模块的形式实现。In addition, each functional module in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware, or can be implemented in the form of hardware plus software function modules.
对于本领域技术人员而言,显然本申请不限于上述示范性实施例的细节,而且在不背离本申请的精神或基本特征的情况下,能够以其他的具体形式实现本申请。It will be apparent to those skilled in the art that the present application is not limited to the details of the above-described exemplary embodiments, but that the present application may be implemented in other specific forms without departing from the spirit or essential characteristics of the present application.
因此,无论从哪一点来看,均应将实施例看作是示范性的,而且是非限制性的,本申请的范围由所附权利要求而不是上述说明限定,因此旨在将落在权利要求的等同要件的含义和范围内的所有变化涵括在本申请内。不应将权利要求中的任何附关联图标记视为限制所涉及的权利要求。Accordingly, the embodiments are to be regarded in all respects as illustrative and not restrictive, and the scope of the application is to be defined by the appended claims rather than the foregoing description, which is therefore intended to fall within the scope of the claims. All changes within the meaning and scope of the equivalents of , are included in this application. Any reference signs in the claims shall not be construed as limiting the involved claim.
本申请实施例可以基于人工智能技术对相关的数据进行获取和处理。其中,人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。The embodiments of the present application may acquire and process related data based on artificial intelligence technology. Among them, artificial intelligence (AI) is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. .
人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、机器人技术、生物识别技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。The basic technologies of artificial intelligence generally include technologies such as sensors, special artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics. Artificial intelligence software technology mainly includes computer vision technology, robotics technology, biometrics technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。The blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
此外,显然“包括”一词不排除其他单元或步骤,单数不排除复数。系统权利要求中陈述的多个单元或装置也可以由一个单元或装置通过软件或者硬件来实现。第二等词语用来表示名称,而并不表示任何特定的顺序。Furthermore, it is clear that the word "comprising" does not exclude other units or steps and the singular does not exclude the plural. Several units or means recited in the system claims can also be realized by one unit or means by means of software or hardware. Second-class terms are used to denote names and do not denote any particular order.
最后应说明的是,以上实施例仅用以说明本申请的技术方案而非限制,尽管参照较佳 实施例对本申请进行了详细说明,本领域的普通技术人员应当理解,可以对本申请的技术方案进行修改或等同替换,而不脱离本申请技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present application rather than limitations. Although the present application has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the present application can be Modifications or equivalent substitutions can be made without departing from the spirit and scope of the technical solutions of the present application.

Claims (20)

  1. 一种开放关系抽取方法,其中,所述方法包括:A method for extracting open relationships, wherein the method comprises:
    获取原始实体数据集及原始关系数据集,分别对所述原始实体数据集及所述原始关系数据集进行远程监督,并将监督到的所述原始实体数据集与所述原始关系数据集进行实体链指,得到原始训练集;Obtain the original entity data set and the original relationship data set, carry out remote supervision on the original entity data set and the original relationship data set respectively, and carry out entity identification between the supervised original entity data set and the original relationship data set. Chain finger, get the original training set;
    对所述原始训练集依次进行策略标注及实体加强处理,得到标准训练集;Performing strategy labeling and entity strengthening processing on the original training set in turn to obtain a standard training set;
    获取预训练的语言模型,利用所述标准训练集对所述语言模型进行实体微调,得到开放实体抽取模型,及利用所述标准训练集对所述语言模型进行关系微调,得到开放关系抽取模型;Obtaining a pre-trained language model, using the standard training set to perform entity fine-tuning on the language model to obtain an open entity extraction model, and using the standard training set to perform relationship fine-tuning on the language model to obtain an open relationship extraction model;
    对待分类文本进行切分,得到切分文本,并利用所述开放实体抽取模型提取所述切分文本中的实体;Segmenting the text to be classified, obtaining segmented text, and extracting entities in the segmented text by using the open entity extraction model;
    利用所述开放关系抽取模型预测所述实体的实体关系,并对所述实体及所述实体关系进行聚类,得到关系抽取结果。The entity relationship of the entity is predicted by the open relationship extraction model, and the entity and the entity relationship are clustered to obtain a relationship extraction result.
  2. 如权利要求1所述的开放关系抽取方法,其中,所述获取原始实体数据集及原始关系数据集,包括:The method for extracting open relationships as claimed in claim 1, wherein the acquiring the original entity data set and the original relationship data set comprises:
    利用预设的数据抓取工具从源网站中抓取文本数据,并对所述文本数据进行切分,得到文本断句,汇总所述文本断句得到所述原始关系数据集;Use a preset data grabbing tool to grab text data from the source website, and segment the text data to obtain text segments, and summarize the text segments to obtain the original relational data set;
    获取开源的实体数据集,其中,所述实体数据集中包括三元组信息及每个三元组信息对应的描述信息,并对所述三元组信息进行去重处理,得到去重三元组,汇总所述去重三元组及所述三元组信息对应的描述信息得到所述原始实体数据集。Obtain an open source entity data set, wherein the entity data set includes triplet information and description information corresponding to each triplet information, and deduplicates the triplet information to obtain a deduplication triplet , summarizing the deduplication triplet and the description information corresponding to the triplet information to obtain the original entity data set.
  3. 如权利要求2所述的开放关系抽取方法,其中,所述分别对所述原始实体数据集及所述原始关系数据集进行远程监督,并将监督到的所述原始实体数据集与所述原始关系数据集进行实体链指,得到原始训练集,包括:The method for extracting open relationships according to claim 2, wherein the original entity dataset and the original relationship dataset are respectively supervised remotely, and the supervised original entity dataset is compared with the original entity dataset. Perform entity chaining on the relational dataset to obtain the original training set, including:
    将所述原始实体数据集中的三元组信息和所述原始关系数据集中的文本断句进行匹配,并根据匹配结果进行位置标注,得到匹配数据;Matching the triplet information in the original entity data set and the text segmentation in the original relationship data set, and performing position labeling according to the matching result to obtain matching data;
    利用预构建的消歧模型,计算所述匹配数据中所述匹配结果及所述原始实体数据集中所述三元组信息对应的描述信息的匹配概率;Using a pre-built disambiguation model, calculate the matching probability of the matching result in the matching data and the description information corresponding to the triple information in the original entity data set;
    当所述匹配概率大于预设阈值时,汇总所述文本断句及所述三元组信息得到所述原始训练集。When the matching probability is greater than a preset threshold, the original training set is obtained by summarizing the text segmentation and the triplet information.
  4. 如权利要求2所述的开放关系抽取方法,其中,所述对所述原始训练集依次进行策略标注及实体加强处理,得到标准训练集,包括:The method for extracting open relationships as claimed in claim 2, wherein the original training set is sequentially subjected to policy labeling and entity reinforcement processing to obtain a standard training set, comprising:
    利用预设的标注符号对所述原始训练集中的文本断句进行分类,得到分类样本,并对所述分类样本中的三元组进行标注,得到标注实体;Classify the text segments in the original training set by using preset annotation symbols to obtain classified samples, and annotate the triples in the classified samples to obtain labeled entities;
    利用预设的自然语言处理库对所述标注实体进行实体加强处理,汇总加强后的分类样本得到所述标准训练集。Use a preset natural language processing library to perform entity enhancement processing on the labeled entities, and collect the enhanced classification samples to obtain the standard training set.
  5. 如权利要求4中所述的开放关系抽取方法,其中,所述利用所述标准训练集对所述语言模型进行实体微调,得到开放实体抽取模型,及利用所述标准训练集对所述语言模型进行关系微调,得到开放关系抽取模型,包括:The method for extracting open relationships according to claim 4, wherein the language model is fine-tuned by using the standard training set to obtain an open entity extraction model, and the language model is fine-tuned by using the standard training set. Perform relation fine-tuning to obtain an open relation extraction model, including:
    在所述分类样本中随机添加空白位,得到训练样本,利用所述语言模型预测所述训练样本中的实体,得到预测实体;Randomly adding blank bits to the classified samples to obtain training samples, and using the language model to predict entities in the training samples to obtain predicted entities;
    计算所述预测实体和所述训练样本中真实实体的差值,当所述差值小于预设的阈值时,确定所述语言模型为所述开放实体抽取模型;Calculate the difference between the predicted entity and the real entity in the training sample, and when the difference is less than a preset threshold, determine that the language model is the open entity extraction model;
    利用预设的关系跨度预测层计算所述预测实体间的关系跨度;Calculate the relationship span between the predicted entities by using a preset relationship span prediction layer;
    基于所述关系跨度,利用预设的二分类线性层输出所述预测实体间的预测结果,其中, 所述预测结果包括关系存在;Based on the relationship span, a preset two-class linear layer is used to output the prediction result between the prediction entities, wherein the prediction result includes the existence of a relationship;
    当所述关系存在的预测结果与所有预测结果的比值大于预设的关系阈值时,组合所述语言模型、所述关系跨度预测层及所述二分类线性层,以得到所述开放关系抽取模型。When the ratio of the prediction result of the existence of the relationship to all the prediction results is greater than a preset relationship threshold, combine the language model, the relationship span prediction layer and the two-class linear layer to obtain the open relationship extraction model .
  6. 如权利要求1所述的开放关系抽取方法,其中,所述对待分类文本进行切分,得到切分文本,并利用所述开放实体抽取模型提取所述切分文本中的实体,包括:The method for extracting open relationships according to claim 1, wherein, segmenting the text to be classified to obtain segmented text, and extracting entities in the segmented text by using the open entity extraction model, comprising:
    根据所述待分类文本中的标点符号将所述待分类文本进行断句,得到待分类断句;Segmenting the text to be classified according to the punctuation marks in the text to be classified, to obtain the segmented sentences to be classified;
    利用所述开放实体抽取模型抽取所述待分类文本中的所有实体,得到待分类实体。All entities in the text to be classified are extracted by using the open entity extraction model to obtain entities to be classified.
  7. 如权利要求6所述的开放关系抽取方法,其中,所述利用所述开放关系抽取模型预测所述实体的实体关系,并对所述实体及所述实体关系进行聚类,得到关系抽取结果,包括:The method for extracting open relationships according to claim 6, wherein the use of the open relationship extraction model to predict the entity relationship of the entity, and clustering the entity and the entity relationship to obtain a relationship extraction result, include:
    基于所述待分类实体,利用所述开放关系抽取模型抽取所述待分类断句中的关系,并过滤掉没有关系的所述待分类实体,得到预测三元组;Based on the entity to be classified, the open relationship extraction model is used to extract the relationship in the segmented sentence to be classified, and the entity to be classified that has no relationship is filtered out to obtain a predicted triplet;
    利用预设的聚类方法对所述预测三元组进行聚类,得到多个聚类团,其中,所述聚类团中包括所述关系抽取结果。The predicted triples are clustered by using a preset clustering method to obtain a plurality of cluster clusters, wherein the cluster clusters include the relationship extraction result.
  8. 一种开放关系抽取装置,其中,所述装置包括:An apparatus for extracting open relationships, wherein the apparatus comprises:
    训练集构建模块,用于获取原始实体数据集及原始关系数据集,分别对所述原始实体数据集及所述原始关系数据集进行远程监督,并将监督到的所述原始实体数据集与所述原始关系数据集进行实体链指,得到原始训练集;The training set building module is used to obtain the original entity data set and the original relationship data set, perform remote supervision on the original entity data set and the original relationship data set respectively, and compare the supervised original entity data set with all the original entity data sets. Perform entity chaining on the original relational data set to obtain the original training set;
    实体加强模块,用于对所述原始训练集依次进行策略标注及实体加强处理,得到标准训练集;an entity reinforcement module, which is used to sequentially perform strategy labeling and entity reinforcement processing on the original training set to obtain a standard training set;
    模型构建模块,用于获取预训练的语言模型,利用所述标准训练集对所述语言模型进行实体微调,得到开放实体抽取模型,及利用所述标准训练集对所述语言模型进行关系微调,得到开放关系抽取模型;a model building module for obtaining a pre-trained language model, performing entity fine-tuning on the language model using the standard training set to obtain an open entity extraction model, and using the standard training set to perform relational fine-tuning on the language model, Get the open relation extraction model;
    实体抽取模块,用于对待分类文本进行切分,得到切分文本,并利用所述开放实体抽取模型提取所述切分文本中的实体;an entity extraction module, used for segmenting the text to be classified, obtaining segmented text, and extracting entities in the segmented text by using the open entity extraction model;
    关系抽取模块,用于利用所述开放关系抽取模型预测所述实体的实体关系,并对所述实体及所述实体关系进行聚类,得到关系抽取结果。The relationship extraction module is used for predicting the entity relationship of the entity by using the open relationship extraction model, and clustering the entity and the entity relationship to obtain a relationship extraction result.
  9. 一种电子设备,其中,所述电子设备包括:An electronic device, wherein the electronic device comprises:
    至少一个处理器;以及,at least one processor; and,
    与所述至少一个处理器通信连接的存储器;其中,a memory communicatively coupled to the at least one processor; wherein,
    所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行如下步骤:The memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the steps of:
    获取原始实体数据集及原始关系数据集,分别对所述原始实体数据集及所述原始关系数据集进行远程监督,并将监督到的所述原始实体数据集与所述原始关系数据集进行实体链指,得到原始训练集;Obtain the original entity data set and the original relationship data set, carry out remote supervision on the original entity data set and the original relationship data set respectively, and carry out entity identification between the supervised original entity data set and the original relationship data set. Chain finger, get the original training set;
    对所述原始训练集依次进行策略标注及实体加强处理,得到标准训练集;Performing strategy labeling and entity strengthening processing on the original training set in turn to obtain a standard training set;
    获取预训练的语言模型,利用所述标准训练集对所述语言模型进行实体微调,得到开放实体抽取模型,及利用所述标准训练集对所述语言模型进行关系微调,得到开放关系抽取模型;Obtaining a pre-trained language model, using the standard training set to perform entity fine-tuning on the language model to obtain an open entity extraction model, and using the standard training set to perform relationship fine-tuning on the language model to obtain an open relationship extraction model;
    对待分类文本进行切分,得到切分文本,并利用所述开放实体抽取模型提取所述切分文本中的实体;Segmenting the text to be classified, obtaining segmented text, and extracting entities in the segmented text by using the open entity extraction model;
    利用所述开放关系抽取模型预测所述实体的实体关系,并对所述实体及所述实体关系进行聚类,得到关系抽取结果。The entity relationship of the entity is predicted by the open relationship extraction model, and the entity and the entity relationship are clustered to obtain a relationship extraction result.
  10. 如权利要求9所述的电子设备,其中,所述获取原始实体数据集及原始关系数据集,包括:The electronic device according to claim 9, wherein the acquiring the original entity data set and the original relationship data set comprises:
    利用预设的数据抓取工具从源网站中抓取文本数据,并对所述文本数据进行切分,得到文本断句,汇总所述文本断句得到所述原始关系数据集;Use a preset data grabbing tool to grab text data from the source website, and segment the text data to obtain text segments, and summarize the text segments to obtain the original relational data set;
    获取开源的实体数据集,其中,所述实体数据集中包括三元组信息及每个三元组信息对应的描述信息,并对所述三元组信息进行去重处理,得到去重三元组,汇总所述去重三元组及所述三元组信息对应的描述信息得到所述原始实体数据集。Obtain an open source entity data set, wherein the entity data set includes triplet information and description information corresponding to each triplet information, and deduplicates the triplet information to obtain a deduplication triplet , summarizing the deduplication triplet and the description information corresponding to the triplet information to obtain the original entity data set.
  11. 如权利要求10所述的电子设备,其中,所述分别对所述原始实体数据集及所述原始关系数据集进行远程监督,并将监督到的所述原始实体数据集与所述原始关系数据集进行实体链指,得到原始训练集,包括:The electronic device according to claim 10, wherein the original entity data set and the original relationship data set are respectively supervised remotely, and the supervised original entity data set and the original relationship data are supervised. Perform entity chaining on the set to obtain the original training set, including:
    将所述原始实体数据集中的三元组信息和所述原始关系数据集中的文本断句进行匹配,并根据匹配结果进行位置标注,得到匹配数据;Matching the triplet information in the original entity data set and the text segmentation in the original relationship data set, and performing position labeling according to the matching result to obtain matching data;
    利用预构建的消歧模型,计算所述匹配数据中所述匹配结果及所述原始实体数据集中所述三元组信息对应的描述信息的匹配概率;Using a pre-built disambiguation model, calculate the matching probability of the matching result in the matching data and the description information corresponding to the triple information in the original entity data set;
    当所述匹配概率大于预设阈值时,汇总所述文本断句及所述三元组信息得到所述原始训练集。When the matching probability is greater than a preset threshold, the original training set is obtained by summarizing the text segmentation and the triplet information.
  12. 如权利要求10所述的电子设备,其中,所述对所述原始训练集依次进行策略标注及实体加强处理,得到标准训练集,包括:The electronic device according to claim 10 , wherein the original training set is sequentially subjected to strategy labeling and entity enhancement processing to obtain a standard training set, comprising:
    利用预设的标注符号对所述原始训练集中的文本断句进行分类,得到分类样本,并对所述分类样本中的三元组进行标注,得到标注实体;Classify the text segments in the original training set by using preset annotation symbols to obtain classified samples, and annotate the triples in the classified samples to obtain labeled entities;
    利用预设的自然语言处理库对所述标注实体进行实体加强处理,汇总加强后的分类样本得到所述标准训练集。Use a preset natural language processing library to perform entity enhancement processing on the labeled entities, and collect the enhanced classification samples to obtain the standard training set.
  13. 如权利要求12中所述的电子设备,其中,所述利用所述标准训练集对所述语言模型进行实体微调,得到开放实体抽取模型,及利用所述标准训练集对所述语言模型进行关系微调,得到开放关系抽取模型,包括:13. The electronic device of claim 12, wherein the language model is subjected to entity fine-tuning using the standard training set to obtain an open entity extraction model, and the language model is relationally performed using the standard training set Fine-tuning to obtain an open relation extraction model, including:
    在所述分类样本中随机添加空白位,得到训练样本,利用所述语言模型预测所述训练样本中的实体,得到预测实体;Randomly adding blank bits to the classified samples to obtain training samples, and using the language model to predict entities in the training samples to obtain predicted entities;
    计算所述预测实体和所述训练样本中真实实体的差值,当所述差值小于预设的阈值时,确定所述语言模型为所述开放实体抽取模型;Calculate the difference between the predicted entity and the real entity in the training sample, and when the difference is less than a preset threshold, determine that the language model is the open entity extraction model;
    利用预设的关系跨度预测层计算所述预测实体间的关系跨度;Calculate the relationship span between the predicted entities by using a preset relationship span prediction layer;
    基于所述关系跨度,利用预设的二分类线性层输出所述预测实体间的预测结果,其中,所述预测结果包括关系存在;Based on the relationship span, a preset two-class linear layer is used to output a prediction result between the prediction entities, wherein the prediction result includes the existence of a relationship;
    当所述关系存在的预测结果与所有预测结果的比值大于预设的关系阈值时,组合所述语言模型、所述关系跨度预测层及所述二分类线性层,以得到所述开放关系抽取模型。When the ratio of the prediction result of the existence of the relationship to all the prediction results is greater than a preset relationship threshold, combine the language model, the relationship span prediction layer and the two-class linear layer to obtain the open relationship extraction model .
  14. 如权利要求9所述的电子设备,其中,所述对待分类文本进行切分,得到切分文本,并利用所述开放实体抽取模型提取所述切分文本中的实体,包括:The electronic device according to claim 9, wherein the segmenting the text to be classified to obtain segmented text, and extracting entities in the segmented text by using the open entity extraction model, comprising:
    根据所述待分类文本中的标点符号将所述待分类文本进行断句,得到待分类断句;Segmenting the text to be classified according to the punctuation marks in the text to be classified, to obtain the segmented sentences to be classified;
    利用所述开放实体抽取模型抽取所述待分类文本中的所有实体,得到待分类实体。All entities in the text to be classified are extracted by using the open entity extraction model to obtain entities to be classified.
  15. 如权利要求14所述的电子设备,其中,所述利用所述开放关系抽取模型预测所述实体的实体关系,并对所述实体及所述实体关系进行聚类,得到关系抽取结果,包括:The electronic device according to claim 14, wherein, predicting the entity relationship of the entity by using the open relationship extraction model, and clustering the entity and the entity relationship to obtain a relationship extraction result, comprising:
    基于所述待分类实体,利用所述开放关系抽取模型抽取所述待分类断句中的关系,并过滤掉没有关系的所述待分类实体,得到预测三元组;Based on the entity to be classified, the open relationship extraction model is used to extract the relationship in the segmented sentence to be classified, and the entity to be classified that has no relationship is filtered out to obtain a predicted triplet;
    利用预设的聚类方法对所述预测三元组进行聚类,得到多个聚类团,其中,所述聚类团中包括所述关系抽取结果。The predicted triples are clustered by using a preset clustering method to obtain a plurality of cluster clusters, wherein the cluster clusters include the relationship extraction result.
  16. 一种计算机可读存储介质,存储有计算机程序,其中,所述计算机程序被处理器执行时实现如下步骤:A computer-readable storage medium storing a computer program, wherein the computer program implements the following steps when executed by a processor:
    获取原始实体数据集及原始关系数据集,分别对所述原始实体数据集及所述原始关系 数据集进行远程监督,并将监督到的所述原始实体数据集与所述原始关系数据集进行实体链指,得到原始训练集;Obtain the original entity data set and the original relationship data set, carry out remote supervision on the original entity data set and the original relationship data set respectively, and carry out entity identification between the supervised original entity data set and the original relationship data set. Chain finger, get the original training set;
    对所述原始训练集依次进行策略标注及实体加强处理,得到标准训练集;Performing strategy labeling and entity strengthening processing on the original training set in turn to obtain a standard training set;
    获取预训练的语言模型,利用所述标准训练集对所述语言模型进行实体微调,得到开放实体抽取模型,及利用所述标准训练集对所述语言模型进行关系微调,得到开放关系抽取模型;Obtaining a pre-trained language model, using the standard training set to perform entity fine-tuning on the language model to obtain an open entity extraction model, and using the standard training set to perform relationship fine-tuning on the language model to obtain an open relationship extraction model;
    对待分类文本进行切分,得到切分文本,并利用所述开放实体抽取模型提取所述切分文本中的实体;Segmenting the text to be classified, obtaining segmented text, and extracting entities in the segmented text by using the open entity extraction model;
    利用所述开放关系抽取模型预测所述实体的实体关系,并对所述实体及所述实体关系进行聚类,得到关系抽取结果。The entity relationship of the entity is predicted by the open relationship extraction model, and the entity and the entity relationship are clustered to obtain a relationship extraction result.
  17. 如权利要求16所述的计算机可读存储介质,其中,所述获取原始实体数据集及原始关系数据集,包括:The computer-readable storage medium of claim 16, wherein the obtaining the original entity data set and the original relationship data set comprises:
    利用预设的数据抓取工具从源网站中抓取文本数据,并对所述文本数据进行切分,得到文本断句,汇总所述文本断句得到所述原始关系数据集;Use a preset data grabbing tool to grab text data from the source website, and segment the text data to obtain text segments, and summarize the text segments to obtain the original relational data set;
    获取开源的实体数据集,其中,所述实体数据集中包括三元组信息及每个三元组信息对应的描述信息,并对所述三元组信息进行去重处理,得到去重三元组,汇总所述去重三元组及所述三元组信息对应的描述信息得到所述原始实体数据集。Obtain an open source entity data set, wherein the entity data set includes triplet information and description information corresponding to each triplet information, and deduplicates the triplet information to obtain a deduplication triplet , summarizing the deduplication triplet and the description information corresponding to the triplet information to obtain the original entity data set.
  18. 如权利要求17所述的计算机可读存储介质,其中,所述分别对所述原始实体数据集及所述原始关系数据集进行远程监督,并将监督到的所述原始实体数据集与所述原始关系数据集进行实体链指,得到原始训练集,包括:The computer-readable storage medium of claim 17, wherein the original entity dataset and the original relationship dataset are separately supervised remotely, and the supervised original entity dataset is compared with the original entity dataset. Perform entity chaining on the original relational data set to obtain the original training set, including:
    将所述原始实体数据集中的三元组信息和所述原始关系数据集中的文本断句进行匹配,并根据匹配结果进行位置标注,得到匹配数据;Matching the triplet information in the original entity data set and the text segmentation in the original relationship data set, and performing position labeling according to the matching result to obtain matching data;
    利用预构建的消歧模型,计算所述匹配数据中所述匹配结果及所述原始实体数据集中所述三元组信息对应的描述信息的匹配概率;Using a pre-built disambiguation model, calculate the matching probability of the matching result in the matching data and the description information corresponding to the triple information in the original entity data set;
    当所述匹配概率大于预设阈值时,汇总所述文本断句及所述三元组信息得到所述原始训练集。When the matching probability is greater than a preset threshold, the original training set is obtained by summarizing the text segmentation and the triplet information.
  19. 如权利要求17所述的计算机可读存储介质,其中,所述对所述原始训练集依次进行策略标注及实体加强处理,得到标准训练集,包括:The computer-readable storage medium according to claim 17, wherein the step of sequentially performing policy labeling and entity enhancement processing on the original training set to obtain a standard training set, comprising:
    利用预设的标注符号对所述原始训练集中的文本断句进行分类,得到分类样本,并对所述分类样本中的三元组进行标注,得到标注实体;Classify the text segments in the original training set by using preset annotation symbols to obtain classified samples, and annotate the triples in the classified samples to obtain labeled entities;
    利用预设的自然语言处理库对所述标注实体进行实体加强处理,汇总加强后的分类样本得到所述标准训练集。Use a preset natural language processing library to perform entity enhancement processing on the labeled entities, and collect the enhanced classification samples to obtain the standard training set.
  20. 如权利要求19中所述的计算机可读存储介质,其中,所述利用所述标准训练集对所述语言模型进行实体微调,得到开放实体抽取模型,及利用所述标准训练集对所述语言模型进行关系微调,得到开放关系抽取模型,包括:19. The computer-readable storage medium of claim 19, wherein the entity fine-tuning of the language model using the standard training set results in an open entity extraction model, and wherein the language model is fine-tuned using the standard training set The model is fine-tuned to obtain an open relation extraction model, including:
    在所述分类样本中随机添加空白位,得到训练样本,利用所述语言模型预测所述训练样本中的实体,得到预测实体;Randomly adding blank bits to the classified samples to obtain training samples, and using the language model to predict entities in the training samples to obtain predicted entities;
    计算所述预测实体和所述训练样本中真实实体的差值,当所述差值小于预设的阈值时,确定所述语言模型为所述开放实体抽取模型;Calculate the difference between the predicted entity and the real entity in the training sample, and when the difference is less than a preset threshold, determine that the language model is the open entity extraction model;
    利用预设的关系跨度预测层计算所述预测实体间的关系跨度;Calculate the relationship span between the predicted entities by using a preset relationship span prediction layer;
    基于所述关系跨度,利用预设的二分类线性层输出所述预测实体间的预测结果,其中,所述预测结果包括关系存在;Based on the relationship span, a preset two-class linear layer is used to output a prediction result between the prediction entities, wherein the prediction result includes the existence of a relationship;
    当所述关系存在的预测结果与所有预测结果的比值大于预设的关系阈值时,组合所述语言模型、所述关系跨度预测层及所述二分类线性层,以得到所述开放关系抽取模型。When the ratio of the prediction result of the relationship to all prediction results is greater than a preset relationship threshold, combine the language model, the relationship span prediction layer and the binary linear layer to obtain the open relationship extraction model .
PCT/CN2021/109488 2021-04-21 2021-07-30 Open relationship extraction method and apparatus, electronic device, and storage medium WO2022222300A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110428927.5 2021-04-21
CN202110428927.5A CN113051356B (en) 2021-04-21 2021-04-21 Open relation extraction method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2022222300A1 true WO2022222300A1 (en) 2022-10-27

Family

ID=76519844

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/109488 WO2022222300A1 (en) 2021-04-21 2021-07-30 Open relationship extraction method and apparatus, electronic device, and storage medium

Country Status (2)

Country Link
CN (1) CN113051356B (en)
WO (1) WO2022222300A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116776886A (en) * 2023-08-15 2023-09-19 浙江同信企业征信服务有限公司 Information extraction method, device, equipment and storage medium

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113051356B (en) * 2021-04-21 2023-05-30 深圳壹账通智能科技有限公司 Open relation extraction method and device, electronic equipment and storage medium
CN113704429A (en) * 2021-08-31 2021-11-26 平安普惠企业管理有限公司 Semi-supervised learning-based intention identification method, device, equipment and medium
CN113553854B (en) * 2021-09-18 2021-12-10 航天宏康智能科技(北京)有限公司 Entity relation joint extraction method and device
CN114528418B (en) * 2022-04-24 2022-10-14 杭州同花顺数据开发有限公司 Text processing method, system and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150178273A1 (en) * 2013-12-20 2015-06-25 Microsoft Corporation Unsupervised Relation Detection Model Training
CN110209836A (en) * 2019-05-17 2019-09-06 北京邮电大学 Remote supervisory Relation extraction method and device
CN110619053A (en) * 2019-09-18 2019-12-27 北京百度网讯科技有限公司 Training method of entity relation extraction model and method for extracting entity relation
CN113051356A (en) * 2021-04-21 2021-06-29 深圳壹账通智能科技有限公司 Open relationship extraction method and device, electronic equipment and storage medium

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140032209A1 (en) * 2012-07-27 2014-01-30 University Of Washington Through Its Center For Commercialization Open information extraction
US11693873B2 (en) * 2016-02-03 2023-07-04 Global Software Innovation Pty Ltd Systems and methods for using entity/relationship model data to enhance user interface engine
US11210324B2 (en) * 2016-06-03 2021-12-28 Microsoft Technology Licensing, Llc Relation extraction across sentence boundaries
CN109472033B (en) * 2018-11-19 2022-12-06 华南师范大学 Method and system for extracting entity relationship in text, storage medium and electronic equipment
CN112487203B (en) * 2019-01-25 2024-01-16 中译语通科技股份有限公司 Relation extraction system integrated with dynamic word vector
US10943068B2 (en) * 2019-03-29 2021-03-09 Microsoft Technology Licensing, Llc N-ary relation prediction over text spans
CN111291185B (en) * 2020-01-21 2023-09-22 京东方科技集团股份有限公司 Information extraction method, device, electronic equipment and storage medium
CN111339774B (en) * 2020-02-07 2022-11-29 腾讯科技(深圳)有限公司 Text entity relation extraction method and model training method
CN111324743A (en) * 2020-02-14 2020-06-23 平安科技(深圳)有限公司 Text relation extraction method and device, computer equipment and storage medium
CN111881256B (en) * 2020-07-17 2022-11-08 中国人民解放军战略支援部队信息工程大学 Text entity relation extraction method and device and computer readable storage medium equipment
CN111950269A (en) * 2020-08-21 2020-11-17 清华大学 Text statement processing method and device, computer equipment and storage medium
CN112214610B (en) * 2020-09-25 2023-09-08 中国人民解放军国防科技大学 Entity relationship joint extraction method based on span and knowledge enhancement
CN112507125A (en) * 2020-12-03 2021-03-16 平安科技(深圳)有限公司 Triple information extraction method, device, equipment and computer readable storage medium
CN112507061A (en) * 2020-12-15 2021-03-16 康键信息技术(深圳)有限公司 Multi-relation medical knowledge extraction method, device, equipment and storage medium
CN112632975A (en) * 2020-12-29 2021-04-09 北京明略软件系统有限公司 Upstream and downstream relation extraction method and device, electronic equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150178273A1 (en) * 2013-12-20 2015-06-25 Microsoft Corporation Unsupervised Relation Detection Model Training
CN110209836A (en) * 2019-05-17 2019-09-06 北京邮电大学 Remote supervisory Relation extraction method and device
CN110619053A (en) * 2019-09-18 2019-12-27 北京百度网讯科技有限公司 Training method of entity relation extraction model and method for extracting entity relation
CN113051356A (en) * 2021-04-21 2021-06-29 深圳壹账通智能科技有限公司 Open relationship extraction method and device, electronic equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116776886A (en) * 2023-08-15 2023-09-19 浙江同信企业征信服务有限公司 Information extraction method, device, equipment and storage medium
CN116776886B (en) * 2023-08-15 2023-12-05 浙江同信企业征信服务有限公司 Information extraction method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN113051356B (en) 2023-05-30
CN113051356A (en) 2021-06-29

Similar Documents

Publication Publication Date Title
WO2021212682A1 (en) Knowledge extraction method, apparatus, electronic device, and storage medium
WO2022022045A1 (en) Knowledge graph-based text comparison method and apparatus, device, and storage medium
WO2022222300A1 (en) Open relationship extraction method and apparatus, electronic device, and storage medium
CN108875051B (en) Automatic knowledge graph construction method and system for massive unstructured texts
CN108399228B (en) Article classification method and device, computer equipment and storage medium
WO2021068339A1 (en) Text classification method and device, and computer readable storage medium
WO2020108063A1 (en) Feature word determining method, apparatus, and server
WO2020252919A1 (en) Resume identification method and apparatus, and computer device and storage medium
US10997369B1 (en) Systems and methods to generate sequential communication action templates by modelling communication chains and optimizing for a quantified objective
JP7164701B2 (en) Computer-readable storage medium storing methods, apparatus, and instructions for matching semantic text data with tags
WO2021121198A1 (en) Semantic similarity-based entity relation extraction method and apparatus, device and medium
CN108804423B (en) Medical text feature extraction and automatic matching method and system
US20180068221A1 (en) System and Method of Advising Human Verification of Machine-Annotated Ground Truth - High Entropy Focus
US11580119B2 (en) System and method for automatic persona generation using small text components
CN110413787B (en) Text clustering method, device, terminal and storage medium
WO2022116435A1 (en) Title generation method and apparatus, electronic device and storage medium
CN111539193A (en) Ontology-based document analysis and annotation generation
CN112989208B (en) Information recommendation method and device, electronic equipment and storage medium
CN113704429A (en) Semi-supervised learning-based intention identification method, device, equipment and medium
CN113051914A (en) Enterprise hidden label extraction method and device based on multi-feature dynamic portrait
US20230114673A1 (en) Method for recognizing token, electronic device and storage medium
CN111177375A (en) Electronic document classification method and device
JP2020173779A (en) Identifying sequence of headings in document
CN114547301A (en) Document processing method, document processing device, recognition model training equipment and storage medium
CN115858776A (en) Variant text classification recognition method, system, storage medium and electronic equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21937524

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE