WO2021121198A1

WO2021121198A1 - Semantic similarity-based entity relation extraction method and apparatus, device and medium

Info

Publication number: WO2021121198A1
Application number: PCT/CN2020/136349
Authority: WO
Inventors: 陈振东
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-09-08
Filing date: 2020-12-15
Publication date: 2021-06-24
Also published as: CN112101041B; CN112101041A

Abstract

A semantic similarity-based entity relation extraction method and apparatus, a device and a medium, relating to the field of artificial intelligence. The method comprises: obtaining annotated corpora and unannotated corpora, and storing each annotated corpus in a seed set (S201); for each annotated corpus in the seed set, on the basis of a preset feature construction mode, constructing features for each annotated corpus, and obtaining relation features of the annotated corpora (S202); inputting the unannotated corpora, the annotated corpora, and the relation features of the annotated corpora into a preset similarity evaluation model (S203); and on the basis of the preset similarity evaluation model and the relation features, evaluating the unannotated corpora, obtaining an evaluation result, then determining an entity relation of the unannotated corpora on the basis of the evaluation result (S204). By means of a semi-supervised method, rapid entity relation extraction may be performed in respect of the unannotated corpora, improving entity relation extraction accuracy and efficiency.

Description

Entity relationship extraction method, device, equipment and medium based on semantic similarity

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office with the application number 2020109372749 titled "Semantic Similarity-Based Entity Relationship Extraction Method, Apparatus, Equipment, and Medium" on September 8, 2020, and its entire contents Incorporated in this application by reference.

Technical field

This application relates to the field of artificial intelligence, and in particular to an entity relationship extraction method, device, equipment and medium based on semantic similarity.

Background technique

In the field of natural language processing, when it comes to semantic network annotation, text understanding, and machine translation, it is often necessary to extract entity relationships from the content in the corpus. Entity relationship extraction is an important research topic in the field of information extraction, and its main purpose is to extract The semantic relationship between the marked entity pairs in the sentence is to determine the relationship category between the entity pairs in the unstructured text on the basis of entity recognition, and form structured data for storage and retrieval. In theoretical research and practical application, entity relationship extraction technology can provide theoretical support for other natural language processing technologies.

In the process of realizing this application, the inventor realizes that the prior art has at least the following problems: The existing method is mainly to determine the similarity between the new sentence and the original corpus by segmenting the sentence and then calculating the similarity. This kind of similarity based on the similarity of text characters depends more on the characterization ability of word vectors. After multiple cycles, the subsequent added corpus will have a semantic drift problem, resulting in the entity of the entire corpus. The accuracy of relation extraction is getting lower and lower.

Summary of the invention

The embodiments of the present application provide a method, device, computer equipment, and storage medium for extracting entity relationships based on semantic similarity, so as to improve the accuracy of extracting the relationship of named entities.

In order to solve the above technical problems, an embodiment of the present application provides an entity relationship extraction method based on semantic similarity, including:

Obtain the labeled corpus and the unlabeled corpus, and store each of the labeled corpus in the seed set;

For each of the annotated corpora in the seed set, construct features on the annotated corpus according to a preset feature construction mode to obtain the relationship feature of the annotated corpus;

Inputting the relationship features of the unlabeled corpus, the labeled corpus, and the labeled corpus into a preset similarity evaluation model;

Based on the preset similarity evaluation model and the relationship feature, the unlabeled corpus is evaluated to obtain an evaluation result, and the entity relationship of the unlabeled corpus is determined according to the evaluation result.

In order to solve the above technical problems, an embodiment of the present application also provides an entity relationship extraction device based on semantic similarity, including:

Data collection module, used to obtain labeled corpus and unlabeled corpus, and store each of the labeled corpus in the seed set;

A feature construction module, configured to construct features on the annotated corpus according to a preset feature construction method for each of the annotated corpus in the seed set, to obtain the relationship feature of the annotated corpus;

A data input module, configured to input the relationship features of the unlabeled corpus, the labeled corpus, and the labeled corpus into a preset similarity evaluation model;

The relation extraction module is configured to evaluate the unlabeled corpus based on the preset similarity evaluation model and the relationship feature to obtain an evaluation result, and determine the entity of the unlabeled corpus according to the evaluation result relationship.

In order to solve the above technical problems, an embodiment of the present application also provides a computer device, including a memory, a processor, and computer-readable instructions stored in the memory and running on the processor, and the processor executes all When the computer-readable instructions are described, the following steps are implemented:

In order to solve the above technical problems, the embodiments of the present application also provide a computer-readable storage medium, the computer-readable storage medium stores computer-readable instructions, and when the computer-readable instructions are executed by a processor, the following steps are implemented:

The method, device, device, and medium for extracting entity relationship based on semantic similarity provided by the embodiments of the present application can obtain labeled corpus and unlabeled corpus, and store each labeled corpus into a seed set, and then target each of the seed sets. According to the preset feature construction method, construct features of the annotated corpus to obtain the relationship characteristics of the annotated corpus, and then input the relationship characteristics of the unlabeled corpus, the annotated corpus, and the annotated corpus into the preset similarity evaluation model. Based on the preset similarity evaluation model and relationship characteristics, the unlabeled corpus is evaluated to obtain the evaluation result, and the entity relationship of the unlabeled corpus is determined according to the evaluation result, so as to realize the entity relationship of the unlabeled corpus in a semi-supervised manner Fast extraction improves the accuracy and efficiency of entity relationship extraction.

Description of the drawings

In order to explain the technical solutions of the embodiments of the present application more clearly, the following will briefly introduce the drawings that need to be used in the description of the embodiments of the present application. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative labor.

Figure 1 is an exemplary system architecture diagram to which the present application can be applied;

Fig. 2 is a flowchart of an embodiment of an entity relationship extraction method based on semantic similarity of the present application;

3 is a schematic structural diagram of an embodiment of an entity relationship extraction device based on semantic similarity according to the present application;

Fig. 4 is a schematic structural diagram of an embodiment of a computer device according to the present application.

Detailed ways

Unless otherwise defined, all technical and scientific terms used herein have the same meanings as commonly understood by those skilled in the technical field of the application; the terms used in the specification of the application herein are only for describing specific embodiments. The purpose is not to limit the application; the terms "including" and "having" in the specification and claims of the application and the above-mentioned description of the drawings and any variations thereof are intended to cover non-exclusive inclusions. The terms "first", "second", etc. in the specification and claims of the present application or the above-mentioned drawings are used to distinguish different objects, rather than to describe a specific sequence.

The reference to "embodiments" herein means that a specific feature, structure, or characteristic described in conjunction with the embodiments may be included in at least one embodiment of the present application. The appearance of the phrase in various places in the specification does not necessarily refer to the same embodiment, nor is it an independent or alternative embodiment mutually exclusive with other embodiments. Those skilled in the art clearly and implicitly understand that the embodiments described herein can be combined with other embodiments.

The technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of them. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.

Please refer to FIG. 1. As shown in FIG. 1, the system architecture 100 may include

terminal devices

101, 102, and 103, a network 104 and a server 105. The network 104 is used to provide a medium for communication links between the

terminal devices

101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, and so on.

The user can use the

terminal devices

101, 102, and 103 to interact with the server 105 through the network 104 to receive or send messages and so on.

The

terminal devices

101, 102, 103 may be various electronic devices with a display screen and support web browsing, including but not limited to smart phones, tablets, e-book readers, MP3 players (Moving Picture E interface display perts Group Audio Layer III. The moving picture expert compresses the standard audio layer 3), MP4 (Moving Picture E interface displays perts Group Audio Layer IV, the moving picture expert compresses the standard audio layer 4) player, laptop portable computer and desktop computer, etc.

The server 105 may be a server that provides various services, for example, a background server that provides support for pages displayed on the

terminal devices

101, 102, and 103.

It should be noted that the method for extracting entity relationship based on semantic similarity provided by the embodiment of the present application is executed by the server, and accordingly, the device for extracting entity relationship based on semantic similarity is set in the server.

It should be understood that the numbers of terminal devices, networks, and servers in FIG. 1 are merely illustrative. There may be any number of terminal devices, networks, and servers according to implementation needs. The

terminal devices

101, 102, and 103 in the embodiments of the present application may specifically correspond to application systems in actual production.

Please refer to FIG. 2. FIG. 2 shows a semantic similarity-based entity relationship extraction method provided by an embodiment of the present application. The method is applied to the server in FIG. 1 as an example for description, and the details are as follows:

S201: Obtain labeled corpus and unlabeled corpus, and store each labeled corpus in a seed set.

Specifically, in the field of natural language processing, when it comes to semantic network annotation, text comprehension, machine translation, and the construction of knowledge graphs, it is often necessary to extract the entity relationship of the content in the corpus in order to build a corpus for automated processing and improve Processing efficiency, before extracting entity relations, it is necessary to pre-set some types of corpus to be extracted. Therefore, some corpora should be annotated in advance to obtain annotated corpus, and the annotated corpus will be stored in the seed set, and other unlabeled corpora will be stored in the seed set. The corpus of is regarded as an unlabeled corpus.

Among them, natural language processing (Natural Language Processing, NLP) refers to the understanding (understanding) of natural language, the need for extensive knowledge about the external world and the ability to use this knowledge, natural language cognition is also regarded as a The problem of AI-complete. NLP tasks mainly refer to tasks related to the semantic understanding or parsing of natural language. Common NLP tasks include but are not limited to: Speech recognition, Chinese word segmentation, Part-of- speech tagging, text categorization, parsing, automatic summarization, question answering, information extraction, etc.

Among them, entity relationship extraction is a classic task in the NLP field. Specifically, given a sentence and the entities appearing in it, the relationship between the entities needs to be inferred based on the semantic information of the sentence. For example, given the sentence: "Tsinghua University is located in the nearest neighbor of Beijing" and the entities "Tsinghua University" and "Beijing", the entity relationship extraction model obtains the relationship of "located", and finally extracts the knowledge of (Tsinghua University, located, Beijing) Triad. Entity relations have been continuously researched and carried out in the past 20 years. Feature engineering, nuclear methods, and graph models have been widely used among them, and some phased results have been achieved. With the advent of the deep learning era, neural network models have brought new breakthroughs in entity relationship extraction.

Among them, annotated corpus refers to the corpus obtained by manually selecting part of the corpus according to actual needs, and labeling the entity relationship of the corpus. In this embodiment, only a small amount of corpus needs to be annotated to meet the subsequent training. The need, for example, ten items, is far less than the number of corpus required for the training of traditional depth models.

It needs to be said that the source of the corpus selection in this embodiment can be selected according to actual needs, which is not limited here. For example, you can collect policy-related corpus from government sites, or collect sports-related corpus from sports forums or news sites.

Among them, the seed set in this embodiment can be understood as a corpus that is continuously improved and expanded. In the initial stage, a part of the corpus required by the task is obtained through manual annotation, and stored in the seed set as annotated corpus. In the follow-up, through semi-supervised tagging training, more corpora of the same type as the corpus required by the task are added to the unlabeled corpus, so that more and more corpora are included in the seed set, and the clustering characteristics of the corpus become more and more important. It's getting more and more obvious. Conducive to improving the robustness of the seed set.

Further, in this embodiment, synonym or similarity corpus related to the marked corpus is acquired and added to the marked corpus, so as to improve the subsequent training effect on the model.

S202: For each annotated corpus in the seed set, construct features on the annotated corpus according to a preset feature construction method to obtain the relational features of the annotated corpus.

Specifically, in the annotated corpus, each entity is annotated, and the relationship between the entities in the annotated corpus is characterized by a preset feature construction method, and the relationship characteristics of the annotated corpus are obtained.

Among them, the relationship feature refers to the entity relationship used to characterize the corpus knowledge tuple.

Preferably, in this embodiment, the preset feature construction method is to separately record the N words before the head entity, between the two entities, and after the tail entity. The three features are respectively denoted as w _BEF , w _BET , and w _AFT . With reference to the description of the subsequent embodiments, in order to avoid repetition, details are not repeated here.

Furthermore, before constructing features on the annotated corpus, it is also necessary to segment the annotated corpus through the annotated entities, and perform word segmentation on the annotated expectations. Specifically, you can use a third-party word segmentation tool or a word segmentation algorithm. Common third-party Word segmentation tools such as stuttering word segmentation, etc. Common word segmentation algorithms include but are not limited to: conditional random field (CRF) algorithm, hidden Markov model (Hidden Markov Model, HMM), and N-gram model.

S203: Input the relationship features of the unlabeled corpus, the labeled corpus, and the labeled corpus into a preset similarity evaluation model.

Specifically, the similarity evaluation model for evaluating entity relationships is pre-trained, and after the relationship features of the unlabeled corpus, the labeled corpus, and the labeled corpus are obtained, the relationship features of the unlabeled corpus, the labeled corpus, and the labeled corpus are taken as input and input to In the preset similarity evaluation model.

Among them, the preset similarity evaluation model is a neural network model, which specifically includes but is not limited to: deep semantic representation (Embedding from Language Model, ELMo) algorithm, OpenAI GPT and pre-trained Bidirectional Encoder Representations from Transformers, BERT ) Models, etc.

Preferably, an improved BERT model is used as the pre-training model in this embodiment.

Among them, the goal of the BERT model is to use large-scale unlabeled corpus training to obtain a representation of the text that contains rich semantic information, that is: the semantic representation of the text, and then fine-tune the semantic representation of the text in a specific NLP task, and finally apply it to the NLP mission. In this embodiment, the BERT model is mainly used to perform semantic representation and semantic extraction at the vocabulary level and syntax level to realize the calculation of the similarity of entity relationships in different corpora, which is beneficial to improve the accuracy of entity relationships.

S204: Evaluate the unlabeled corpus based on the preset similarity evaluation model and relationship characteristics to obtain the evaluation result, and determine the entity relationship of the unlabeled corpus according to the evaluation result.

Specifically, the unlabeled corpus is evaluated through a preset similarity evaluation model, annotated corpus, and relationship features, and the evaluation result is obtained, and the entity relationship of the unlabeled corpus is determined according to the evaluation result.

Among them, the evaluation results include the similarity relationship between unlabeled corpus and labeled corpus, and there is no similarity relationship between unlabeled corpus and labeled corpus.

It should be understood that when there is a similarity relationship between the unlabeled corpus and the labeled corpus, it indicates that the unlabeled corpus is close to or the same as the labeled corpus. In this case, the entity relationship corresponding to the labeled corpus can be used as the unlabeled corpus. Entity relationship.

Based on the preset similarity evaluation model and relationship features, the unlabeled corpus is evaluated, and the specific process of obtaining the evaluation result can be referred to the description of the subsequent embodiments. In order to avoid repetition, it will not be repeated here.

In this embodiment, by acquiring annotated corpus and unlabeled corpus, each annotated corpus is stored in a seed set, and then for each annotated corpus in the seed set, the annotated corpus is constructed according to a preset feature construction method. Feature to obtain the relational features of the annotated corpus, and then input the relational features of the unlabeled corpus, the annotated corpus, and the annotated corpus into the preset similarity evaluation model. Based on the preset similarity evaluation model and relational features, the unlabeled corpus The evaluation is performed to obtain the evaluation result, and the entity relationship of the unlabeled corpus is determined according to the evaluation result, and the entity relationship of the unlabeled corpus is quickly extracted in a semi-supervised manner, which improves the accuracy and efficiency of entity relationship extraction.

In some optional implementation manners of this embodiment, after step S204, the entity relationship extraction method based on semantic similarity further includes:

Compare the evaluation result with the preset conditions, and determine the unlabeled corpus that meets the preset conditions as the candidate corpus;

The candidate corpus is added to the seed set, and the updated seed set is obtained.

Specifically, the evaluation result is compared with a preset condition, and an unlabeled corpus that meets the preset condition is determined as a candidate corpus, and the candidate corpus is added to the seed set to obtain an updated seed set.

Among them, in this embodiment, the preset condition may specifically be that the evaluation result is that there is a similarity relationship between the unlabeled corpus and the labeled corpus, and the similarity relationship reaches a preset value. The preset value can be set according to actual needs, such as 0.8 , There is no specific limitation here.

It should be understood that in this embodiment, the unlabeled corpus that meets the conditions is added to the seed set to expand the number of samples in the seed set, which is beneficial to improve the recognition accuracy of the subsequent preset similarity recognition model.

In this embodiment, the seed set is updated in a semi-supervised manner to increase the number of samples in the seed set, which is beneficial to improve the accuracy of subsequent recognition.

In some optional implementations of this embodiment, in step S202, for each annotated corpus in the seed set, features are constructed on the annotated corpus according to a preset feature construction method, and the relationship features of the annotated corpus obtained include:

Obtain the named entities of the annotated corpus;

For the named entity, obtain the N word segmentation before the named entity to form a knowledge tuple, as the first relationship feature, obtain the word segmentation between two consecutive named entities, and form the knowledge tuple, as the second relationship feature, after obtaining the named entity N word segmentation forms a knowledge tuple as the third relation feature, where N is a positive integer;

The first relationship feature, the second relationship feature, and the third relationship feature are used as the relationship features of the annotated corpus.

Specifically, obtaining the named entity of the annotated corpus may be manually annotated, or may be through a named entity recognition model.

Among them, the number of N can be set according to actual needs, for example, set N to 3.

Among them, the knowledge tuple refers to the tuple composed of the entity and the word segmentation before and after the entity, and the tuple is used to characterize the relationship between the entity and the word segmentation.

In this embodiment, by constructing features on the annotated corpus, the relationship features of the annotated corpus are obtained, which improves the accuracy of subsequent semantic extraction based on the relationship features, which is beneficial to improve the accuracy of similarity recognition.

In some optional implementations of this embodiment, the BERT model includes a coding layer, a Concat layer, and a fully connected layer. In step S204, the unlabeled corpus is evaluated based on a preset similarity evaluation model and relationship features, to obtain The evaluation results include:

Using the coding layer of the BERT model, each unlabeled corpus is coded to obtain the first coding feature, and each labeled corpus is coded to obtain the second coding feature;

Through the Concat layer of the BERT model, feature extraction and fusion are performed on the first coding feature and the second coding feature to obtain the first fusion feature and the second fusion feature;

For any first fusion feature, calculate the loss value of the first fusion feature and each second fusion feature based on the loss function of the fully connected layer, and use the minimum loss value as the target loss value;

If the target loss value is less than the preset loss threshold, it is determined that the evaluation result is that the unlabeled corpus corresponding to the first fusion feature and the labeled corpus corresponding to the target loss value have a semantic similarity relationship.

Specifically, through the BERT model, the semantic similarity between the unlabeled corpus and the labeled corpus is evaluated, and the evaluation result is obtained.

Among them, BERT (Bidirectional Encoder Representations from Transformers) is a deep learning model based on a multi-layer Transformer structure. The essence of BERT is to learn a good feature representation for words by running a self-supervised learning method on the basis of massive corpus, the so-called Self-supervised learning refers to supervised learning that runs on data that is not manually labeled. In future specific NLP tasks, we can directly use the feature representation of BERT as the word embedding feature of the task. Therefore, BERT provides a model for migration learning of other tasks, which can be used as a feature extractor after being fine-tuned or fixed according to the task.

In this embodiment, in order to prevent overfitting of the BERT model, a Dropout layer is added after the fully connected layer. It should be noted that when designing the network, each layer of neurons set represents a learned intermediate feature (that is, a combination of several weights), and all neurons in the network work together to represent specific attributes of the input data (such as image In classification, it represents the category to which it belongs). When the amount of data is too small relative to the complexity of the network (that is, the expression ability and fitting ability of the network), overfitting occurs. Obviously, there are many overlaps and redundancy among the features represented by each neuron at this time. The direct effect of adding the Dropout layer after the fully connected layer of this embodiment is to reduce the number of intermediate features, thereby reducing redundancy, that is, to increase the orthogonality between the features of each layer, specifically to randomly let the network certain The weights of some hidden layer nodes do not work, and those nodes that do not work can be temporarily considered as not part of the network structure, but their weights must be preserved (just temporarily not updated), because it may work the next time the sample is input. Effectively prevent overfitting.

Among them, the preset loss threshold can be set according to actual needs, for example, set to 0.05, which is not specifically limited here.

Further, in some optional implementations of this embodiment, the loss function is two-class cross entropy. For any first fusion feature, based on the loss function of the fully connected layer, calculate the first fusion feature and each second fusion feature. The loss value of the fusion feature includes:

Among them, Loss is the loss value, and y is the sample label of the second fusion feature. When the second fusion feature is a positive example, the value is 1, otherwise the value is 0.

The probability that the first fusion feature is a positive example.

It should be understood that the two-category cross entropy predicts two categories, and the two categories are divided into positive examples and negative examples, and specific positive examples and negative examples can be set in the model.

In this embodiment, the BERT model is used to evaluate the unlabeled corpus based on the preset similarity evaluation model and relationship characteristics to obtain the evaluation result, which is beneficial to improve the accuracy of the evaluation.

It should be understood that the size of the sequence number of each step in the foregoing embodiment does not mean the order of execution. The execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiment of the present application.

Fig. 3 shows a schematic block diagram of an entity relationship extraction device based on semantic similarity in a one-to-one correspondence with the above embodiment of the entity relationship extraction method based on semantic similarity. As shown in FIG. 3, the entity relationship extraction device based on semantic similarity includes a data collection module 31, a feature construction module 32, a data input module 33 and a relationship extraction module 34. The detailed description of each functional module is as follows:

The data collection module 31 is used to obtain labeled corpus and unlabeled corpus, and store each labeled corpus in the seed set;

The feature construction module 32 is configured to construct features on the annotated corpus according to a preset feature construction method for each annotated corpus in the seed set, and obtain the relationship features of the annotated corpus;

The data input module 33 is used to input the relationship features of the unlabeled corpus, the labeled corpus, and the labeled corpus into the preset similarity evaluation model;

The relation extraction module 34 is configured to evaluate the unlabeled corpus based on the preset similarity evaluation model and relationship characteristics to obtain the evaluation result, and determine the entity relationship of the unlabeled corpus according to the evaluation result.

Optionally, the entity relationship extraction device based on semantic similarity further includes:

The candidate corpus determination module is used to compare the evaluation result with the preset conditions, and determine the unlabeled corpus that meets the preset conditions as the candidate corpus;

The seed set update module is used to add the candidate corpus to the seed set to obtain the updated seed set.

Optionally, the feature building module 32 includes:

The named entity acquisition unit is used to acquire the named entity of the annotated corpus;

The feature construction unit is used to obtain the N word segmentation before the named entity for the named entity to form a knowledge tuple as the first relationship feature, and obtain the word segmentation between two consecutive named entities to form the knowledge tuple as the second relationship Features, after obtaining the named entity, N word segmentation forms a knowledge tuple as the third relationship feature, where N is a positive integer;

The relationship feature determining unit is configured to use the first relationship feature, the second relationship feature, and the third relationship feature as the relationship feature of the annotated corpus.

Optionally, the BERT model includes an encoding layer, a Concat layer, and a fully connected layer, and the relation extraction module 34 includes:

The feature coding unit is used to use the coding layer of the BERT model to encode each unlabeled corpus to obtain the first coding feature, and to encode each labeled corpus to obtain the second coding feature;

The feature fusion unit is used to perform feature extraction and fusion on the first coding feature and the second coding feature respectively through the Concat layer of the BERT model to obtain the first fusion feature and the second fusion feature;

The loss calculation unit is used to calculate the loss value of the first fusion feature and each second fusion feature based on the loss function of the fully connected layer for any first fusion feature, and use the minimum loss value as the target loss value;

The result determining unit is configured to determine that the evaluation result is that the unlabeled corpus corresponding to the first fusion feature and the labeled corpus corresponding to the target loss value have a semantic similarity relationship if the target loss value is less than the preset loss threshold.

For the specific limitation of the entity relationship extraction device based on semantic similarity, please refer to the above limitation on the entity relationship extraction method based on semantic similarity, which will not be repeated here. Each module in the above-mentioned semantic similarity-based entity relationship extraction device can be implemented in whole or in part by software, hardware, and a combination thereof. The above-mentioned modules may be embedded in the form of hardware or independent of the processor in the computer equipment, or may be stored in the memory of the computer equipment in the form of software, so that the processor can call and execute the operations corresponding to the above-mentioned modules.

In order to solve the above technical problems, the embodiments of the present application also provide computer equipment. Please refer to FIG. 4 for details. FIG. 4 is a block diagram of the basic structure of the computer device in this embodiment.

The computer device 4 includes a memory 41, a processor 42, and a network interface 43 that are connected to each other in communication via a system bus. It should be pointed out that the figure only shows the computer device 4 with the components connected to the memory 41, the processor 42, and the network interface 43. However, it should be understood that it is not required to implement all the illustrated components, and alternative implementations can be made. More or fewer components. Among them, those skilled in the art can understand that the computer device here is a device that can automatically perform numerical calculation and/or information processing in accordance with pre-set or stored instructions. Its hardware includes, but is not limited to, a microprocessor, a dedicated Integrated Circuit (Application Specific Integrated Circuit, ASIC), Programmable Gate Array (Field-Programmable Gate Array, FPGA), Digital Processor (Digital Signal Processor, DSP), embedded equipment, etc.

The computer device may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. The computer device can interact with the user through a keyboard, a mouse, a remote control, a touch panel, or a voice control device.

The memory 41 includes at least one type of readable storage medium, the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or D interface display memory, etc.), random access memory (RAM) , Static random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disks, optical disks, etc. In some embodiments, the memory 41 may be an internal storage unit of the computer device 4, such as a hard disk or memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, for example, a plug-in hard disk equipped on the computer device 4, a smart memory card (Smart Media Card, SMC), and a secure digital (Secure Digital, SD) card, Flash Card, etc. Of course, the memory 41 may also include both the internal storage unit of the computer device 4 and its external storage device. In this embodiment, the memory 41 is generally used to store the operating system and various application software installed in the computer device 4, such as program codes for controlling electronic files. In addition, the memory 41 can also be used to temporarily store various types of data that have been output or will be output.

The processor 42 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips in some embodiments. The processor 42 is generally used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is configured to run program codes or process data stored in the memory 41, for example, run program codes for controlling electronic files.

The network interface 43 may include a wireless network interface or a wired network interface, and the network interface 43 is generally used to establish a communication connection between the computer device 4 and other electronic devices.

This application also provides another implementation manner, that is, a computer-readable storage medium is provided. The computer-readable storage medium may be non-volatile or volatile, and the computer-readable storage medium stores An interface display program, the interface display program can be executed by at least one processor, so that the at least one processor executes the steps of the entity relationship extraction method based on semantic similarity as described above.

Through the description of the above embodiments, those skilled in the art can clearly understand that the method of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, The optical disc) includes several instructions to make a terminal device (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods described in the various embodiments of the present application.

Obviously, the above-described embodiments are only a part of the embodiments of the present application, rather than all of the embodiments. The drawings show preferred embodiments of the present application, but do not limit the patent scope of the present application. The present application can be implemented in many different forms. On the contrary, the purpose of providing these examples is to make the understanding of the disclosure of the present application more thorough and comprehensive. Although this application has been described in detail with reference to the foregoing embodiments, for those skilled in the art, it is still possible for those skilled in the art to modify the technical solutions described in each of the foregoing specific embodiments, or equivalently replace some of the technical features. . All equivalent structures made by using the contents of the description and drawings of this application, directly or indirectly used in other related technical fields, are similarly within the scope of patent protection of this application.

Claims

An entity relationship extraction method based on semantic similarity, which includes:

Obtain the labeled corpus and the unlabeled corpus, and store each of the labeled corpus in the seed set;

For each of the annotated corpora in the seed set, construct features on the annotated corpus according to a preset feature construction mode to obtain the relationship feature of the annotated corpus;

Inputting the relationship features of the unlabeled corpus, the labeled corpus, and the labeled corpus into a preset similarity evaluation model;

Based on the preset similarity evaluation model and the relationship feature, the unlabeled corpus is evaluated to obtain an evaluation result, and the entity relationship of the unlabeled corpus is determined according to the evaluation result.
The method for extracting entity relationship based on semantic similarity of claim 1, wherein, in the evaluation model based on the preset similarity evaluation model and the relationship feature, the unlabeled corpus is evaluated to obtain an evaluation result After that, the entity relationship extraction method based on semantic similarity further includes:

Compare the evaluation result with the preset conditions, and determine the unlabeled corpus that meets the preset conditions as the candidate corpus;

The candidate corpus is added to the seed set to obtain an updated seed set.
The entity relationship extraction method based on semantic similarity according to claim 1, wherein, for each of the annotated corpus in the seed set, a feature is constructed for the annotated corpus according to a preset feature construction method. , Obtaining the relationship features of the labeled corpus includes:

Acquiring the named entity of the annotated corpus;

For the named entity, obtain the N word segments before the named entity to form a knowledge tuple as the first relationship feature, and obtain the word segmentation between two consecutive named entities to form the knowledge tuple as the second relationship feature, N word segmentation after acquiring the named entity to form a knowledge tuple as the third relationship feature, where N is a positive integer;

The first relationship feature, the second relationship feature, and the third relationship feature are used as the relationship feature of the annotation corpus.
5. The method for extracting entity relationship based on semantic similarity according to claim 1, wherein the preset similarity evaluation model is a BERT model.
The entity relationship extraction method based on semantic similarity of claim 4, wherein the BERT model includes a coding layer, a Concat layer, and a fully connected layer, and the preset similarity evaluation model and the relationship Features, evaluate the unlabeled corpus, and the evaluation results obtained include:

Using the coding layer of the BERT model to encode each of the unlabeled corpus to obtain a first coding feature, and to encode each of the labeled corpus to obtain a second coding feature;

Perform feature extraction and fusion on the first coding feature and the second coding feature through the Concat layer of the BERT model, respectively, to obtain the first fusion feature and the second fusion feature;

For any one of the first fusion features, calculate the loss value of the first fusion feature and each second fusion feature based on the loss function of the fully connected layer, and use the minimum loss value as the target loss value;

If the target loss value is less than the preset loss threshold, it is determined that the evaluation result is that the unlabeled corpus corresponding to the first fusion feature and the labeled corpus corresponding to the target loss value have a semantic similarity relationship.
The entity relationship extraction method based on semantic similarity according to claim 5, wherein the loss function is two-class cross entropy, and for any one of the first fusion features, the calculation is based on the loss function of the fully connected layer The loss value of the first fusion feature and each second fusion feature includes:

Among them, Loss is the loss value, and y is the sample label of the second fusion feature. When the second fusion feature is a positive example, the value is 1, otherwise the value is 0.
The probability that the first fusion feature is a positive example.
An entity relationship extraction device based on semantic similarity, which includes:

The data collection module is used to obtain labeled corpus and unlabeled corpus, and store each labeled corpus in the seed set;

The feature construction module is used to construct features of the annotated corpus according to the preset feature construction method for each annotated corpus in the seed set, and obtain the relationship characteristics of the annotated corpus;

The data input module is used to input the relationship features of the unlabeled corpus, the labeled corpus, and the labeled corpus into the preset similarity evaluation model;

The relation extraction module is used to evaluate the unlabeled corpus based on the preset similarity evaluation model and relationship characteristics to obtain the evaluation result, and determine the entity relationship of the unlabeled corpus according to the evaluation result.
8. The entity relationship extraction device based on semantic similarity according to claim 7, wherein the feature construction module comprises:

The named entity acquisition unit is used to acquire the named entity of the annotated corpus;

The feature construction unit is used to obtain the N word segmentation before the named entity for the named entity to form a knowledge tuple as the first relationship feature, and obtain the word segmentation between two consecutive named entities to form the knowledge tuple as the second relationship Feature, N word segmentation after the named entity is obtained to form a knowledge tuple as the third relationship feature, where N is a positive integer;

The relationship feature determining unit is configured to use the first relationship feature, the second relationship feature, and the third relationship feature as the relationship feature of the annotated corpus.
A computer device includes a memory, a processor, and computer-readable instructions that are stored in the memory and can run on the processor, wherein the processor implements the following steps when the processor executes the computer-readable instructions:

Obtain the labeled corpus and the unlabeled corpus, and store each of the labeled corpus in the seed set;

For each of the annotated corpora in the seed set, construct features on the annotated corpus according to a preset feature construction mode to obtain the relationship feature of the annotated corpus;

Inputting the relationship features of the unlabeled corpus, the labeled corpus, and the labeled corpus into a preset similarity evaluation model;

Based on the preset similarity evaluation model and the relationship feature, the unlabeled corpus is evaluated to obtain an evaluation result, and the entity relationship of the unlabeled corpus is determined according to the evaluation result.
The computer device of claim 9, wherein, after the unlabeled corpus is evaluated based on the preset similarity evaluation model and the relationship feature, and the evaluation result is obtained, the processor executes The computer-readable instructions further implement the following steps:

Compare the evaluation result with the preset conditions, and determine the unlabeled corpus that meets the preset conditions as the candidate corpus;

The candidate corpus is added to the seed set to obtain an updated seed set.
8. The computer device according to claim 9, wherein said for each of said annotated corpora in said seed set, features are constructed on said annotated corpus according to a preset feature construction method to obtain a value of said annotated corpus Relationship characteristics include:

Acquiring the named entity of the annotated corpus;

For the named entity, obtain the N word segments before the named entity to form a knowledge tuple as the first relationship feature, and obtain the word segmentation between two consecutive named entities to form the knowledge tuple as the second relationship feature, N word segmentation after acquiring the named entity to form a knowledge tuple as the third relationship feature, where N is a positive integer;

The first relationship feature, the second relationship feature, and the third relationship feature are used as the relationship feature of the annotation corpus.
9. The computer device according to claim 9, wherein the preset similarity evaluation model is a BERT model.
The computer device according to claim 12, wherein the BERT model includes a coding layer, a Concat layer, and a fully connected layer, and based on the preset similarity evaluation model and the relationship feature, the unlabeled The corpus is evaluated, and the evaluation results obtained include:

Using the coding layer of the BERT model to encode each of the unlabeled corpus to obtain a first coding feature, and to encode each of the labeled corpus to obtain a second coding feature;

Perform feature extraction and fusion on the first coding feature and the second coding feature through the Concat layer of the BERT model, respectively, to obtain the first fusion feature and the second fusion feature;

For any one of the first fusion features, calculate the loss value of the first fusion feature and each second fusion feature based on the loss function of the fully connected layer, and use the minimum loss value as the target loss value;

If the target loss value is less than the preset loss threshold, it is determined that the evaluation result is that the unlabeled corpus corresponding to the first fusion feature and the labeled corpus corresponding to the target loss value have a semantic similarity relationship.
The computer device according to claim 13, wherein the loss function is two-class cross entropy, and for any one of the first fusion features, based on the loss function of the fully connected layer, the first fusion feature and the The loss value of each second fusion feature includes:

Among them, Loss is the loss value, and y is the sample label of the second fusion feature. When the second fusion feature is a positive example, the value is 1, otherwise the value is 0.
The probability that the first fusion feature is a positive example.
A computer-readable storage medium, the computer-readable storage medium stores computer-readable instructions, wherein, when the computer-readable instructions are executed by a processor, the following steps are implemented:

Obtain the labeled corpus and the unlabeled corpus, and store each of the labeled corpus in the seed set;

For each of the annotated corpora in the seed set, construct features on the annotated corpus according to a preset feature construction mode to obtain the relationship feature of the annotated corpus;

Inputting the relationship features of the unlabeled corpus, the labeled corpus, and the labeled corpus into a preset similarity evaluation model;

Based on the preset similarity evaluation model and the relationship feature, the unlabeled corpus is evaluated to obtain an evaluation result, and the entity relationship of the unlabeled corpus is determined according to the evaluation result.
15. The computer-readable storage medium of claim 15, wherein, after the unlabeled corpus is evaluated based on the preset similarity evaluation model and the relationship feature, and the evaluation result is obtained, the When the computer-readable instructions are executed by the processor, the following steps are also implemented:

Compare the evaluation result with the preset conditions, and determine the unlabeled corpus that meets the preset conditions as the candidate corpus;

The candidate corpus is added to the seed set to obtain an updated seed set.
15. The computer-readable storage medium according to claim 15, wherein for each of the annotated corpora in the seed set, a feature is constructed on the annotated corpus according to a preset feature construction method to obtain the The relational characteristics of annotated corpus include:

Acquiring the named entity of the annotated corpus;

For the named entity, obtain the N word segments before the named entity to form a knowledge tuple as the first relationship feature, and obtain the word segmentation between two consecutive named entities to form the knowledge tuple as the second relationship feature, N word segmentation after acquiring the named entity to form a knowledge tuple as the third relationship feature, where N is a positive integer;

The first relationship feature, the second relationship feature, and the third relationship feature are used as the relationship feature of the annotation corpus.
15. The computer-readable storage medium of claim 15, wherein the preset similarity evaluation model is a BERT model.
The computer-readable storage medium of claim 18, wherein the BERT model includes a coding layer, a Concat layer, and a fully connected layer, and based on the preset similarity evaluation model and the relationship feature, the The unmarked corpus is evaluated, and the evaluation results obtained include:

Using the coding layer of the BERT model to encode each of the unlabeled corpus to obtain a first coding feature, and to encode each of the labeled corpus to obtain a second coding feature;

Perform feature extraction and fusion on the first coding feature and the second coding feature through the Concat layer of the BERT model, respectively, to obtain the first fusion feature and the second fusion feature;

For any one of the first fusion features, calculate the loss value of the first fusion feature and each second fusion feature based on the loss function of the fully connected layer, and use the minimum loss value as the target loss value;

If the target loss value is less than the preset loss threshold, it is determined that the evaluation result is that the unlabeled corpus corresponding to the first fusion feature and the labeled corpus corresponding to the target loss value have a semantic similarity relationship.
The computer-readable storage medium according to claim 19, wherein the loss function is two-class cross entropy, and for any one of the first fusion features, the first fusion feature is calculated based on the loss function of the fully connected layer. The loss value of the fusion feature and each second fusion feature includes:

Among them, Loss is the loss value, and y is the sample label of the second fusion feature. When the second fusion feature is a positive example, the value is 1, otherwise the value is 0.
The probability that the first fusion feature is a positive example.