WO2021121198A1 - Semantic similarity-based entity relation extraction method and apparatus, device and medium - Google Patents

Semantic similarity-based entity relation extraction method and apparatus, device and medium Download PDF

Info

Publication number
WO2021121198A1
WO2021121198A1 PCT/CN2020/136349 CN2020136349W WO2021121198A1 WO 2021121198 A1 WO2021121198 A1 WO 2021121198A1 CN 2020136349 W CN2020136349 W CN 2020136349W WO 2021121198 A1 WO2021121198 A1 WO 2021121198A1
Authority
WO
WIPO (PCT)
Prior art keywords
corpus
feature
relationship
annotated
fusion
Prior art date
Application number
PCT/CN2020/136349
Other languages
French (fr)
Chinese (zh)
Inventor
陈振东
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021121198A1 publication Critical patent/WO2021121198A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • This application relates to the field of artificial intelligence, and in particular to an entity relationship extraction method, device, equipment and medium based on semantic similarity.
  • Entity relationship extraction is an important research topic in the field of information extraction, and its main purpose is to extract The semantic relationship between the marked entity pairs in the sentence is to determine the relationship category between the entity pairs in the unstructured text on the basis of entity recognition, and form structured data for storage and retrieval.
  • entity relationship extraction technology can provide theoretical support for other natural language processing technologies.
  • the existing method is mainly to determine the similarity between the new sentence and the original corpus by segmenting the sentence and then calculating the similarity.
  • This kind of similarity based on the similarity of text characters depends more on the characterization ability of word vectors. After multiple cycles, the subsequent added corpus will have a semantic drift problem, resulting in the entity of the entire corpus. The accuracy of relation extraction is getting lower and lower.
  • the embodiments of the present application provide a method, device, computer equipment, and storage medium for extracting entity relationships based on semantic similarity, so as to improve the accuracy of extracting the relationship of named entities.
  • an embodiment of the present application provides an entity relationship extraction method based on semantic similarity, including:
  • the unlabeled corpus is evaluated to obtain an evaluation result, and the entity relationship of the unlabeled corpus is determined according to the evaluation result.
  • an embodiment of the present application also provides an entity relationship extraction device based on semantic similarity, including:
  • Data collection module used to obtain labeled corpus and unlabeled corpus, and store each of the labeled corpus in the seed set;
  • a feature construction module configured to construct features on the annotated corpus according to a preset feature construction method for each of the annotated corpus in the seed set, to obtain the relationship feature of the annotated corpus;
  • a data input module configured to input the relationship features of the unlabeled corpus, the labeled corpus, and the labeled corpus into a preset similarity evaluation model
  • the relation extraction module is configured to evaluate the unlabeled corpus based on the preset similarity evaluation model and the relationship feature to obtain an evaluation result, and determine the entity of the unlabeled corpus according to the evaluation result relationship.
  • an embodiment of the present application also provides a computer device, including a memory, a processor, and computer-readable instructions stored in the memory and running on the processor, and the processor executes all When the computer-readable instructions are described, the following steps are implemented:
  • the unlabeled corpus is evaluated to obtain an evaluation result, and the entity relationship of the unlabeled corpus is determined according to the evaluation result.
  • the embodiments of the present application also provide a computer-readable storage medium, the computer-readable storage medium stores computer-readable instructions, and when the computer-readable instructions are executed by a processor, the following steps are implemented:
  • the unlabeled corpus is evaluated to obtain an evaluation result, and the entity relationship of the unlabeled corpus is determined according to the evaluation result.
  • the method, device, device, and medium for extracting entity relationship based on semantic similarity can obtain labeled corpus and unlabeled corpus, and store each labeled corpus into a seed set, and then target each of the seed sets.
  • construct features of the annotated corpus to obtain the relationship characteristics of the annotated corpus, and then input the relationship characteristics of the unlabeled corpus, the annotated corpus, and the annotated corpus into the preset similarity evaluation model.
  • the unlabeled corpus is evaluated to obtain the evaluation result, and the entity relationship of the unlabeled corpus is determined according to the evaluation result, so as to realize the entity relationship of the unlabeled corpus in a semi-supervised manner
  • Fast extraction improves the accuracy and efficiency of entity relationship extraction.
  • Figure 1 is an exemplary system architecture diagram to which the present application can be applied;
  • Fig. 2 is a flowchart of an embodiment of an entity relationship extraction method based on semantic similarity of the present application
  • FIG. 3 is a schematic structural diagram of an embodiment of an entity relationship extraction device based on semantic similarity according to the present application
  • Fig. 4 is a schematic structural diagram of an embodiment of a computer device according to the present application.
  • the system architecture 100 may include terminal devices 101, 102, and 103, a network 104 and a server 105.
  • the network 104 is used to provide a medium for communication links between the terminal devices 101, 102, 103 and the server 105.
  • the network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, and so on.
  • the user can use the terminal devices 101, 102, and 103 to interact with the server 105 through the network 104 to receive or send messages and so on.
  • the terminal devices 101, 102, 103 may be various electronic devices with a display screen and support web browsing, including but not limited to smart phones, tablets, e-book readers, MP3 players (Moving Picture E interface display perts Group Audio Layer III. The moving picture expert compresses the standard audio layer 3), MP4 (Moving Picture E interface displays perts Group Audio Layer IV, the moving picture expert compresses the standard audio layer 4) player, laptop portable computer and desktop computer, etc.
  • the server 105 may be a server that provides various services, for example, a background server that provides support for pages displayed on the terminal devices 101, 102, and 103.
  • the method for extracting entity relationship based on semantic similarity provided by the embodiment of the present application is executed by the server, and accordingly, the device for extracting entity relationship based on semantic similarity is set in the server.
  • terminal devices, networks, and servers in FIG. 1 are merely illustrative. There may be any number of terminal devices, networks, and servers according to implementation needs.
  • the terminal devices 101, 102, and 103 in the embodiments of the present application may specifically correspond to application systems in actual production.
  • FIG. 2 shows a semantic similarity-based entity relationship extraction method provided by an embodiment of the present application. The method is applied to the server in FIG. 1 as an example for description, and the details are as follows:
  • S201 Obtain labeled corpus and unlabeled corpus, and store each labeled corpus in a seed set.
  • NLP Natural Language Processing
  • Common NLP tasks include but are not limited to: Speech recognition, Chinese word segmentation, Part-of- speech tagging, text categorization, parsing, automatic summarization, question answering, information extraction, etc.
  • entity relationship extraction is a classic task in the NLP field. Specifically, given a sentence and the entities appearing in it, the relationship between the entities needs to be inferred based on the semantic information of the sentence. For example, given the sentence: "Tsinghua University is located in the nearest neighbor of Beijing” and the entities “Tsinghua University” and "Beijing", the entity relationship extraction model obtains the relationship of "located”, and finally extracts the knowledge of (Tsinghua University, located, Beijing) Triad. Entity relations have been continuously researched and carried out in the past 20 years. Feature engineering, nuclear methods, and graph models have been widely used among them, and some phased results have been achieved. With the advent of the deep learning era, neural network models have brought new breakthroughs in entity relationship extraction.
  • annotated corpus refers to the corpus obtained by manually selecting part of the corpus according to actual needs, and labeling the entity relationship of the corpus.
  • only a small amount of corpus needs to be annotated to meet the subsequent training.
  • the need for example, ten items, is far less than the number of corpus required for the training of traditional depth models.
  • the source of the corpus selection in this embodiment can be selected according to actual needs, which is not limited here. For example, you can collect policy-related corpus from government sites, or collect sports-related corpus from sports forums or news sites.
  • the seed set in this embodiment can be understood as a corpus that is continuously improved and expanded.
  • a part of the corpus required by the task is obtained through manual annotation, and stored in the seed set as annotated corpus.
  • more corpora of the same type as the corpus required by the task are added to the unlabeled corpus, so that more and more corpora are included in the seed set, and the clustering characteristics of the corpus become more and more important. It's getting more and more obvious. Conducive to improving the robustness of the seed set.
  • synonym or similarity corpus related to the marked corpus is acquired and added to the marked corpus, so as to improve the subsequent training effect on the model.
  • S202 For each annotated corpus in the seed set, construct features on the annotated corpus according to a preset feature construction method to obtain the relational features of the annotated corpus.
  • each entity is annotated, and the relationship between the entities in the annotated corpus is characterized by a preset feature construction method, and the relationship characteristics of the annotated corpus are obtained.
  • the relationship feature refers to the entity relationship used to characterize the corpus knowledge tuple.
  • the preset feature construction method is to separately record the N words before the head entity, between the two entities, and after the tail entity.
  • the three features are respectively denoted as w BEF , w BET , and w AFT .
  • Common third-party Word segmentation tools such as stuttering word segmentation, etc.
  • Common word segmentation algorithms include but are not limited to: conditional random field (CRF) algorithm, hidden Markov model (Hidden Markov Model, HMM), and N-gram model.
  • S203 Input the relationship features of the unlabeled corpus, the labeled corpus, and the labeled corpus into a preset similarity evaluation model.
  • the similarity evaluation model for evaluating entity relationships is pre-trained, and after the relationship features of the unlabeled corpus, the labeled corpus, and the labeled corpus are obtained, the relationship features of the unlabeled corpus, the labeled corpus, and the labeled corpus are taken as input and input to In the preset similarity evaluation model.
  • the preset similarity evaluation model is a neural network model, which specifically includes but is not limited to: deep semantic representation (Embedding from Language Model, ELMo) algorithm, OpenAI GPT and pre-trained Bidirectional Encoder Representations from Transformers, BERT ) Models, etc.
  • ELMo Deep semantic representation
  • OpenAI GPT OpenAI GPT
  • BERT Bidirectional Encoder Representations from Transformers
  • an improved BERT model is used as the pre-training model in this embodiment.
  • the goal of the BERT model is to use large-scale unlabeled corpus training to obtain a representation of the text that contains rich semantic information, that is: the semantic representation of the text, and then fine-tune the semantic representation of the text in a specific NLP task, and finally apply it to the NLP mission.
  • the BERT model is mainly used to perform semantic representation and semantic extraction at the vocabulary level and syntax level to realize the calculation of the similarity of entity relationships in different corpora, which is beneficial to improve the accuracy of entity relationships.
  • S204 Evaluate the unlabeled corpus based on the preset similarity evaluation model and relationship characteristics to obtain the evaluation result, and determine the entity relationship of the unlabeled corpus according to the evaluation result.
  • the unlabeled corpus is evaluated through a preset similarity evaluation model, annotated corpus, and relationship features, and the evaluation result is obtained, and the entity relationship of the unlabeled corpus is determined according to the evaluation result.
  • the evaluation results include the similarity relationship between unlabeled corpus and labeled corpus, and there is no similarity relationship between unlabeled corpus and labeled corpus.
  • the unlabeled corpus is evaluated, and the specific process of obtaining the evaluation result can be referred to the description of the subsequent embodiments. In order to avoid repetition, it will not be repeated here.
  • each annotated corpus is stored in a seed set, and then for each annotated corpus in the seed set, the annotated corpus is constructed according to a preset feature construction method. Feature to obtain the relational features of the annotated corpus, and then input the relational features of the unlabeled corpus, the annotated corpus, and the annotated corpus into the preset similarity evaluation model.
  • the unlabeled corpus The evaluation is performed to obtain the evaluation result, and the entity relationship of the unlabeled corpus is determined according to the evaluation result, and the entity relationship of the unlabeled corpus is quickly extracted in a semi-supervised manner, which improves the accuracy and efficiency of entity relationship extraction.
  • the entity relationship extraction method based on semantic similarity further includes:
  • the candidate corpus is added to the seed set, and the updated seed set is obtained.
  • the evaluation result is compared with a preset condition, and an unlabeled corpus that meets the preset condition is determined as a candidate corpus, and the candidate corpus is added to the seed set to obtain an updated seed set.
  • the preset condition may specifically be that the evaluation result is that there is a similarity relationship between the unlabeled corpus and the labeled corpus, and the similarity relationship reaches a preset value.
  • the preset value can be set according to actual needs, such as 0.8 , There is no specific limitation here.
  • the unlabeled corpus that meets the conditions is added to the seed set to expand the number of samples in the seed set, which is beneficial to improve the recognition accuracy of the subsequent preset similarity recognition model.
  • the seed set is updated in a semi-supervised manner to increase the number of samples in the seed set, which is beneficial to improve the accuracy of subsequent recognition.
  • step S202 for each annotated corpus in the seed set, features are constructed on the annotated corpus according to a preset feature construction method, and the relationship features of the annotated corpus obtained include:
  • N word segmentation before the named entity to form a knowledge tuple as the first relationship feature
  • obtain the word segmentation between two consecutive named entities and form the knowledge tuple, as the second relationship feature
  • the first relationship feature, the second relationship feature, and the third relationship feature are used as the relationship features of the annotated corpus.
  • obtaining the named entity of the annotated corpus may be manually annotated, or may be through a named entity recognition model.
  • the number of N can be set according to actual needs, for example, set N to 3.
  • the knowledge tuple refers to the tuple composed of the entity and the word segmentation before and after the entity, and the tuple is used to characterize the relationship between the entity and the word segmentation.
  • the relationship features of the annotated corpus are obtained, which improves the accuracy of subsequent semantic extraction based on the relationship features, which is beneficial to improve the accuracy of similarity recognition.
  • the BERT model includes a coding layer, a Concat layer, and a fully connected layer.
  • the unlabeled corpus is evaluated based on a preset similarity evaluation model and relationship features, to obtain The evaluation results include:
  • each unlabeled corpus is coded to obtain the first coding feature, and each labeled corpus is coded to obtain the second coding feature;
  • any first fusion feature calculate the loss value of the first fusion feature and each second fusion feature based on the loss function of the fully connected layer, and use the minimum loss value as the target loss value;
  • the target loss value is less than the preset loss threshold, it is determined that the evaluation result is that the unlabeled corpus corresponding to the first fusion feature and the labeled corpus corresponding to the target loss value have a semantic similarity relationship.
  • the semantic similarity between the unlabeled corpus and the labeled corpus is evaluated, and the evaluation result is obtained.
  • BERT Bidirectional Encoder Representations from Transformers
  • the essence of BERT is to learn a good feature representation for words by running a self-supervised learning method on the basis of massive corpus, the so-called Self-supervised learning refers to supervised learning that runs on data that is not manually labeled.
  • Self-supervised learning refers to supervised learning that runs on data that is not manually labeled.
  • BERT provides a model for migration learning of other tasks, which can be used as a feature extractor after being fine-tuned or fixed according to the task.
  • each layer of neurons set represents a learned intermediate feature (that is, a combination of several weights), and all neurons in the network work together to represent specific attributes of the input data (such as image In classification, it represents the category to which it belongs).
  • image In classification it represents the category to which it belongs.
  • the direct effect of adding the Dropout layer after the fully connected layer of this embodiment is to reduce the number of intermediate features, thereby reducing redundancy, that is, to increase the orthogonality between the features of each layer, specifically to randomly let the network certain
  • the weights of some hidden layer nodes do not work, and those nodes that do not work can be temporarily considered as not part of the network structure, but their weights must be preserved (just temporarily not updated), because it may work the next time the sample is input. Effectively prevent overfitting.
  • the preset loss threshold can be set according to actual needs, for example, set to 0.05, which is not specifically limited here.
  • the loss function is two-class cross entropy. For any first fusion feature, based on the loss function of the fully connected layer, calculate the first fusion feature and each second fusion feature.
  • the loss value of the fusion feature includes:
  • Loss is the loss value
  • y is the sample label of the second fusion feature.
  • the value is 1, otherwise the value is 0.
  • the probability that the first fusion feature is a positive example.
  • the two-category cross entropy predicts two categories, and the two categories are divided into positive examples and negative examples, and specific positive examples and negative examples can be set in the model.
  • the BERT model is used to evaluate the unlabeled corpus based on the preset similarity evaluation model and relationship characteristics to obtain the evaluation result, which is beneficial to improve the accuracy of the evaluation.
  • Fig. 3 shows a schematic block diagram of an entity relationship extraction device based on semantic similarity in a one-to-one correspondence with the above embodiment of the entity relationship extraction method based on semantic similarity.
  • the entity relationship extraction device based on semantic similarity includes a data collection module 31, a feature construction module 32, a data input module 33 and a relationship extraction module 34.
  • the detailed description of each functional module is as follows:
  • the data collection module 31 is used to obtain labeled corpus and unlabeled corpus, and store each labeled corpus in the seed set;
  • the feature construction module 32 is configured to construct features on the annotated corpus according to a preset feature construction method for each annotated corpus in the seed set, and obtain the relationship features of the annotated corpus;
  • the data input module 33 is used to input the relationship features of the unlabeled corpus, the labeled corpus, and the labeled corpus into the preset similarity evaluation model;
  • the relation extraction module 34 is configured to evaluate the unlabeled corpus based on the preset similarity evaluation model and relationship characteristics to obtain the evaluation result, and determine the entity relationship of the unlabeled corpus according to the evaluation result.
  • the entity relationship extraction device based on semantic similarity further includes:
  • the candidate corpus determination module is used to compare the evaluation result with the preset conditions, and determine the unlabeled corpus that meets the preset conditions as the candidate corpus;
  • the seed set update module is used to add the candidate corpus to the seed set to obtain the updated seed set.
  • the feature building module 32 includes:
  • the named entity acquisition unit is used to acquire the named entity of the annotated corpus
  • the feature construction unit is used to obtain the N word segmentation before the named entity for the named entity to form a knowledge tuple as the first relationship feature, and obtain the word segmentation between two consecutive named entities to form the knowledge tuple as the second relationship Features, after obtaining the named entity, N word segmentation forms a knowledge tuple as the third relationship feature, where N is a positive integer;
  • the relationship feature determining unit is configured to use the first relationship feature, the second relationship feature, and the third relationship feature as the relationship feature of the annotated corpus.
  • the BERT model includes an encoding layer, a Concat layer, and a fully connected layer
  • the relation extraction module 34 includes:
  • the feature coding unit is used to use the coding layer of the BERT model to encode each unlabeled corpus to obtain the first coding feature, and to encode each labeled corpus to obtain the second coding feature;
  • the feature fusion unit is used to perform feature extraction and fusion on the first coding feature and the second coding feature respectively through the Concat layer of the BERT model to obtain the first fusion feature and the second fusion feature;
  • the loss calculation unit is used to calculate the loss value of the first fusion feature and each second fusion feature based on the loss function of the fully connected layer for any first fusion feature, and use the minimum loss value as the target loss value;
  • the result determining unit is configured to determine that the evaluation result is that the unlabeled corpus corresponding to the first fusion feature and the labeled corpus corresponding to the target loss value have a semantic similarity relationship if the target loss value is less than the preset loss threshold.
  • Each module in the above-mentioned semantic similarity-based entity relationship extraction device can be implemented in whole or in part by software, hardware, and a combination thereof.
  • the above-mentioned modules may be embedded in the form of hardware or independent of the processor in the computer equipment, or may be stored in the memory of the computer equipment in the form of software, so that the processor can call and execute the operations corresponding to the above-mentioned modules.
  • FIG. 4 is a block diagram of the basic structure of the computer device in this embodiment.
  • the computer device 4 includes a memory 41, a processor 42, and a network interface 43 that are connected to each other in communication via a system bus. It should be pointed out that the figure only shows the computer device 4 with the components connected to the memory 41, the processor 42, and the network interface 43. However, it should be understood that it is not required to implement all the illustrated components, and alternative implementations can be made. More or fewer components. Among them, those skilled in the art can understand that the computer device here is a device that can automatically perform numerical calculation and/or information processing in accordance with pre-set or stored instructions.
  • Its hardware includes, but is not limited to, a microprocessor, a dedicated Integrated Circuit (Application Specific Integrated Circuit, ASIC), Programmable Gate Array (Field-Programmable Gate Array, FPGA), Digital Processor (Digital Signal Processor, DSP), embedded equipment, etc.
  • ASIC Application Specific Integrated Circuit
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • DSP Digital Processor
  • the computer device may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • the computer device can interact with the user through a keyboard, a mouse, a remote control, a touch panel, or a voice control device.
  • the memory 41 includes at least one type of readable storage medium, the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or D interface display memory, etc.), random access memory (RAM) , Static random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disks, optical disks, etc.
  • the memory 41 may be an internal storage unit of the computer device 4, such as a hard disk or memory of the computer device 4.
  • the memory 41 may also be an external storage device of the computer device 4, for example, a plug-in hard disk equipped on the computer device 4, a smart memory card (Smart Media Card, SMC), and a secure digital (Secure Digital, SD) card, Flash Card, etc.
  • the memory 41 may also include both the internal storage unit of the computer device 4 and its external storage device.
  • the memory 41 is generally used to store the operating system and various application software installed in the computer device 4, such as program codes for controlling electronic files.
  • the memory 41 can also be used to temporarily store various types of data that have been output or will be output.
  • the processor 42 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips in some embodiments.
  • the processor 42 is generally used to control the overall operation of the computer device 4.
  • the processor 42 is configured to run program codes or process data stored in the memory 41, for example, run program codes for controlling electronic files.
  • the network interface 43 may include a wireless network interface or a wired network interface, and the network interface 43 is generally used to establish a communication connection between the computer device 4 and other electronic devices.
  • the computer-readable storage medium may be non-volatile or volatile, and the computer-readable storage medium stores An interface display program, the interface display program can be executed by at least one processor, so that the at least one processor executes the steps of the entity relationship extraction method based on semantic similarity as described above.
  • the method of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better. ⁇
  • the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, The optical disc) includes several instructions to make a terminal device (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods described in the various embodiments of the present application.
  • a terminal device which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.

Abstract

A semantic similarity-based entity relation extraction method and apparatus, a device and a medium, relating to the field of artificial intelligence. The method comprises: obtaining annotated corpora and unannotated corpora, and storing each annotated corpus in a seed set (S201); for each annotated corpus in the seed set, on the basis of a preset feature construction mode, constructing features for each annotated corpus, and obtaining relation features of the annotated corpora (S202); inputting the unannotated corpora, the annotated corpora, and the relation features of the annotated corpora into a preset similarity evaluation model (S203); and on the basis of the preset similarity evaluation model and the relation features, evaluating the unannotated corpora, obtaining an evaluation result, then determining an entity relation of the unannotated corpora on the basis of the evaluation result (S204). By means of a semi-supervised method, rapid entity relation extraction may be performed in respect of the unannotated corpora, improving entity relation extraction accuracy and efficiency.

Description

基于语义相似度的实体关系抽取方法、装置、设备及介质Entity relationship extraction method, device, equipment and medium based on semantic similarity
本申请要求于2020年9月8日,提交中国专利局、申请号2020109372749发明名称为“基于语义相似度的实体关系抽取方法、装置、设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office with the application number 2020109372749 titled "Semantic Similarity-Based Entity Relationship Extraction Method, Apparatus, Equipment, and Medium" on September 8, 2020, and its entire contents Incorporated in this application by reference.
技术领域Technical field
本申请涉及人工智能领域,尤其涉及一种基于语义相似度的实体关系抽取方法、装置、设备及介质。This application relates to the field of artificial intelligence, and in particular to an entity relationship extraction method, device, equipment and medium based on semantic similarity.
背景技术Background technique
在自然语言处理领域,涉及到语义网络标注、篇章理解、机器翻译方面时,经常需要对语料中的内容,进行实体关系抽取,实体关系抽取作为信息抽取领域的重要研究课题,其主要目的是抽取句子中已标记实体对之间的语义关系,即在实体识别的基础上确定无结构文本中实体对间的关系类别,并形成结构化的数据以便存储和取用。在理论研究和额实际运用中,实体关系抽取技术能为其它自然语言处理技术提供理论支持。In the field of natural language processing, when it comes to semantic network annotation, text understanding, and machine translation, it is often necessary to extract entity relationships from the content in the corpus. Entity relationship extraction is an important research topic in the field of information extraction, and its main purpose is to extract The semantic relationship between the marked entity pairs in the sentence is to determine the relationship category between the entity pairs in the unstructured text on the basis of entity recognition, and form structured data for storage and retrieval. In theoretical research and practical application, entity relationship extraction technology can provide theoretical support for other natural language processing technologies.
在实现本申请的过程中,发明人意识到现有技术至少存在如下问题:现有的方式,主要是通过对语句进行分词,进而计算相似度,来确定新语句与原有语料之间的相似性,这种基于文本字符相似程度的相似性,其计算的准确性较多依赖于词向量的表征能力,在多次循环之后,会使后续加入的语料产生语义漂移问题,导致整个语料的实体关系抽取准确度越来越低。In the process of realizing this application, the inventor realizes that the prior art has at least the following problems: The existing method is mainly to determine the similarity between the new sentence and the original corpus by segmenting the sentence and then calculating the similarity. This kind of similarity based on the similarity of text characters depends more on the characterization ability of word vectors. After multiple cycles, the subsequent added corpus will have a semantic drift problem, resulting in the entity of the entire corpus. The accuracy of relation extraction is getting lower and lower.
发明内容Summary of the invention
本申请实施例提供一种基于语义相似度的实体关系抽取方法、装置、计算机设备和存储介质,以提高命名实体的关系抽取的准确率。The embodiments of the present application provide a method, device, computer equipment, and storage medium for extracting entity relationships based on semantic similarity, so as to improve the accuracy of extracting the relationship of named entities.
为了解决上述技术问题,本申请实施例提供一种基于语义相似度的实体 关系抽取方法,包括:In order to solve the above technical problems, an embodiment of the present application provides an entity relationship extraction method based on semantic similarity, including:
获取标注语料和未标注语料,将每个所述标注语料存入到种子集合中;Obtain the labeled corpus and the unlabeled corpus, and store each of the labeled corpus in the seed set;
针对所述种子集合中的每个所述标注语料,根据预设特征构造的方式,对所述标注语料构建特征,得到所述标注语料的关系特征;For each of the annotated corpora in the seed set, construct features on the annotated corpus according to a preset feature construction mode to obtain the relationship feature of the annotated corpus;
将所述未标注语料、所述标注语料和所述标注语料的关系特征输入到预设的相似度评估模型中;Inputting the relationship features of the unlabeled corpus, the labeled corpus, and the labeled corpus into a preset similarity evaluation model;
基于所述预设的相似度评估模型和所述关系特征,对所述未标注语料进行评估,得到评估结果,并根据所述评估结果,确定所述未标注语料的实体关系。Based on the preset similarity evaluation model and the relationship feature, the unlabeled corpus is evaluated to obtain an evaluation result, and the entity relationship of the unlabeled corpus is determined according to the evaluation result.
为了解决上述技术问题,本申请实施例还提供一种基于语义相似度的实体关系抽取装置,包括:In order to solve the above technical problems, an embodiment of the present application also provides an entity relationship extraction device based on semantic similarity, including:
数据采集模块,用于获取标注语料和未标注语料,将每个所述标注语料存入到种子集合中;Data collection module, used to obtain labeled corpus and unlabeled corpus, and store each of the labeled corpus in the seed set;
特征构建模块,用于针对所述种子集合中的每个所述标注语料,根据预设特征构造的方式,对所述标注语料构建特征,得到所述标注语料的关系特征;A feature construction module, configured to construct features on the annotated corpus according to a preset feature construction method for each of the annotated corpus in the seed set, to obtain the relationship feature of the annotated corpus;
数据输入模块,用于将所述未标注语料、所述标注语料和所述标注语料的关系特征输入到预设的相似度评估模型中;A data input module, configured to input the relationship features of the unlabeled corpus, the labeled corpus, and the labeled corpus into a preset similarity evaluation model;
关系抽取模块,用于基于所述预设的相似度评估模型和所述关系特征,对所述未标注语料进行评估,得到评估结果,并根据所述评估结果,确定所述未标注语料的实体关系。The relation extraction module is configured to evaluate the unlabeled corpus based on the preset similarity evaluation model and the relationship feature to obtain an evaluation result, and determine the entity of the unlabeled corpus according to the evaluation result relationship.
为了解决上述技术问题,本申请实施例还提供一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:In order to solve the above technical problems, an embodiment of the present application also provides a computer device, including a memory, a processor, and computer-readable instructions stored in the memory and running on the processor, and the processor executes all When the computer-readable instructions are described, the following steps are implemented:
获取标注语料和未标注语料,将每个所述标注语料存入到种子集合中;Obtain the labeled corpus and the unlabeled corpus, and store each of the labeled corpus in the seed set;
针对所述种子集合中的每个所述标注语料,根据预设特征构造的方式,对所述标注语料构建特征,得到所述标注语料的关系特征;For each of the annotated corpora in the seed set, construct features on the annotated corpus according to a preset feature construction mode to obtain the relationship feature of the annotated corpus;
将所述未标注语料、所述标注语料和所述标注语料的关系特征输入到预设的相似度评估模型中;Inputting the relationship features of the unlabeled corpus, the labeled corpus, and the labeled corpus into a preset similarity evaluation model;
基于所述预设的相似度评估模型和所述关系特征,对所述未标注语料进行评估,得到评估结果,并根据所述评估结果,确定所述未标注语料的实体关系。Based on the preset similarity evaluation model and the relationship feature, the unlabeled corpus is evaluated to obtain an evaluation result, and the entity relationship of the unlabeled corpus is determined according to the evaluation result.
为了解决上述技术问题,本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可读指令,所述计算机可读指令被处 理器执行时实现如下步骤:In order to solve the above technical problems, the embodiments of the present application also provide a computer-readable storage medium, the computer-readable storage medium stores computer-readable instructions, and when the computer-readable instructions are executed by a processor, the following steps are implemented:
获取标注语料和未标注语料,将每个所述标注语料存入到种子集合中;Obtain the labeled corpus and the unlabeled corpus, and store each of the labeled corpus in the seed set;
针对所述种子集合中的每个所述标注语料,根据预设特征构造的方式,对所述标注语料构建特征,得到所述标注语料的关系特征;For each of the annotated corpora in the seed set, construct features on the annotated corpus according to a preset feature construction mode to obtain the relationship feature of the annotated corpus;
将所述未标注语料、所述标注语料和所述标注语料的关系特征输入到预设的相似度评估模型中;Inputting the relationship features of the unlabeled corpus, the labeled corpus, and the labeled corpus into a preset similarity evaluation model;
基于所述预设的相似度评估模型和所述关系特征,对所述未标注语料进行评估,得到评估结果,并根据所述评估结果,确定所述未标注语料的实体关系。Based on the preset similarity evaluation model and the relationship feature, the unlabeled corpus is evaluated to obtain an evaluation result, and the entity relationship of the unlabeled corpus is determined according to the evaluation result.
本申请实施例提供的基于语义相似度的实体关系抽取方法、装置、设备及介质,通过获取标注语料和未标注语料,将每个标注语料存入到种子集合中,再针对种子集合中的每个标注语料,根据预设特征构造的方式,对标注语料构建特征,得到标注语料的关系特征,进而将未标注语料、标注语料和标注语料的关系特征输入到预设的相似度评估模型中,基于预设的相似度评估模型和关系特征,对未标注语料进行评估,得到评估结果,并根据评估结果,确定未标注语料的实体关系,实现通过半监督的方式,对未标注语料进行实体关系快速提取,提高了实体关系提取的准确率和效率。The method, device, device, and medium for extracting entity relationship based on semantic similarity provided by the embodiments of the present application can obtain labeled corpus and unlabeled corpus, and store each labeled corpus into a seed set, and then target each of the seed sets. According to the preset feature construction method, construct features of the annotated corpus to obtain the relationship characteristics of the annotated corpus, and then input the relationship characteristics of the unlabeled corpus, the annotated corpus, and the annotated corpus into the preset similarity evaluation model. Based on the preset similarity evaluation model and relationship characteristics, the unlabeled corpus is evaluated to obtain the evaluation result, and the entity relationship of the unlabeled corpus is determined according to the evaluation result, so as to realize the entity relationship of the unlabeled corpus in a semi-supervised manner Fast extraction improves the accuracy and efficiency of entity relationship extraction.
附图说明Description of the drawings
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions of the embodiments of the present application more clearly, the following will briefly introduce the drawings that need to be used in the description of the embodiments of the present application. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative labor.
图1是本申请可以应用于其中的示例性系统架构图;Figure 1 is an exemplary system architecture diagram to which the present application can be applied;
图2是本申请的基于语义相似度的实体关系抽取方法的一个实施例的流程图;Fig. 2 is a flowchart of an embodiment of an entity relationship extraction method based on semantic similarity of the present application;
图3是根据本申请的基于语义相似度的实体关系抽取装置的一个实施例的结构示意图;3 is a schematic structural diagram of an embodiment of an entity relationship extraction device based on semantic similarity according to the present application;
图4是根据本申请的计算机设备的一个实施例的结构示意图。Fig. 4 is a schematic structural diagram of an embodiment of a computer device according to the present application.
具体实施方式Detailed ways
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技 术领域的技术人员通常理解的含义相同;本文中在申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本申请;本申请的说明书和权利要求书及上述附图说明中的术语“包括”和“具有”以及它们的任何变形,意图在于覆盖不排他的包含。本申请的说明书和权利要求书或上述附图中的术语“第一”、“第二”等是用于区别不同对象,而不是用于描述特定顺序。Unless otherwise defined, all technical and scientific terms used herein have the same meanings as commonly understood by those skilled in the technical field of the application; the terms used in the specification of the application herein are only for describing specific embodiments. The purpose is not to limit the application; the terms "including" and "having" in the specification and claims of the application and the above-mentioned description of the drawings and any variations thereof are intended to cover non-exclusive inclusions. The terms "first", "second", etc. in the specification and claims of the present application or the above-mentioned drawings are used to distinguish different objects, rather than to describe a specific sequence.
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。The reference to "embodiments" herein means that a specific feature, structure, or characteristic described in conjunction with the embodiments may be included in at least one embodiment of the present application. The appearance of the phrase in various places in the specification does not necessarily refer to the same embodiment, nor is it an independent or alternative embodiment mutually exclusive with other embodiments. Those skilled in the art clearly and implicitly understand that the embodiments described herein can be combined with other embodiments.
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of them. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.
请参阅图1,如图1所示,系统架构100可以包括终端设备101、102、103,网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。Please refer to FIG. 1. As shown in FIG. 1, the system architecture 100 may include terminal devices 101, 102, and 103, a network 104 and a server 105. The network 104 is used to provide a medium for communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, and so on.
用户可以使用终端设备101、102、103通过网络104与服务器105交互,以接收或发送消息等。The user can use the terminal devices 101, 102, and 103 to interact with the server 105 through the network 104 to receive or send messages and so on.
终端设备101、102、103可以是具有显示屏并且支持网页浏览的各种电子设备,包括但不限于智能手机、平板电脑、电子书阅读器、MP3播放器(Moving Picture E界面显示perts Group Audio Layer III,动态影像专家压缩标准音频层面3)、MP4(Moving Picture E界面显示perts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、膝上型便携计算机和台式计算机等等。The terminal devices 101, 102, 103 may be various electronic devices with a display screen and support web browsing, including but not limited to smart phones, tablets, e-book readers, MP3 players (Moving Picture E interface display perts Group Audio Layer III. The moving picture expert compresses the standard audio layer 3), MP4 (Moving Picture E interface displays perts Group Audio Layer IV, the moving picture expert compresses the standard audio layer 4) player, laptop portable computer and desktop computer, etc.
服务器105可以是提供各种服务的服务器,例如对终端设备101、102、103上显示的页面提供支持的后台服务器。The server 105 may be a server that provides various services, for example, a background server that provides support for pages displayed on the terminal devices 101, 102, and 103.
需要说明的是,本申请实施例所提供的基于语义相似度的实体关系抽取方法由服务器执行,相应地,基于语义相似度的实体关系抽取装置设置于服务器中。It should be noted that the method for extracting entity relationship based on semantic similarity provided by the embodiment of the present application is executed by the server, and accordingly, the device for extracting entity relationship based on semantic similarity is set in the server.
应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器,本申请实施 例中的终端设备101、102、103具体可以对应的是实际生产中的应用系统。It should be understood that the numbers of terminal devices, networks, and servers in FIG. 1 are merely illustrative. There may be any number of terminal devices, networks, and servers according to implementation needs. The terminal devices 101, 102, and 103 in the embodiments of the present application may specifically correspond to application systems in actual production.
请参阅图2,图2示出本申请实施例提供的一种基于语义相似度的实体关系抽取方法,以该方法应用在图1中的服务端为例进行说明,详述如下:Please refer to FIG. 2. FIG. 2 shows a semantic similarity-based entity relationship extraction method provided by an embodiment of the present application. The method is applied to the server in FIG. 1 as an example for description, and the details are as follows:
S201:获取标注语料和未标注语料,将每个标注语料存入到种子集合中。S201: Obtain labeled corpus and unlabeled corpus, and store each labeled corpus in a seed set.
具体地,在自然语言处理领域,涉及到语义网络标注、篇章理解、机器翻译和构建知识图谱等方面时,经常需要对语料中的内容,进行实体关系抽取,以便搭建语料库,进行自动化处理,提高处理效率,在进行实体关系抽取之前,需要预先设定一些待提取语料的类型,因而,预先对一些语料进行标注,得到标注语料,并将标注语料存入到种子集合中,并将其他未标注的语料,作为未标注语料。Specifically, in the field of natural language processing, when it comes to semantic network annotation, text comprehension, machine translation, and the construction of knowledge graphs, it is often necessary to extract the entity relationship of the content in the corpus in order to build a corpus for automated processing and improve Processing efficiency, before extracting entity relations, it is necessary to pre-set some types of corpus to be extracted. Therefore, some corpora should be annotated in advance to obtain annotated corpus, and the annotated corpus will be stored in the seed set, and other unlabeled corpora will be stored in the seed set. The corpus of is regarded as an unlabeled corpus.
其中,自然语言处理(Natural Language Processing,NLP),是指由于理解(understanding)自然语言,需要关于外在世界的广泛知识以及运用操作这些知识的能力,自然语言认知,同时也被视为一个人工智能完备(AI-complete)的问题。NLP任务主要是指涉及到自然语言的语义理解或解析的一些任务,常见的NLP任务包括但不限于:语音识别(Speech recognition)、中文自动分词(Chinese word segmentation)、词性标注(Part-of-speech tagging)、文本分类(Text categorization)、句法分析(Parsing)、自动摘要(Automatic summarization)、问答系统(Question answering)和信息抽取(Information extraction)等。Among them, natural language processing (Natural Language Processing, NLP) refers to the understanding (understanding) of natural language, the need for extensive knowledge about the external world and the ability to use this knowledge, natural language cognition is also regarded as a The problem of AI-complete. NLP tasks mainly refer to tasks related to the semantic understanding or parsing of natural language. Common NLP tasks include but are not limited to: Speech recognition, Chinese word segmentation, Part-of- speech tagging, text categorization, parsing, automatic summarization, question answering, information extraction, etc.
其中,实体关系抽取是一个NLP领域的经典任务,具体来说,给定一个句子和其中出现的实体,需要根据句子语义信息推测实体间的关系。例如,给定句子:「清华大学坐落于北京近邻」以及实体「清华大学」与「北京」,实体关系抽取模型得到「位于」的关系,并最终抽取出(清华大学,位于,北京)的知识三元组。实体关系在过去的20多年里都有持续研究开展,特征工程、核方法、图模型曾被广泛应用其中,取得了一些阶段性的成果。随着深度学习时代来临,神经网络模型则为实体关系抽取带来了新的突破。Among them, entity relationship extraction is a classic task in the NLP field. Specifically, given a sentence and the entities appearing in it, the relationship between the entities needs to be inferred based on the semantic information of the sentence. For example, given the sentence: "Tsinghua University is located in the nearest neighbor of Beijing" and the entities "Tsinghua University" and "Beijing", the entity relationship extraction model obtains the relationship of "located", and finally extracts the knowledge of (Tsinghua University, located, Beijing) Triad. Entity relations have been continuously researched and carried out in the past 20 years. Feature engineering, nuclear methods, and graph models have been widely used among them, and some phased results have been achieved. With the advent of the deep learning era, neural network models have brought new breakthroughs in entity relationship extraction.
其中,标注语料是指通过人工的方式,根据实际需要选取部分语料,对语料的实体关系进行标注后,得到的语料,在本实施例中,仅需对少量语料进行标注即可满足后续的训练需要,例如,十条,远少于传统深度模型的训练所需语料数量。Among them, annotated corpus refers to the corpus obtained by manually selecting part of the corpus according to actual needs, and labeling the entity relationship of the corpus. In this embodiment, only a small amount of corpus needs to be annotated to meet the subsequent training. The need, for example, ten items, is far less than the number of corpus required for the training of traditional depth models.
需要说的是,本实施例中的语料选取来源,可以根据实际需要来选择,此处不做限定。例如,可以从政府站点,采集政策相关的语料,或者,从体育论坛或者新闻站点,采集体育相关的语料等。It needs to be said that the source of the corpus selection in this embodiment can be selected according to actual needs, which is not limited here. For example, you can collect policy-related corpus from government sites, or collect sports-related corpus from sports forums or news sites.
其中,本实施例中的种子集合,可以理解为不断完善扩容的语料库,在 初始阶段,通过人工标注的方式,获取一部分任务需要的语料类型的语料,作为标注语料,存入到种子集合中,在后续,通过半监督的标注训练,从未标注语料中,添加更多与任务需要的语料类型相同的语料,使得种子集合中包含的语料越来越多,也使得语料的聚类特征,越来越明显。有利于提高种子集合的稳健性。Among them, the seed set in this embodiment can be understood as a corpus that is continuously improved and expanded. In the initial stage, a part of the corpus required by the task is obtained through manual annotation, and stored in the seed set as annotated corpus. In the follow-up, through semi-supervised tagging training, more corpora of the same type as the corpus required by the task are added to the unlabeled corpus, so that more and more corpora are included in the seed set, and the clustering characteristics of the corpus become more and more important. It's getting more and more obvious. Conducive to improving the robustness of the seed set.
进一步地,本实施例获取与标记语料相关的同义词或相似度语料,加入到标记语料中,以便提高后续对模型的训练效果。Further, in this embodiment, synonym or similarity corpus related to the marked corpus is acquired and added to the marked corpus, so as to improve the subsequent training effect on the model.
S202:针对种子集合中的每个标注语料,根据预设特征构造的方式,对标注语料构建特征,得到标注语料的关系特征。S202: For each annotated corpus in the seed set, construct features on the annotated corpus according to a preset feature construction method to obtain the relational features of the annotated corpus.
具体地,标注语料中,标注了每个实体,通过预设特征构造的方式,表征标注语料中实体之间的关系,得到标注语料的关系特征。Specifically, in the annotated corpus, each entity is annotated, and the relationship between the entities in the annotated corpus is characterized by a preset feature construction method, and the relationship characteristics of the annotated corpus are obtained.
其中,关系特征是指用于表征语料知识元组的实体关系。Among them, the relationship feature refers to the entity relationship used to characterize the corpus knowledge tuple.
优选地,本实施例中,预设特征构造方法为分别记录头实体之前、两实体之间和尾实体之后的N个词,三个特征分别记为w BEF,w BET,w AFT,具体可参考后续实施例的描述,为避免重复,此处不再赘述。 Preferably, in this embodiment, the preset feature construction method is to separately record the N words before the head entity, between the two entities, and after the tail entity. The three features are respectively denoted as w BEF , w BET , and w AFT . With reference to the description of the subsequent embodiments, in order to avoid repetition, details are not repeated here.
进一步地,在对标注语料构建特征之前,还需要通过标注好的实体,对标注语料进行分词,对标注预料进行分词处理,具体可以使用第三方分词工具,也可使用分词算法,常见的第三方分词工具例如结巴分词等,常见的分词算法包括但不限于:条件随机场(conditional random field,CRF)算法、隐马尔可夫模型(Hidden Markov Model,HMM)和N-gram模型等。Furthermore, before constructing features on the annotated corpus, it is also necessary to segment the annotated corpus through the annotated entities, and perform word segmentation on the annotated expectations. Specifically, you can use a third-party word segmentation tool or a word segmentation algorithm. Common third-party Word segmentation tools such as stuttering word segmentation, etc. Common word segmentation algorithms include but are not limited to: conditional random field (CRF) algorithm, hidden Markov model (Hidden Markov Model, HMM), and N-gram model.
S203:将未标注语料、标注语料和标注语料的关系特征输入到预设的相似度评估模型中。S203: Input the relationship features of the unlabeled corpus, the labeled corpus, and the labeled corpus into a preset similarity evaluation model.
具体地,预先训练好用于评估实体关系的相似度评估模型,得到未标注语料、标注语料和标注语料的关系特征后,将未标注语料、标注语料和标注语料的关系特征作为输入,输入到预设的相似度评估模型中。Specifically, the similarity evaluation model for evaluating entity relationships is pre-trained, and after the relationship features of the unlabeled corpus, the labeled corpus, and the labeled corpus are obtained, the relationship features of the unlabeled corpus, the labeled corpus, and the labeled corpus are taken as input and input to In the preset similarity evaluation model.
其中,预设的相似度评估模型为神经网络模型,具体包括但不限于:深度语义表征(Embedding from Language Model,ELMo)算法、OpenAI GPT和预训练双向编码器语义(Bidirectional Encoder Representations from Transformers,BERT)模型等。Among them, the preset similarity evaluation model is a neural network model, which specifically includes but is not limited to: deep semantic representation (Embedding from Language Model, ELMo) algorithm, OpenAI GPT and pre-trained Bidirectional Encoder Representations from Transformers, BERT ) Models, etc.
优选地,在本实施例中采用改进的BERT模型作为预训练模型。Preferably, an improved BERT model is used as the pre-training model in this embodiment.
其中,BERT模型的目标是利用大规模无标注语料训练、获得文本的包含丰富语义信息的Representation,即:文本的语义表示,然后将文本的语义表示在特定NLP任务中作微调,最终应用于该NLP任务。在本实施例中,主要通过BERT模型来进行词汇级别、句法级别的语义表征和语义提取,来实现 对不同语料中的实体关系相似程度计算,有利于提高实体关系的精准程度。Among them, the goal of the BERT model is to use large-scale unlabeled corpus training to obtain a representation of the text that contains rich semantic information, that is: the semantic representation of the text, and then fine-tune the semantic representation of the text in a specific NLP task, and finally apply it to the NLP mission. In this embodiment, the BERT model is mainly used to perform semantic representation and semantic extraction at the vocabulary level and syntax level to realize the calculation of the similarity of entity relationships in different corpora, which is beneficial to improve the accuracy of entity relationships.
S204:基于预设的相似度评估模型和关系特征,对未标注语料进行评估,得到评估结果,并根据评估结果,确定未标注语料的实体关系。S204: Evaluate the unlabeled corpus based on the preset similarity evaluation model and relationship characteristics to obtain the evaluation result, and determine the entity relationship of the unlabeled corpus according to the evaluation result.
具体地,通过预设的相似度评估模型、标注语料和关系特征,对未标注语料进行评估,得到评估结果,并根据评估结果,确定未标注语料的实体关系。Specifically, the unlabeled corpus is evaluated through a preset similarity evaluation model, annotated corpus, and relationship features, and the evaluation result is obtained, and the entity relationship of the unlabeled corpus is determined according to the evaluation result.
其中,评估结果包括未标注语料与标注语料存在相似度关系、未标注语料与标注语料不存在相似度关系。Among them, the evaluation results include the similarity relationship between unlabeled corpus and labeled corpus, and there is no similarity relationship between unlabeled corpus and labeled corpus.
应理解,在未标注语料与该标注语料存在相似度关系时,表明未标注语料与该标注语料的语义接近或相同,此时,可将标注语料对应的实体关系,作为所述未标注语料的实体关系。It should be understood that when there is a similarity relationship between the unlabeled corpus and the labeled corpus, it indicates that the unlabeled corpus is close to or the same as the labeled corpus. In this case, the entity relationship corresponding to the labeled corpus can be used as the unlabeled corpus. Entity relationship.
基于预设的相似度评估模型和关系特征,对未标注语料进行评估,得到评估结果的具体过程,可参考后续实施例的描述,为避免重复,此处不再赘述。Based on the preset similarity evaluation model and relationship features, the unlabeled corpus is evaluated, and the specific process of obtaining the evaluation result can be referred to the description of the subsequent embodiments. In order to avoid repetition, it will not be repeated here.
在本实施例中,通过获取标注语料和未标注语料,将每个标注语料存入到种子集合中,再针对种子集合中的每个标注语料,根据预设特征构造的方式,对标注语料构建特征,得到标注语料的关系特征,进而将未标注语料、标注语料和标注语料的关系特征输入到预设的相似度评估模型中,基于预设的相似度评估模型和关系特征,对未标注语料进行评估,得到评估结果,并根据评估结果,确定未标注语料的实体关系,实现通过半监督的方式,对未标注语料进行实体关系快速提取,提高了实体关系提取的准确率和效率。In this embodiment, by acquiring annotated corpus and unlabeled corpus, each annotated corpus is stored in a seed set, and then for each annotated corpus in the seed set, the annotated corpus is constructed according to a preset feature construction method. Feature to obtain the relational features of the annotated corpus, and then input the relational features of the unlabeled corpus, the annotated corpus, and the annotated corpus into the preset similarity evaluation model. Based on the preset similarity evaluation model and relational features, the unlabeled corpus The evaluation is performed to obtain the evaluation result, and the entity relationship of the unlabeled corpus is determined according to the evaluation result, and the entity relationship of the unlabeled corpus is quickly extracted in a semi-supervised manner, which improves the accuracy and efficiency of entity relationship extraction.
在本实施例的一些可选的实现方式中,在步骤S204之后,该基于语义相似度的实体关系抽取方法还包括:In some optional implementation manners of this embodiment, after step S204, the entity relationship extraction method based on semantic similarity further includes:
将评估结果与预设条件进行比较,确定符合预设条件的未标注语料,作为候选语料;Compare the evaluation result with the preset conditions, and determine the unlabeled corpus that meets the preset conditions as the candidate corpus;
将候选语料加入到种子集合中,得到更新后的种子集合。The candidate corpus is added to the seed set, and the updated seed set is obtained.
具体地,将评估结果与预设条件进行比较,确定符合预设条件的未标注语料,作为候选语料,并将候选语料加入到种子集合中,得到更新后的种子集合中。Specifically, the evaluation result is compared with a preset condition, and an unlabeled corpus that meets the preset condition is determined as a candidate corpus, and the candidate corpus is added to the seed set to obtain an updated seed set.
其中,在本实施例中,预设条件具体可以是评估结果为未标注语料与标注语料存在相似度关系,且相似度关系达到预设数值,预设数值可根据实际需要进行设定,例如0.8,此处不作具体限定。Among them, in this embodiment, the preset condition may specifically be that the evaluation result is that there is a similarity relationship between the unlabeled corpus and the labeled corpus, and the similarity relationship reaches a preset value. The preset value can be set according to actual needs, such as 0.8 , There is no specific limitation here.
应理解,本实施例将符合条件的未标注语料加入到种子集合中,扩大种子集合的样本数量,有利于提高后续预设的相似度识别模型识别准确率。It should be understood that in this embodiment, the unlabeled corpus that meets the conditions is added to the seed set to expand the number of samples in the seed set, which is beneficial to improve the recognition accuracy of the subsequent preset similarity recognition model.
在本实施例中,通过半监督的方式,对种子集合进行更新,提高种子集合的样本数量,有利于提高后续识别的准确率。In this embodiment, the seed set is updated in a semi-supervised manner to increase the number of samples in the seed set, which is beneficial to improve the accuracy of subsequent recognition.
在本实施例的一些可选的实现方式中,步骤S202中,针对种子集合中的每个标注语料,根据预设特征构造的方式,对标注语料构建特征,得到标注语料的关系特征包括:In some optional implementations of this embodiment, in step S202, for each annotated corpus in the seed set, features are constructed on the annotated corpus according to a preset feature construction method, and the relationship features of the annotated corpus obtained include:
获取标注语料的命名实体;Obtain the named entities of the annotated corpus;
针对命名实体,获取命名实体之前N个分词,构成知识元组,作为第一关系特征,获取两个连续的命名实体之间的分词,构成知识元组,作为第二关系特征,获取命名实体之后N个分词,构成知识元组,作为第三关系特征,其中,N为正整数;For the named entity, obtain the N word segmentation before the named entity to form a knowledge tuple, as the first relationship feature, obtain the word segmentation between two consecutive named entities, and form the knowledge tuple, as the second relationship feature, after obtaining the named entity N word segmentation forms a knowledge tuple as the third relation feature, where N is a positive integer;
将第一关系特征、第二关系特征和第三关系特征,作为标注语料的关系特征。The first relationship feature, the second relationship feature, and the third relationship feature are used as the relationship features of the annotated corpus.
具体地,获取标注语料的命名实体具体可以是通过人工标注的方式,也可以是通过命名实体识别模型。Specifically, obtaining the named entity of the annotated corpus may be manually annotated, or may be through a named entity recognition model.
其中,N的数量可以根据实际需要进行设置,例如,设置N为3。Among them, the number of N can be set according to actual needs, for example, set N to 3.
其中,知识元组是指实体与实体前后的分词组成的元组,该元组用以表征实体与分词的关系。Among them, the knowledge tuple refers to the tuple composed of the entity and the word segmentation before and after the entity, and the tuple is used to characterize the relationship between the entity and the word segmentation.
在本实施例中,通过对对标注语料构建特征,得到标注语料的关系特征,提高了后续根据关系特征进行语义提取的准确率,有利于提高相似度识别的准确率。In this embodiment, by constructing features on the annotated corpus, the relationship features of the annotated corpus are obtained, which improves the accuracy of subsequent semantic extraction based on the relationship features, which is beneficial to improve the accuracy of similarity recognition.
在本实施例的一些可选的实现方式中,BERT模型包括编码层、Concat层和全连接层,步骤S204中,基于预设的相似度评估模型和关系特征,对未标注语料进行评估,得到评估结果包括:In some optional implementations of this embodiment, the BERT model includes a coding layer, a Concat layer, and a fully connected layer. In step S204, the unlabeled corpus is evaluated based on a preset similarity evaluation model and relationship features, to obtain The evaluation results include:
采用BERT模型的编码层,对每个未标注语料进行编码,得到第一编码特征,对每个标注语料进行编码,得到第二编码特征;Using the coding layer of the BERT model, each unlabeled corpus is coded to obtain the first coding feature, and each labeled corpus is coded to obtain the second coding feature;
通过BERT模型的Concat层分别对第一编码特征和第二编码特征进行特征提取融合,得到第一融合特征和第二融合特征;Through the Concat layer of the BERT model, feature extraction and fusion are performed on the first coding feature and the second coding feature to obtain the first fusion feature and the second fusion feature;
针对任意一个第一融合特征,基于全连接层的损失函数,计算第一融合特征与每个第二融合特征的损失值,将最小损失值作为目标损失值;For any first fusion feature, calculate the loss value of the first fusion feature and each second fusion feature based on the loss function of the fully connected layer, and use the minimum loss value as the target loss value;
若目标损失值小于预设损失阈值,则确定评估结果为第一融合特征对应的未标注语料与目标损失值对应的标注语料存在语义相似度关系。If the target loss value is less than the preset loss threshold, it is determined that the evaluation result is that the unlabeled corpus corresponding to the first fusion feature and the labeled corpus corresponding to the target loss value have a semantic similarity relationship.
具体地,通过BERT模型,对未标注语料与标注语料的语义相似度进行评估,得到评估结果。Specifically, through the BERT model, the semantic similarity between the unlabeled corpus and the labeled corpus is evaluated, and the evaluation result is obtained.
其中,BERT(Bidirectional Encoder Representations from Transformers)是 一个基于多层Transformer结构的深度学习模型,BERT的本质上是通过在海量的语料的基础上运行自监督学习方法为单词学习一个好的特征表示,所谓自监督学习是指在没有人工标注的数据上运行的监督学习。在以后特定的NLP任务中,我们可以直接使用BERT的特征表示作为该任务的词嵌入特征。所以BERT提供的是一个供其它任务迁移学习的模型,该模型可以根据任务微调或者固定之后作为特征提取器。Among them, BERT (Bidirectional Encoder Representations from Transformers) is a deep learning model based on a multi-layer Transformer structure. The essence of BERT is to learn a good feature representation for words by running a self-supervised learning method on the basis of massive corpus, the so-called Self-supervised learning refers to supervised learning that runs on data that is not manually labeled. In future specific NLP tasks, we can directly use the feature representation of BERT as the word embedding feature of the task. Therefore, BERT provides a model for migration learning of other tasks, which can be used as a feature extractor after being fine-tuned or fixed according to the task.
在本实施例中,为防止BERT模型的过拟合,在全连接层后加入Dropout层。需要说明的是,在设计网络时,设定的每层神经元代表一个学习到的中间特征(即几个权值的组合),网络所有神经元共同作用来表征输入数据的特定属性(如图像分类中,表征所属类别)。当相对于网络的复杂程度(即网络的表达能力、拟合能力)而言数据量过小时,出现过拟合,显然这时各神经元表示的特征相互之间存在许多重复和冗余。在本实施例的全连接层之后加入Dropout层的直接作用是减少中间特征的数量,从而减少冗余,即增加每层各个特征之间的正交性,具体是在模型训练时随机让网络某些隐含层节点的权重不工作,不工作的那些节点可以暂时认为不是网络结构的一部分,但是它的权重得保留下来(只是暂时不更新而已),因为下次样本输入时它可能发生作用,有效防止过拟合。In this embodiment, in order to prevent overfitting of the BERT model, a Dropout layer is added after the fully connected layer. It should be noted that when designing the network, each layer of neurons set represents a learned intermediate feature (that is, a combination of several weights), and all neurons in the network work together to represent specific attributes of the input data (such as image In classification, it represents the category to which it belongs). When the amount of data is too small relative to the complexity of the network (that is, the expression ability and fitting ability of the network), overfitting occurs. Obviously, there are many overlaps and redundancy among the features represented by each neuron at this time. The direct effect of adding the Dropout layer after the fully connected layer of this embodiment is to reduce the number of intermediate features, thereby reducing redundancy, that is, to increase the orthogonality between the features of each layer, specifically to randomly let the network certain The weights of some hidden layer nodes do not work, and those nodes that do not work can be temporarily considered as not part of the network structure, but their weights must be preserved (just temporarily not updated), because it may work the next time the sample is input. Effectively prevent overfitting.
其中,预设损失阈值可以根据实际需要进行设定,例如,设置为0.05,此处不作具体限定。Among them, the preset loss threshold can be set according to actual needs, for example, set to 0.05, which is not specifically limited here.
进一步地,在本实施例的一些可选的实现方式中,损失函数为二分类交叉熵,针对任意一个第一融合特征,基于全连接层的损失函数,计算第一融合特征与每个第二融合特征的损失值,包括:Further, in some optional implementations of this embodiment, the loss function is two-class cross entropy. For any first fusion feature, based on the loss function of the fully connected layer, calculate the first fusion feature and each second fusion feature. The loss value of the fusion feature includes:
Figure PCTCN2020136349-appb-000001
Figure PCTCN2020136349-appb-000001
其中,Loss为损失值,y为第二融合特征的样本标签,在第二融合特征属于正例时,取值为1,否则取值为0,
Figure PCTCN2020136349-appb-000002
为第一融合特征是正例的概率。
Among them, Loss is the loss value, and y is the sample label of the second fusion feature. When the second fusion feature is a positive example, the value is 1, otherwise the value is 0.
Figure PCTCN2020136349-appb-000002
The probability that the first fusion feature is a positive example.
应理解,二分类交叉熵针对两个分类进行预测,两分类分为正例和负例,具体正例和负例可以在模型中设定。It should be understood that the two-category cross entropy predicts two categories, and the two categories are divided into positive examples and negative examples, and specific positive examples and negative examples can be set in the model.
在本实施例中,通过BERT模型,基于预设的相似度评估模型和关系特征,对未标注语料进行评估,得到评估结果,有利于提高评估的准确性。In this embodiment, the BERT model is used to evaluate the unlabeled corpus based on the preset similarity evaluation model and relationship characteristics to obtain the evaluation result, which is beneficial to improve the accuracy of the evaluation.
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that the size of the sequence number of each step in the foregoing embodiment does not mean the order of execution. The execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiment of the present application.
图3示出与上述实施例基于语义相似度的实体关系抽取方法一一对应的 基于语义相似度的实体关系抽取装置的原理框图。如图3所示,该基于语义相似度的实体关系抽取装置包括数据采集模块31、特征构建模块32、数据输入模块33和关系抽取模块34。各功能模块详细说明如下:Fig. 3 shows a schematic block diagram of an entity relationship extraction device based on semantic similarity in a one-to-one correspondence with the above embodiment of the entity relationship extraction method based on semantic similarity. As shown in FIG. 3, the entity relationship extraction device based on semantic similarity includes a data collection module 31, a feature construction module 32, a data input module 33 and a relationship extraction module 34. The detailed description of each functional module is as follows:
数据采集模块31,用于获取标注语料和未标注语料,将每个标注语料存入到种子集合中;The data collection module 31 is used to obtain labeled corpus and unlabeled corpus, and store each labeled corpus in the seed set;
特征构建模块32,用于针对种子集合中的每个标注语料,根据预设特征构造的方式,对标注语料构建特征,得到标注语料的关系特征;The feature construction module 32 is configured to construct features on the annotated corpus according to a preset feature construction method for each annotated corpus in the seed set, and obtain the relationship features of the annotated corpus;
数据输入模块33,用于将未标注语料、标注语料和标注语料的关系特征输入到预设的相似度评估模型中;The data input module 33 is used to input the relationship features of the unlabeled corpus, the labeled corpus, and the labeled corpus into the preset similarity evaluation model;
关系抽取模块34,用于基于预设的相似度评估模型和关系特征,对未标注语料进行评估,得到评估结果,并根据评估结果,确定未标注语料的实体关系。The relation extraction module 34 is configured to evaluate the unlabeled corpus based on the preset similarity evaluation model and relationship characteristics to obtain the evaluation result, and determine the entity relationship of the unlabeled corpus according to the evaluation result.
可选地,基于语义相似度的实体关系抽取装置还包括:Optionally, the entity relationship extraction device based on semantic similarity further includes:
候选语料确定模块,用于将评估结果与预设条件进行比较,确定符合预设条件的未标注语料,作为候选语料;The candidate corpus determination module is used to compare the evaluation result with the preset conditions, and determine the unlabeled corpus that meets the preset conditions as the candidate corpus;
种子集合更新模块,用于将候选语料加入到种子集合中,得到更新后的种子集合。The seed set update module is used to add the candidate corpus to the seed set to obtain the updated seed set.
可选地,特征构建模块32包括:Optionally, the feature building module 32 includes:
命名实体获取单元,用于获取标注语料的命名实体;The named entity acquisition unit is used to acquire the named entity of the annotated corpus;
特征构建单元,用于针对命名实体,获取命名实体之前N个分词,构成知识元组,作为第一关系特征,获取两个连续的命名实体之间的分词,构成知识元组,作为第二关系特征,获取命名实体之后N个分词,构成知识元组,作为第三关系特征,其中,N为正整数;The feature construction unit is used to obtain the N word segmentation before the named entity for the named entity to form a knowledge tuple as the first relationship feature, and obtain the word segmentation between two consecutive named entities to form the knowledge tuple as the second relationship Features, after obtaining the named entity, N word segmentation forms a knowledge tuple as the third relationship feature, where N is a positive integer;
关系特征确定单元,用于将第一关系特征、第二关系特征和第三关系特征,作为标注语料的关系特征。The relationship feature determining unit is configured to use the first relationship feature, the second relationship feature, and the third relationship feature as the relationship feature of the annotated corpus.
可选地,BERT模型包括编码层、Concat层和全连接层,关系抽取模块34包括:Optionally, the BERT model includes an encoding layer, a Concat layer, and a fully connected layer, and the relation extraction module 34 includes:
特征编码单元,用于采用BERT模型的编码层,对每个未标注语料进行编码,得到第一编码特征,对每个标注语料进行编码,得到第二编码特征;The feature coding unit is used to use the coding layer of the BERT model to encode each unlabeled corpus to obtain the first coding feature, and to encode each labeled corpus to obtain the second coding feature;
特征融合单元,用于通过BERT模型的Concat层分别对第一编码特征和第二编码特征进行特征提取融合,得到第一融合特征和第二融合特征;The feature fusion unit is used to perform feature extraction and fusion on the first coding feature and the second coding feature respectively through the Concat layer of the BERT model to obtain the first fusion feature and the second fusion feature;
损失计算单元,用于针对任意一个第一融合特征,基于全连接层的损失函数,计算第一融合特征与每个第二融合特征的损失值,将最小损失值作为目标损失值;The loss calculation unit is used to calculate the loss value of the first fusion feature and each second fusion feature based on the loss function of the fully connected layer for any first fusion feature, and use the minimum loss value as the target loss value;
结果确定单元,用于若目标损失值小于预设损失阈值,则确定评估结果为第一融合特征对应的未标注语料与目标损失值对应的标注语料存在语义相似度关系。The result determining unit is configured to determine that the evaluation result is that the unlabeled corpus corresponding to the first fusion feature and the labeled corpus corresponding to the target loss value have a semantic similarity relationship if the target loss value is less than the preset loss threshold.
关于基于语义相似度的实体关系抽取装置的具体限定可以参见上文中对于基于语义相似度的实体关系抽取方法的限定,在此不再赘述。上述基于语义相似度的实体关系抽取装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。For the specific limitation of the entity relationship extraction device based on semantic similarity, please refer to the above limitation on the entity relationship extraction method based on semantic similarity, which will not be repeated here. Each module in the above-mentioned semantic similarity-based entity relationship extraction device can be implemented in whole or in part by software, hardware, and a combination thereof. The above-mentioned modules may be embedded in the form of hardware or independent of the processor in the computer equipment, or may be stored in the memory of the computer equipment in the form of software, so that the processor can call and execute the operations corresponding to the above-mentioned modules.
为解决上述技术问题,本申请实施例还提供计算机设备。具体请参阅图4,图4为本实施例计算机设备基本结构框图。In order to solve the above technical problems, the embodiments of the present application also provide computer equipment. Please refer to FIG. 4 for details. FIG. 4 is a block diagram of the basic structure of the computer device in this embodiment.
所述计算机设备4包括通过系统总线相互通信连接存储器41、处理器42、网络接口43。需要指出的是,图中仅示出了具有组件连接存储器41、处理器42、网络接口43的计算机设备4,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。其中,本技术领域技术人员可以理解,这里的计算机设备是一种能够按照事先设定或存储的指令,自动进行数值计算和/或信息处理的设备,其硬件包括但不限于微处理器、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程门阵列(Field-Programmable Gate Array,FPGA)、数字处理器(Digital Signal Processor,DSP)、嵌入式设备等。The computer device 4 includes a memory 41, a processor 42, and a network interface 43 that are connected to each other in communication via a system bus. It should be pointed out that the figure only shows the computer device 4 with the components connected to the memory 41, the processor 42, and the network interface 43. However, it should be understood that it is not required to implement all the illustrated components, and alternative implementations can be made. More or fewer components. Among them, those skilled in the art can understand that the computer device here is a device that can automatically perform numerical calculation and/or information processing in accordance with pre-set or stored instructions. Its hardware includes, but is not limited to, a microprocessor, a dedicated Integrated Circuit (Application Specific Integrated Circuit, ASIC), Programmable Gate Array (Field-Programmable Gate Array, FPGA), Digital Processor (Digital Signal Processor, DSP), embedded equipment, etc.
所述计算机设备可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。所述计算机设备可以与用户通过键盘、鼠标、遥控器、触摸板或声控设备等方式进行人机交互。The computer device may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. The computer device can interact with the user through a keyboard, a mouse, a remote control, a touch panel, or a voice control device.
所述存储器41至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或D界面显示存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中,所述存储器41可以是所述计算机设备4的内部存储单元,例如该计算机设备4的硬盘或内存。在另一些实施例中,所述存储器41也可以是所述计算机设备4的外部存储设备,例如该计算机设备4上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。当然,所述存储器41还可以既包括所述计算机设备4的内部存储单元也包括其外部存储设备。本实施例中,所述存储器41通常用于存储安装于所述计算机设备4的操作系 统和各类应用软件,例如电子文件的控制的程序代码等。此外,所述存储器41还可以用于暂时地存储已经输出或者将要输出的各类数据。The memory 41 includes at least one type of readable storage medium, the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or D interface display memory, etc.), random access memory (RAM) , Static random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disks, optical disks, etc. In some embodiments, the memory 41 may be an internal storage unit of the computer device 4, such as a hard disk or memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, for example, a plug-in hard disk equipped on the computer device 4, a smart memory card (Smart Media Card, SMC), and a secure digital (Secure Digital, SD) card, Flash Card, etc. Of course, the memory 41 may also include both the internal storage unit of the computer device 4 and its external storage device. In this embodiment, the memory 41 is generally used to store the operating system and various application software installed in the computer device 4, such as program codes for controlling electronic files. In addition, the memory 41 can also be used to temporarily store various types of data that have been output or will be output.
所述处理器42在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器42通常用于控制所述计算机设备4的总体操作。本实施例中,所述处理器42用于运行所述存储器41中存储的程序代码或者处理数据,例如运行电子文件的控制的程序代码。The processor 42 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips in some embodiments. The processor 42 is generally used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is configured to run program codes or process data stored in the memory 41, for example, run program codes for controlling electronic files.
所述网络接口43可包括无线网络接口或有线网络接口,该网络接口43通常用于在所述计算机设备4与其他电子设备之间建立通信连接。The network interface 43 may include a wireless network interface or a wired network interface, and the network interface 43 is generally used to establish a communication connection between the computer device 4 and other electronic devices.
本申请还提供了另一种实施方式,即提供一种计算机可读存储介质,所述计算机可读存储介质可以是非易失性,也可以是易失性,所述计算机可读存储介质存储有界面显示程序,所述界面显示程序可被至少一个处理器执行,以使所述至少一个处理器执行如上述的基于语义相似度的实体关系抽取方法的步骤。This application also provides another implementation manner, that is, a computer-readable storage medium is provided. The computer-readable storage medium may be non-volatile or volatile, and the computer-readable storage medium stores An interface display program, the interface display program can be executed by at least one processor, so that the at least one processor executes the steps of the entity relationship extraction method based on semantic similarity as described above.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the method of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, The optical disc) includes several instructions to make a terminal device (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods described in the various embodiments of the present application.
显然,以上所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例,附图中给出了本申请的较佳实施例,但并不限制本申请的专利范围。本申请可以以许多不同的形式来实现,相反地,提供这些实施例的目的是使对本申请的公开内容的理解更加透彻全面。尽管参照前述实施例对本申请进行了详细的说明,对于本领域的技术人员来而言,其依然可以对前述各具体实施方式所记载的技术方案进行修改,或者对其中部分技术特征进行等效替换。凡是利用本申请说明书及附图内容所做的等效结构,直接或间接运用在其他相关的技术领域,均同理在本申请专利保护范围之内。Obviously, the above-described embodiments are only a part of the embodiments of the present application, rather than all of the embodiments. The drawings show preferred embodiments of the present application, but do not limit the patent scope of the present application. The present application can be implemented in many different forms. On the contrary, the purpose of providing these examples is to make the understanding of the disclosure of the present application more thorough and comprehensive. Although this application has been described in detail with reference to the foregoing embodiments, for those skilled in the art, it is still possible for those skilled in the art to modify the technical solutions described in each of the foregoing specific embodiments, or equivalently replace some of the technical features. . All equivalent structures made by using the contents of the description and drawings of this application, directly or indirectly used in other related technical fields, are similarly within the scope of patent protection of this application.

Claims (20)

  1. 一种基于语义相似度的实体关系抽取方法,其中,包括:An entity relationship extraction method based on semantic similarity, which includes:
    获取标注语料和未标注语料,将每个所述标注语料存入到种子集合中;Obtain the labeled corpus and the unlabeled corpus, and store each of the labeled corpus in the seed set;
    针对所述种子集合中的每个所述标注语料,根据预设特征构造的方式,对所述标注语料构建特征,得到所述标注语料的关系特征;For each of the annotated corpora in the seed set, construct features on the annotated corpus according to a preset feature construction mode to obtain the relationship feature of the annotated corpus;
    将所述未标注语料、所述标注语料和所述标注语料的关系特征输入到预设的相似度评估模型中;Inputting the relationship features of the unlabeled corpus, the labeled corpus, and the labeled corpus into a preset similarity evaluation model;
    基于所述预设的相似度评估模型和所述关系特征,对所述未标注语料进行评估,得到评估结果,并根据所述评估结果,确定所述未标注语料的实体关系。Based on the preset similarity evaluation model and the relationship feature, the unlabeled corpus is evaluated to obtain an evaluation result, and the entity relationship of the unlabeled corpus is determined according to the evaluation result.
  2. 如权利要求1所述的基于语义相似度的实体关系抽取方法,其中,在所述基于所述预设的相似度评估模型和所述关系特征,对所述未标注语料进行评估,得到评估结果之后,所述基于语义相似度的实体关系抽取方法还包括:The method for extracting entity relationship based on semantic similarity of claim 1, wherein, in the evaluation model based on the preset similarity evaluation model and the relationship feature, the unlabeled corpus is evaluated to obtain an evaluation result After that, the entity relationship extraction method based on semantic similarity further includes:
    将评估结果与预设条件进行比较,确定符合所述预设条件的未标注语料,作为候选语料;Compare the evaluation result with the preset conditions, and determine the unlabeled corpus that meets the preset conditions as the candidate corpus;
    将所述候选语料加入到所述种子集合中,得到更新后的种子集合。The candidate corpus is added to the seed set to obtain an updated seed set.
  3. 如权利要求1所述的基于语义相似度的实体关系抽取方法,其中,所述针对所述种子集合中的每个所述标注语料,根据预设特征构造的方式,对所述标注语料构建特征,得到所述标注语料的关系特征包括:The entity relationship extraction method based on semantic similarity according to claim 1, wherein, for each of the annotated corpus in the seed set, a feature is constructed for the annotated corpus according to a preset feature construction method. , Obtaining the relationship features of the labeled corpus includes:
    获取所述标注语料的命名实体;Acquiring the named entity of the annotated corpus;
    针对所述命名实体,获取所述命名实体之前N个分词,构成知识元组,作为第一关系特征,获取两个连续的命名实体之间的分词,构成知识元组,作为第二关系特征,获取所述命名实体之后N个分词,构成知识元组,作为第三关系特征,其中,N为正整数;For the named entity, obtain the N word segments before the named entity to form a knowledge tuple as the first relationship feature, and obtain the word segmentation between two consecutive named entities to form the knowledge tuple as the second relationship feature, N word segmentation after acquiring the named entity to form a knowledge tuple as the third relationship feature, where N is a positive integer;
    将所述第一关系特征、第二关系特征和所述第三关系特征,作为所述标注语料的关系特征。The first relationship feature, the second relationship feature, and the third relationship feature are used as the relationship feature of the annotation corpus.
  4. 如权利要求1所述的基于语义相似度的实体关系抽取方法,其中,所述预设的相似度评估模型为BERT模型。5. The method for extracting entity relationship based on semantic similarity according to claim 1, wherein the preset similarity evaluation model is a BERT model.
  5. 如权利要求4所述的基于语义相似度的实体关系抽取方法,其中,所述BERT模型包括编码层、Concat层和全连接层,所述基于所述预设的相似度评估模型和所述关系特征,对所述未标注语料进行评估,得到评估结果包 括:The entity relationship extraction method based on semantic similarity of claim 4, wherein the BERT model includes a coding layer, a Concat layer, and a fully connected layer, and the preset similarity evaluation model and the relationship Features, evaluate the unlabeled corpus, and the evaluation results obtained include:
    采用所述BERT模型的编码层,对每个所述未标注语料进行编码,得到第一编码特征,对每个所述标注语料进行编码,得到第二编码特征;Using the coding layer of the BERT model to encode each of the unlabeled corpus to obtain a first coding feature, and to encode each of the labeled corpus to obtain a second coding feature;
    通过BERT模型的Concat层分别对所述第一编码特征和第二编码特征进行特征提取融合,得到第一融合特征和第二融合特征;Perform feature extraction and fusion on the first coding feature and the second coding feature through the Concat layer of the BERT model, respectively, to obtain the first fusion feature and the second fusion feature;
    针对任意一个所述第一融合特征,基于全连接层的损失函数,计算所述第一融合特征与每个第二融合特征的损失值,将最小损失值作为目标损失值;For any one of the first fusion features, calculate the loss value of the first fusion feature and each second fusion feature based on the loss function of the fully connected layer, and use the minimum loss value as the target loss value;
    若目标损失值小于预设损失阈值,则确定所述评估结果为所述第一融合特征对应的未标注语料与所述目标损失值对应的标注语料存在语义相似度关系。If the target loss value is less than the preset loss threshold, it is determined that the evaluation result is that the unlabeled corpus corresponding to the first fusion feature and the labeled corpus corresponding to the target loss value have a semantic similarity relationship.
  6. 如权利要求5所述的基于语义相似度的实体关系抽取方法,其中,所述损失函数为二分类交叉熵,所述针对任意一个所述第一融合特征,基于全连接层的损失函数,计算所述第一融合特征与每个第二融合特征的损失值,包括:The entity relationship extraction method based on semantic similarity according to claim 5, wherein the loss function is two-class cross entropy, and for any one of the first fusion features, the calculation is based on the loss function of the fully connected layer The loss value of the first fusion feature and each second fusion feature includes:
    Figure PCTCN2020136349-appb-100001
    Figure PCTCN2020136349-appb-100001
    其中,Loss为损失值,y为第二融合特征的样本标签,在第二融合特征属于正例时,取值为1,否则取值为0,
    Figure PCTCN2020136349-appb-100002
    为第一融合特征是正例的概率。
    Among them, Loss is the loss value, and y is the sample label of the second fusion feature. When the second fusion feature is a positive example, the value is 1, otherwise the value is 0.
    Figure PCTCN2020136349-appb-100002
    The probability that the first fusion feature is a positive example.
  7. 一种基于语义相似度的实体关系抽取装置,其中,包括:An entity relationship extraction device based on semantic similarity, which includes:
    数据采集模块,用于获取标注语料和未标注语料,将每个标注语料存入到种子集合中;The data collection module is used to obtain labeled corpus and unlabeled corpus, and store each labeled corpus in the seed set;
    特征构建模块,用于针对种子集合中的每个标注语料,根据预设特征构造的方式,对标注语料构建特征,得到标注语料的关系特征;The feature construction module is used to construct features of the annotated corpus according to the preset feature construction method for each annotated corpus in the seed set, and obtain the relationship characteristics of the annotated corpus;
    数据输入模块,用于将未标注语料、标注语料和标注语料的关系特征输入到预设的相似度评估模型中;The data input module is used to input the relationship features of the unlabeled corpus, the labeled corpus, and the labeled corpus into the preset similarity evaluation model;
    关系抽取模块,用于基于预设的相似度评估模型和关系特征,对未标注语料进行评估,得到评估结果,并根据评估结果,确定未标注语料的实体关系。The relation extraction module is used to evaluate the unlabeled corpus based on the preset similarity evaluation model and relationship characteristics to obtain the evaluation result, and determine the entity relationship of the unlabeled corpus according to the evaluation result.
  8. 如权利要求7所述的基于语义相似度的实体关系抽取装置,其中,所述特征构建模块包括:8. The entity relationship extraction device based on semantic similarity according to claim 7, wherein the feature construction module comprises:
    命名实体获取单元,用于获取标注语料的命名实体;The named entity acquisition unit is used to acquire the named entity of the annotated corpus;
    特征构建单元,用于针对命名实体,获取命名实体之前N个分词,构成知识元组,作为第一关系特征,获取两个连续的命名实体之间的分词,构成知识元组,作为第二关系特征,获取命名实体之后N个分词,构成知识元组, 作为第三关系特征,其中,N为正整数;The feature construction unit is used to obtain the N word segmentation before the named entity for the named entity to form a knowledge tuple as the first relationship feature, and obtain the word segmentation between two consecutive named entities to form the knowledge tuple as the second relationship Feature, N word segmentation after the named entity is obtained to form a knowledge tuple as the third relationship feature, where N is a positive integer;
    关系特征确定单元,用于将第一关系特征、第二关系特征和第三关系特征,作为标注语料的关系特征。The relationship feature determining unit is configured to use the first relationship feature, the second relationship feature, and the third relationship feature as the relationship feature of the annotated corpus.
  9. 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其中,所述处理器执行所述计算机可读指令时实现如下步骤:A computer device includes a memory, a processor, and computer-readable instructions that are stored in the memory and can run on the processor, wherein the processor implements the following steps when the processor executes the computer-readable instructions:
    获取标注语料和未标注语料,将每个所述标注语料存入到种子集合中;Obtain the labeled corpus and the unlabeled corpus, and store each of the labeled corpus in the seed set;
    针对所述种子集合中的每个所述标注语料,根据预设特征构造的方式,对所述标注语料构建特征,得到所述标注语料的关系特征;For each of the annotated corpora in the seed set, construct features on the annotated corpus according to a preset feature construction mode to obtain the relationship feature of the annotated corpus;
    将所述未标注语料、所述标注语料和所述标注语料的关系特征输入到预设的相似度评估模型中;Inputting the relationship features of the unlabeled corpus, the labeled corpus, and the labeled corpus into a preset similarity evaluation model;
    基于所述预设的相似度评估模型和所述关系特征,对所述未标注语料进行评估,得到评估结果,并根据所述评估结果,确定所述未标注语料的实体关系。Based on the preset similarity evaluation model and the relationship feature, the unlabeled corpus is evaluated to obtain an evaluation result, and the entity relationship of the unlabeled corpus is determined according to the evaluation result.
  10. 如权利要求9所述的计算机设备,其中,在所述基于所述预设的相似度评估模型和所述关系特征,对所述未标注语料进行评估,得到评估结果之后,所述处理器执行所述计算机可读指令时还实现如下步骤:The computer device of claim 9, wherein, after the unlabeled corpus is evaluated based on the preset similarity evaluation model and the relationship feature, and the evaluation result is obtained, the processor executes The computer-readable instructions further implement the following steps:
    将评估结果与预设条件进行比较,确定符合所述预设条件的未标注语料,作为候选语料;Compare the evaluation result with the preset conditions, and determine the unlabeled corpus that meets the preset conditions as the candidate corpus;
    将所述候选语料加入到所述种子集合中,得到更新后的种子集合。The candidate corpus is added to the seed set to obtain an updated seed set.
  11. 如权利要求9所述的计算机设备,其中,所述针对所述种子集合中的每个所述标注语料,根据预设特征构造的方式,对所述标注语料构建特征,得到所述标注语料的关系特征包括:8. The computer device according to claim 9, wherein said for each of said annotated corpora in said seed set, features are constructed on said annotated corpus according to a preset feature construction method to obtain a value of said annotated corpus Relationship characteristics include:
    获取所述标注语料的命名实体;Acquiring the named entity of the annotated corpus;
    针对所述命名实体,获取所述命名实体之前N个分词,构成知识元组,作为第一关系特征,获取两个连续的命名实体之间的分词,构成知识元组,作为第二关系特征,获取所述命名实体之后N个分词,构成知识元组,作为第三关系特征,其中,N为正整数;For the named entity, obtain the N word segments before the named entity to form a knowledge tuple as the first relationship feature, and obtain the word segmentation between two consecutive named entities to form the knowledge tuple as the second relationship feature, N word segmentation after acquiring the named entity to form a knowledge tuple as the third relationship feature, where N is a positive integer;
    将所述第一关系特征、第二关系特征和所述第三关系特征,作为所述标注语料的关系特征。The first relationship feature, the second relationship feature, and the third relationship feature are used as the relationship feature of the annotation corpus.
  12. 如权利要求9所述的计算机设备,其中,所述预设的相似度评估模型为BERT模型。9. The computer device according to claim 9, wherein the preset similarity evaluation model is a BERT model.
  13. 如权利要求12所述的计算机设备,其中,所述BERT模型包括编码层、Concat层和全连接层,所述基于所述预设的相似度评估模型和所述关系 特征,对所述未标注语料进行评估,得到评估结果包括:The computer device according to claim 12, wherein the BERT model includes a coding layer, a Concat layer, and a fully connected layer, and based on the preset similarity evaluation model and the relationship feature, the unlabeled The corpus is evaluated, and the evaluation results obtained include:
    采用所述BERT模型的编码层,对每个所述未标注语料进行编码,得到第一编码特征,对每个所述标注语料进行编码,得到第二编码特征;Using the coding layer of the BERT model to encode each of the unlabeled corpus to obtain a first coding feature, and to encode each of the labeled corpus to obtain a second coding feature;
    通过BERT模型的Concat层分别对所述第一编码特征和第二编码特征进行特征提取融合,得到第一融合特征和第二融合特征;Perform feature extraction and fusion on the first coding feature and the second coding feature through the Concat layer of the BERT model, respectively, to obtain the first fusion feature and the second fusion feature;
    针对任意一个所述第一融合特征,基于全连接层的损失函数,计算所述第一融合特征与每个第二融合特征的损失值,将最小损失值作为目标损失值;For any one of the first fusion features, calculate the loss value of the first fusion feature and each second fusion feature based on the loss function of the fully connected layer, and use the minimum loss value as the target loss value;
    若目标损失值小于预设损失阈值,则确定所述评估结果为所述第一融合特征对应的未标注语料与所述目标损失值对应的标注语料存在语义相似度关系。If the target loss value is less than the preset loss threshold, it is determined that the evaluation result is that the unlabeled corpus corresponding to the first fusion feature and the labeled corpus corresponding to the target loss value have a semantic similarity relationship.
  14. 如权利要求13所述的计算机设备,其中,所述损失函数为二分类交叉熵,所述针对任意一个所述第一融合特征,基于全连接层的损失函数,计算所述第一融合特征与每个第二融合特征的损失值,包括:The computer device according to claim 13, wherein the loss function is two-class cross entropy, and for any one of the first fusion features, based on the loss function of the fully connected layer, the first fusion feature and the The loss value of each second fusion feature includes:
    Figure PCTCN2020136349-appb-100003
    Figure PCTCN2020136349-appb-100003
    其中,Loss为损失值,y为第二融合特征的样本标签,在第二融合特征属于正例时,取值为1,否则取值为0,
    Figure PCTCN2020136349-appb-100004
    为第一融合特征是正例的概率。
    Among them, Loss is the loss value, and y is the sample label of the second fusion feature. When the second fusion feature is a positive example, the value is 1, otherwise the value is 0.
    Figure PCTCN2020136349-appb-100004
    The probability that the first fusion feature is a positive example.
  15. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可读指令,其中,所述计算机可读指令被处理器执行时实现如下步骤:A computer-readable storage medium, the computer-readable storage medium stores computer-readable instructions, wherein, when the computer-readable instructions are executed by a processor, the following steps are implemented:
    获取标注语料和未标注语料,将每个所述标注语料存入到种子集合中;Obtain the labeled corpus and the unlabeled corpus, and store each of the labeled corpus in the seed set;
    针对所述种子集合中的每个所述标注语料,根据预设特征构造的方式,对所述标注语料构建特征,得到所述标注语料的关系特征;For each of the annotated corpora in the seed set, construct features on the annotated corpus according to a preset feature construction mode to obtain the relationship feature of the annotated corpus;
    将所述未标注语料、所述标注语料和所述标注语料的关系特征输入到预设的相似度评估模型中;Inputting the relationship features of the unlabeled corpus, the labeled corpus, and the labeled corpus into a preset similarity evaluation model;
    基于所述预设的相似度评估模型和所述关系特征,对所述未标注语料进行评估,得到评估结果,并根据所述评估结果,确定所述未标注语料的实体关系。Based on the preset similarity evaluation model and the relationship feature, the unlabeled corpus is evaluated to obtain an evaluation result, and the entity relationship of the unlabeled corpus is determined according to the evaluation result.
  16. 如权利要求15所述的计算机可读存储介质,其中,在所述基于所述预设的相似度评估模型和所述关系特征,对所述未标注语料进行评估,得到评估结果之后,所述计算机可读指令被处理器执行时还实现如下步骤:15. The computer-readable storage medium of claim 15, wherein, after the unlabeled corpus is evaluated based on the preset similarity evaluation model and the relationship feature, and the evaluation result is obtained, the When the computer-readable instructions are executed by the processor, the following steps are also implemented:
    将评估结果与预设条件进行比较,确定符合所述预设条件的未标注语料,作为候选语料;Compare the evaluation result with the preset conditions, and determine the unlabeled corpus that meets the preset conditions as the candidate corpus;
    将所述候选语料加入到所述种子集合中,得到更新后的种子集合。The candidate corpus is added to the seed set to obtain an updated seed set.
  17. 如权利要求15所述的计算机可读存储介质,其中,所述针对所述种 子集合中的每个所述标注语料,根据预设特征构造的方式,对所述标注语料构建特征,得到所述标注语料的关系特征包括:15. The computer-readable storage medium according to claim 15, wherein for each of the annotated corpora in the seed set, a feature is constructed on the annotated corpus according to a preset feature construction method to obtain the The relational characteristics of annotated corpus include:
    获取所述标注语料的命名实体;Acquiring the named entity of the annotated corpus;
    针对所述命名实体,获取所述命名实体之前N个分词,构成知识元组,作为第一关系特征,获取两个连续的命名实体之间的分词,构成知识元组,作为第二关系特征,获取所述命名实体之后N个分词,构成知识元组,作为第三关系特征,其中,N为正整数;For the named entity, obtain the N word segments before the named entity to form a knowledge tuple as the first relationship feature, and obtain the word segmentation between two consecutive named entities to form the knowledge tuple as the second relationship feature, N word segmentation after acquiring the named entity to form a knowledge tuple as the third relationship feature, where N is a positive integer;
    将所述第一关系特征、第二关系特征和所述第三关系特征,作为所述标注语料的关系特征。The first relationship feature, the second relationship feature, and the third relationship feature are used as the relationship feature of the annotation corpus.
  18. 如权利要求15所述的计算机可读存储介质,其中,所述预设的相似度评估模型为BERT模型。15. The computer-readable storage medium of claim 15, wherein the preset similarity evaluation model is a BERT model.
  19. 如权利要求18所述的计算机可读存储介质,其中,所述BERT模型包括编码层、Concat层和全连接层,所述基于所述预设的相似度评估模型和所述关系特征,对所述未标注语料进行评估,得到评估结果包括:The computer-readable storage medium of claim 18, wherein the BERT model includes a coding layer, a Concat layer, and a fully connected layer, and based on the preset similarity evaluation model and the relationship feature, the The unmarked corpus is evaluated, and the evaluation results obtained include:
    采用所述BERT模型的编码层,对每个所述未标注语料进行编码,得到第一编码特征,对每个所述标注语料进行编码,得到第二编码特征;Using the coding layer of the BERT model to encode each of the unlabeled corpus to obtain a first coding feature, and to encode each of the labeled corpus to obtain a second coding feature;
    通过BERT模型的Concat层分别对所述第一编码特征和第二编码特征进行特征提取融合,得到第一融合特征和第二融合特征;Perform feature extraction and fusion on the first coding feature and the second coding feature through the Concat layer of the BERT model, respectively, to obtain the first fusion feature and the second fusion feature;
    针对任意一个所述第一融合特征,基于全连接层的损失函数,计算所述第一融合特征与每个第二融合特征的损失值,将最小损失值作为目标损失值;For any one of the first fusion features, calculate the loss value of the first fusion feature and each second fusion feature based on the loss function of the fully connected layer, and use the minimum loss value as the target loss value;
    若目标损失值小于预设损失阈值,则确定所述评估结果为所述第一融合特征对应的未标注语料与所述目标损失值对应的标注语料存在语义相似度关系。If the target loss value is less than the preset loss threshold, it is determined that the evaluation result is that the unlabeled corpus corresponding to the first fusion feature and the labeled corpus corresponding to the target loss value have a semantic similarity relationship.
  20. 如权利要求19所述的计算机可读存储介质,其中,所述损失函数为二分类交叉熵,所述针对任意一个所述第一融合特征,基于全连接层的损失函数,计算所述第一融合特征与每个第二融合特征的损失值,包括:The computer-readable storage medium according to claim 19, wherein the loss function is two-class cross entropy, and for any one of the first fusion features, the first fusion feature is calculated based on the loss function of the fully connected layer. The loss value of the fusion feature and each second fusion feature includes:
    Figure PCTCN2020136349-appb-100005
    Figure PCTCN2020136349-appb-100005
    其中,Loss为损失值,y为第二融合特征的样本标签,在第二融合特征属于正例时,取值为1,否则取值为0,
    Figure PCTCN2020136349-appb-100006
    为第一融合特征是正例的概率。
    Among them, Loss is the loss value, and y is the sample label of the second fusion feature. When the second fusion feature is a positive example, the value is 1, otherwise the value is 0.
    Figure PCTCN2020136349-appb-100006
    The probability that the first fusion feature is a positive example.
PCT/CN2020/136349 2020-09-08 2020-12-15 Semantic similarity-based entity relation extraction method and apparatus, device and medium WO2021121198A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010937274.9 2020-09-08
CN202010937274.9A CN112101041B (en) 2020-09-08 2020-09-08 Entity relationship extraction method, device, equipment and medium based on semantic similarity

Publications (1)

Publication Number Publication Date
WO2021121198A1 true WO2021121198A1 (en) 2021-06-24

Family

ID=73752238

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/136349 WO2021121198A1 (en) 2020-09-08 2020-12-15 Semantic similarity-based entity relation extraction method and apparatus, device and medium

Country Status (2)

Country Link
CN (1) CN112101041B (en)
WO (1) WO2021121198A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113886535A (en) * 2021-09-18 2022-01-04 前海飞算云创数据科技(深圳)有限公司 Knowledge graph-based question and answer method and device, storage medium and electronic equipment
CN114372446A (en) * 2021-12-13 2022-04-19 北京五八信息技术有限公司 Vehicle attribute labeling method, device and storage medium
CN115033717A (en) * 2022-08-12 2022-09-09 杭州恒生聚源信息技术有限公司 Triple extraction model training method, triple extraction method, device and equipment
CN116049347A (en) * 2022-06-24 2023-05-02 荣耀终端有限公司 Sequence labeling method based on word fusion and related equipment
CN116486420A (en) * 2023-04-12 2023-07-25 北京百度网讯科技有限公司 Entity extraction method, device and storage medium of document image
CN117592562A (en) * 2024-01-18 2024-02-23 卓世未来(天津)科技有限公司 Knowledge base automatic construction method based on natural language processing

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114925210B (en) * 2022-03-21 2023-12-08 中国电信股份有限公司 Knowledge graph construction method, device, medium and equipment
CN115470871B (en) * 2022-11-02 2023-02-17 江苏鸿程大数据技术与应用研究院有限公司 Policy matching method and system based on named entity recognition and relation extraction model

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018218705A1 (en) * 2017-05-27 2018-12-06 中国矿业大学 Method for recognizing network text named entity based on neural network probability disambiguation
CN109446514A (en) * 2018-09-18 2019-03-08 平安科技(深圳)有限公司 Construction method, device and the computer equipment of news property identification model
CN110825827A (en) * 2019-11-13 2020-02-21 北京明略软件系统有限公司 Entity relationship recognition model training method and device and entity relationship recognition method and device
CN110969005A (en) * 2018-09-29 2020-04-07 航天信息股份有限公司 Method and device for determining similarity between entity corpora

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018218705A1 (en) * 2017-05-27 2018-12-06 中国矿业大学 Method for recognizing network text named entity based on neural network probability disambiguation
CN109446514A (en) * 2018-09-18 2019-03-08 平安科技(深圳)有限公司 Construction method, device and the computer equipment of news property identification model
CN110969005A (en) * 2018-09-29 2020-04-07 航天信息股份有限公司 Method and device for determining similarity between entity corpora
CN110825827A (en) * 2019-11-13 2020-02-21 北京明略软件系统有限公司 Entity relationship recognition model training method and device and entity relationship recognition method and device

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113886535A (en) * 2021-09-18 2022-01-04 前海飞算云创数据科技(深圳)有限公司 Knowledge graph-based question and answer method and device, storage medium and electronic equipment
CN114372446A (en) * 2021-12-13 2022-04-19 北京五八信息技术有限公司 Vehicle attribute labeling method, device and storage medium
CN114372446B (en) * 2021-12-13 2023-02-17 北京爱上车科技有限公司 Vehicle attribute labeling method, device and storage medium
CN116049347A (en) * 2022-06-24 2023-05-02 荣耀终端有限公司 Sequence labeling method based on word fusion and related equipment
CN116049347B (en) * 2022-06-24 2023-10-31 荣耀终端有限公司 Sequence labeling method based on word fusion and related equipment
CN115033717A (en) * 2022-08-12 2022-09-09 杭州恒生聚源信息技术有限公司 Triple extraction model training method, triple extraction method, device and equipment
CN115033717B (en) * 2022-08-12 2022-11-08 杭州恒生聚源信息技术有限公司 Triple extraction model training method, triple extraction method, device and equipment
CN116486420A (en) * 2023-04-12 2023-07-25 北京百度网讯科技有限公司 Entity extraction method, device and storage medium of document image
CN116486420B (en) * 2023-04-12 2024-01-12 北京百度网讯科技有限公司 Entity extraction method, device and storage medium of document image
CN117592562A (en) * 2024-01-18 2024-02-23 卓世未来(天津)科技有限公司 Knowledge base automatic construction method based on natural language processing
CN117592562B (en) * 2024-01-18 2024-04-09 卓世未来(天津)科技有限公司 Knowledge base automatic construction method based on natural language processing

Also Published As

Publication number Publication date
CN112101041B (en) 2022-02-15
CN112101041A (en) 2020-12-18

Similar Documents

Publication Publication Date Title
WO2021121198A1 (en) Semantic similarity-based entity relation extraction method and apparatus, device and medium
US20230016365A1 (en) Method and apparatus for training text classification model
WO2022037256A1 (en) Text sentence processing method and device, computer device and storage medium
CN112685565B (en) Text classification method based on multi-mode information fusion and related equipment thereof
WO2022022045A1 (en) Knowledge graph-based text comparison method and apparatus, device, and storage medium
CN111241237B (en) Intelligent question-answer data processing method and device based on operation and maintenance service
CN111753060A (en) Information retrieval method, device, equipment and computer readable storage medium
CN111930942B (en) Text classification method, language model training method, device and equipment
WO2021135469A1 (en) Machine learning-based information extraction method, apparatus, computer device, and medium
CN112395385B (en) Text generation method and device based on artificial intelligence, computer equipment and medium
CN110619051B (en) Question sentence classification method, device, electronic equipment and storage medium
CN114780727A (en) Text classification method and device based on reinforcement learning, computer equipment and medium
WO2021063089A1 (en) Rule matching method, rule matching apparatus, storage medium and electronic device
CN112926308B (en) Method, device, equipment, storage medium and program product for matching text
CN113987169A (en) Text abstract generation method, device and equipment based on semantic block and storage medium
CN110781302A (en) Method, device and equipment for processing event role in text and storage medium
CN112287069A (en) Information retrieval method and device based on voice semantics and computer equipment
US20230008897A1 (en) Information search method and device, electronic device, and storage medium
CN115438149A (en) End-to-end model training method and device, computer equipment and storage medium
CN113051368A (en) Double-tower model training method, double-tower model searching device and electronic equipment
CN115730597A (en) Multi-level semantic intention recognition method and related equipment thereof
CN114817478A (en) Text-based question and answer method and device, computer equipment and storage medium
CN112417875B (en) Configuration information updating method and device, computer equipment and medium
CN113434636A (en) Semantic-based approximate text search method and device, computer equipment and medium
CN115169370B (en) Corpus data enhancement method and device, computer equipment and medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20901660

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20901660

Country of ref document: EP

Kind code of ref document: A1