WO2022198868A1 - 开放式实体关系的抽取方法、装置、设备及存储介质 - Google Patents

开放式实体关系的抽取方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2022198868A1
WO2022198868A1 PCT/CN2021/109168 CN2021109168W WO2022198868A1 WO 2022198868 A1 WO2022198868 A1 WO 2022198868A1 CN 2021109168 W CN2021109168 W CN 2021109168W WO 2022198868 A1 WO2022198868 A1 WO 2022198868A1
Authority
WO
WIPO (PCT)
Prior art keywords
data set
target
processed
relation
entity
Prior art date
Application number
PCT/CN2021/109168
Other languages
English (en)
French (fr)
Inventor
朱昱锦
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2022198868A1 publication Critical patent/WO2022198868A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning

Definitions

  • the present application relates to the field of artificial intelligence neural networks, and in particular, to a method, apparatus, device and storage medium for extracting open entity relationships.
  • Entity relationship extraction technology is to input a piece of context text and two entities, and output the relationship type of these two entities in this context. It is widely used in information extraction, graph construction and association discovery. However, traditional relation extraction technology is difficult to put into practical application because the relation type is fixed and the data is difficult to label. Open relation extraction technology is valued because it can automatically output all possible relation triples from an input text.
  • the traditional open relation extraction scheme generally adopts the method of rule template, but the method of rule template has the problems of openness and complexity, high dependence on expert knowledge, difficulty in migration and rigid matching; in order to solve the existence of rule template method
  • this method has the problems of few ready-made datasets, high labeling cost, and difficulty in dealing with overlapping relations; in order to solve the problem of inability to deal with overlapping relations, it is proposed to first extract the head entity from the sentence , and then jointly extract the tail entity and determine the relationship type according to the output of the head entity and the hidden layer of the neural network.
  • this method needs to calculate a large matrix with the number of rows and columns being the length of the input sentence to solve the problem of developing relationship extraction. Therefore, it is difficult for the existing open relation extraction to deal with indeterminate type relations.
  • the present application provides an open entity relationship extraction method, device, device and storage medium, which are used to solve the problem that the existing open relationship extraction is difficult to handle indefinite type relationships.
  • a first aspect of the present application provides an open entity relationship extraction method, including:
  • relationship classification data set to be processed Obtain the relationship classification data set to be processed, and preprocess the entity relationship, field length and relationship triplet of the relationship classification data set to be processed to obtain the data set to be processed;
  • the initial unsupervised generation model is constructed by using the pre-trained backbone model, and the initial unsupervised generation model is trained and optimized through the to-be-processed data set to obtain the target unsupervised generation model;
  • a second aspect of the present application provides an open entity relationship extraction device, including a memory, a processor, and computer-readable instructions stored on the memory and executable on the processor, the processor executing the When the computer readable instructions are described, the following steps are implemented:
  • relationship classification data set to be processed Obtain the relationship classification data set to be processed, and preprocess the entity relationship, field length and relationship triplet of the relationship classification data set to be processed to obtain the data set to be processed;
  • the initial unsupervised generation model is constructed by using the pre-trained backbone model, and the initial unsupervised generation model is trained and optimized through the to-be-processed data set to obtain the target unsupervised generation model;
  • a third aspect of the present application provides a computer-readable storage medium, where computer instructions are stored in the computer-readable storage medium, and when the computer instructions are executed on a computer, the computer is caused to perform the following steps:
  • relationship classification data set to be processed Obtain the relationship classification data set to be processed, and preprocess the entity relationship, field length and relationship triplet of the relationship classification data set to be processed to obtain the data set to be processed;
  • the initial unsupervised generation model is constructed by using the pre-trained backbone model, and the initial unsupervised generation model is trained and optimized through the to-be-processed data set to obtain the target unsupervised generation model;
  • a fourth aspect of the present application provides an apparatus for extracting open entity relationships, including:
  • a first preprocessing module configured to obtain a relationship classification data set to be processed, and preprocess the entity relationship, field length and relationship triplet of the relationship classification data set to be processed to obtain a data set to be processed;
  • a training optimization module is used to construct an initial unsupervised generation model through the pre-trained backbone model, and through the data set to be processed, the initial unsupervised generation model is trained and optimized to obtain a target unsupervised generation model;
  • the second preprocessing module is used to obtain the text to be processed, and perform word segmentation and word pairing processing on the text to be processed to obtain the preprocessed text;
  • the extraction module is configured to perform data format conversion, hidden layer vector conversion, entity relationship prediction and text sequence generation on the preprocessed text through the target unsupervised generation model to obtain target entity relationship information.
  • the technical solution provided by the present application solves the problem that the existing open relationship extraction is difficult to handle the relationship of indeterminate type.
  • FIG. 1 is a schematic diagram of an embodiment of a method for extracting open entity relationships in an embodiment of the present application
  • FIG. 2 is a schematic diagram of another embodiment of the method for extracting open entity relationships in an embodiment of the present application
  • FIG. 3 is a schematic diagram of an embodiment of an apparatus for extracting open entity relationships in an embodiment of the present application
  • FIG. 4 is a schematic diagram of another embodiment of an apparatus for extracting open entity relationships in an embodiment of the present application.
  • FIG. 5 is a schematic diagram of an embodiment of an apparatus for extracting open entity relationships in an embodiment of the present application.
  • the embodiments of the present application provide an open entity relationship extraction method, apparatus, device, and storage medium, which solve the problem that the existing open relationship extraction is difficult to handle indefinite type relationships.
  • an embodiment of the method for extracting an open entity relationship in the embodiment of the present application includes:
  • the execution body of the present application may be an apparatus for extracting open entity relationships, and may also be a terminal or a server, which is not specifically limited here.
  • the embodiments of the present application take the server as an execution subject as an example for description.
  • the relationship classification dataset to be processed is open source, and the number of relationship classification datasets to be processed includes one or more.
  • the relationship classification dataset to be processed includes dataset SemEval-2010Task8, dataset ACE2003-2004, Data set TACRED, data set FewRel and Baidu information extraction set DuIE, etc.
  • the relation classification dataset to be processed includes text sentences and relation triples, and the pending relation classification dataset includes entities and entity relationships between entities.
  • the server extracts the relationship classification data that has undergone entity labeling and entity relationship extraction and labeling from multiple open source libraries to obtain an initial relationship classification data set, and performs data cleaning and data attribute reduction on the initial relationship classification data set to obtain the relationship to be processed.
  • Classification data set extract the entities and entity relationships of the relationship classification data set to be processed, perform synonym/synonym enhancement processing on the entities and entity relationships in the relationship classification data set to be processed, and obtain an enhanced data set, which includes multiple triples Group (head entity, relation and tail entity) and multiple augmented triples, augmented triples include the same context, relation
  • the sentence lengths of the text sentences in the enhanced dataset are processed according to the preset field length to obtain a processed dataset, and multiple triples and multiple expansions in the processing dataset are processed.
  • the triplet is divided into N samples to obtain sample data, and a preset amount of data is selected from the sample data to obtain a data set to be processed.
  • the pre-trained backbone models include a unified language model (unified language model, UniLM), a generative pre-training (generative pre-training, GPT) model, a large-scale language model based on the transformer transformer GPT-2 or pre-training
  • the generative summary model PEGASUS, etc. in the present embodiment, is preferably a unified language model UniLM, and UniLM uses three different mask mechanisms based on the pre-trained model BERT—a bidirectional language model (bidirected language model). model, BiLM), one-way language model (left-to-right language model, LRLM) and sequence to sequence language model (sequence to sequence language model, S2S LM) pre-trained generative language model.
  • BiLM bidirectional language model
  • LRLM one-way language model
  • sequence to sequence language model sequence to sequence language model
  • the initial unsupervised generative model built from the pre-trained backbone model consists of an encoder and a decoder.
  • the server divides the data set to be processed based on the preset random sampling algorithm or stratified sampling algorithm, and obtains the training data set, the verification data set and the test data set, wherein the preset division ratio can be 8:1:1.
  • the server receives the to-be-processed text sent by the preset display interface or the terminal, and uses the preset open source library Jieba to perform word segmentation processing on the to-be-processed text to obtain a word segmentation list.
  • the words are taken out from the word segmentation list in the order of the word segmentation list.
  • the preprocessed text is obtained.
  • the server Based on the input format of the target unsupervised generation model, the server converts the data format of the preprocessed text to obtain the converted text.
  • the converted text is converted into a hidden layer vector, which is generated by the target unsupervised generation.
  • the decoder in the model based on the preset greedy algorithm or beam search algorithm, matches the corresponding target word in the preset dictionary according to the entity relationship in the hidden layer vector, and generates the target word according to the preset sequence order and target word.
  • the preset dictionary is a dictionary list consisting of a single Chinese character, number or character
  • the list is calculated by calculating the word frequency of the corpus based on a large amount of corpus-inverse text frequency index ( term frequency-inverse document frequency, TF-IDF), which is obtained by comparing the term frequency-inverse text frequency index TF-IDF with the predicted frequency value.
  • TF-IDF corpus-inverse text frequency index
  • the text sequence includes an entity-relation field. Since the entity-relation field has a high probability not to exist in the to-be-processed text, the existing problem is solved. Open relation extraction is difficult to deal with the problem of indeterminate types of relations.
  • the entity relationship, field length and relationship triplet of the relationship classification data set to be processed are preprocessed, the initial unsupervised generation model is constructed through the pre-trained backbone model, and the target unsupervised generation model is constructed through the pre-trained backbone model.
  • perform data format conversion, hidden layer vector conversion, entity relationship prediction and text sequence generation on the preprocessed text which solves the problem of high labeling cost, low computational efficiency, inability to handle overlapping samples, and the problem of expanding to open problems involving calculating an average number of rows and columns. It solves the problem that the existing open relation extraction is difficult to deal with the indefinite type relation.
  • FIG. 2 another embodiment of the method for extracting open entity relationships in the embodiment of the present application includes:
  • the server obtains the target word data that has undergone deduplication and fusion processing, and according to the configured synonym definition information, performs character string generation on the target word data to obtain a thesaurus dictionary; obtains the relationship classification data set to be processed, and the to-be-processed relationship classification data set. Entities and entity relationships of relation classification datasets; part-of-speech tagging for relation classification datasets to be processed, and random selection of target entities and target entity relations from entities and entity relations; traversing the thesaurus dictionary according to target entities and target entity relations to obtain target synonyms.
  • the configured synonym definition information may be the mapping type and corresponding relationship of the synonyms.
  • the server downloads word data from the web pages or thesaurus of github.com/fighting41lov/funNLP, github.com/liuhuanyong/ChineseSemanticKB and Harbin Institute of Technology Dacilin by calling the preset download interface or downloading plug-ins.
  • the json format stores the graph to obtain a synonym dictionary, in which words with similar meanings are connected in the graph.
  • the server obtains the part of speech of the synonym in the thesaurus dictionary, that is, the part of speech of the synonym, and extracts the part of speech of the entity relationship in the relation classification data set.
  • the part of speech of the entity relationship field includes the part of speech of the entity and the part of speech of the field related to the entity relationship. Entity-relation part-of-speech tagging the relational classification data set to achieve part-of-speech disambiguation.
  • the word “swimming” can be used as a verb in the context to represent an action (in this case, the synonyms are “swimming”, “swimming” Water”), can also be used as a noun to denote an activity/event (in this case the synonyms are "breaststroke", “freestyle”, etc.).
  • the server randomly selects a preset number of entities and entity relationships from entities and entity relationships through a preset random selection algorithm, obtains the target entity and the target entity relationship, and matches the synonym dictionary according to the target entity and the target entity relationship.
  • the corresponding target synonyms are obtained, and the number of the target synonyms includes one or more than one.
  • synonym replacement is performed on the relation classification data set to be processed to obtain an enhanced data set.
  • the server modifies the word string corresponding to the target synonym in the relation classification data set to be processed to the character string corresponding to the target synonym, thereby obtaining the enhanced data set.
  • the server classifies the enhanced data set based on the preset entity field length to obtain a first data set and a second data set, the first data set is used to indicate that the length of the preset entity field is met, and the second data set is used to indicate Does not conform to the preset entity field length; according to the preset sentence length, classify the first data set and the second data set to obtain the target data set and non-target data set, the target data set is used to indicate that the preset sentence length is met, and the non-target data set is The target data set is used to indicate that it does not meet the preset sentence length; the sentences in the non-target data set are filled with vacancy characters and masked to obtain the filling data; the filling data and the target data set are determined as the filtering data set.
  • the server obtains the initial entity field length of the enhanced data set and the initial sentence length of the statement.
  • the server judges the script through if-else to determine whether the initial entity field length is greater than the preset entity field length.
  • the field is determined to be an entity, and a first data set that conforms to the preset entity field length is obtained. If so, the field corresponding to the initial entity field length is not determined to be an entity, and a second data set that does not conform to the preset entity field length is obtained.
  • the length of the entity field is valued according to the statistical results.
  • the server determines whether the initial sentence length is the preset sentence length, and the preset sentence length can be the number of characters in the text sentence.
  • the preset sentence length is 128 characters, and a text sentence includes 128 characters.
  • Set the target data set of sentence length if not, obtain a non-target data set that does not meet the preset sentence length, truncate the characters of the data whose initial sentence length is greater than the preset sentence length in the non-target data set to obtain the truncated data, and Filling the data whose initial sentence length is less than the preset sentence length in the non-target data set is filled with vacancy characters, and masking the filled vacancy characters to obtain the filling data, thereby obtaining the filtered data set.
  • the server extracts the initial relation triple set in the filtering data set, and the initial relation phrase set corresponding to the initial relation triple set; according to the initial relation phrase set, performs alignment analysis on the initial relation triple set, and obtains multiple Relation triples to be processed and multiple target relation triples, multiple relation triples to be processed are used to indicate that multiple relation triples to be processed are the same triple, and multiple target relation triples are used for Indicate that multiple target relation triples are not the same triple; fuse multiple relation triples to be processed to obtain multiple fusion relation triples, and combine multiple fusion relation triples and multiple targets Relational triples are identified as the dataset to be processed.
  • the server extracts the initial relation triple set in the filtering data set, and the initial relation phrase set corresponding to the initial relation triple set, and judges whether the relation phrases in the initial relation phrase set are consistent through the preset regular expression, and if so, then Determine that the corresponding relational phrase is the target relational phrase, if not, continue to judge;
  • the server extracts the initial relation triples (head entity, relation, tail entity) of each text sentence in the filtering data set, thereby obtaining the initial relation triple set, and extracts three initial relation phrases corresponding to each initial relation triple , so as to obtain the initial relational phrase set.
  • the server judges whether the three initial relation phrases between the initial relation triples are all the same, and if the three initial relation phrases between the initial relation triples are the same, then judges the headers between the initial relation triples Whether the entity and the tail entity are the same, if so, it is determined that the corresponding two initial relationship triples are the same triple, so as to obtain multiple pending relationship triples, if not, then the corresponding two initial relationship triples are determined.
  • the tuples are not the same triple, so that multiple target relation triples are obtained; if the three initial relation phrases between the initial relation triples are not the same, the corresponding initial relation triple is determined as the target Relation triples to obtain multiple target relation triples, and fuse multiple pending relation triples to obtain pending data including multiple fusion relation triples and multiple target relation triples set, wherein the set of target relation triples includes relation triples that have not been replaced by synonyms in the thesaurus and relation triples that have been replaced by synonyms in the thesaurus.
  • the server constructs an initial unsupervised generation model through the pre-trained backbone model, and divides the data set to be processed into a training data set, a verification data set and a test data set; Train to obtain a candidate unsupervised generative model; through the candidate unsupervised generative model, perform hidden layer vector transformation, entity relationship prediction and text sequence generation on the verification data set to obtain the verification result; calculate the verification loss of the verification result through the preset loss function value, according to the verification loss value, optimize the candidate unsupervised generative model to obtain the optimized unsupervised generative model; through the test data set, test the optimized unsupervised generative model, obtain the test result, and calculate the test loss value of the test result, Determine the target unsupervised generative model based on the test loss value.
  • the server converts the data format of the training data set to the input format of the initial unsupervised generation model, obtains the format-converted training data set, inputs the format-converted training data set into the initial unsupervised generation model, and generates the model through the initial unsupervised generation model.
  • the format-converted training data set is encoded and decoded in turn, so that the parameters of the initial unsupervised generative model are applicable to the training data set, and the model fine-tuning of the initial unsupervised generative model is realized. This results in a candidate unsupervised generative model.
  • the server converts the verification data set into a hidden layer vector set through the encoder in the candidate unsupervised generation model, and performs entity relationship prediction and text sequence generation on the hidden layer vector set through a preset dictionary to obtain the verification result.
  • the server passes the preset loss function, the loss function includes but is not limited to the cross-entropy loss function, through the cross-entropy loss function, calculates the cross-entropy between the verification data set and the verification result, that is, the verification loss value, according to the verification loss value , iteratively adjust the hyperparameters and/or model network structure of the candidate unsupervised generative model until the loss function converges, so as to obtain the optimized unsupervised generative model to improve the accuracy of the optimized unsupervised generative model.
  • the server By optimizing the unsupervised generation model, the server performs hidden layer vector transformation, entity relationship prediction and text sequence generation on the test data set to obtain the test result, and calculates the test loss value of the test result to determine whether the test loss value is greater than the preset threshold. If yes, then iteratively optimize the optimized unsupervised generative model to obtain the target unsupervised generative model; if not, determine the optimized unsupervised generative model as the target unsupervised generative model.
  • a text sequence is generated, and the text sequence includes an entity-relation field, wherein the entity-relation field has a high probability that no input text exists (that is, the to-be-processed text).
  • entity-relation field has a high probability that no input text exists (that is, the to-be-processed text).
  • the server receives the to-be-processed text sent by the preset display interface or the terminal, and uses the preset open source library Jieba to perform word segmentation processing on the to-be-processed text to obtain a word segmentation list.
  • the words are taken out from the word segmentation list in the order of the word segmentation list.
  • the preprocessed text is obtained.
  • the server converts the data format of the preprocessed text into the encoded input format of the target unsupervised generation model, and obtains the converted text.
  • the target unsupervised generation model includes an encoder and a decoder; the data is fitted to the converted text through the encoder, The hidden layer vector is obtained; through the decoder, based on the preset greedy algorithm and the hidden layer vector, the corresponding target word is obtained from the preset dictionary; the text sequence is generated according to the target word to obtain the target entity relationship information.
  • the server converts the data format of the preprocessed text to the encoded input format of the target unsupervised generative model: [CLS]XXX ⁇ entity_head>XXX ⁇ /entity_head>XXX ⁇ entity_tail>XXX ⁇ /entity_tail>XXX[SEP]YYY[END ], where [CLS] is the classification bit, which has no practical significance; [SEP] is the division bit, the content before [SEP] is the input content during inference, and the content after [SEP] is the generated content; [END] is the termination bit, Indicates the end of relationship generation; the part enclosed by ⁇ tag> and ⁇ /tag> is the mention of the entity in the sentence; the content enclosed by [SEP] and [END] is the generated entity relationship; through the target unsupervised generation model
  • the embedded layer of the encoder and the multi-layer neural network in the coder perform data fitting on the converted text, that is, convert the converted text into a hidden layer vector, and obtain the hidden layer vector.
  • the hidden layer vector includes multiple word vectors, and the server generates unsupervised through the target.
  • the decoder in the model calculates the joint probability between each two word vectors in the hidden layer vector, and selects the corresponding target word from the preset dictionary according to the joint probability through the preset greedy algorithm, and assigns the target word to the target word.
  • the words generate text sequences according to the sequence order of the word vectors, so as to obtain the target entity relationship information, that is, select the most suitable (that is, the maximum probability value predicted by the target unsupervised generation model (the joint probability) from the preset dictionary table attached to the backbone model.
  • the character at the position corresponding to the maximum value is followed by the text to be processed, so as to realize the sequence of extraction, prediction and regeneration of the entity relationship of the text to be processed.
  • the entity relationship, field length and relationship triplet of the relationship classification data set to be processed are preprocessed, the initial unsupervised generation model is constructed through the pre-trained backbone model, and the target unsupervised generation model is constructed through the pre-trained backbone model.
  • perform data format conversion, hidden layer vector conversion, entity relationship prediction and text sequence generation on the preprocessed text which solves the problem of high labeling cost, low computational efficiency, inability to handle overlapping samples, and the problem of expanding to open problems involving calculating an average number of rows and columns. It solves the problem that the existing open relation extraction is difficult to deal with the indefinite type relation.
  • One embodiment includes:
  • the first preprocessing module 301 is configured to obtain the relationship classification data set to be processed, and preprocess the entity relationship, field length and relationship triplet of the relationship classification data set to be processed to obtain the data set to be processed;
  • the training optimization module 302 is used for constructing an initial unsupervised generation model by using a pre-trained backbone model, and training and optimizing the initial unsupervised generation model through the data set to be processed to obtain a target unsupervised generation model;
  • the second preprocessing module 303 is used to obtain the text to be processed, and perform word segmentation and word pairing processing on the text to be processed to obtain the preprocessed text;
  • the extraction module 304 is configured to perform data format conversion, hidden layer vector conversion, entity relationship prediction and text sequence generation on the preprocessed text through the target unsupervised generation model to obtain target entity relationship information.
  • each module in the above apparatus for extracting an open entity relationship corresponds to each step in the above embodiment of the above method for extracting an open entity relationship, and the functions and implementation process thereof will not be repeated here.
  • the entity relationship, field length and relationship triplet of the relationship classification data set to be processed are preprocessed, the initial unsupervised generation model is constructed through the pre-trained backbone model, and the target unsupervised generation model is constructed through the pre-trained backbone model.
  • perform data format conversion, hidden layer vector conversion, entity relationship prediction and text sequence generation on the preprocessed text which solves the problem of high labeling cost, low computational efficiency, inability to handle overlapping samples, and the problem of expanding to open problems involving calculating an average number of rows and columns. It solves the problem that the existing open relation extraction is difficult to deal with the indefinite type relation.
  • FIG. 4 another embodiment of the apparatus for extracting open entity relationships in the embodiment of the present application includes:
  • the first preprocessing module 301 is configured to obtain the relationship classification data set to be processed, and preprocess the entity relationship, field length and relationship triplet of the relationship classification data set to be processed to obtain the data set to be processed;
  • the first preprocessing module 301 specifically includes:
  • a creation and acquisition unit 3011 is used to create a synonym dictionary, and to obtain a relation classification data set to be processed, and a target synonym corresponding to the relation classification data set to be processed in the synonym dictionary;
  • the replacement unit 3012 is used to replace the synonyms in the relation classification data set to be processed by the target synonyms to obtain an enhanced data set;
  • Filtering unit 3013 configured to filter the enhanced data set according to the preset entity field length and the preset sentence length to obtain a filtered data set
  • the processing unit 3014 is used to obtain the relational triplet set of the filtering data set, and perform alignment processing and deduplication processing on the relational triplet set through a preset regular expression to obtain the data set to be processed;
  • the training optimization module 302 is used for constructing an initial unsupervised generation model by using a pre-trained backbone model, and training and optimizing the initial unsupervised generation model through the data set to be processed to obtain a target unsupervised generation model;
  • the second preprocessing module 303 is used to obtain the text to be processed, and perform word segmentation and word pairing processing on the text to be processed to obtain the preprocessed text;
  • the extraction module 304 is configured to perform data format conversion, hidden layer vector conversion, entity relationship prediction and text sequence generation on the preprocessed text through the target unsupervised generation model to obtain target entity relationship information.
  • the creation and acquisition unit 3011 can also be specifically used for:
  • the filtering unit 3013 can also be specifically used for:
  • the enhanced data set is classified to obtain a first data set and a second data set, the first data set is used to indicate that the length of the preset entity field is met, and the second data set is used to indicate that it does not meet the preset length.
  • Entity field length a first data set is used to indicate that the length of the preset entity field is met, and the second data set is used to indicate that it does not meet the preset length.
  • the first data set and the second data set are classified to obtain the target data set and non-target data set, the target data set is used to indicate that the preset sentence length is met, and the non-target data set is used to indicate that it does not meet the preset sentence length.
  • default sentence length
  • processing unit 3014 can also be specifically used for:
  • the initial relation phrase set perform alignment analysis on the initial relation triple set to obtain multiple to-be-processed relation triples and multiple target relation triples, and multiple to-be-processed relation triples are used to indicate multiple pending relation triples
  • the relation triple is the same triple, and multiple target relation triples are used to indicate that multiple target relation triples are not the same triple;
  • the multiple fused relation triples are fused to obtain multiple fused relation triples, and the multiple fused relation triples and multiple target relation triples are determined as the pending data set.
  • the extraction module 304 can also be specifically used for:
  • Data fitting is performed on the converted text through the encoder to obtain the hidden layer vector
  • the decoder based on the preset greedy algorithm and the hidden layer vector, the corresponding target word is obtained from the preset dictionary;
  • training optimization module 302 can also be specifically used for:
  • the initial unsupervised generative model is trained to obtain the candidate unsupervised generative model
  • the hidden layer vector transformation, entity relationship prediction and text sequence generation are performed on the verification data set, and the verification result is obtained;
  • the verification loss value of the verification result is calculated by the preset loss function, and the candidate unsupervised generation model is optimized according to the verification loss value, and the optimized unsupervised generation model is obtained;
  • the optimized unsupervised generative model is tested to obtain the test result, and the test loss value of the test result is calculated, and the target unsupervised generative model is determined according to the test loss value.
  • each module and each unit in the above-mentioned open entity relationship extraction apparatus corresponds to each step in the above-mentioned open entity relationship extraction method embodiment, and the functions and implementation process thereof will not be repeated here.
  • the entity relationship, field length and relationship triplet of the relationship classification data set to be processed are preprocessed, the initial unsupervised generation model is constructed through the pre-trained backbone model, and the target unsupervised generation model is constructed through the pre-trained backbone model.
  • perform data format conversion, hidden layer vector conversion, entity relationship prediction and text sequence generation on the preprocessed text which solves the problem of high labeling cost, low computational efficiency, inability to handle overlapping samples, and the problem of expanding to open problems involving calculating an average number of rows and columns. It solves the problem that the existing open relation extraction is difficult to deal with the indefinite type relation.
  • FIGS 3 and 4 above describe in detail the apparatus for extracting open entity relationships in the embodiments of the present application from the perspective of modular functional entities.
  • the following is a detailed description of the apparatus for extracting open entity relationships in the embodiments of the present application from the perspective of hardware processing. describe.
  • FIG. 5 is a schematic structural diagram of an open entity relationship extraction device provided by an embodiment of the present application.
  • the open entity relationship extraction device 500 may vary greatly due to different configurations or performances, and may include one or more than one Central processing units (CPU) 510 (eg, one or more processors) and memory 520, one or more storage media 530 (eg, one or more mass storage devices) that store application programs 533 or data 532.
  • the memory 520 and the storage medium 530 may be short-term storage or persistent storage.
  • the program stored in the storage medium 530 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations in the apparatus 500 for extracting open entity relationships.
  • the processor 510 may be configured to communicate with the storage medium 530 to execute a series of instruction operations in the storage medium 530 on the open entity-relationship extraction device 500 .
  • the open entity relationship extraction device 500 may also include one or more power supplies 540, one or more wired or wireless network interfaces 550, one or more input and output interfaces 560, and/or, one or more than one operating system 531, For example Windows Server, Mac OS X, Unix, Linux, FreeBSD, etc.
  • operating system 531 For example Windows Server, Mac OS X, Unix, Linux, FreeBSD, etc.
  • FIG. 5 does not constitute a limitation on the extraction device for open entity relationship, and may include more or less components than those shown in the figure, or a combination of certain components may be included. some components, or a different arrangement of components.
  • the present application also provides an open entity relationship extraction device, comprising: a memory and at least one processor, wherein the memory stores instructions, and the memory and the at least one processor are interconnected through a line; the at least one processor The processor invokes the instructions in the memory, so that the apparatus for extracting open entity relationships executes the steps in the above method for extracting open entity relationships.
  • the present invention also provides a computer-readable storage medium.
  • the computer-readable storage medium may be a non-volatile computer-readable storage medium.
  • the computer-readable storage medium may also be a volatile computer-readable storage medium.
  • relationship classification data set to be processed Obtain the relationship classification data set to be processed, and preprocess the entity relationship, field length and relationship triplet of the relationship classification data set to be processed to obtain the data set to be processed;
  • the initial unsupervised generative model is constructed through the pre-trained backbone model, and the initial unsupervised generative model is trained and optimized through the data set to be processed to obtain the target unsupervised generative model;
  • the computer-readable storage medium may mainly include a stored program area and a stored data area, wherein the stored program area may store an operating system, an application program required by at least one function, and the like; Use the created data, etc.
  • the blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • the integrated unit if implemented as a software functional unit and sold or used as a stand-alone product, may be stored in a computer-readable storage medium.
  • the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that contribute to the prior art, or all or part of the technical solutions, and the computer software products are stored in a storage medium , including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods in the various embodiments of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种开放式实体关系的抽取方法、装置、设备及存储介质,用于解决现有的开放关系抽取难以处理不定类型关系的问题。开放式实体关系的抽取方法包括:预处理待处理的关系分类数据集的实体关系、字段长度和关系三元组得到待处理数据集;通过预先训练好的主干模型构建初始无监督生成模型,并通过待处理数据集对初始无监督生成模型进行训练和优化,得到目标无监督生成模型;对待处理文本进行分词和词配对处理,得到预处理文本;通过目标无监督生成模型,对预处理文本进行隐层向量转换、实体关系预测和文本序列生成,得到目标实体关系信息。此外,待处理的关系分类数据集可存储于区块链中。

Description

开放式实体关系的抽取方法、装置、设备及存储介质
本申请要求于2021年03月26日提交中国专利局、申请号为202110322883.8、发明名称为“开放式实体关系的抽取方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在申请中。
技术领域
本申请涉及人工智能的神经网络领域,尤其涉及一种开放式实体关系的抽取方法、装置、设备及存储介质。
背景技术
实体关系抽取技术为通过输入一段上下文文本及两个实体,输出这两个实体在这段上下文中的关系类型,被广泛运用在信息提取、图谱构建和关联发现等领域。但传统关系抽取技术因为关系类型固定、数据难以标注而难以投入实际应用,开放关系抽取技术由于能从输入的一段文本中自动输出所有可能的关系三元组而受到重视。
发明人意识到,目前,传统开放关系抽取方案一般采用规则模板的方式,但是规则模板的方式存在开放复杂、对专家知识依赖高、难以迁移和匹配死板的问题;为了解决规则模板的方式所存在的问题,提出了按照语义角色标注的方式,但是该方式存在现成数据集少、标注成本高和难以处理重叠关系的问题;为了解决无法处理重叠关系的问题,提出了首先从句子里提取头实体,然后根据头实体与神经网络隐藏层的输出,联合提取尾实体并判断关系类型的方式,但是,该方式存在须要计算一个行列数均为输入句长度的大矩阵以解决开发关系抽取的问题,因而导致了现有的开放关系抽取难以处理不定类型关系。
发明内容
本申请提供一种开放式实体关系的抽取方法、装置、设备及存储介质,用于解决现有的开放关系抽取难以处理不定类型关系的问题。
本申请第一方面提供了一种开放式实体关系的抽取方法,包括:
获取待处理的关系分类数据集,对所述待处理的关系分类数据集的实体关系、字段长度和关系三元组进行预处理,得到待处理数据集;
通过预先训练好的主干模型构建初始无监督生成模型,并通过所述待处理数据集,对所述初始无监督生成模型进行训练和优化,得到目标无监督生成模型;
获取待处理文本,并对所述待处理文本进行分词和词配对处理,得到预处理文本;
通过所述目标无监督生成模型,对所述预处理文本进行数据格式转换、隐层向量转换、实体关系预测和文本序列生成,得到目标实体关系信息。
本申请第二方面提供了一种开放式实体关系的抽取设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:
获取待处理的关系分类数据集,对所述待处理的关系分类数据集的实体关系、字段长度和关系三元组进行预处理,得到待处理数据集;
通过预先训练好的主干模型构建初始无监督生成模型,并通过所述待处理数据集,对所述初始无监督生成模型进行训练和优化,得到目标无监督生成模型;
获取待处理文本,并对所述待处理文本进行分词和词配对处理,得到预处理文本;
通过所述目标无监督生成模型,对所述预处理文本进行数据格式转换、隐层向量转换、实体关系预测和文本序列生成,得到目标实体关系信息。
本申请第三方面提供了一种计算机可读存储介质,所述计算机可读存储介质中存储计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:
获取待处理的关系分类数据集,对所述待处理的关系分类数据集的实体关系、字段长度和关系三元组进行预处理,得到待处理数据集;
通过预先训练好的主干模型构建初始无监督生成模型,并通过所述待处理数据集,对所述初始无监督生成模型进行训练和优化,得到目标无监督生成模型;
获取待处理文本,并对所述待处理文本进行分词和词配对处理,得到预处理文本;
通过所述目标无监督生成模型,对所述预处理文本进行数据格式转换、隐层向量转换、实体关系预测和文本序列生成,得到目标实体关系信息。
本申请第四方面提供了一种开放式实体关系的抽取装置,包括:
第一预处理模块,用于获取待处理的关系分类数据集,对所述待处理的关系分类数据集的实体关系、字段长度和关系三元组进行预处理,得到待处理数据集;
训练优化模块,用于通过预先训练好的主干模型构建初始无监督生成模型,并通过所述待处理数据集,对所述初始无监督生成模型进行训练和优化,得到目标无监督生成模型;
第二预处理模块,用于获取待处理文本,并对所述待处理文本进行分词和词配对处理,得到预处理文本;
抽取模块,用于通过所述目标无监督生成模型,对所述预处理文本进行数据格式转换、隐层向量转换、实体关系预测和文本序列生成,得到目标实体关系信息。
本申请提供的技术方案解决了现有的开放关系抽取难以处理不定类型关系的问题。
附图说明
图1为本申请实施例中开放式实体关系的抽取方法的一个实施例示意图;
图2为本申请实施例中开放式实体关系的抽取方法的另一个实施例示意图;
图3为本申请实施例中开放式实体关系的抽取装置的一个实施例示意图;
图4为本申请实施例中开放式实体关系的抽取装置的另一个实施例示意图;
图5为本申请实施例中开放式实体关系的抽取设备的一个实施例示意图。
具体实施方式
本申请实施例提供了一种开放式实体关系的抽取方法、装置、设备及存储介质,解决了现有的开放关系抽取难以处理不定类型关系的问题。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”或“具有”及其任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
为便于理解,下面对本申请实施例的具体流程进行描述,请参阅图1,本申请实施例中开放式实体关系的抽取方法的一个实施例包括:
101、获取待处理的关系分类数据集,对待处理的关系分类数据集的实体关系、字段长度和关系三元组进行预处理,得到待处理数据集。
可以理解的是,本申请的执行主体可以为开放式实体关系的抽取装置,还可以是终端或者服务器,具体此处不做限定。本申请实施例以服务器为执行主体为例进行说明。
其中,待处理的关系分类数据集为开源的,待处理的关系分类数据集的数量包括一个或多个,例如:待处理的关系分类数据集包括数据集SemEval-2010Task8、数据集ACE2003-2004、数据集TACRED、数据集FewRel和百度信息抽取集DuIE等,待处理的关系分类数据集包括文本句子和关系三元组,待处理的关系分类数据集包括实体和实体之间的实体关系。
服务器从多个开源库中提取已经过实体标注以及实体关系抽取和标注的关系分类数据,从而得到初始关系分类数据集,对初始关系分类数据集进行数据清洗和数据属性规约, 得到待处理的关系分类数据集,提取待处理的关系分类数据集的实体和实体关系,对待处理的关系分类数据集中的实体和实体关系进行同义词/近义词增强处理,得到增强数据集,增强数据集包括多个三元组(头实体,关系和尾实体)和多个扩增三元组,扩增三元组包括由通过预置的同义词典随机替换关系三元组中的成分,而得到的上下文相同、关系类型相同和具体实体组合不同的新三元组,按照预设的字段长度对增强数据集中文本句子的句子长度进行处理,得到处理数据集,将处理数据集中的多个三元组和多个扩增三元组划分为N个样本,从而得到样本数据,从样本数据中选取预置数量的数据,得到待处理数据集。
102、通过预先训练好的主干模型构建初始无监督生成模型,并通过待处理数据集,对初始无监督生成模型进行训练和优化,得到目标无监督生成模型。
其中,预先训练好的主干模型包括统一的语言模型(unified language model,UniLM)、生成式的预训练(generative pre-training,GPT)模型、基于转换器transformer的大型语言模型GPT-2或预训练生成式摘要模型PEGASUS等,本实施例中预先训练好的主干模型优选为统一的语言模型UniLM,UniLM为基于预训练模型BERT使用三种不同的遮罩(mask)机制—双向语言模型(bidirected language model,BiLM)、单向语言模型(left-to-right language model,LRLM)和序列到序列语言模型(sequence to sequence language model,S2S LM)训练而得的预训练生成式语言模型。通过预先训练好的主干模型构建而成的初始无监督生成模型包括编码器和解码器。服务器按照预设的划分比例,基于预置的随机采样算法或分层采样算法,对待处理数据集进行分割,得到训练数据集、验证数据集和测试数据集,其中,预设的划分比例可为8:1:1。
103、获取待处理文本,并对待处理文本进行分词和词配对处理,得到预处理文本。
服务器通过接收预置的显示界面或终端发送的待处理文本,通过预置的开源库Jieba,对待处理文本进行分词处理,得到分词列表,按分词列表的顺序从分词列表中将词两两取出,以实现词配对处理,得到预处理文本,其中,词配对处理不会明显影响目标无监督生成模型效率,例如:如果有N个词,那么需要配对N(N-1)/2次,平均一个句子里N=5,要配对10次,模型做一次推断时间约为1s,10次推断是10s,这个量级不会明显影响模型效率。
104、通过目标无监督生成模型,对预处理文本进行数据格式转换、隐层向量转换、实体关系预测和文本序列生成,得到目标实体关系信息。
服务器基于目标无监督生成模型的输入格式,对预处理文本进行数据格式转换,得到转换文本,通过目标无监督生成模型中的编码器,对转换文本进行转换为隐层向量,通过目标无监督生成模型中的解码器,基于预置的贪心算法或集束搜索算法,根据隐层向量中的实体关系,匹配预置的词典中对应的目标字词,按照预设的序列顺序和目标字词,生成新的文本序列,从而得到目标实体关系信息,其中,预置的词典为一个由单个汉字、数字或字符组成的词典列表,该列表由通过基于大量语料,计算语料的词频-逆文本频率指数(term frequency–inverse document frequency,TF-IDF),将词频-逆文本频率指数TF-IDF与预测的频率值进行对比分析而得到。通过直接根据待处理文本和待处理文本中的两个实体,生成一个文本序列,该文本序列包括实体关系字段,由于该实体关系字段极大概率不存在待处理文本中,从而解决了现有的开放关系抽取难以处理不定类型关系的问题。
本申请实施例中,通过对待处理的关系分类数据集的实体关系、字段长度和关系三元组进行预处理,通过预先训练好的主干模型构建初始无监督生成模型,以及通过目标无监督生成模型,对预处理文本进行数据格式转换、隐层向量转换、实体关系预测和文本序列生成,解决了标注成本高、计算效率低、无法处理重叠样本以及存在拓展到开放问题时涉及计算一个行列数均为输入句长度的大矩阵的问题,从而解决了现有的开放关系抽取难以处理不定类型关系的问题。
请参阅图2,本申请实施例中开放式实体关系的抽取方法的另一个实施例包括:
201、创建同义词词典,并获取待处理的关系分类数据集,以及同义词词典中待处理的关系分类数据集对应的目标同义词。
具体地,服务器获取经过去重融合处理的目标字词数据,根据配置的同义词定义信息,对目标字词数据进行字符串生成,得到同义词词典;获取待处理的关系分类数据集,以及待处理的关系分类数据集的实体和实体关系;对待处理的关系分类数据集进行词性标注,并从实体和实体关系中随机选取目标实体和目标实体关系;根据目标实体和目标实体关系遍历同义词词典,得到对应的目标同义词。
其中,配置的同义词定义信息可为同义词的映射类型和对应关系。服务器通过调用预置的下载接口或下载插件从github.com/fighting41lov/funNLP,github.com/liuhuanyong/ChineseSemanticKB和哈尔滨工业大学大词林的网页或词库中下载字词数据,对该字词数据进行数据预处理和去重融合处理,得到经过去重融合处理的目标字词数据,并按照由二元组(词,词)构成的列表的数据结构和目标字词数据,构建图谱,并按照json格式对图谱进行存储得到同义词词典,其中,在图谱中,相近意思的字词都会连接起来。
服务器获取同义词词典中同义词的词性,即同义词词性,并提取关系分类数据集中的实体关系词性,该实体关系字段词性包括实体的词性,以及与实体关系相关的字段的词性,根据该同义词词性和该实体关系词性,对关系分类数据集进行词性标注,以实现词性的歧义消除,例如:“游泳”一词,在语境中可以作动词表示一个动作(此时同义词是“游动”、“泅水”),也可以作名词表示一项活动/项目(此时同义词是“蛙泳”、“自由泳”等)。
服务器通过预置的随机选择算法,从实体和实体关系中随机选取预设的选取数量的实体和实体关系,得到目标实体和目标实体关系,根据目标实体和目标实体关系,对同义词词典进行匹配,得到对应的目标同义词,该目标同义词的数量包括一个或一个以上。
202、通过目标同义词,对待处理的关系分类数据集进行同义词替换,得到增强数据集。
服务器将待处理的关系分类数据集中与目标同义词对应的词字符串修改为目标同义词对应的字符串,从而得到增强数据集。
203、按照预设实体字段长度和预设句长度,对增强数据集进行过滤,得到过滤数据集。
具体地,服务器基于预设实体字段长度,对增强数据集进行分类,得到第一数据集和第二数据集,第一数据集用于指示符合预设实体字段长度,第二数据集用于指示不符合预设实体字段长度;根据预设句长度,对第一数据集和第二数据集进行分类,得到目标数据集和非目标数据集,目标数据集用于指示符合预设句长度,非目标数据集用于指示不符合预设句长度;对非目标数据集中的语句进行空缺符填充和遮罩处理,得到填充数据;将填充数据和目标数据集确定为过滤数据集。
服务器获取增强数据集的初始实体字段长度,以及语句的初始句长度,服务器通过if-else判断脚本,判断初始实体字段长度是否大于预设实体字段长度,若否,则将初始实体字段长度对应的字段确定为实体,得到符合预设实体字段长度的第一数据集,若是,则不将初始实体字段长度对应的字段确定为实体,得到不符合预设实体字段长度的第二数据集,预设实体字段长度根据统计结果取值,中文情形一般取k=7;服务器也可通过预置的函数(如:python语言中的filter函数),基于预设实体字段长度和初始实体字段长度,对增强数据集中各语句的字段进行过滤,例如:增强数据集中各语句的字段为一个列表lst,通过lst_new=list(filter(x:len(x)>7,lst)),实现基于预设实体字段长度和初始实体字段长度,对增强数据集中各语句的字段进行的过滤。
服务器判断初始句长度是否为预设句长度,该预设句长度可为文本句子的字符数量,例如:预设句长度为128个字符,一个文本句子包括128个字符,若是,则得到符合预设 句长度的目标数据集,若否,则得到不符合预设句长度的非目标数据集,将非目标数据集中初始句长度大于预设句长度的数据的字符进行截断,得到截断数据,并对非目标数据集中初始句长度小于预设句长度的数据进行空缺符填充,并对填充的空缺符进行遮罩mask处理,得到填充数据,从而得到过滤数据集。
204、获取过滤数据集的关系三元组集,通过预置的正则表达式,对关系三元组集进行对齐处理和去重处理,得到待处理数据集。
具体地,服务器提取过滤数据集中的初始关系三元组集,以及初始关系三元组集对应的初始关系短语集;根据初始关系短语集,对初始关系三元组集进行对齐分析,得到多个待处理关系三元组以及多个目标关系三元组,多个待处理关系三元组用于指示多个待处理关系三元组为同一个三元组,多个目标关系三元组用于指示多个目标关系三元组不为同一个三元组;将多个待处理关系三元组进行融合,得到多个融合关系三元组,并将多个融合关系三元组和多个目标关系三元组确定为待处理数据集。
服务器提取过滤数据集中的初始关系三元组集,以及初始关系三元组集对应的初始关系短语集,通过预置的正则表达式,判断初始关系短语集中关系短语之间是否一致,若是,则判定对应的关系短语为目标关系短语,若否,则继续进行判断;
或者,服务器提取过滤数据集中各文本句子的初始关系三元组(头实体,关系,尾实体),从而得到初始关系三元组集,并提取各初始关系三元组对应的三个初始关系短语,从而得到初始关系短语集。服务器判断各初始关系三元组之间的三个初始关系短语是否均相同,若各初始关系三元组之间的三个初始关系短语均相同,则判断各初始关系三元组之间的头实体和尾实体是否相同,若是,则判定对应的两个初始关系三元组为同一个三元组,从而得到多个待处理关系三元组,若否,则判定对应的两个初始关系三元组不为同一个三元组,从而得到多个目标关系三元组;若各初始关系三元组之间的三个初始关系短语不相同,则将对应的初始关系三元组确定为目标关系三元组,从而得到多个目标关系三元组,并将多个待处理关系三元组进行融合,从而得到包括多个融合关系三元组和多个目标关系三元组的待处理数据集,其中,目标关系三元组集包括未经过同义词词典中的同义词替换的关系三元组和经过同义词词典中的同义词替换的关系三元组。
205、通过预先训练好的主干模型构建初始无监督生成模型,并通过待处理数据集,对初始无监督生成模型进行训练和优化,得到目标无监督生成模型。
具体地,服务器通过预先训练好的主干模型构建初始无监督生成模型,并将待处理数据集划分为训练数据集、验证数据集和测试数据集;通过训练数据集,对初始无监督生成模型进行训练,得到候选无监督生成模型;通过候选无监督生成模型,对验证数据集进行隐层向量转换、实体关系预测和文本序列生成,得到验证结果;通过预置的损失函数计算验证结果的验证损失值,根据验证损失值,对候选无监督生成模型进行优化,得到优化无监督生成模型;通过测试数据集,对优化无监督生成模型进行测试,得到测试结果,并计算测试结果的测试损失值,根据测试损失值确定目标无监督生成模型。
服务器将训练数据集的数据格式转换为初始无监督生成模型的输入格式,得到格式转换后的训练数据集,将格式转换后的训练数据集输入初始无监督生成模型中,通过初始无监督生成模型中的编码器和解码器,对格式转换后的训练数据集依次进行编码处理和解码处理,以使得初始无监督生成模型的参数适用训练数据集,实现了对初始无监督生成模型的模型微调,从而得到候选无监督生成模型。
服务器通过候选无监督生成模型中的编码器,将验证数据集转换为隐层向量集,并通过预置的字典,对隐层向量集进行实体关系预测和文本序列生成,得到验证结果。
服务器通过预置的损失函数,该损失函数包括但不限于交叉熵损失函数,通过该交叉熵损失函数,计算验证数据集与验证结果之间的交叉熵,即验证损失值,根据该验证损失 值,对候选无监督生成模型的超参数和/或模型网络结构进行迭代调整,直至损失函数收敛,从而得到优化无监督生成模型,以提高优化无监督生成模型的准确性。
服务器通过优化无监督生成模型,对测试数据集进行隐层向量转换、实体关系预测和文本序列生成,得到测试结果,并计算测试结果的测试损失值,判断该测试损失值是否大于预设阈值,若是,则对优化无监督生成模型进行迭代优化,得到目标无监督生成模型,若否,则将优化无监督生成模型确定为目标无监督生成模型。
通过直接根据待处理的关系分类数据集中的文本句子和两个实体,生成一个文本序列,该文本序列包括实体关系字段,其中,该实体关系字段极大概率不存在输入的文本(即待处理的关系分类数据集中的文本句子)中,解决了现有的开放关系抽取难以处理不定类型关系的问题。
206、获取待处理文本,并对待处理文本进行分词和词配对处理,得到预处理文本。
服务器通过接收预置的显示界面或终端发送的待处理文本,通过预置的开源库Jieba,对待处理文本进行分词处理,得到分词列表,按分词列表的顺序从分词列表中将词两两取出,以实现词配对处理,得到预处理文本,其中,词配对处理不会明显影响目标无监督生成模型效率,例如:如果有N个词,那么需要配对N(N-1)/2次,平均一个句子里N=5,要配对10次,模型做一次推断时间约为1s,10次推断是10s,这个量级不会明显影响模型效率。
207、通过目标无监督生成模型,对预处理文本进行数据格式转换、隐层向量转换、实体关系预测和文本序列生成,得到目标实体关系信息。
具体地,服务器将预处理文本的数据格式转换为目标无监督生成模型的编码输入格式,得到转换文本,目标无监督生成模型包括编码器和解码器;通过编码器对转换文本进行数据拟合,得到隐层向量;通过解码器,基于预置的贪心算法和隐层向量,从预置的词典中获取对应的目标字词;根据目标字词生成文本序列得到目标实体关系信息。
例如,服务器将预处理文本的数据格式转换为目标无监督生成模型的编码输入格式:[CLS]XXX<entity_head>XXX</entity_head>XXX<entity_tail>XXX</entity_tail>XXX[SEP]YYY[END],其中[CLS]为分类位,无实际意义;[SEP]为划分位,[SEP]前的内容为推理时的输入内容,[SEP]后的为生成内容;[END]为终止位,表示关系生成结束;<tag>与</tag>围住的部分即实体在句中的提及mention;[SEP]和[END]围住的内容为生成的实体关系;通过目标无监督生成模型中的编码器的嵌入层和多层神经网络,对转换文本进行数据拟合,即将转换文本转换为隐层向量,得到隐层向量,隐层向量包括多个词向量,服务器通过目标无监督生成模型中的解码器,计算隐层向量中每两个词向量之间的联合概率,并通过预置的贪心算法根据该联合概率,从预置的词典里中选择对应的目标字词,将目标字词按照词向量的序列顺序生成文本序列,从而得到目标实体关系信息,即从主干模型附带的预置的词典表中选择最符合(即目标无监督生成模型预测的最大概率值(联合概率的最大值)对应的位置)的字符接在待处理文本后,以实现对待处理文本的实体关系的抽取、预测和重新生成序列。
本申请实施例中,通过对待处理的关系分类数据集的实体关系、字段长度和关系三元组进行预处理,通过预先训练好的主干模型构建初始无监督生成模型,以及通过目标无监督生成模型,对预处理文本进行数据格式转换、隐层向量转换、实体关系预测和文本序列生成,解决了标注成本高、计算效率低、无法处理重叠样本以及存在拓展到开放问题时涉及计算一个行列数均为输入句长度的大矩阵的问题,从而解决了现有的开放关系抽取难以处理不定类型关系的问题。
上面对本申请实施例中开放式实体关系的抽取方法进行了描述,下面对本申请实施例中开放式实体关系的抽取装置进行描述,请参阅图3,本申请实施例中开放式实体关系的 抽取装置一个实施例包括:
第一预处理模块301,用于获取待处理的关系分类数据集,对待处理的关系分类数据集的实体关系、字段长度和关系三元组进行预处理,得到待处理数据集;
训练优化模块302,用于通过预先训练好的主干模型构建初始无监督生成模型,并通过待处理数据集,对初始无监督生成模型进行训练和优化,得到目标无监督生成模型;
第二预处理模块303,用于获取待处理文本,并对待处理文本进行分词和词配对处理,得到预处理文本;
抽取模块304,用于通过目标无监督生成模型,对预处理文本进行数据格式转换、隐层向量转换、实体关系预测和文本序列生成,得到目标实体关系信息。
上述开放式实体关系的抽取装置中各个模块的功能实现与上述开放式实体关系的抽取方法实施例中各步骤相对应,其功能和实现过程在此处不再一一赘述。
本申请实施例中,通过对待处理的关系分类数据集的实体关系、字段长度和关系三元组进行预处理,通过预先训练好的主干模型构建初始无监督生成模型,以及通过目标无监督生成模型,对预处理文本进行数据格式转换、隐层向量转换、实体关系预测和文本序列生成,解决了标注成本高、计算效率低、无法处理重叠样本以及存在拓展到开放问题时涉及计算一个行列数均为输入句长度的大矩阵的问题,从而解决了现有的开放关系抽取难以处理不定类型关系的问题。
请参阅图4,本申请实施例中开放式实体关系的抽取装置的另一个实施例包括:
第一预处理模块301,用于获取待处理的关系分类数据集,对待处理的关系分类数据集的实体关系、字段长度和关系三元组进行预处理,得到待处理数据集;
其中,第一预处理模块301具体包括:
创建获取单元3011,用于创建同义词词典,并获取待处理的关系分类数据集,以及同义词词典中待处理的关系分类数据集对应的目标同义词;
替换单元3012,用于通过目标同义词,对待处理的关系分类数据集进行同义词替换,得到增强数据集;
过滤单元3013,用于按照预设实体字段长度和预设句长度,对增强数据集进行过滤,得到过滤数据集;
处理单元3014,用于获取过滤数据集的关系三元组集,通过预置的正则表达式,对关系三元组集进行对齐处理和去重处理,得到待处理数据集;
训练优化模块302,用于通过预先训练好的主干模型构建初始无监督生成模型,并通过待处理数据集,对初始无监督生成模型进行训练和优化,得到目标无监督生成模型;
第二预处理模块303,用于获取待处理文本,并对待处理文本进行分词和词配对处理,得到预处理文本;
抽取模块304,用于通过目标无监督生成模型,对预处理文本进行数据格式转换、隐层向量转换、实体关系预测和文本序列生成,得到目标实体关系信息。
可选的,创建获取单元3011还可以具体用于:
获取经过去重融合处理的目标字词数据,根据配置的同义词定义信息,对目标字词数据进行字符串生成,得到同义词词典;
获取待处理的关系分类数据集,以及待处理的关系分类数据集的实体和实体关系;
对待处理的关系分类数据集进行词性标注,并从实体和实体关系中随机选取目标实体和目标实体关系;
根据目标实体和目标实体关系遍历同义词词典,得到对应的目标同义词。
可选的,过滤单元3013还可以具体用于:
基于预设实体字段长度,对增强数据集进行分类,得到第一数据集和第二数据集,第 一数据集用于指示符合预设实体字段长度,第二数据集用于指示不符合预设实体字段长度;
根据预设句长度,对第一数据集和第二数据集进行分类,得到目标数据集和非目标数据集,目标数据集用于指示符合预设句长度,非目标数据集用于指示不符合预设句长度;
对非目标数据集中的语句进行空缺符填充和遮罩处理,得到填充数据;
将填充数据和目标数据集确定为过滤数据集。
可选的,处理单元3014还可以具体用于:
提取过滤数据集中的初始关系三元组集,以及初始关系三元组集对应的初始关系短语集;
根据初始关系短语集,对初始关系三元组集进行对齐分析,得到多个待处理关系三元组以及多个目标关系三元组,多个待处理关系三元组用于指示多个待处理关系三元组为同一个三元组,多个目标关系三元组用于指示多个目标关系三元组不为同一个三元组;
将多个待处理关系三元组进行融合,得到多个融合关系三元组,并将多个融合关系三元组和多个目标关系三元组确定为待处理数据集。
可选的,抽取模块304还可以具体用于:
将预处理文本的数据格式转换为目标无监督生成模型的编码输入格式,得到转换文本,目标无监督生成模型包括编码器和解码器;
通过编码器对转换文本进行数据拟合,得到隐层向量;
通过解码器,基于预置的贪心算法和隐层向量,从预置的词典中获取对应的目标字词;
根据目标字词生成文本序列得到目标实体关系信息。
可选的,训练优化模块302还可以具体用于:
通过预先训练好的主干模型构建初始无监督生成模型,并将待处理数据集划分为训练数据集、验证数据集和测试数据集;
通过训练数据集,对初始无监督生成模型进行训练,得到候选无监督生成模型;
通过候选无监督生成模型,对验证数据集进行隐层向量转换、实体关系预测和文本序列生成,得到验证结果;
通过预置的损失函数计算验证结果的验证损失值,根据验证损失值,对候选无监督生成模型进行优化,得到优化无监督生成模型;
通过测试数据集,对优化无监督生成模型进行测试,得到测试结果,并计算测试结果的测试损失值,根据测试损失值确定目标无监督生成模型。
上述开放式实体关系的抽取装置中各模块和各单元的功能实现与上述开放式实体关系的抽取方法实施例中各步骤相对应,其功能和实现过程在此处不再一一赘述。
本申请实施例中,通过对待处理的关系分类数据集的实体关系、字段长度和关系三元组进行预处理,通过预先训练好的主干模型构建初始无监督生成模型,以及通过目标无监督生成模型,对预处理文本进行数据格式转换、隐层向量转换、实体关系预测和文本序列生成,解决了标注成本高、计算效率低、无法处理重叠样本以及存在拓展到开放问题时涉及计算一个行列数均为输入句长度的大矩阵的问题,从而解决了现有的开放关系抽取难以处理不定类型关系的问题。
上面图3和图4从模块化功能实体的角度对本申请实施例中的开放式实体关系的抽取装置进行详细描述,下面从硬件处理的角度对本申请实施例中开放式实体关系的抽取设备进行详细描述。
图5是本申请实施例提供的一种开放式实体关系的抽取设备的结构示意图,该开放式实体关系的抽取设备500可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器(central processing units,CPU)510(例如,一个或一个以上处理器)和存储器520,一个或一个以上存储应用程序533或数据532的存储介质530(例如一个或一 个以上海量存储设备)。其中,存储器520和存储介质530可以是短暂存储或持久存储。存储在存储介质530的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对开放式实体关系的抽取设备500中的一系列指令操作。更进一步地,处理器510可以设置为与存储介质530通信,在开放式实体关系的抽取设备500上执行存储介质530中的一系列指令操作。
开放式实体关系的抽取设备500还可以包括一个或一个以上电源540,一个或一个以上有线或无线网络接口550,一个或一个以上输入输出接口560,和/或,一个或一个以上操作系统531,例如Windows Serve,Mac OS X,Unix,Linux,FreeBSD等等。本领域技术人员可以理解,图5示出的开放式实体关系的抽取设备结构并不构成对开放式实体关系的抽取设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
本申请还提供一种开放式实体关系的抽取设备,包括:存储器和至少一个处理器,所述存储器中存储有指令,所述存储器和所述至少一个处理器通过线路互连;所述至少一个处理器调用所述存储器中的所述指令,以使得所述开放式实体关系的抽取设备执行上述开放式实体关系的抽取方法中的步骤。
本发明还提供一种计算机可读存储介质,该计算机可读存储介质可以为非易失性计算机可读存储介质,该计算机可读存储介质也可以为易失性计算机可读存储介质,计算机可读存储介质中存储有指令,当指令在计算机上运行时,使得计算机执行如下步骤:
获取待处理的关系分类数据集,对待处理的关系分类数据集的实体关系、字段长度和关系三元组进行预处理,得到待处理数据集;
通过预先训练好的主干模型构建初始无监督生成模型,并通过待处理数据集,对初始无监督生成模型进行训练和优化,得到目标无监督生成模型;
获取待处理文本,并对待处理文本进行分词和词配对处理,得到预处理文本;
通过目标无监督生成模型,对预处理文本进行数据格式转换、隐层向量转换、实体关系预测和文本序列生成,得到目标实体关系信息。
进一步地,计算机可读存储介质可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序等;存储数据区可存储根据区块链节点的使用所创建的数据等。
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施 例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。

Claims (20)

  1. 一种开放式实体关系的抽取方法,其中,所述开放式实体关系的抽取方法包括:
    获取待处理的关系分类数据集,对所述待处理的关系分类数据集的实体关系、字段长度和关系三元组进行预处理,得到待处理数据集;
    通过预先训练好的主干模型构建初始无监督生成模型,并通过所述待处理数据集,对所述初始无监督生成模型进行训练和优化,得到目标无监督生成模型;
    获取待处理文本,并对所述待处理文本进行分词和词配对处理,得到预处理文本;
    通过所述目标无监督生成模型,对所述预处理文本进行数据格式转换、隐层向量转换、实体关系预测和文本序列生成,得到目标实体关系信息。
  2. 根据权利要求1所述的开放式实体关系的抽取方法,其中,所述获取待处理的关系分类数据集,对所述待处理的关系分类数据集的实体关系、字段长度和关系三元组进行预处理,得到待处理数据集,包括:
    创建同义词词典,并获取待处理的关系分类数据集,以及所述同义词词典中所述待处理的关系分类数据集对应的目标同义词;
    通过所述目标同义词,对所述待处理的关系分类数据集进行同义词替换,得到增强数据集;
    按照预设实体字段长度和预设句长度,对所述增强数据集进行过滤,得到过滤数据集;
    获取所述过滤数据集的关系三元组集,通过预置的正则表达式,对所述关系三元组集进行对齐处理和去重处理,得到待处理数据集。
  3. 根据权利要求2所述的开放式实体关系的抽取方法,其中,所述创建同义词词典,并获取待处理的关系分类数据集,以及所述同义词词典中所述待处理的关系分类数据集对应的目标同义词,包括:
    获取经过去重融合处理的目标字词数据,根据配置的同义词定义信息,对所述目标字词数据进行字符串生成,得到同义词词典;
    获取待处理的关系分类数据集,以及所述待处理的关系分类数据集的实体和实体关系;
    对所述待处理的关系分类数据集进行词性标注,并从所述实体和所述实体关系中随机选取目标实体和目标实体关系;
    根据所述目标实体和所述目标实体关系遍历所述同义词词典,得到对应的目标同义词。
  4. 根据权利要求2所述的开放式实体关系的抽取方法,其中,所述按照预设实体字段长度和预设句长度,对所述增强数据集进行过滤,得到过滤数据集,包括:
    基于预设实体字段长度,对所述增强数据集进行分类,得到第一数据集和第二数据集,所述第一数据集用于指示符合所述预设实体字段长度,所述第二数据集用于指示不符合所述预设实体字段长度;
    根据预设句长度,对所述第一数据集和所述第二数据集进行分类,得到目标数据集和非目标数据集,所述目标数据集用于指示符合所述预设句长度,所述非目标数据集用于指示不符合所述预设句长度;
    对所述非目标数据集中的语句进行空缺符填充和遮罩处理,得到填充数据;
    将所述填充数据和所述目标数据集确定为过滤数据集。
  5. 根据权利要求2所述的开放式实体关系的抽取方法,其中,所述获取所述过滤数据集的关系三元组集,通过预置的正则表达式,对所述关系三元组集进行对齐处理和去重处理,得到待处理数据集,包括:
    提取所述过滤数据集中的初始关系三元组集,以及所述初始关系三元组集对应的初始关系短语集;
    根据所述初始关系短语集,对所述初始关系三元组集进行对齐分析,得到多个待处理关系三元组以及多个目标关系三元组,所述多个待处理关系三元组用于指示多个待处理关系三元组为同一个三元组,所述多个目标关系三元组用于指示多个目标关系三元组不为同一个三元组;
    将所述多个待处理关系三元组进行融合,得到多个融合关系三元组,并将所述多个融合关系三元组和所述多个目标关系三元组确定为待处理数据集。
  6. 根据权利要求1所述的开放式实体关系的抽取方法,其中,所述通过所述目标无监督生成模型,对所述预处理文本进行数据格式转换、隐层向量转换、实体关系预测和文本序列生成,得到目标实体关系信息,包括:
    将所述预处理文本的数据格式转换为所述目标无监督生成模型的编码输入格式,得到转换文本,所述目标无监督生成模型包括编码器和解码器;
    通过所述编码器对所述转换文本进行数据拟合,得到隐层向量;
    通过所述解码器,基于预置的贪心算法和所述隐层向量,从预置的词典中获取对应的目标字词;
    根据所述目标字词生成文本序列得到目标实体关系信息。
  7. 根据权利要求1-6中任一项所述的开放式实体关系的抽取方法,其中,所述通过预先训练好的主干模型构建初始无监督生成模型,并通过所述待处理数据集,对所述初始无监督生成模型进行训练和优化,得到目标无监督生成模型,包括:
    通过预先训练好的主干模型构建初始无监督生成模型,并将所述待处理数据集划分为训练数据集、验证数据集和测试数据集;
    通过所述训练数据集,对所述初始无监督生成模型进行训练,得到候选无监督生成模型;
    通过所述候选无监督生成模型,对所述验证数据集进行隐层向量转换、实体关系预测和文本序列生成,得到验证结果;
    通过预置的损失函数计算所述验证结果的验证损失值,根据所述验证损失值,对所述候选无监督生成模型进行优化,得到优化无监督生成模型;
    通过所述测试数据集,对所述优化无监督生成模型进行测试,得到测试结果,并计算所述测试结果的测试损失值,根据所述测试损失值确定目标无监督生成模型。
  8. 一种开放式实体关系的抽取设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:
    获取待处理的关系分类数据集,对所述待处理的关系分类数据集的实体关系、字段长度和关系三元组进行预处理,得到待处理数据集;
    通过预先训练好的主干模型构建初始无监督生成模型,并通过所述待处理数据集,对所述初始无监督生成模型进行训练和优化,得到目标无监督生成模型;
    获取待处理文本,并对所述待处理文本进行分词和词配对处理,得到预处理文本;
    通过所述目标无监督生成模型,对所述预处理文本进行数据格式转换、隐层向量转换、实体关系预测和文本序列生成,得到目标实体关系信息。
  9. 如权利要求8所述的开放式实体关系的抽取设备,所述处理器执行所述计算机程序时还实现以下步骤:
    创建同义词词典,并获取待处理的关系分类数据集,以及所述同义词词典中所述待处理的关系分类数据集对应的目标同义词;
    通过所述目标同义词,对所述待处理的关系分类数据集进行同义词替换,得到增强数据集;
    按照预设实体字段长度和预设句长度,对所述增强数据集进行过滤,得到过滤数据集;
    获取所述过滤数据集的关系三元组集,通过预置的正则表达式,对所述关系三元组集进行对齐处理和去重处理,得到待处理数据集。
  10. 如权利要求9所述的开放式实体关系的抽取设备,所述处理器执行所述计算机程序时还实现以下步骤:
    获取经过去重融合处理的目标字词数据,根据配置的同义词定义信息,对所述目标字词数据进行字符串生成,得到同义词词典;
    获取待处理的关系分类数据集,以及所述待处理的关系分类数据集的实体和实体关系;
    对所述待处理的关系分类数据集进行词性标注,并从所述实体和所述实体关系中随机选取目标实体和目标实体关系;
    根据所述目标实体和所述目标实体关系遍历所述同义词词典,得到对应的目标同义词。
  11. 如权利要求9所述的开放式实体关系的抽取设备,所述处理器执行所述计算机程序时还实现以下步骤:
    基于预设实体字段长度,对所述增强数据集进行分类,得到第一数据集和第二数据集,所述第一数据集用于指示符合所述预设实体字段长度,所述第二数据集用于指示不符合所述预设实体字段长度;
    根据预设句长度,对所述第一数据集和所述第二数据集进行分类,得到目标数据集和非目标数据集,所述目标数据集用于指示符合所述预设句长度,所述非目标数据集用于指示不符合所述预设句长度;
    对所述非目标数据集中的语句进行空缺符填充和遮罩处理,得到填充数据;
    将所述填充数据和所述目标数据集确定为过滤数据集。
  12. 如权利要求9所述的开放式实体关系的抽取设备,所述处理器执行所述计算机程序时还实现以下步骤:
    提取所述过滤数据集中的初始关系三元组集,以及所述初始关系三元组集对应的初始关系短语集;
    根据所述初始关系短语集,对所述初始关系三元组集进行对齐分析,得到多个待处理关系三元组以及多个目标关系三元组,所述多个待处理关系三元组用于指示多个待处理关系三元组为同一个三元组,所述多个目标关系三元组用于指示多个目标关系三元组不为同一个三元组;
    将所述多个待处理关系三元组进行融合,得到多个融合关系三元组,并将所述多个融合关系三元组和所述多个目标关系三元组确定为待处理数据集。
  13. 如权利要求8所述的开放式实体关系的抽取设备,所述处理器执行所述计算机程序时还实现以下步骤:
    将所述预处理文本的数据格式转换为所述目标无监督生成模型的编码输入格式,得到转换文本,所述目标无监督生成模型包括编码器和解码器;
    通过所述编码器对所述转换文本进行数据拟合,得到隐层向量;
    通过所述解码器,基于预置的贪心算法和所述隐层向量,从预置的词典中获取对应的目标字词;
    根据所述目标字词生成文本序列得到目标实体关系信息。
  14. 如权利要求8-13中任一项所述的开放式实体关系的抽取设备,所述处理器执行所述计算机程序时还实现以下步骤:
    通过预先训练好的主干模型构建初始无监督生成模型,并将所述待处理数据集划分为训练数据集、验证数据集和测试数据集;
    通过所述训练数据集,对所述初始无监督生成模型进行训练,得到候选无监督生成模 型;
    通过所述候选无监督生成模型,对所述验证数据集进行隐层向量转换、实体关系预测和文本序列生成,得到验证结果;
    通过预置的损失函数计算所述验证结果的验证损失值,根据所述验证损失值,对所述候选无监督生成模型进行优化,得到优化无监督生成模型;
    通过所述测试数据集,对所述优化无监督生成模型进行测试,得到测试结果,并计算所述测试结果的测试损失值,根据所述测试损失值确定目标无监督生成模型。
  15. 一种计算机可读存储介质,所述计算机可读存储介质中存储计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:
    获取待处理的关系分类数据集,对所述待处理的关系分类数据集的实体关系、字段长度和关系三元组进行预处理,得到待处理数据集;
    通过预先训练好的主干模型构建初始无监督生成模型,并通过所述待处理数据集,对所述初始无监督生成模型进行训练和优化,得到目标无监督生成模型;
    获取待处理文本,并对所述待处理文本进行分词和词配对处理,得到预处理文本;
    通过所述目标无监督生成模型,对所述预处理文本进行数据格式转换、隐层向量转换、实体关系预测和文本序列生成,得到目标实体关系信息。
  16. 如权利要求15所述的计算机可读存储介质,当所述计算机指令在计算机上运行时,使得计算机还执行如下步骤:
    创建同义词词典,并获取待处理的关系分类数据集,以及所述同义词词典中所述待处理的关系分类数据集对应的目标同义词;
    通过所述目标同义词,对所述待处理的关系分类数据集进行同义词替换,得到增强数据集;
    按照预设实体字段长度和预设句长度,对所述增强数据集进行过滤,得到过滤数据集;
    获取所述过滤数据集的关系三元组集,通过预置的正则表达式,对所述关系三元组集进行对齐处理和去重处理,得到待处理数据集。
  17. 如权利要求16所述的计算机可读存储介质,当所述计算机指令在计算机上运行时,使得计算机还执行如下步骤:
    获取经过去重融合处理的目标字词数据,根据配置的同义词定义信息,对所述目标字词数据进行字符串生成,得到同义词词典;
    获取待处理的关系分类数据集,以及所述待处理的关系分类数据集的实体和实体关系;
    对所述待处理的关系分类数据集进行词性标注,并从所述实体和所述实体关系中随机选取目标实体和目标实体关系;
    根据所述目标实体和所述目标实体关系遍历所述同义词词典,得到对应的目标同义词。
  18. 如权利要求16所述的计算机可读存储介质,当所述计算机指令在计算机上运行时,使得计算机还执行如下步骤:
    基于预设实体字段长度,对所述增强数据集进行分类,得到第一数据集和第二数据集,所述第一数据集用于指示符合所述预设实体字段长度,所述第二数据集用于指示不符合所述预设实体字段长度;
    根据预设句长度,对所述第一数据集和所述第二数据集进行分类,得到目标数据集和非目标数据集,所述目标数据集用于指示符合所述预设句长度,所述非目标数据集用于指示不符合所述预设句长度;
    对所述非目标数据集中的语句进行空缺符填充和遮罩处理,得到填充数据;
    将所述填充数据和所述目标数据集确定为过滤数据集。
  19. 如权利要求16所述的计算机可读存储介质,当所述计算机指令在计算机上运行时, 使得计算机还执行如下步骤:
    提取所述过滤数据集中的初始关系三元组集,以及所述初始关系三元组集对应的初始关系短语集;
    根据所述初始关系短语集,对所述初始关系三元组集进行对齐分析,得到多个待处理关系三元组以及多个目标关系三元组,所述多个待处理关系三元组用于指示多个待处理关系三元组为同一个三元组,所述多个目标关系三元组用于指示多个目标关系三元组不为同一个三元组;
    将所述多个待处理关系三元组进行融合,得到多个融合关系三元组,并将所述多个融合关系三元组和所述多个目标关系三元组确定为待处理数据集。
  20. 一种开放式实体关系的抽取装置,其中,所述开放式实体关系的抽取装置包括:
    第一预处理模块,用于获取待处理的关系分类数据集,对所述待处理的关系分类数据集的实体关系、字段长度和关系三元组进行预处理,得到待处理数据集;
    训练优化模块,用于通过预先训练好的主干模型构建初始无监督生成模型,并通过所述待处理数据集,对所述初始无监督生成模型进行训练和优化,得到目标无监督生成模型;
    第二预处理模块,用于获取待处理文本,并对所述待处理文本进行分词和词配对处理,得到预处理文本;
    抽取模块,用于通过所述目标无监督生成模型,对所述预处理文本进行数据格式转换、隐层向量转换、实体关系预测和文本序列生成,得到目标实体关系信息。
PCT/CN2021/109168 2021-03-26 2021-07-29 开放式实体关系的抽取方法、装置、设备及存储介质 WO2022198868A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110322883.8 2021-03-26
CN202110322883.8A CN113011189A (zh) 2021-03-26 2021-03-26 开放式实体关系的抽取方法、装置、设备及存储介质

Publications (1)

Publication Number Publication Date
WO2022198868A1 true WO2022198868A1 (zh) 2022-09-29

Family

ID=76407421

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/109168 WO2022198868A1 (zh) 2021-03-26 2021-07-29 开放式实体关系的抽取方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN113011189A (zh)
WO (1) WO2022198868A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115629928A (zh) * 2022-12-22 2023-01-20 中国人民解放军国防科技大学 一种面向类脑处理器的软硬协同验证方法及系统
CN115840742A (zh) * 2023-02-13 2023-03-24 每日互动股份有限公司 一种数据清洗方法、装置、设备及介质
CN116029294A (zh) * 2023-03-30 2023-04-28 华南师范大学 词项配对方法、装置及设备
CN116737870A (zh) * 2023-08-09 2023-09-12 北京国电通网络技术有限公司 上报信息存储方法、装置、电子设备和计算机可读介质
CN116775801A (zh) * 2023-06-26 2023-09-19 中山大学 一种面向中文医学文本的实体关系抽取方法及系统
CN117290510A (zh) * 2023-11-27 2023-12-26 浙江太美医疗科技股份有限公司 文档信息抽取方法、模型、电子设备及可读介质

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113011189A (zh) * 2021-03-26 2021-06-22 深圳壹账通智能科技有限公司 开放式实体关系的抽取方法、装置、设备及存储介质
CN113743095A (zh) * 2021-07-19 2021-12-03 西安理工大学 基于词格和相对位置嵌入的中文问题生成统一预训练方法
CN113627172A (zh) * 2021-07-26 2021-11-09 重庆邮电大学 基于多粒度特征融合和不确定去噪的实体识别方法及系统
CN113836316B (zh) * 2021-09-23 2023-01-03 北京百度网讯科技有限公司 三元组数据的处理方法、训练方法、装置、设备及介质
CN114528418B (zh) * 2022-04-24 2022-10-14 杭州同花顺数据开发有限公司 一种文本处理方法、系统和存储介质
CN115150354B (zh) * 2022-06-29 2023-11-10 北京天融信网络安全技术有限公司 一种生成域名的方法、装置、存储介质及电子设备
CN115048925B (zh) * 2022-08-15 2022-11-04 中科雨辰科技有限公司 一种确定异常文本的数据处理系统

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190354883A1 (en) * 2016-09-22 2019-11-21 nference, inc. Systems, methods, and computer readable media for visualization of semantic information and inference of temporal signals indicating salient associations between life science entities
CN111324743A (zh) * 2020-02-14 2020-06-23 平安科技(深圳)有限公司 文本关系抽取的方法、装置、计算机设备及存储介质
CN112069319A (zh) * 2020-09-10 2020-12-11 杭州中奥科技有限公司 文本抽取方法、装置、计算机设备和可读存储介质
CN112487206A (zh) * 2020-12-09 2021-03-12 中国电子科技集团公司第三十研究所 一种自动构建数据集的实体关系抽取方法
CN112527981A (zh) * 2020-11-20 2021-03-19 清华大学 开放式信息抽取方法、装置、电子设备及存储介质
CN113011189A (zh) * 2021-03-26 2021-06-22 深圳壹账通智能科技有限公司 开放式实体关系的抽取方法、装置、设备及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190354883A1 (en) * 2016-09-22 2019-11-21 nference, inc. Systems, methods, and computer readable media for visualization of semantic information and inference of temporal signals indicating salient associations between life science entities
CN111324743A (zh) * 2020-02-14 2020-06-23 平安科技(深圳)有限公司 文本关系抽取的方法、装置、计算机设备及存储介质
CN112069319A (zh) * 2020-09-10 2020-12-11 杭州中奥科技有限公司 文本抽取方法、装置、计算机设备和可读存储介质
CN112527981A (zh) * 2020-11-20 2021-03-19 清华大学 开放式信息抽取方法、装置、电子设备及存储介质
CN112487206A (zh) * 2020-12-09 2021-03-12 中国电子科技集团公司第三十研究所 一种自动构建数据集的实体关系抽取方法
CN113011189A (zh) * 2021-03-26 2021-06-22 深圳壹账通智能科技有限公司 开放式实体关系的抽取方法、装置、设备及存储介质

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115629928A (zh) * 2022-12-22 2023-01-20 中国人民解放军国防科技大学 一种面向类脑处理器的软硬协同验证方法及系统
CN115840742A (zh) * 2023-02-13 2023-03-24 每日互动股份有限公司 一种数据清洗方法、装置、设备及介质
CN116029294A (zh) * 2023-03-30 2023-04-28 华南师范大学 词项配对方法、装置及设备
CN116775801A (zh) * 2023-06-26 2023-09-19 中山大学 一种面向中文医学文本的实体关系抽取方法及系统
CN116737870A (zh) * 2023-08-09 2023-09-12 北京国电通网络技术有限公司 上报信息存储方法、装置、电子设备和计算机可读介质
CN116737870B (zh) * 2023-08-09 2023-10-27 北京国电通网络技术有限公司 上报信息存储方法、装置、电子设备和计算机可读介质
CN117290510A (zh) * 2023-11-27 2023-12-26 浙江太美医疗科技股份有限公司 文档信息抽取方法、模型、电子设备及可读介质
CN117290510B (zh) * 2023-11-27 2024-01-30 浙江太美医疗科技股份有限公司 文档信息抽取方法、模型、电子设备及可读介质

Also Published As

Publication number Publication date
CN113011189A (zh) 2021-06-22

Similar Documents

Publication Publication Date Title
WO2022198868A1 (zh) 开放式实体关系的抽取方法、装置、设备及存储介质
CN111310438B (zh) 基于多粒度融合模型的中文句子语义智能匹配方法及装置
CN109840287B (zh) 一种基于神经网络的跨模态信息检索方法和装置
CN113239181B (zh) 基于深度学习的科技文献引文推荐方法
WO2022227207A1 (zh) 文本分类方法、装置、计算机设备和存储介质
CN111159223B (zh) 一种基于结构化嵌入的交互式代码搜索方法及装置
CN111709243B (zh) 一种基于深度学习的知识抽取方法与装置
WO2021204014A1 (zh) 一种模型训练的方法及相关装置
CN111414481A (zh) 基于拼音和bert嵌入的中文语义匹配方法
KR20220114495A (ko) 탐색, 검색 및 순위화를 위한 상호작용 계층 신경망
CN112800203B (zh) 一种融合文本和知识表征的问答匹配方法及系统
CN116304066B (zh) 一种基于提示学习的异质信息网络节点分类方法
WO2023134083A1 (zh) 基于文本的情感分类方法和装置、计算机设备、存储介质
CN116661805B (zh) 代码表示的生成方法和装置、存储介质及电子设备
CN116756303A (zh) 一种多主题文本摘要自动生成方法及系统
CN114298055B (zh) 基于多级语义匹配的检索方法、装置、计算机设备和存储介质
CN113343692B (zh) 搜索意图的识别方法、模型训练方法、装置、介质及设备
CN116662566A (zh) 一种基于对比学习机制的异质信息网络链路预测方法
CN116680407A (zh) 一种知识图谱的构建方法及装置
Ye et al. Going “deeper”: Structured sememe prediction via transformer with tree attention
Nambiar et al. Attention based abstractive summarization of malayalam document
WO2023115770A1 (zh) 一种翻译方法及其相关设备
CN115544999A (zh) 一种面向领域的并行大规模文本查重方法
CN113435212B (zh) 一种基于规则嵌入的文本推断方法及装置
CN115936014A (zh) 一种医学实体对码方法、系统、计算机设备、存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21932480

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 240124)