CN115658921A

CN115658921A - Open domain scientific knowledge discovery method and device based on pre-training language model

Info

Publication number: CN115658921A
Application number: CN202211392326.4A
Authority: CN
Inventors: 陈华钧; 田玺; 毕祯; 张宁豫
Original assignee: Xinhua Fusion Media Technology Development Beijing Co ltd; Zhejiang University ZJU
Current assignee: Xinhua Fusion Media Technology Development Beijing Co ltd; Zhejiang University ZJU
Priority date: 2022-11-08
Filing date: 2022-11-08
Publication date: 2023-01-31

Abstract

The invention discloses an open domain scientific knowledge discovery method and device based on a pre-training language model, wherein an input template comprising a head entity, a first prompt, a second prompt and a tail entity mask is constructed; filling pre-trained embedding of a head entity of each triple containing a target relation, a discrete token of a first prompt corresponding to the target relation and a second prompt token into an input template, and performing mask processing on a tail entity to form input sample data; constructing a single pre-training language model for each target relation, training a mask task on the pre-training language model by using input sample data corresponding to the target relation, and optimizing the embedded representation of the first prompt and the second prompt; and predicting missing entities in the triples by using the optimized embedded expressions of the first prompt words and the second prompt words and the pre-training language model, so that the knowledge discovery efficiency and accuracy of the pre-training language model can be improved.

Description

Open domain scientific knowledge discovery method and device based on pre-training language model

Technical Field

The invention belongs to the field of natural language processing and machine learning, and particularly relates to an open domain scientific knowledge discovery method and device based on a pre-training language model.

Background

The development of the pre-training language model promotes the research in the field of natural language processing to a new stage, universal language representation can be learned from massive linguistic data without manual labeling, and downstream tasks are remarkably promoted. Some studies have shown that pre-trained language models have certain capabilities to store information and answer questions, and different kinds of knowledge are implicit in their parameters, and this knowledge acquisition is crucial for the language models in various downstream tasks. However, just like most nervous systems, knowledge in pre-trained language models is encoded in a diffuse manner, which makes it often difficult to interpret and update.

In view of the development of the pre-trained language model, the pre-trained language model is widely used for the supplementary construction of the scientific knowledge graph in the scientific field. The scientific knowledge map contains various scientific knowledge recorded by triplets (head entities, relations and tail entities), wherein the head entities and the tail entities can be entities in various scientific fields of biomedicine, chemistry and the like such as diseases, medicines, genes, molecules and the like, and the relations are inclusion, action, types and the like.

Knowledge learned during pre-training can be potentially acquired through fine-tuning or prompting, and prompting is an effective method for directly acquiring the knowledge without any addition. The prompts can be divided into manually created prompts and automatically learned prompts. Although manually created prompts are intuitive and do allow a degree of triple completion tasks, this approach requires time and experience, especially for some complex completion tasks, even an experienced prompt designer may not be able to manually find the best prompt. The automatic learning prompt realizes the automation of the design process of the prompt template, but fails to fully capture scientific term information.

Patent document CN114706943A discloses an intention recognition method, which performs intention recognition on an input text added with a prompt by using a pre-trained language model. The added prompts do not capture scientific terminology information, resulting in poor accuracy of the embedded knowledge mined during the learning process.

Patent document CN114661913A discloses an entity relationship extraction method based on a pre-training language model, which solves the problem of poor knowledge mining efficiency due to manual participation of prompt template labeling in a knowledge mining scheme by screening prompt templates, but the method still does not capture scientific term information and also leads to poor accuracy of embedded knowledge mined in a learning process.

Disclosure of Invention

In view of the above, an object of the present invention is to provide a method and an apparatus for discovering open domain scientific knowledge based on a pre-trained language model, which fully consider external scientific knowledge and learnable knowledge representation, can better acquire knowledge in the pre-trained language model, achieve the capability of better detecting knowledge in the pre-trained language model, and improve the knowledge discovery efficiency and accuracy of the pre-trained language model.

In order to achieve the above object, an embodiment provides a method for discovering open domain scientific knowledge based on a pre-training language model, including the following steps:

extracting triples (head entities, relations and tail entities) from the scientific knowledge map, and constructing a first prompt and a second prompt for each relation by using external scientific knowledge;

constructing an input template of a pre-training language model, wherein the input template comprises a head entity, a first prompt, a second prompt and a tail entity mask;

filling a head entity of each triple, a first prompt corresponding to the target relation and a second prompt corresponding to the target relation into an input template by taking all the triples containing the target relation as sample data, and processing a tail entity mask to form input sample data;

constructing a single pre-training language model for each target relation, training a mask task on the pre-training language model by using input sample data corresponding to the target relation, and optimizing the embedded representation of the first prompt and the second prompt;

and predicting missing entities in the triples containing the target relationship by using the optimized embedded expression of the first prompt and the second prompt and the pre-training language model, so as to realize the discovery of open domain scientific knowledge.

Preferably, a first cue and a second cue constructed by using external scientific knowledge are related to the relation in the triples, the first cue is used as discrete tokens in the input template, and the second cue is used as initialization tokens of a continuous space vector in the input template.

Preferably, the first cue filling corresponding to the target relationship exists in the input template in a discrete form as a discrete token of the target relationship.

Preferably, the second cue filling corresponding to the target relationship exists in the input template in the form of a continuous vector, as the initialization tokens of the continuous vector of the target relationship.

Preferably, the pre-trained embedding is used to perform embedded representation on tokens in the second cue corresponding to the target relationship, and the embedded representation is used as initialization of a continuous vector of the target relationship.

Preferably, input sample data corresponding to the target relationship is used for training a mask task on the pre-training language model, parameters of the pre-training language model are kept fixed, the negative logarithm possibility of the input sample data is minimized by adopting the following loss function and a gradient correction method, and the embedded representation of the first prompt and the second prompt is updated;

wherein, the first and the second end of the pipe are connected with each other,

represents a loss function, D _r Representing object relationshipsr set of head-to-tail pairs of entities (h, t), t _r (h) Representing input sample data, P (([ MASK) formed by filling the head entity and tail entity pairs (h, t) containing the target relation r into the input template]＝t)|t _r (h) ) represents t _r (h) MASK input to Pre-training language model output]Equal to the probability value of the tail entity t.

Preferably, the predicting of the missing entities in the triples containing the target relationship by using the optimized embedded representations of the first cue and the second cue and the pre-trained language model comprises:

taking a known entity in the triple as a head entity, and filling the head entity, the optimized first prompt and the embedded representation of the second prompt into an input template to form input sample data;

and inputting input sample data into the pre-training language model, and outputting the prediction probability of the missing entity in the triple through calculation.

In order to achieve the above object, an embodiment provides an apparatus for discovering open domain scientific knowledge based on a pre-training language model, including:

the external knowledge acquisition module is used for extracting triples (head entities, relations and tail entities) from the scientific knowledge map and constructing a first prompt and a second prompt for each relation by using external scientific knowledge;

the input template building module is used for building an input template of a pre-training language model, and the input template comprises a head entity, a first prompt, a second prompt and a tail entity mask;

the input sample data construction module is used for filling the head entity of each triple, the first prompt corresponding to the target relation and the second prompt into the input template, and processing the tail entity mask to form input sample data by taking all the triples containing the target relation as the sample data;

the training module is used for constructing a single pre-training language model for each target relation, performing mask task training on the pre-training language model by using input sample data corresponding to the target relation, and optimizing the embedded representation of the first prompt and the second prompt;

and the application module is used for predicting missing entities in the triples containing the target relation by using the optimized embedded representation of the first prompt language and the second prompt language and the pre-training language model, so that the discovery of open domain scientific knowledge is realized.

Compared with the prior art, the invention has the beneficial effects that at least:

when the input sample data is constructed, external scientific knowledge is introduced to serve as a first prompt word and a second prompt word of a target relation, so that the input sample data contains more scientifically relevant semantic information, when the pre-training language model is trained, the semantic information of a continuous space is integrated, the experience of the artificial prompt words is not completely relied on, the knowledge information in the pre-training language model can be captured more effectively, and the knowledge discovery efficiency and accuracy of the pre-training language model are improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flowchart of an open domain scientific knowledge discovery method based on a pre-trained language model provided by an embodiment;

FIG. 2 is a flowchart of an open domain scientific knowledge discovery method based on a pre-trained language model according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the detailed description and specific examples, while indicating the scope of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.

In order to enable the pre-training language model to better learn knowledge, the knowledge detection capability in the language model is better achieved, and the knowledge discovery efficiency and accuracy of the pre-training language model are improved. The embodiment provides an open domain scientific knowledge discovery method and device based on a pre-training language model.

As shown in fig. 1, the method for discovering open domain scientific knowledge based on a pre-trained language model provided by the embodiment includes the following steps:

step 1, extracting triples from a scientific knowledge map, classifying the triples according to relations, and constructing a first prompt and a second prompt for each relation by using external scientific knowledge.

In the embodiment, the triple inclusion (head entity h, relation r and tail entity t) extracted from the scientific knowledge graph contains various scientific knowledge, wherein the head entity comprises biomedical and chemical field entities such as diseases, medicines, genes and molecules, the relation comprises inclusion, action, type and the like, and the tail entity comprises biomedical and chemical field entities such as diseases, medicines, genes and molecules.

In the embodiment, two prompts are constructed for each relationship by using external scientific knowledge, wherein the first prompt t is _r1 And a second prompt t _r2 Related to the relation r in the triplet, it may be equal or unequal. The first cue is used as discrete tokens of the relation in the input template, and the second cue is used as initialization tokens of the continuous space vector of the relation in the input template.

And 2, constructing an input template of the pre-training language model, wherein the input template comprises a head entity, a first prompt, a second prompt and a tail entity mask.

The objective facts are not independent from each other, and the data-driven generated prompts can be used for extracting knowledge in the pre-training language model by referring to the distribution of knowledge in the training set, and even can be used for recovering the objective facts in the pre-training language model initialized randomly. Some previous methods (such as AutoPrompt) search out the better K candidate prompts in a discrete vocabulary, and then pick and verify, where the search space is limited to a discrete space. If replaced with a continuous vector, it may not be a true token, but a continuous vector representation, where the scientific terminology information of a discrete token is missing. To this end, the embodiment uses scientific terms to construct the first hints for use as discrete tokens, and also adds continuous space vectors according to the second hints for use as initialization tokens, and then masks tokens of the triplet tail entities to predict these masked tokens. Thus, the constructed input template is:

t _r ＝[X][Term] ₁ …[Term] _m [P] ₁ …[P] _n [MASK]

wherein, [ X ]]Is the head entity of the triplet, [ Term] ₁ …[Term] _m Discrete tokens, i.e., t, representing scientific terms related to the relationship r _r1 M is t _r1 The number of the middle tokens is such that,

is a continuous vector in the vector space, d denotes the dimension of the embedded vector, n is t _r2 Number of intermediate tokens, [ MASK ]]Is a masked tail entity.

And 3, filling the input template with the head entity of each triple, the first prompt corresponding to the target relationship and the second prompt corresponding to the target relationship by taking all the triples containing the target relationship as sample data, and processing the tail entity mask to form the input sample data.

In the embodiment, the concerned relation is used as a target relation, and all triples containing the target relation are used as sample data of a pre-training language model corresponding to the target relation. The sample data is converted into input sample data through the input template. Specifically, the first cue words and the second cue words corresponding to the target relationships fill the input template, and in the input template, the first cue words filling corresponding to the target relationships exist in a discrete form and serve as discrete tokens of the target relationships. And filling the second cue words corresponding to the target relation in a continuous vector form, and taking the second cue words as the initialization tokens of the continuous vector of the target relation.

In this challenging non-convex optimization problem, good initialization continuous vectors are important. Thus, embodiments contemplate a more complex form of using a second prompt constructed manually to determine the number n and location of the continuous vectors [ P ] for each target relationship, and initializing [ P ] using pre-trained embedding of tokens in the second prompt.

And meanwhile, filling the head entity into the input template, and performing mask processing on the tail entity to form input sample data.

And 4, constructing a single pre-training language model for each target relation, training a mask task on the pre-training language model by using input sample data corresponding to the target relation, and optimizing the embedded representation of the first prompt and the second prompt.

In the embodiment, each target relation corresponds to one pre-training language model, input sample data corresponding to the target relation is used for training a mask task on the pre-training language model, parameters of the pre-training language model are kept fixed, and the negative logarithm possibility of the input sample data is minimized by adopting the following loss function and a gradient correction method so as to update the embedded expression of the first prompt and the second prompt;

wherein the content of the first and second substances,

representing a loss function, D _r Set of head-entity-tail-entity pairs (h, t) representing a target relation r, t _r (h) Representing input sample data, P (([ MASK) formed by filling the head entity and tail entity pairs (h, t) containing the target relation r into the input template]＝t)|t _r (h) ) represents t _r (h) MASK input to Pre-training language model output]Equal to the probability value of the tail entity t.

And exploring knowledge information of the pre-trained language model through a knowledge mask task of the pre-trained language model, and optimizing the embedded representation of the first prompt and the second prompt, wherein the embedded representation is more favorable for predicting the missing entities in the triples.

And 5, predicting missing entities in the triples containing the target relation by using the optimized embedded expression of the first prompt and the second prompt and the pre-training language model, so as to realize the discovery of open domain scientific knowledge.

Predicting missing entities in triples containing target relationships by using the optimized embedded representations of the first prompt and the second prompt and a pre-trained language model, wherein the predicting comprises the following steps:

and taking the known entity in the triple as a head entity, and filling the head entity, the optimized first prompt and the embedded representation of the second prompt into an input template to form input sample data, wherein the input sample data is a complete filling type statement.

Inputting input sample data into a pre-training language model, outputting the prediction probability of the missing entities in the triples through calculation, and screening the missing entities according to the prediction probability, wherein the missing entities are used for perfecting the missing triples to form complete triples, so that the discovery of open domain scientific knowledge is realized.

Based on the same inventive concept, as shown in fig. 2, an embodiment further provides an open domain scientific knowledge discovery apparatus based on a pre-trained language model, including:

the external knowledge acquisition module is used for extracting triples from the scientific knowledge map, classifying the triples according to the relations and constructing a first prompt and a second prompt for each relation by using external scientific knowledge;

and the application module is used for predicting missing entities in the triples containing the target relationship by using the embedded expression of the optimized first prompt language and the optimized second prompt language and the pre-training language model, so that the discovery of open-domain scientific knowledge is realized.

It should be noted that, when the apparatus for discovering open domain scientific knowledge based on a pre-trained language model provided in the foregoing embodiment performs open domain scientific knowledge discovery based on a pre-trained language model, the division of the functional modules is taken as an example, and the function distribution may be completed by different functional modules according to needs, that is, the internal structure of the terminal or the server is divided into different functional modules to complete all or part of the functions described above. In addition, the device for discovering open domain scientific knowledge based on the pre-trained language model provided by the above embodiment and the method for discovering open domain scientific knowledge based on the pre-trained language model have the same concept, and the specific implementation process is detailed in the method for discovering open domain scientific knowledge based on the pre-trained language model, and is not described herein again.

The above-mentioned embodiments are intended to illustrate the technical solutions and advantages of the present invention, and it should be understood that the above-mentioned embodiments are only the most preferred embodiments of the present invention, and are not intended to limit the present invention, and any modifications, additions, equivalents, etc. made within the scope of the principles of the present invention should be included in the scope of the present invention.

Claims

1. A method for discovering open domain scientific knowledge based on a pre-training language model is characterized by comprising the following steps:

2. The method of claim 1, wherein the first and second cues constructed from external scientific knowledge are related to the relationship in the triplet, the first cue is a discrete token in the input template, and the second cue is an initialization token of a continuous space vector in the input template.

3. The method for discovering open domain scientific knowledge based on pre-trained language model according to claim 1, wherein the first cue filling corresponding to the target relationship exists in the input template in a discrete form as a discrete tokens of the target relationship.

4. The method for discovering open domain scientific knowledge based on pre-trained language model according to claim 1, wherein the second prompt filling corresponding to the target relationship exists in the input template in the form of continuous vectors as initialization tokens of the continuous vectors of the target relationship.

5. The method for discovering open domain scientific knowledge based on pre-trained language model according to claim 4, characterized in that the embedded representation of tokens in the second prompt corresponding to the target relationship is performed by using pre-trained embedding, and the embedded representation is used as the initialization of continuous vectors of the target relationship.

6. The method for discovering open domain scientific knowledge based on a pre-trained language model according to claim 1, wherein the pre-trained language model is trained for a masking task by using input sample data corresponding to a target relationship, parameters of the pre-trained language model are kept fixed, and the negative logarithm possibility of the input sample data is minimized by using the following loss function and a gradient correction method to update the embedded representation of the first prompt and the second prompt;

7. The method for discovering open domain scientific knowledge based on pre-trained language model according to claim 1, wherein the predicting of the missing entities in the triples containing the target relationship by using the optimized embedded representations of the first and second hints and the pre-trained language model comprises:

inputting input sample data into a pre-training language model, and outputting the prediction probability of the missing entity in the triple through calculation.

8. An open domain scientific knowledge discovery apparatus based on a pre-trained language model, comprising: