CN115658898A

CN115658898A - Chinese and English book entity relation extraction method, system and equipment

Info

Publication number: CN115658898A
Application number: CN202211316489.4A
Authority: CN
Inventors: 胡文斌; 许珈铭
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2022-10-26
Filing date: 2022-10-26
Publication date: 2023-01-31

Abstract

The invention discloses a Chinese and English book entity relationship extraction method, a system and equipment, wherein the English text entity relationship extraction is carried out by inputting an entity extraction result into an English text entity relationship extraction network in combination with a prompt fine tuning template; the invention uses large-scale language model and supervised training, and can be completely suitable for large-scale data set. The prompting template designed by the invention is different from the design concept of the existing prompting template, because the invention focuses more on the emphasis of the model on entity information, most of the existing prompting templates only focus on a single task, namely, the prompting template provided by the invention is still feasible for extracting single language information. Meanwhile, bilingual word embedding mapping is not needed in the aspect of data preprocessing of the training model, so that mapping errors caused by different languages under the condition of different grammars are greatly reduced, and the problem of low training data quality is solved. The method has the characteristics of strong universality, low cost and easy updating.

Description

Chinese and English book entity relation extraction method, system and equipment

Technical Field

The invention belongs to the technical field of computer science, natural language processing and machine learning, relates to a Chinese and English text entity relationship extraction method, system and equipment, and particularly relates to a supervised Chinese and English text entity relationship extraction method, system and equipment based on deep learning and prompt fine tuning.

Background

The relation extraction is one of basic tasks of information extraction, aims to identify target relations of entities in texts, is important for construction of a knowledge base and understanding of the texts, is particularly beneficial to tasks of natural language processing, such as question answering and text understanding, and is an essential step as a core key of knowledge graph construction. The relation extraction model is generated from a characteristic-based statistical model, researches in recent years focus on entity relation extraction of monolingual, but at present, information technology is rapidly developed, as more and more channels of resource information can be obtained by people, the obtained information language types are greatly different, and in order to help people to obtain entity information of texts of different language types more quickly and definitely, the information in a source language text is extracted and converted into a target language which can be understood by readers. The cross-language relationship extraction is a very important task in information extraction, and can help people to quickly acquire the relationship between different entities in different language texts so as to accelerate the understanding of text contents, and meanwhile, a plurality of web application programs can benefit, such as 'information retrieval', 'web question answering and mining', and the like.

However, as models are extended to more languages, it is difficult to obtain relationship annotations of entities, which makes most of current methods of cross-language entity relationship extraction focused on zero-sample learning (zero-learning), which simply trains a model using one language and then migrates to another language for testing. For example, wasiUddinAhmad and Nanyun Peng et al add dependency information to Graph convolutional neural networks (Graph convolutional networks) using general dependency parsing of sentences, and achieve good performance of cross-language entity relationship extraction tasks on ACE05 datasets, which indicates effectiveness of the cross-language entity relationship extraction tasks, but may be troublesome in processing data and face a problem of high overhead in the model training process.

In addition to the above method using the general dependency information of sentences and the graph convolution neural network, jian Ni and rau Florian propose a bilingual word embedding mapping (embedding) method for extracting cross-language entity relationships, which embeds words in a source language (word) mapped to a target language, so that a model trained using the source language can be conveniently applied to the target language, but this method cannot be applied to a high-resource scene, and can only be applied to a low-resource scene. Since the grammar of each country is also very different from the vocabulary, it is also very difficult to map the vocabularies of different countries in an aligned manner and accurately.

At present, more and more complex algorithms are designed for cross-language relation extraction, cross-language relation extraction is carried out from combination of different neural network models (CNN, LSTM and Bi-LSTM) to combination of different language models, meanwhile, zero sample learning is unsupervised learning, relation extraction is a classification task, supervised learning is more suitable for the classification task, and the problems are that a supervised learning supervised training data set is not well extracted from the cross-language relation to train and test the models, and meanwhile, the unsupervised learning possibly has obvious effect on a small-scale data set, but cannot be well applied to a large-scale data set. In summary, the problems of the current cross-language relationship extraction mainly include: 1. the zero sample learning is intensively used, so that the supervised training cannot be carried out, the model learning training and testing cannot be realized, and the large-scale data training cannot be adapted. 2. There is no well supervised data set for the model to be trained. 3. The preprocessing of sentences and the feature extraction of each language sentence are too complicated and increase the time cost. Therefore, new algorithms and techniques are needed to solve the above problems.

Disclosure of Invention

In order to solve the technical problems, the invention provides a supervised English text entity relation extraction method, system and equipment based on deep learning and prompt fine tuning.

The technical scheme adopted by the method is as follows: a Chinese-English book entity relationship extraction method comprises the following steps:

step 1: preprocessing an input sentence of a user;

step 2: extracting the entity of the sentence preprocessed in the step 1;

and step 3: combining the result of the step 2 with the English text entity relationship extraction network in the prompt fine tuning template input, and performing English text entity relationship extraction;

the Chinese-English text entity relation extraction network comprises a SentecPiece processing layer, an encoding layer, a decoding layer and a linear classification layer; generating a digital vector through a SentenPice processing layer, inputting the digital vector generated in English into an encoding layer to generate an encoding vector, and inputting the digital vector generated in Chinese and the encoding vector generated by the encoding layer into a decoding layer to obtain a final relation output type; or inputting the digit vector generated by English into a decoding layer to generate a coding vector, and inputting the digit vector generated by Chinese and the coding vector generated by the decoding layer into the coding layer to obtain a final relation output type; the Linear classification layer consists of a Linear layer and a Softmax layer and is used for classifying entity relations in sentences;

the prompt fine-tuning template comprises:

template 1: and (3) coding layer input:<s>The sentence:“sensor (English)”include sensitivity 1 (English) Chinese character of' Wenand entity2 (English)</s>；

Decoding layer input:<s>this sentence "Example sentence (Chinese)InEntity1 (Chinese)Andentity2 (Chinese)What is the relationship between?</s>；

Template 2: and (3) coding layer input:<s>What is the type of relationship between entity1 (English)and entity2 (English)？</s>；

Decoding layer input:<s>this sentence "Example sentence (Chinese)"comprising these two entities"Entity1 (Chinese)'and'Fruit of Chinese wolfberry Body 2 (Chinese)”</s>；

Template 3: and (3) coding layer input:<s>sensor (English)[Seg_ment_ation]entity1 (English)[Seg_ment_ation]entity2 (English)</s>；

Decoding layer input:<s>example sentence (Chinese)[Seg_ment_ation]Entity1 (Chinese)[Seg_ment_ation]Entity2 (Chinese)</s>；

Template 4: and (3) coding layer input:<s>sensor (English)[Seg_ment_ation]entity1 (English)[Seg_ment_ation]entity2 (English)</s>；

Decoding layer input:<s>entity1 (Chinese)[Seg_ment_ation]Entity2 (Chinese)</s>；

Template 5: and (3) coding layer input:<s>The sentence:“sensor (English)”includes entity1 (English) Chinese character of' Wen[Seg_ment_ation]entity2 (English)</s>；

Template 6: and (3) coding layer input:<s>The sentence of sensor (English)includesentity1 (English) Chinese character of' Wen[Seg_ment_ation]entity2 (English)</s>；

Wherein [ Seg _ ment _ ation ] is a segmentation identifier for segmenting sentence instances and entities; < s > is a sentence start identifier, </s > is a sentence end identifier, entity1 is an input English entity1, entity2 is an input English entity2, entity1 is an input Chinese entity1, entity2 is an input Chinese entity2, a sentence instance (Chinese) is an entire sentence containing the Chinese entity, and sensor (English) is an entire sentence containing the English entity; the underlined template is used for inputting data, the underlined template is used for prompting, (Chinese) indicates that Chinese is input, and (English) indicates that English is input.

The technical scheme adopted by the system of the invention is as follows: a Chinese-English book entity relationship extraction system comprises the following modules:

the module 1 is used for preprocessing input sentences of a user;

the module 2 is used for performing entity extraction on the sentences preprocessed in the module 1;

the module 3 is used for combining the result of the module 2 with the English text entity relationship extraction network in the prompt fine tuning template input to extract the English text entity relationship;

the Chinese-English text entity relation extraction network comprises a Sentence piece processing layer, an encoding layer, a decoding layer and a linear classification layer; generating a digital vector through a Sentence Picture processing layer, inputting the digital vector generated by English into a coding layer to generate a coding vector, and inputting the digital vector generated by Chinese and the coding vector generated by the coding layer into a decoding layer to obtain a final relation output type; or inputting the digital vector generated by English into the decoding layer to generate a coding vector, and inputting the digital vector generated by Chinese and the coding vector generated by the decoding layer into the coding layer to obtain a final relation output type; the Linear classification layer consists of a Linear layer and a Softmax layer and is used for classifying entity relations in sentences;

the prompt fine-tuning template comprises:

Decoding layer input:<s>this sentence "Example sentence (Chinese)InEntity1 (Chinese)And withEntity2 (Chinese)What is the relationship between?</s>；

Decoding layer input:<s>this sentence "Example sentence (Chinese)"comprising both entities"Entity1 (Chinese)"and"Fruit of Chinese wolfberry Body 2 (Chinese)”</s>；

And (4) template: and (3) coding layer input:<s>sensor (English)[Seg_ment_ation]entity1 (English)[Seg_ment_ation]entity2 (English)</s>；

Template 5: and (3) coding layer input: the sensor:sensor (English)”includes entity1 (English)[Seg_ment_ation]entity2 (English)；

Decoding layer input:example sentence (Chinese)[Seg_ment_ation]Entity1 (Chinese)[Seg_ment_ation]Entity2 (Chinese)；

Template 6: and (3) coding layer input:<s>The sentence of sensor (English)includes entity1 (English) Chinese character of' Wen[Seg_ment_ation]entity2 (English)</s>；

Wherein [ segment _ ment _ ation ] is a segmentation identifier for segmenting sentence instances and entities; (< s > is the sentence start mark, </s > is the sentence end mark, entry 1 is the input English entity1, entry 2 is the input English entity2, entity1 is the input Chinese entity1, entity2 is the input Chinese entity2, the sentence example (Chinese) is the whole sentence containing Chinese entity, and sensor (English) is the whole sentence containing English entity); the underlined characters in the template are used for inputting data, the underlined characters are used for prompting the template, chinese is used for inputting Chinese, and English is used for inputting English.

The technical scheme adopted by the equipment of the invention is as follows: a chinese-english book entity relation extraction device, comprising:

one or more processors;

the storage device is used for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors realize the Chinese-English entity relationship extraction method.

The core innovation points of the invention comprise:

(1) And designing a Chinese and English parallel corpus entity relation extraction data set and a data preprocessing technology.

Including the subordinate innovation points 1.1: and (4) screening a parallel corpus data set.

Because a relation extraction data set based on parallel corpora is not available in the world at present, and the existing entity relation extraction data set is only limited to monolingual and cannot be well applied to cross-language entity relation extraction, part of data is extracted from the WMT17 English-Chinese parallel corpora data set for processing. In order to avoid the problem that the accuracy of data processing is reduced due to the fact that the sentences are too long, the embodiment screens the sentences, and only the sentences of which the Chinese sentence number is not more than 70 are selected for processing. To further improve accuracy, if a "period" appears in a selected sentence, the selected sentence is divided into two sentences. In this embodiment, the statements screened by the above method are summarized and processed in the next step.

Including the subordinate innovation points 1.2: and (4) creating an English text entity relation extraction data set in the parallel linguistic data.

The data set screened from the dependent innovation point 1.1 only has synonymous chinese and english sentences, and the relationships between entities in the sentences and entities are not labeled, so the embodiment uses a predwatt predicate-parameter extraction model to extract the entities and connected predicates in the sentences, for example: the entity extracted by the vessel manufactured by the Xiaoming is 'Xiaoming' and 'a vessel', and the connected predicate is 'manufactured', wherein the 'manufactured' predicate can well represent the relation between 'Xiaoming' and 'a vessel' as 'creators and creatives'. In the embodiment, predicates exceeding 50000 in sentences under each predicate type are selected by extracting predicates to classify the sentences, then 12 relation labels are induced from the classified predicates, and the sentences with the 12 predicates are selected, so that 941000 parallel corpus data sets with entities and entity relation labels are obtained in the embodiment. It is also the first data set extracted based on the Chinese-English-language entity relationship of the English-parallel corpus in WMT17, and it is worth noting that this is also the first data set for cross-language relationship extraction in large scale.

Including the subordinate innovation points 1.3: the data preprocessing technology, when the present embodiment needs to predict the relationship type using the model, the present embodiment may use the method in the dependent innovation point 1.1 to perform sentence segmentation on the data, then use the method in the dependent innovation point 1.2 to perform entity extraction, and input the preprocessed data into the model to perform relationship prediction.

(2): and creating a supervised Chinese-English book entity relationship extraction model.

Including the dependent innovation points 2.1: and designing a Chinese and English text entity relationship extraction model.

The embodiment designs a Chinese and English entity relation extraction model based on a cross-language large model, and is implemented by an end-to-end structure model, wherein the structure comprises an encoding layer (Encoder) and a decoding layer (Decoder). The method comprises the steps of firstly processing all preprocessed Chinese and English data through a Sentence Picture model to generate a digital vector, then inputting the digital vector generated by English into a model coding layer, and inputting the digital vector generated by Chinese into a model decoding layer for training. When the input of the coding layer is Chinese, the input of the decoding layer is English at the moment, and thus, a relation classification model is obtained in the embodiment.

Dependent innovation point 2.2: and carrying out supervised training on the Chinese and English entity relation extraction model.

At present, cross-language entity relationship extraction cannot be supervised training because the problem of extracting a data set based on parallel corpus relationship does not exist, most of the cross-language entity relationship extraction is unsupervised fine tuning, and a Chinese-English entity relationship extraction model designed in the embodiment can be supervised trained by using the data set constructed by the core innovation point 1 of the embodiment. In this embodiment, the relationship label in the parallel corpus entity relationship extraction dataset made by the core innovation point 1 is used as the prediction target, and the model calculates the loss function during each batch of training, so as to optimize the accuracy of the prediction result during the next iteration and repeat the iteration operation all the time to complete the model training. The trained model is then used for testing.

(3): and prompting the design of the fine tuning template.

Dependent innovation point 3.1: and (4) designing three prompting templates.

The Hard Prompt template (Hard-Prompt) is designed, and is a Prompt consisting of specific Chinese or English words and is a human readable Prompt, so that the Hard Prompt template is a template which needs to be designed and created by people through experience. The source data and the prompt template are combined and taken as training data to be input into the model for training, and the model is prompted to improve the training effect of the model through the comprehension capacity of the language model to sentences

The following hard cue templates were designed in this example:

template 1. Encoding layer input: the sensor: "sensor" includes entity1 (English) and entity2 (English) </s >;

decoding layer input: < s > what is the relationship between entity1 (chinese) and entity2 (chinese) in this sentence "sentence example (chinese)"? [ s ];

template 2. Encoding layer input: < s > What is the type of relationship between entry 1 (English) and entry 2 (English)? [ s ];

decoding layer input: < s > this sentence "sentence instance (Chinese)" includes the two entities "entity 1 (Chinese)" and "entity 2 (Chinese)" </s >

The design of Soft Prompt (Soft-Prompt) template, the Soft Prompt template is automatically optimized by model vector space, these prompts can directly execute Prompt effect in the embedding space of the model, so the template vocabulary design is no longer limited by human readable language, and may be some mark symbols or task-specific vectors. In the embodiment, the specific symbols are designed to be inserted into the original data, different entities in the data are marked to be made into soft prompts to be input into the model, so that the effect of improving the model training is achieved by automatically optimizing parameters according to the prompts in the model training process, and the design of the soft prompts also needs to be designed and created according to the experience of people.

The following soft prompt templates were designed in this example:

template 3. Encoding layer input: < s > sensor (English) [ segment _ ment _ position ] entry 1 (English) [ segment _ ment _ position ] entry 2 (English) </s >;

decoding layer input: < s > example of sentence (Chinese) [ segment _ ment _ presentation ] entity1 (Chinese) [ segment _ presentation ] entity2 (Chinese) </s >;

template 4. Encoding layer input: < s > sense (English) [ Seg _ ment _ ation ] entry 1 (English) [ Seg _ ment _ ation ] entry 2 (English) </s >;

decoding layer input: < s > entity1 (Chinese) [ segment _ ment _ ation ] entity2 (Chinese) </s >

Where [ Seg _ ment _ ation ] is a segmentation identifier for the purpose of segmenting sentence instances and entities.

The soft and hard prompt mixed fine tuning template inserts special marks among different entities of different input data on the basis of combining original data and the hard prompt template, improves the sentence understanding capacity of a language model, and also automatically optimizes parameters of the language model in a vector space to achieve the effect of improving model training.

The following soft and hard prompt hybrid fine tuning templates are designed in the embodiment:

template 5. Encoding layer input: the sensor: "sensor" included entity1 (English) [ segment _ creation ] entity2 (English);

decoding layer input: sentence instance (Chinese) [ Seg _ ment _ nation ] entity1 (Chinese) [ Seg _ ment _ nation ] entity2 (Chinese);

template 6. Encoding layer input: < s > The presence of presence in inclusion 1 (English) [ segment _ ation ] entry 2 (English) </s >;

decoding layer input: < s > entity1 (Chinese) [ segment _ creation ] entity2 (Chinese).

Dependent innovation point 3.2: and carrying out prompt fine adjustment on the whole model.

At present, the prompt fine adjustment is basically started when the model training is completed and the application is started to the downstream task, but the prompt fine adjustment is desirably performed during the model training, so that the prompt template of the embodiment is not only applied to the downstream task, but also the model is finely adjusted during the model training, and the prompt template is input into the model as training data.

Compared with the prior art: the invention uses large-scale language model and supervised training, and can be completely suitable for large-scale data set. The prompting template designed by the invention is different from the design concept of the existing prompting template, because the invention focuses more on the emphasis of the model on entity information, most of the existing prompting templates only focus on a single task, in other words, the prompting template designed by the invention is still feasible for extracting single language information. Meanwhile, bilingual word embedding mapping is not needed in the aspect of data preprocessing of the training model, so that mapping errors caused by different languages under the condition of different grammars are greatly reduced, and the problem of low training data quality is solved. The method has the characteristics of strong universality, low cost and easy updating.

Drawings

FIG. 1 is a functional block diagram of a method according to an embodiment of the present invention;

fig. 2 is a diagram of a network structure for extracting chinese-english ontology relationship according to an embodiment of the present invention;

FIG. 3 is a diagram of a coding layer structure according to an embodiment of the present invention;

FIG. 4 is a diagram of a decoding layer structure according to an embodiment of the present invention;

FIG. 5 is a functional block diagram of data preprocessing and data set construction according to an embodiment of the present invention;

fig. 6 is a schematic block diagram of a network training for extracting chinese-english ontology relationship according to an embodiment of the present invention;

FIG. 7 is a diagram illustrating the comparison of F1 values between the method of the embodiment of the present invention and different models.

Detailed Description

In order to facilitate understanding and implementation of the present invention for persons of ordinary skill in the art, the present invention is further described in detail with reference to the drawings and examples, it is to be understood that the implementation examples described herein are only for illustration and explanation of the present invention and are not to be construed as limiting the present invention.

At present, the information technology is rapidly developed, because more and more channels of resource information can be obtained by people, the types of the obtained information languages are greatly different, and in order to help people to obtain entity information of texts in different language types more quickly and definitely, information in a source language text is extracted and converted into a target language which can be understood by a reader. The cross-language relationship extraction is a very important task in information extraction, and can help people to quickly acquire the relationship between different entities in different language texts so as to accelerate the understanding of text contents, and meanwhile, a plurality of web application programs can benefit, such as 'information retrieval', 'web question answering and mining', and the like.

The embodiment proposes a method of combining a large-scale pre-training language model with a designed universal prompt-tuning template for training to improve the effect of cross-language relationship extraction. At present, no cross-language relational data set based on parallel corpuses exists in the world, so a method based on general language dependency analysis is designed in the embodiment, and a cross-language relational data set based on parallel corpuses is improved for the existing parallel corpuses.

Referring to fig. 1, the method for extracting chinese-english book entity relationship provided by the present invention includes the following steps:

step 1: preprocessing an input sentence of a user;

the embodiment firstly discards sentences of which the word number length of the input sentences exceeds a preset value; and then, segmenting the sentence, and when the sentence has the sentence number and the sentence still exists behind the sentence number, segmenting the sentence into a plurality of small sentences by taking the sentence number as a cutting point.

Step 2: extracting the entity of the sentence preprocessed in the step 1;

the embodiment uses a PredPatt predicate-parameter extraction model to identify a predicate entity structure in a sentence;

data extracted by the prior art cannot be directly used, because the extracted entity and predicate are still quite fuzzy, the extracted linear form needs to be defined in the embodiment, the embodiment needs to traverse on the entity and predicate node extracted by predwatt to obtain a new linear form, and then marks the word of each node: p represents a predicate flag, a represents an entity flag, ph represents a predicate header flag, and ah represents an entity header flag. This embodiment inserts parentheses at the beginning and end of the entity and square brackets at the beginning and end of the predicate. The present embodiment provides an example for reference: [ (Chris: ah) wants: ph [ (Chris: ah) build: ph (a: aboat: ah) ] ]. To recover the predicate-parameter structure, the present embodiment only needs to build it recursively from the outermost parentheses. At each level of this linear structure, parentheses help to recover the parameter nodes. Labels ah and ph help identify the head labels of predicates and parameters, respectively. This embodiment defines that if an automatically generated linearized PredPatt has mismatched parentheses or parenthesis, or a predicate (or parameter) has zero or more header labels, it is in format error and cannot be used by us.

And step 3: inputting the result of the step 2 into a Chinese-English text entity relationship extraction network by combining with a prompt fine-tuning template, and extracting Chinese-English text entity relationships;

referring to fig. 2, the chinese-english entity relationship extraction network of the present embodiment includes a sequencepiece processing layer, an encoding layer, a decoding layer, and a linear classification layer; generating a digital vector through a SentenPice processing layer, inputting the digital vector generated in English into an encoding layer to generate an encoding vector, and inputting the digital vector generated in Chinese and the encoding vector generated by the encoding layer into a decoding layer to obtain a final relation output type; or inputting the digital vector generated by English into the decoding layer to generate a coding vector, and inputting the digital vector generated by Chinese and the coding vector generated by the decoding layer into the coding layer to obtain a final relation output type; the Linear classification layer consists of a Linear layer and a Softmax layer and is used for classifying entity relations in sentences;

the encoding layer of this embodiment has 12 identical transform encoders stacked in series, the decoding layer also has 12 identical transform decoders stacked in series and each transform decoder has a Cross-Attention module between them. The single transformer encoder is provided with a self attention mechanism module, a first residual block, a feedforward layer and a second residual block; the single transformer decoder is provided with a Masked-Self orientation module, a third residual block, a second Self Attention mechanism module, a fourth residual block, a second feedforward layer and a fifth residual block; the Linear classification layer is provided with a Linear layer and a Softmax layer.

Referring to fig. 3, in the model in a single transform encoder in the encoding layer of the present embodiment, the input data processed by the self-attention module and the sequenceliece processing layer are both connected to the first residual block, and the input data is also connected to the self-attention module; the feedforward layer is connected with the first residual block and the second residual block, and the first residual block is connected with the second residual block; thus, a single transformer encoder is formed, and each transformer encoder is connected and stacked to form the coding layer of the network.

Please refer to fig. 4, in a single transform decoder in the decoding layer of this embodiment, the input data processed by the Masked set-Attention module and the sequence piece processing layer are both connected to the third residual block, and the input data processed by the sequence piece processing layer is also connected to the Masked set-Attention module; the final output data of the coding layer and the third residual error block are connected with the second self-attention mechanism module, and the third residual error block and the second self-attention mechanism module are connected with the fourth residual error block; the fourth residual block is connected with the second feedforward layer, and the first residual block and the second residual block are both connected with the fifth residual block; thus, a single transform decoder is formed, and a Cross-Attention module is connected between each transform decoder and is also connected with the encoder. Thereby resulting in a decoding layer of the network.

The prompt fine-tuning template of the embodiment comprises:

Decoding layer input:<s>this sentence "Example sentence (Chinese)' middleEntity1 (Chinese)Andentity2 (Chinese)What is the relationship between?</s>；

And (3) template 5: and (3) coding layer input:<s>The sentence:“sensor (English)”includes entity1 (English) Chinese character of' Wen[Seg_ment_ation]entity2 (English)</s>；

Template 6: and (3) coding layer input:<s>The sentence of sensor (English)includesentity1 (English) Writing)[Seg_ment_ation]entity2 (English)＜/s>；

Wherein [ segment _ ment _ ation ] is a segmentation identifier for segmenting sentence instances and entities; < s > is sentence start identification, </s > is sentence end identification, entiyl is input English entity1, entity2 is input English entity2, entity1 is input Chinese entity1, entity2 is input Chinese entity2, sentence instance (Chinese) is the whole sentence containing Chinese entity, and sense (English) is the whole sentence containing English entity; the underlined characters in the template are used for inputting data, the underlined characters are used for prompting the template, chinese is used for inputting Chinese, and English is used for inputting English.

In this embodiment, the coded output is input to a decoding layer, and finally the final relationship type is output;

for cross-language extraction tasks, define the task as d = [ X, Y]Where X is the language pair for input that has added the prompt template and Y is the relationship type of output, X for each input instance _prompt ＝{e ₁ ，e ₂ ，e ₃ ，e _s ，......，e _o ，......，e _n }，e _i Is an entity, i is more than or equal to 1 and less than or equal to n; two entities used to predict entity relationships are denoted as e _s And e _o Wherein e is _s As a master entity, e _o For object entities, the cross-language relationship extraction task is to predict the principal entity e _s And target entity e _o The relation Y belongs to Y; extracting network M by remembering Chinese and English book entity relationship and extracting x _prompt ＝{e ₁ ，e ₂ ，e ₃ ，e _s ，......，e _o ，......，e _n Add before and after } "<s>"and"</s>Is obtained "<s>x _prompt ＜/s>"input coding layer, wherein the coding layer has a Self-Attention mechanism (Self-Attention) module to calculate the query vector Q of the word first _encoder (Query), the looked-up vector K of the word _encoder (Key) and a weighted vector V _encoder (Value)：

Q _encoder (Query)＝x _prommpt W _Q ；

K _encoder (Key)＝x _prompt W _K ；

V _encoder (Value)＝x _prompt W _V ；

Wherein W _Q 、W _K 、W _V The output from the attention mechanism module is finally obtained for a random matrix:

wherein d is _k Is the dimension of the matrix; softmax () represents a normalized exponential function value of (0, 1);

will be provided with

Making residual connection to obtain

Activation was done using two layers of linear mapping and using the GeLU activation function:

two Linear () maps twice linearly; the above formula can also be written as follows:

wherein "W ₁ Is "to

Is randomly generated matrix, "W ₂ ' is

Is randomly generated matrix of, "b ₁ And b ₂ Is randomInitialized different bias values

x _idden Performing residual join

Obtaining:

the above procedure was followed through 12 lagers to obtain the final v _encoder (ii) a Then will "<s>x _prompt ＜/s>"and v _encoder Input decoding layer gets v _deroder ；

A Masked Self-orientation module is arranged in front of a Self-Attention mechanism module of the coding layer of the decoding layer, and each layer of the decoding layer performs Cross-orientation operation on the final layer output of the coding layer; wherein the MaskedSelf-Attention module is to<s>x _prompt ＜/s>Inputting the data into a Masked Self-authorization module to regenerate the data to obtain:

Q _decoder (Query)＝x _prompt W _Q ；

K _decoder (Key)＝x _prompt W _K ；

wherein Q is _deroder (Query) and K _decoder The partial value of the matrix obtained by (Key) point multiplication becomes 0 or minus infinity, so that the matrix becomes a lower triangular matrix, and then the Softmax function normalization processing is carried out on each layer of the matrix to obtain the V required by the decoding layer self-attention module _decoder Wherein the Q and K matrixes in the self-attention mechanism module in the decoding layer are v finally output by the coding layer _encoder Generating a V matrix which is the output of the Masked Self-orientation module, then operating the same as the coding layer, but outputting the V of the output of the coding layer again after outputting each lager in the decoding layer _encoder Performing Cross-Attention operation; if a certain lager in the decoding layer outputs _i And i is the ith lager inside the decoding layer, the following sequence is obtained:

v _{Cross-Attention} ＝Softmax((W _Qi )(W _K v _encoder ) ^T )W _V v _encoder ；

wherein "W" in the above formula _Q 'is' _i "random matrix," W _K Is "v _encoder "random matrix of" W _V Is "v _encoder "in order to generate" v _{Cross-Attention} ”。v _{Cross-Attention} Then the input of the next lager in the decoding layer is changed, and the operation is repeated until the final v is obtained through 12 lagers _decoder Then v _decoder And obtaining the probability of the prediction relation type y through a Linear layer and a Softmax layer:

M(y)：p(y|x _prompt )＝Softmax(Wv _aecoder +b)；

wherein, W represents the random initialization matrix to be optimized, and b is a random offset value.

Please refer to fig. 5, the chinese-english book entity relationship extraction network of the present embodiment is a trained chinese-english text entity relationship extraction network; the training process comprises the following steps:

(1) Constructing a Chinese and English parallel corpus entity relation extraction data set;

in this embodiment, a data set of english parallel corpus in WMT17 is obtained through a network for preprocessing, a long sentence of the corpus type is first segmented, and when a full stop occurs in a sentence and a sentence still follows the full stop, this embodiment will segment multiple small sentences from the sentence with the full stop as a segmentation point. If the number of words in the sentence exceeds 70, the sentence is not used, and the remaining sentence is further processed. Entities and predicates in the reserved sentences are obtained by using a PredPatt tool, predicates are extracted to classify the sentences to select predicates of which the sentence number exceeds 50000 under each predicate type, and then 12 relation labels are induced from the classified predicates, so that most relations in the current society can be covered. Thus, the embodiment obtains the label of each parallel corpus sentence relation type. In this embodiment, sentences, entities in sentences, and relationship labels are packaged, and 941000 pieces of data are selected from 12 parallel corpora of relationship types as the entity relationship extraction data set of this embodiment, where the number of training sets/test sets/verification sets is 888000/212000/31800. It is worth noting that this is already a large data set.

(2) Combining the data result in the training set with a prompt fine-tuning template, generating a digital vector through a Sentence piece processing layer, extracting an English text entity relationship in the digital vector input from a network for training, and changing a machine translation model into a relationship classification model;

referring to fig. 6, this embodiment designs a chinese-english ontology relationship extraction model based on a cross-language large model, and is implemented by taking an mBART model with an end-to-end structure for machine translation as an example. Because the translation task is considered to be a cross-language type task by the embodiment, the machine translation model can exert a good effect in the face of the parallel corpus-based data set, and therefore the data set designed by the embodiment can be well fitted. The end-to-end structure includes an encoding layer (Encoder) and a decoding layer (Decoder). The method comprises the steps of firstly processing all preprocessed Chinese and English data through a Sentence Picture model to generate a digital vector, then inputting the digital vector generated by English into a model coding layer, and inputting the digital vector generated by Chinese into a model decoding layer for training. When the encoding layer inputs Chinese, the decoding layer inputs English, so that the embodiment changes a machine translation model into a relational classification model. Example X for a language l ^l The model obtains monolingual knowledge in a pre-training phase by optimizing the following objectives:

F _PT ＝∑ _l∈L -log P(X ^l |N(X ^l ))；

wherein L represents the language set contained in the self-contained CC-25 of the model, and CC-25 is the language set of the self-contained mBART model, which comprises 25 different languages and is used for the pre-training stage. N (, x) includes a noise function of sentence concealment and sentence ordering. When the above model is applied to the cross-language task, the embodiment provides a sentence pair X, when the sentence pair is input, the language identifier needs to be added to the input sentence, for example, when Chinese is input in a coding layer, the language identifier needs to be added to the end of the sentence [ ZH ]]"language identification of English input required when decoding layerTo add "[ EN ] at the beginning of a sentence]"is used for language identification. For the cross-language extraction task, the embodiment defines this task as d = [ X, y =]Where X is a language pair for input that has added a prompt template and Y is the relationship type of output for each input instance X _prompt ＝{e ₁ ，e ₂ ，e ₃ ，e _s ，......，e _o ，......，e _n Since there are many entities in the above example, there are many entity labels, but mainly two entities in it are used to predict entity relationships, so use e _s And e _o These two tags mark the entity, where e _s As a master entity, e _o For the subject entity, the remaining entity tags are independent of the predicted relationship, since the cross-language relationship extraction task is to predict the subject entity e _s And a target entity e _o The relationship Y ∈ Y. In this embodiment, assume that the cross-language relationship extraction model is M, and the embodiment uses x _prompt ＝{e ₁ ，e ₂ ，e ₃ ，e _s ，......，e _o ，......，e _n Add before and after } "<s>"and"</s>Is obtained "<s>x _prompt ＜/s>"input coding layer, the hidden vector sequence output by the coding layer is v _encoder ＝{v _<s> ，v ₁ ，v ₂ ，v ₃ ，v _s ，......，v _o ，......，v _n ，v _</s> Will then "<s>x _prompt ＜/s>"and v _encoder Input coding layer gets v _deroder The following formula represents the conditional probability of the predicted relationship type y:

M(y)：p(y|x _prompt )＝Softmax(Wv _encoder +b)；

w is a random matrix, b is an offset value, and the next fine tuning phase also needs to use the prompt template to improve the task effect, and the prompt template is assumed to be T (-) in the embodiment. T (-) represents extra input words, number of words, word position template mapping x to one input prompt x _prompt = T (x). Since the model M is trained, the relationship type can be predicted, so the following predicted relationship class can be obtained in this embodimentFormula of type:

p(y|x)＝p(M(y)|x _prompt )；

the training set of the data set used in this embodiment includes 888000 chinese-english parallel corpora, the verification set is 31800 chinese-english parallel corpora, the test set is 21200 chinese-english parallel corpora, and the relationship labels are 12.

This example will compare the following experiments:

an LSTM model;

Bi-LSTM model;

an mBART classification model;

mBART classification model + hard prompt template 1;

mBART classification model + hard prompt template 2;

mBART classification model + soft prompt template 3;

mBART classification model + soft prompt template 4;

combining an mBART classification model and a soft and hard prompt template 5;

combining an mBART classification model and a soft and hard prompt template 6;

for convenience of observation, the present embodiment in the article or figure will be written later by way of mBART + template x, for example, "mBART classification model + hard cue template 1" will be written as "mBART + template 1".

Setting parameters: the activation function is Gelu function, the parameter is initialized to N (0, 0.2), the number of layers of coding layer and decoding layer is 12, and the learning rate is 5e ^-6 Epoch is 10 and batch _ size is 128.

To verify the method proposed by the present embodiment, the present embodiment compares F1 values between different algorithms.

As can be seen from fig. 7, the combination of the mBART classification model and the prompt template in this embodiment is much better than that of the mBART model alone, where the F1 value of the best mBART + template 4 achieves a good result of 65.4, and the F1 value of the mBART + template 3 achieves a good result of 64.8. The embodiment can see that the effect of the soft prompt is the best, and other templates also achieve the good achievement of the extraction of the relationship of the promotion model. In the embodiment, comparing LSTM and Bi-LSTM with the single mBART model, the embodiment can find that the reconstructed classification model based on the mBART also achieves a certain effect on the relationship extraction effect, which is superior to the two aforementioned language models which are used much at present. The large-scale data set is used in the embodiment, and it can be seen from the graph x that the model of the embodiment still has a good effect when the large-scale data set is subjected to relationship extraction.

As for the result in fig. 6, it can be considered that the english-text entity relationship extraction data set constructed by the data preprocessing method designed in this embodiment can be used as a training data set, because experiments performed on different models find that the chinese-english entity relationship extraction data set constructed by this embodiment still has a normal effect.

In this embodiment, the source language is English and the target language is Chinese. Aiming at the problem of the current cross-language relationship extraction: 1. the zero sample learning is intensively used, so that the supervised training cannot be carried out, the model learning training and testing cannot be carried out, and the model learning training and testing method cannot adapt to large-scale data for training. 2. There is no well-supervised data set for the model to be trained. 3. The preprocessing of sentences and the feature extraction of each language sentence are too complicated and increase the time cost. The embodiment designs a Chinese and English book entity relation extraction technology based on a cross-language large model, provides a large-scale supervised data set for training, fills the gap that the cross-language relation extraction cannot be subjected to large-scale training and lacks of the supervised data set, and also provides a data preprocessing method. Meanwhile, the embodiment also verifies the availability of the data set through experiments, and also verifies the effectiveness of the Chinese-English book relationship extraction technology provided by the embodiment.

It should be understood that the above description of the preferred embodiments is given for clarity and not for any purpose of limitation, and that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A Chinese-English book entity relation extraction method is characterized by comprising the following steps:

step 1: preprocessing an input sentence of a user;

and 2, step: extracting the entity of the sentence preprocessed in the step 1;

the Chinese-English text entity relation extraction network comprises a SentecPiece processing layer, an encoding layer, a decoding layer and a linear classification layer; generating a digital vector through a Sentence Picture processing layer, inputting the digital vector generated by English into a coding layer to generate a coding vector, and inputting the digital vector generated by Chinese and the coding vector generated by the coding layer into a decoding layer to obtain a final relation output type; or inputting the digit vector generated by English into a decoding layer to generate a coding vector, and inputting the digit vector generated by Chinese and the coding vector generated by the decoding layer into the coding layer to obtain a final relation output type; the Linear classification layer consists of a Linear layer and a Softmax layer and is used for classifying entity relations in sentences;

the prompt fine-tuning template comprises:

template 1: and (3) coding layer input:<s>The sentence:“sensor (English)”include sensitivity 1 (English)and entity2 (English)</s>；

Template 2: and (3) coding layer input:<s>What is the type of relationship between entity1 (English) Chinese character of' Wenand entity2 (English)？</s>；

Decoding layer input:<s>this sentence "Example sentence (Chinese)"comprising both entities"Entity1 (Chinese)"and"Entity2 (Chinese)”</s>；

Template 3:and (3) coding layer input:<s>sensor (English)[Seg_ment_ation]entity1 (English)[Seg_ment_ation]entity2 (English)</s>；

Decoding layer input:<s>example sentence (Chinese)[Seg_ment_ation]Entity1 (Chinese)[Seg_ment_ation]Fruit of Chinese wolfberry Body 2 (Chinese)</s>；

And (3) template 5: and (3) coding layer input:<s>The sentence:“sensor (English)”includes entity1 (English)[Seg_ment_ation]entity2 (English)</s>；

Template 6: and (3) coding layer input:<s>The sentence of sensor (English)includesentity1 (English)[Seg_ment_ation]entity2 (English)</s>；

Wherein [ segment _ ment _ ation ] is a segmentation identifier for segmenting sentence instances and entities; the < s > is a sentence starting mark, </s > is a sentence ending mark, entity1 is input English entity1, entity2 is input English entity2, entity1 is input Chinese entity1, entity2 is input Chinese entity2, a sentence instance (Chinese) is a whole sentence containing Chinese entity, and sensor (English) is a whole sentence containing English entity; the underlined characters in the template are used for inputting data, the underlined characters are used for prompting the template, chinese is used for inputting Chinese, and English is used for inputting English.

2. The method for extracting chinese-english text entity relation according to claim 1, wherein: in the step 1, the preprocessing is to abandon sentences with word number length exceeding a preset value; and then, segmenting the sentence, and when the sentence has the sentence number and the sentence still exists behind the sentence number, segmenting the sentence into a plurality of small sentences by taking the sentence number as a cutting point.

3. The method of extracting Chinese-English entity relationship according to claim 1, wherein: in step 2, identifying a predicate entity structure in the sentence by using a PredPatt predicate-parameter extraction model;

firstly, defining an extracted linear form, traversing on an entity and a predicate node extracted by PredPatt to obtain a new linear form, and then marking words of each node, wherein p represents a predicate mark, a represents an entity mark, ph represents a predicate header mark, and ah represents an entity header mark; inserting round brackets at the beginning and the end of the entity, and inserting square brackets at the beginning and the end of the predicate; if an automatically generated linearized PredPatt has mismatched parentheses or parentheses, or a predicate has zero or more headers, it is an unusable error format.

4. The method of extracting Chinese-English entity relationship according to claim 1, wherein: inputting the data preprocessed in the step 3 into a trained model, and finally outputting a final relation type;

for cross-language extraction tasks, define the task as d = [ X, Y =]Where X is the language pair for input that has added the prompt template and Y is the relationship type of output, X for each input instance _prompt ＝{e ₁ ，e ₂ ，e ₃ ，e _s ，......，e _o ，......，e _n }，e _i Is an entity, i is more than or equal to 1 and less than or equal to n; two entities used to predict entity relationships are denoted as e _s And e _o Wherein e is _s As a master entity, e _o For object entities, the cross-language relationship extraction task is to predict the principal entity e _s And target entity e _o The relation Y belongs to Y; chinese and English book entity relation drawingTaking network M, and x _prompt ＝{e ₁ ，e ₂ ，e ₃ ，e _s ，......，e _o ，......，e _n Add before and after } "<s>"and"</s>Is obtained "<s>x _prompt </s>"input coding layer, wherein the coding layer has a self-attention mechanism module to calculate the query vector Q of the word _encoder (Query), the looked-up vector K of the word _encoder (Key) and weighting vector V _encoder (Value)：

Q _encoder (Query)＝x _prompt W _Q ；

K _encoder (Key)＝x _prompt W _K ；

V _encoder (Value)＝x _prompt W _V ；

Wherein W _Q 、W _K 、W _V The output from the attention mechanism module is finally obtained as a random matrix:

will be provided with

Making residual connection to obtain

two Linear () maps twice linearly; the above formula is further written as follows:

wherein "W ₁ ' is

Is randomly generated matrix, "W ₂ ' is

Is randomly generated matrix of, "b ₁ And b ₂ "different bias values for random initialization

x _idden Performing residual join

Obtaining:

the above procedure was followed through 12 lagers to obtain the final v _encoder (ii) a Then will "<s>x _prompt </s>"and v _encoder Input decoding layer gets v _deroder ；

A Masked Self-orientation module is arranged in front of a Self-Attention mechanism module of the coding layer of the decoding layer, and each layer of the decoding layer performs Cross-orientation operation on the final layer output of the coding layer; wherein said Masked Self-orientation module is to<s>x _prompt </s>Inputting the data into a Masked Self-authorization module to regenerate the data to obtain:

Q _decoder (Query)＝x _prompt W _Q ；

K _decoder (Key)＝x _prompt W _K ；

wherein Q is _deroder And K _decoder The median value of the matrix obtained by dot multiplication becomes 0 or minus infinity so that the matrix becomes a lower triangular matrix, and thenPerforming Softmax function normalization processing on each layer of the matrix to obtain V required by the decoding layer self-attention module _decoder Wherein the Q and K matrixes in the self-attention mechanism module in the decoding layer are v finally output by the coding layer _encoder Generating a V matrix which is the output of the Masked Self-orientation module, then operating the same as the coding layer, but outputting the V of the output of the coding layer again after outputting each lager in the decoding layer _encoder Performing Cross-Attention operation; if a certain lager in the decoding layer outputs _i And i is the ith lager inside the decoding layer, the following sequence is obtained:

wherein "W" in the above formula _Q 'is' _i "random matrix," W _K Is "v _encoder "random matrix" W _V Is "v _encoder "in order to generate" v _{Cross-Attention} ”。v _{Cross-Attention} Then the input of the next lager in the decoding layer is changed, and the operation is repeated until the final v is obtained through 12 lagers _decoder After v _decoder And obtaining the probability of the prediction relation type y through a Linear layer and a Softmax layer:

M(y)：p(y|x _prompt )＝Softmax(Wv _decoder +b)；

5. The method of extracting Chinese-English entity relationship according to claim 1, wherein: in step 1, the coding layer is formed by connecting and stacking 12 identical transform encoders; the transform encoder comprises a self attention mechanism module, a first residual block, a feedforward layer and a second residual block;

the self-attention module and the input data processed by the SenterePiece processing layer are connected with the first residual block, and the input data is also connected with the self-attention module; the feedforward layer is connected with the first residual block and the second residual block, and the first residual block is connected with the second residual block.

6. The method of extracting Chinese-English entity relationship according to claim 1, wherein: in step 1, the decoding layer is formed by connecting and stacking 12 identical transform decoders, and a Cross-Attention module is arranged between every two transform decoders; the transform decoder comprises a Masked-Self Attention module, a third residual block, a second Self Attention mechanism module, a fourth residual block, a second feedforward layer and a fifth residual block;

the mask set-extension module and the input data processed by the sequence piece processing layer are both connected with the third residual block, and the input data processed by the sequence piece processing layer is also connected with the mask set-extension module; the final output data of the coding layer is connected with the third residual error block and the second self-attention mechanism module, and the third residual error block and the second self-attention mechanism module are both connected with the fourth residual error block; the fourth residual block is connected to the second feed-forward layer, while both are connected to the fifth residual block.

7. The method for extracting chinese-english text entity relation according to any one of claims 1-6, wherein: the Chinese-English text entity relation extraction network is a trained Chinese-English text entity relation extraction network; the training process comprises the following steps:

extracting a plurality of data from a WMT17 English-Chinese parallel corpus data set and processing the data; the processing comprises the steps of firstly discarding sentences with the word number length exceeding a preset value; then, the sentence is divided, and when the sentence number appears in the sentence and the sentence still exists behind the sentence number, the sentence is divided into a plurality of small sentences by taking the sentence number as a cutting point;

extracting entities and connected predicates in sentences by using a PredPatt predicate-parameter extraction model, classifying the sentences by extracting predicates to select predicates of which the sentences exceed M under each predicate type, then inducing N relation labels from the classified predicates, and selecting the sentences of the predicates with the N relation labels to obtain a plurality of parallel corpus data sets with the entities and the entity relation labels; wherein M and N are preset values; dividing data in the parallel corpus data set into a training set, a test set and a verification set;

(2) Combining the data result in the training set with a prompt fine-tuning template, generating a digital vector through a Sentence piece processing layer, extracting an English text entity relationship in the digital vector input into a network for training, and changing a machine translation model into a relationship classification model;

example X for a language l ^l The Chinese and English entity relation extraction network obtains the monolingual knowledge in the pre-training stage by optimizing the following targets:

wherein L represents the English text entity relationship and extracts the language set contained in the CC-25 of the network, and N (#) comprises the noise function of sentence hiding and sentence sequencing.

8. A Chinese-English book entity relation extraction system is characterized by comprising the following modules:

the module 1 is used for preprocessing input sentences of a user;

the module 2 is used for extracting the entity of the sentence preprocessed in the module 1;

the prompt fine-tuning template comprises:

Decoding layer input:<s>this sentence "Example sentence (Chinese)"comprising these two entities"Entity1 (Chinese)"and"Entity2 (Chinese)”</s>；

And (3) template 5: and (3) coding layer input: the sensor:sensor (English)”includes entity1 (English)[Seg_ment_ation]entity2 (English)；

Decoding layer input:example sentence (Chinese)[Seg_ment_ation]Entity1 (Chinese)[Seg_ment_ation]Entity2 (Chinese character)；

Template 6: and (3) coding layer input:<s>The sentence of sensor (English)includes entity1 (English)[Seg_ment_ation]entity2 (English)</s>；

Wherein [ Seg _ ment _ ation ] is a segmentation identifier for segmenting sentence instances and entities; (< s > is the sentence start mark, </s > is the sentence end mark, entry 1 is the input English entity1, entry 2 is the input English entity2, entity1 is the input Chinese entity1, entity2 is the input Chinese entity2, the sentence example (Chinese) is the whole sentence containing Chinese entity, and sensor (English) is the whole sentence containing English entity); the underlined characters in the template are used for inputting data, the underlined characters are used for prompting the template, chinese is used for inputting Chinese, and English is used for inputting English.

9. An apparatus for extracting a Chinese-English book entity relationship, comprising:

one or more processors;

a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the chinese-english entity relationship extraction method according to any one of claims 1 to 7.