CN115169326A

CN115169326A - Chinese relation extraction method, device, terminal and storage medium

Info

Publication number: CN115169326A
Application number: CN202210392477.3A
Authority: CN
Inventors: 李龙; 张煇; 梁力伟; 王恩慧
Original assignee: Shanxi Changhe Technology Co ltd
Current assignee: Shanxi Changhe Technology Co ltd
Priority date: 2022-04-15
Filing date: 2022-04-15
Publication date: 2022-10-11

Abstract

The embodiment of the invention discloses a Chinese relation extraction method, a Chinese relation extraction device, a terminal and a storage medium, wherein the method comprises the following steps: acquiring character representation and all potential words of a sentence to be processed; obtaining a first feature based on the character representation and all of the potential words through a multi-granular mesh model; extracting a second feature in the sentence through a Bert model; splicing the first feature and the second feature to obtain a spliced feature; inputting the splicing features into a softmax classifier, and predicting entity relations in the sentences. The method combines the Bert and the multi-granularity grid model, not only utilizes the Bert to generate the character vector of the sentence, but also fuses a plurality of meanings of the single word with character representation for coding, thereby better solving the problem of ambiguity of Chinese polysemous words; and the experimental result shows that the model of the invention obtains better effect in the Chinese relation extraction task.

Description

Chinese relation extraction method, device, terminal and storage medium

Technical Field

The invention relates to the technical field of natural language processing, in particular to a Chinese relation extraction method, a Chinese relation extraction device, a Chinese relation extraction terminal and a storage medium.

Background

The relationship extraction is one of subtasks of information extraction, has a very important position, and aims to extract the relationship between entity pairs from redundant and multisource scattered texts so as to form a structured entity-relationship triple. The relation extraction has wide application value in a plurality of downstream tasks such as the construction of a knowledge graph, a relation question-answering system and the like. If the relation among the people is extracted to combine the people entities, the people knowledge graph can be obtained so as to realize the large knowledge mining and reasoning service across surnames. By extracting the relation between the tourist attractions and the culture, the culture knowledge map can be obtained, and a foundation is provided for realizing a culture tourism question-answering system.

Since the relation extraction plays an important role in the field of natural language processing, a great deal of attention of scholars is paid. Liu et al first proposed a neural network method of CNN (Chinese name: convolutional neural network) to automatically extract sentence features, avoiding the problem of error propagation caused by feature engineering, and the F1 value reaches 59.42; zeng et al blends the embedded representation expressing the location information into the CNN network, and obtains the most important features in the sentence through the maximum pooling; based on the above teaching, zeng et al expand and provide a Pulse Coupled Neural Network (PCNN) method, divide the convolution result into three segments according to the positions of two given entities, and design a segmented maximum pooling layer to replace a single maximum pooling layer, thereby capturing structural information and other potential information; however, the PCNN model faces the problem of sentence selection, lin et al apply an attention mechanism to all instances in a package, with F1 values up to 60.55; however, because the CNN network can not capture the long-distance sentence characteristics, zhang et al firstly try to utilize an RNN (Chinese name: recurrent neural network) method, which can embody the memory advantage of the method when modeling long texts, and the F1 value reaches 61.04; zhou et al, in turn, introduced an attention mechanism in the RNN model, with an F1 value of 59.48.

Although the above studies improve the accuracy of chinese relation extraction to some extent, there still exist some problems, including:

(1) The relation extraction model based on the words depends on the word segmentation result to a great extent, the more accurate the word segmentation is, the better the effect is, otherwise, the error propagation problem is caused. For example, the sentence "darwining all the cuckoo", if the sentence is correctly divided into "darwining/researching/all/cuckoo", the entities "darwining" and "cuckoo" can get the correct relation of "researching", but if the sentence is divided into "darwining/institute/having/cuckoo", the entities "darwining" and "cuckoo" can be wrongly labeled as the relation of "belonging".

(2) Although the character-based relation extraction method is not influenced by the word segmentation result, the information of the words cannot be captured, and the problem of ambiguity of the ambiguous words cannot be solved. For example, in the sentence, the 'rhododendron' has two meanings of 'rhododendron' and 'rhododendron bird', and for character-based relation extraction, the real meaning of the character-based relation extraction is difficult to distinguish without extra knowledge.

Thus, there is a need for a better solution to the problems of the prior art.

Disclosure of Invention

In view of this, the present invention provides a method, an apparatus, a terminal and a storage medium for extracting a chinese relation, so as to overcome the problems in the prior art.

Specifically, the present invention proposes the following specific examples:

the embodiment of the invention provides a Chinese relation extraction method, which comprises the following steps:

acquiring character representation and all potential words of a sentence to be processed;

obtaining a first feature based on the character representation and all of the potential words through a multi-granular mesh model;

extracting a second feature in the sentence through a Bert model;

splicing the first feature and the second feature to obtain a spliced feature;

inputting the splicing features into a softmax classifier, and predicting entity relations in the sentences.

In a specific embodiment, the character representation is obtained by splicing character embedding and position embedding;

the character embedding is obtained by representing the sentence into a plurality of characters and mapping each character through a Skip gram model;

the position embedding is the relative distance from each character to a preset head-tail entity.

In a specific embodiment, the potential word is obtained by converting a character string in the sentence through word2vec integrated with an external Chinese database.

In a particular embodiment, the multi-granular mesh model includes an LSTM model.

In a specific embodiment, the splicing is performed based on the following formula:

h ^* ＝Hα ^T ；

α＝softmax(w ^T H ^* )；

H ^* ＝tanh(H)；

wherein h is a first characteristic and Vec is a second characteristic; w is a trainable parameter; t stands for transpose of vector, h ^* Is a splicing feature.

The embodiment of the present invention further provides a chinese relation extracting apparatus, including:

the acquisition module is used for acquiring character representation and all potential words of the sentence to be processed;

a first feature module for deriving a first feature based on the character representation and all of the potential words by a multi-granular lattice model;

the second characteristic module is used for extracting a second characteristic in the sentence through a Bert model;

the splicing module is used for splicing the first characteristic and the second characteristic to obtain a splicing characteristic;

and the prediction module is used for inputting the splicing characteristics into a softmax classifier and predicting the entity relation in the statement.

the character embedding is obtained by representing the statement into a plurality of characters and mapping each character through a Skip gram model;

The embodiment of the present invention further provides a terminal, which includes a processor and a memory, where the memory stores a computer program, and the processor implements the method when executing the computer program.

An embodiment of the present invention further provides a storage medium, where a computer program is stored, and when the computer program is executed, the method described above is implemented.

Therefore, the embodiment of the invention provides a Chinese relation extraction method, a Chinese relation extraction device, a Chinese relation extraction terminal and a Chinese relation extraction storage medium, wherein the method comprises the following steps: acquiring character representation and all potential words of a sentence to be processed; obtaining a first feature based on the character representation and all the potential words through a multi-granularity grid model; extracting a second feature in the sentence through a Bert model; splicing the first feature and the second feature to obtain a spliced feature; inputting the splicing features into a softmax classifier, and predicting entity relations in the sentences. The scheme combines the Bert and the multi-granularity grid model, not only utilizes the Bert to generate the character vector of a sentence, but also blends a plurality of meanings of a single word into the character representation for coding, thereby better solving the problem of ambiguity of Chinese polysemous words; and experimental results show that the model of the invention has better effect in the task of extracting the Chinese relation.

Drawings

In order to more clearly illustrate the technical solution of the present invention, the drawings required in the embodiments will be briefly described below, and it should be understood that the following drawings only illustrate some embodiments of the present invention, and therefore should not be considered as limiting the scope of the present invention. Like components are numbered similarly in the various figures.

FIG. 1 is a flow chart of a Chinese relationship extraction method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a Chinese relationship extraction model combining Bert and a multi-granularity grid network in a Chinese relationship extraction method according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a chinese relation extracting apparatus according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.

The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

Hereinafter, the terms "including", "having", and their derivatives, which may be used in various embodiments of the present invention, are only intended to indicate specific features, numbers, steps, operations, elements, components, or combinations of the foregoing, and should not be construed as first excluding the existence of, or adding to, one or more other features, numbers, steps, operations, elements, components, or combinations of the foregoing.

Furthermore, the terms "first," "second," "third," and the like are used solely to distinguish one from another and are not to be construed as indicating or implying relative importance.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which various embodiments of the present invention belong. The terms (such as terms defined in a commonly used dictionary) will be construed to have the same meaning as the contextual meaning in the related art and will not be construed to have an idealized or overly formal meaning unless expressly so defined in various embodiments of the present invention.

Example 1

The embodiment 1 of the invention discloses a Chinese relation extraction method, as shown in figure 1, comprising the following steps:

step 101, acquiring character representation and all potential words of a sentence to be processed;

specifically, the character representation is obtained by splicing character embedding and position embedding; the character embedding is obtained by representing the statement into a plurality of characters and mapping each character through a Skip gram model; the position embedding is the relative distance from each character to a preset head-tail entity.

In particular, it is possible, in particular, first, the sentence s is represented by M characters s = { c ₁ ，c ₂ ，…，c _M A sequence of characters, each character c is mapped by the Skip gram model _i Is mapped as one d ^c Dimension character embedding, expressed as

In addition, the entity pairs are specified using position embedding, with the position embedding for each character being expressed as the relative distance of the current character to the head and tail entities, respectively

Character c is obtained by character embedding and position embedding splicing _i Vector representation of

Wherein

Obtaining a character representation of the final sentence

In particular, to capture word-level features, information for all potential words in a sentence also needs to be entered. Here, a potential word is any subsequence of characters that it matches a word in a dictionary built on a large raw text of segments, w _b，e As a subsequence starting with the b-th character and ending with the e-th character. If word w is to be used with word2vec _b，e Conversion to real-valued vectors

It can only be mapped to one embedding, ignoring the fact that many words have multiple word senses. An external knowledge base is thus integrated into the model (as shown in fig. 2) to represent word senses. w is a _b，e Is denoted as Sense (w) _b，e ) Each word meaning therein

Conversion into real-valued vectors

Finally, the word w _b，e Expressed as a set of vectors

102, obtaining a first characteristic based on the character representation and all the potential words through a multi-granularity grid model;

specifically, the multi-granular mesh model includes an LSTM model.

In particular, the encoder section combines external knowledge with word sense disambiguation, using a multi-granular mesh LSTM network to construct a distributed representation for each input instance. The direct input to the encoder is a sequence of characters, and all potential words in the lexicon. After training, the output of the encoder is a hidden state vector of the input sentence.

First, the LSTM cell consists of four basic gate structures: input gate i _j Controlling which information enters the cell; output gate o _j Controlling information to be output from the unit; forget to remember door f _j Which information in the control unit is to be deleted. All three gates have a weight matrix of W (including W) _i 、W _o 、W _f 、 W _c ) And U (including U) _i 、U _o 、U _f 、U _c )、b _i 、b _o 、b _f And b _c Representing its offset vector. σ () represents the sigmoid function, the current cell state c _j All historical information streams up to the current time are recorded. Thus, the character-based LSTM function includes:

wherein, the first and the second end of the pipe are connected with each other,

representThe jth character vector in the sentence,

a vector representation representing the hidden state at the previous time.

For each word w matching the lexicon _b，e The k-th word sense of the word is expressed as

All its meaning representations are counted, word w _b，e The k sense of (2) is calculated as:

wherein

A set of word levels representing an input gate and an output gate,

represents the word w _b，e The memory unit of the kth sense, then all senses are merged into a comprehensive representation

To calculate w _b，e The memory cell of (1):

the form of all word sensesThe states will all be computed in the word representation

In (3), ambiguous words can be better represented. D denotes the vocabulary with which it matches,

representing all words matching D ending with the e-th character, all circular paths will flow in the character e to get the current cell

Wherein

Is a regularization term.

Finally, the final hidden state vector for each character in the sequence is calculated using equation 3

Step 103, extracting a second feature in the sentence through a Bert model;

in order to better solve the problem of Chinese word ambiguity, a Bert Model is introduced to extract features from the whole input Sentence, firstly, the Bert Model uses a Masked Language Model (MLM) and a Next Sequence Prediction (NSP) as new training tasks; second, a large amount of data and computational power is used to meet the training strength of Bert. Therefore, the characteristics of the input Chinese sentence s can be better extracted by using Bert: vec = Bert(s).

Step 104, splicing the first feature and the second feature to obtain a spliced feature;

specifically, after learning the hidden state of an example, h and Vec are combined:

after the final hidden state is obtained, a character-level attention mechanism is used to merge the features:

H ^* ＝tanh(H) (12)

α＝softmax(w ^T H ^* ) (13)

h ^* ＝Hα ^T (14)

wherein W is a trainable parameter; t represents the transposition of the vector, h ^* Is a splicing feature.

And 105, inputting the splicing characteristics into a softmax classifier, and predicting entity relationships in the sentences.

Specifically, h is ^* Input softmax classifier to predict the relationship:

o＝Wh ^* +b (15)

p(y|s)＝softmax(o) (16)

for all training examples (S) ⁱ ，y ⁱ ) The cross entropy is used to define the objective function:

the invention was tested using the SanWen data set, the sentences of which were derived from 837 Chinese documents, the data set contained 9 types, respectively Unknow, create, use, near, social, localized, ownership, general-specific, family, part-white. The data set details are shown in table 1.

TABLE 1 SanWen data set

Data set	Training set	Verification set	Test set
				SanWen	695	58	84

The values of the parameters used in the model of the invention are shown in table 2. Experiments prove that the values in the table 2 are the best hyper-parameter values in the invention.

TABLE 2 parameter values

Hyper-parameter	Value of
		Learning rate	0.0005
Dropout	0.5
		Character embedding	100
Mesh embedding	200
		Position embedding	5
LSTM hidden layer	200
		Regularization	1e-8

Table 3 compares the model results of the present invention with the F1 values and AUC values of each baseline model, and Zeng et al proposed a CNN model. Zeng et al added position embedding on this basis and proposed a segmented CNN model. On the basis of the PCNN model, lin et al add a selective attention mechanism, and Li et al propose a multi-granularity grid network. From the results, we can observe that the model of the present invention performs best in all models due to the accuracy of word segmentation and the existence of polysemous word information, so that the information of semantic level can improve the ability of obtaining deep semantic information from text.

TABLE 3 comparison of F1 values and AUC for each model

Example 2

For further explanation of the present invention, embodiment 2 of the present invention further discloses a chinese relation extracting apparatus, as shown in fig. 3, including:

an obtaining module 201, configured to obtain character representations and all potential words of a sentence to be processed;

a first feature module 202, configured to obtain a first feature based on the character representation and all the potential words through a multi-granular mesh model;

a second feature module 203, configured to extract a second feature in the sentence through a Bert model;

a splicing module 204, configured to splice the first feature and the second feature to obtain a spliced feature;

and the prediction module 205 is used for inputting the splicing characteristics into a softmax classifier and predicting the entity relationship in the statement.

Further, the character representation is obtained by splicing character embedding and position embedding;

the character embedding is obtained by representing the sentence into a plurality of characters and mapping each character through a Skigram model;

Further, the potential word is obtained by converting the character string in the sentence through word2vec integrated with an external Chinese database.

Further, the multi-granular mesh model includes an LSTM model.

Further, the splicing is performed based on the following formula:

h ^* ＝Hα ^T ；

α＝softmax(w ^T H ^* )；

H ^* ＝tanh(H)；

Example 3

Embodiment 3 of the present invention further discloses a terminal, which includes a processor and a memory, where the memory stores a computer program, and the processor implements the method described in embodiment 1 when executing the computer program.

Example 4

Embodiment 4 of the present invention also discloses a storage medium, in which a computer program is stored, and when the computer program is executed, the method described in embodiment 1 is implemented.

Therefore, the embodiment of the invention provides a method, a device, a terminal and a storage medium for extracting a Chinese relation, wherein the method comprises the following steps: acquiring character representation and all potential words of a sentence to be processed; obtaining a first feature based on the character representation and all the potential words through a multi-granularity grid model; extracting a second feature in the sentence through a Bert model; splicing the first feature and the second feature to obtain a spliced feature; inputting the splicing features into a softmax classifier, and predicting entity relations in the sentences. The method combines the Bert and the multi-granularity grid model, not only utilizes the Bert to generate the character vector of the sentence, but also integrates a plurality of meanings of the single word into the character representation for coding, thereby better solving the problem of ambiguity of Chinese polysemous words; and experimental results show that the model of the invention has better effect in the task of extracting the Chinese relation.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative and, for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, each functional module or unit in each embodiment of the present invention may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention or a part thereof which contributes to the prior art in essence can be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a smart phone, a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, and various media capable of storing program codes.

Claims

1. A Chinese relationship extraction method is characterized by comprising the following steps:

obtaining a first feature based on the character representation and all the potential words through a multi-granularity grid model;

extracting a second feature in the sentence through a Bert model;

splicing the first feature and the second feature to obtain a spliced feature;

2. The method of claim 1, wherein the character representation is obtained by concatenating character embedding and position embedding;

3. The method of claim 1, wherein the potential word is obtained by converting a string in the sentence by word2vec integrated with an external chinese database.

4. The method of claim 1, in which the multi-granular mesh model comprises an LSTM model.

5. The method of claim 1, wherein said splicing is performed based on the following formula:

h ^* ＝Hα ^T ；

α＝softmax(w ^T H ^* )；

H ^* ＝tanh(H)；

6. A chinese relationship extraction apparatus, comprising:

the acquisition module is used for acquiring the character representation and all potential words of the sentence to be processed;

7. The apparatus of claim 6, wherein the character representation is obtained by concatenating character embedding and position embedding;

8. A terminal, characterized in that it comprises a processor and a memory, in which a computer program is stored, which processor, when executing the computer program, implements the method according to any one of claims 1-5.

9. A storage medium, characterized in that a computer program is stored in the storage medium, which computer program, when executed, implements the method of any one of claims 1-5.