CN111339407A

CN111339407A - Implementation method of information extraction cloud platform

Info

Publication number: CN111339407A
Application number: CN202010100115.3A
Authority: CN
Inventors: 张日崇; 刘德志; 曾方锐
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2020-02-18
Filing date: 2020-02-18
Publication date: 2020-06-26
Anticipated expiration: 2040-02-18
Also published as: CN111339407B

Abstract

The invention realizes an implementation method of an information extraction cloud platform through a method in the field of natural language processing, realizes the communication of each link for information extraction based on remote supervision through three steps of data acquisition, relation extraction design, relation extraction model establishment and result calculation and output, adopts a mixed model based on a template and machine learning to perform a mutually complementary extraction mode, simultaneously reasonably schedules a network and computing resources to complete the acquisition of a training text and the information extraction method of the training of the information extraction model, and makes up the defects of the traditional information extraction method.

Description

Implementation method of information extraction cloud platform

Technical Field

The invention relates to the field of natural language processing, in particular to an implementation method of an information extraction cloud platform.

Background

With the development of the internet and the continuous growth of network information, more and more information can be retrieved from the internet through a search engine, and search results show the characteristics of data quantification, diversified forms, coverage comprehension and the like, so that on one hand, the possibility of searching the results by a user is improved, and on the other hand, the user is difficult to quickly and accurately locate the required information. The method is an urgent need of people in the information age for rapidly and accurately obtaining useful information from massive information, and the demand also promotes the problem of information extraction to become a research hotspot in the current natural language processing field. Information extraction techniques are techniques that extract structured information from semi-structured or unstructured information. The research focus of the patent is to extract a relation triple consisting of an entity 1, a relation and an entity 2 from a plain text sentence. The information extraction technology can semi-automatically extract key information from mass data to construct a knowledge map, and assists people to better utilize big data to solve problems, and the characteristics of high efficiency and convenience enable information extraction to become a research hotspot which is widely concerned by researchers and is urgently required to be developed.

Currently, there are two main methods for relationship and entity extraction:

1. the template-based method mainly relies on various rules which can identify structured information, such as grammar rules and the like set by engineering personnel for extracting information. This approach is difficult to implement, fails to adequately consider comprehensiveness and is poorly versatile, often yielding only suboptimal results. But this method can be effectively a machine learning based method as a complement to the extraction result when extracting a specific sub-domain.

2. Based on a machine learning method, deep learning training is carried out on the model by utilizing a large amount of existing marked data, and relational entities are extracted through the obtained deep learning model. For example, the discovery is performed by extracting entities using a model such as an expanded convolutional neural network, and training the extracted entities using deep learning of a bidirectional GRU plus attention network to obtain a relationship extraction model.

Although information extraction has made a significant breakthrough in recent years with the rapid progress of deep learning, the following problems still need to be solved urgently in this field: first, unlike the extraction problem of English text information, the implementation of Chinese information extraction model is more complicated and difficult. On one hand, compared with English, the Chinese sentence is complex in syntax structure, complex in ambiguity resolution in terms of words and words, flexible and diverse in semantic expression, and a plurality of uncertain factors are added for training. The marked Chinese text training data are less, and the probability of under-fitting of the deep learning model is increased. In the case of lack of training data, how to automatically obtain a large amount of labeled Chinese text data is a great importance in the field of information extraction. Secondly, although there are many people who have proposed methods for improving the Chinese entity extraction and relationship extraction in the scientific research field, few people merge the processes to form an automation tool, which is convenient for others to call. In the fast-paced information age, how to quickly read and automatically schedule computing resource training and obtain an information extraction model is an urgent pain point problem to be solved. Finally, conventional information extraction methods rely primarily on specific data sets and relational lists to train information extraction methods that are applicable to specific data sets. Once the relationship list needs to be extended, it is difficult to train an information extraction method suitable for a new relationship list based on the existing data set. For a specific field, neither a single template-based extraction method nor a single machine learning-based extraction method can actually solve the problem in the actual scene. Therefore, a practical mixed extraction model is also an engineering strategy in the field of information extraction.

Therefore, a cloud platform implementation method which can fully utilize network resources, automatically extract the marked Chinese text from the network, train a model by using the marked Chinese text to finally obtain the cloud platform implementation method which can ensure the accuracy and the breadth of information extraction and can macroscopically integrate the whole information extraction process is yet to be provided.

In recent years, the wave generated by deep learning is rolled into the whole world, and the deep learning also affects various fields of natural language processing under the support of massive resources and strong calculation power. Knowledge maps organized in entities and relationships are widely used in search engines and question-answering systems. Because of the large scale of knowledge and the expensive manual annotation, it is impossible to add these new knowledge by manual annotation alone. In order to add more abundant world knowledge to the knowledge graph as timely and accurately as possible, researchers strive to explore a method for efficiently and automatically acquiring the world knowledge, namely an information extraction technology. The task of information extraction has become a research focus of researchers in recent years, but the traditional information extraction method and the tool thereof have the common problems pointed out above.

Disclosure of Invention

In order to achieve the above object, the present invention designs a complete model construction process for information extraction in the vertical domain: the method comprises the steps of remote supervision training data acquisition, entity recognition, data labeling, a remote supervision relation extraction algorithm and a rule-based relation extraction algorithm, and the method for realizing the information extraction cloud platform is realized based on the algorithm.

It is divided into three steps.

The method comprises the following steps: the data acquisition method comprises the following specific processes: firstly, inputting a selected field and an initial relationship set by a user, and acquiring a knowledge base from the selected field and the initial relationship set, wherein the knowledge base comprises entities and relationships in data; then, acquiring a text library by adopting a trained remote supervision acquisition method through remote supervision; finally, a named entity identification method is adopted, and a knowledge base is used for data annotation;

step two: designing a relation extraction method, namely a sentence-level attention relation extraction method, and converting a method for expressing each word in a sentence by using a word vector in the sentence-level attention relation extraction method into a method based on a bidirectional gating cyclic unit to express an input sequence, wherein the method can be divided into five parts: an input layer, i.e. an input plain text sentence; a low-dimensional embedding layer, i.e. mapping each word in the input sentence to a low-dimensional vector, e.g. using a pre-trained low-dimensional word vector; a high-dimensional embedding layer for obtaining high-dimensional embedding from low-dimensional embedding through a bidirectional gating circulation unit; combining weight vectors generated by the attention layers at the sentence level at each time step into a characteristic vector at the sentence level in a mode of multiplying high-dimensional embedding at the word level by the weight vectors; completing softmax relation classification by using the finally obtained sentence-level feature vector;

step three: and establishing a relationship extraction model and calculating an output result, wherein the relationship extraction model is a rule-based information extraction method, the dependence on the dependency syntax analysis is relied on, and the rule-based information extraction method of extracting triples by using the result of the dependency syntax analysis is another relationship extraction method, so as to supplement the defect that the deep learning-based method in the step two can only extract a limited relationship list. The two extraction methods are comprehensively used, so that the relation triples can be extracted more accurately and comprehensively.

The training process of the trained remote supervision acquisition method comprises the following steps:

firstly, according to a field selected by a user and a given initial relation set, acquiring other relations of the vertical field and corresponding entity pairs under the relation from a knowledge Chinese knowledge map through a data crawling script, wherein the crawling script is realized on the basis of a News paper3k library and an open Beautiful Soup library which are open at github, the Beautiful Soup library is utilized to circularly traverse triple information corresponding to each relation, search websites corresponding to corresponding entity pairs are continuously constructed to obtain corresponding news, relevant sentences are extracted by using a webpage text extraction function carried by the News paper3k library to serve as data of the given relation, and the goal of acquiring sentences related to two entity pairs in the triple as data of the corresponding relation from the Internet is completed.

The named entity identification method comprises the following steps:

training the characteristics of a sequence by using an expansion-based convolutional neural network, and completing sequence labeling work by using a conditional transition field method, wherein the structure mainly comprises the following steps: training an expansion convolution neural network module for inputting sequence features and a conditional random field module for using the sequence features output by the expansion convolution neural network module by using the word vector of each character, and finally completing sequence labeling, namely named entity recognition, by the conditional random field module;

the random field module is a linear conditional random field model which uses a Viterbi algorithm to calculate the probability of state transition, and the sequence labeling target method comprises the following steps:

wherein,

is a local factor that is a function of the local factor,

is toSuccessive labelling of the pairwise factors, Z, for scoring_xIs a partition function, and F (x) is a feature of the input sequence x. Conditional random field methods to avoid overfitting therein

The factor is independent of both the timing t and the input sequence x, the prediction of the conditional random field is accomplished by global search using the viterbi algorithm, x being the input sequence and y being the output sequence;

using a dilated convolutional neural network to pair each element x in an input sequence x_tPerforming feature representation and outputting c_t：

Representing full concatenation of vectors, W_cIs a convolution kernel, i.e. a convolution matrix used over a sliding window of width r in the input sequence, a characteristic feature characteristic of a dilated convolutional neural network is the dilation width δ applied to the sliding window.

The specific implementation manner of the relationship extraction method is as follows:

using a bi-directional gated-loop unit and a word-level attention mechanism to obtain a representation of the input sequence, the input sequence x will first obtain a sequence representation based on forward and backward directions by the gated-loop unit, i.e. for the t position in the input sequence, the method will obtain the sequence vector represented by the bi-directional gated-loop unit by the following formula:

r_t＝σ(W_rx_t+U_rh_t-1)

u_t＝σ(W_ux_t+U_uh_t-1)

where σ represents a logical Sigmoid (Sigmoid) function, x is the input, h_t-1Is the previous hidden state. h is_t，r_t，u_tRespectively representing a hidden layer state, a reset gate state and a forgetting gate state at the position t, W_r，W_u，W_c，U_r，U_u，U_cThe vector on the t position obtained by the bidirectional gating circulation unit can be expressed as a vector h obtained by fully connecting a forward hidden layer state vector obtained by carrying out gating circulation unit training from front to back and a backward hidden layer state vector obtained by carrying out gating circulation unit training from back to front_t(ii) a And obtaining a combined vector H ═ H₁，...，h_T]Then, the sentence-level attention is used for relationship classification, and the sentence-level attention mechanism is as follows:

M＝tanh(H)

α＝softmax(w^TM)

r＝Hα^T

the attention mechanism was originally proposed by DeepMind for image classification, which allows "the neural network to focus more on relevant parts of the input and less on irrelevant parts when performing the prediction task; whereas in the direction of natural language processing, attention mechanisms were originally used in the field of machine translation. This is a schematic of a general attention mechanism, as shown in fig. 1. The core of the attention mechanism is to put a sequence

I.e. Keys maps to d _ k dimensional attention weight a. Where K may be a word vector or a word vector of the text. In most cases, another input element q, called Query, is used as a reference when computing the attention distribution. If a query is defined, noteThe intent mechanism will act on input elements that are deemed relevant to the task from the query; if a query is not defined, the attention mechanism considers the scope as an input element related to the task itself.

Where w is a weight vector, α is an attention vector, and r is a feature vector of a sentence used in the final classification training of the relationship extraction method.

The specific implementation manner of extracting the triple by using the result of the dependency syntax analysis in the relationship extraction method is as follows: and setting a triple extraction rule based on the fixed relation, the fixed language postposition relation and the main and predicate guest relation.

The technical scheme of the application aims at the problems which are urgently needed to be solved in the fields of natural language processing, relation extraction and entity identification, and design research and development are carried out. For the problem of insufficient labeled data, a remote supervision mode is adopted, a large amount of training data containing entities are obtained by utilizing network space big data, and the training data are further screened by adopting a remote supervision noise reduction mode to ensure the quality of the training data. For the common requirements of different fields on information extraction, a method for extracting the information from the cloud platform is provided and realized, each link for extracting the information based on remote supervision is opened, a mixed model based on a template and based on machine learning is adopted to perform an extraction mode which is complementary to each other, and simultaneously, a network and computing resources are reasonably scheduled to complete the acquisition of a training text and the training of an information extraction model, so that a user can conveniently complete the information extraction aiming at a specific field, and further, the knowledge graph of the specific field is enriched to support various applications on the knowledge graph.

Drawings

FIG. 1 is a mechanism of attention;

FIG. 2 is a system flow diagram;

FIG. 3 is a data acquisition flow diagram;

FIG. 4 request address for News in New wave;

FIG. 5 request address for hundred degree News;

FIG. 6 illustrates the overall architecture of the named entity recognition method;

FIG. 7 is a diagram illustrating an overall structure of a relationship extraction method;

FIG. 8 an example of a dependency syntax analysis result;

FIG. 9 dependency syntax relationship types;

Detailed Description

The following is a preferred embodiment of the present invention and is further described with reference to the accompanying drawings, but the present invention is not limited to this embodiment.

In order to achieve the above object, the present invention designs a complete model construction process for information extraction in the vertical domain: the method comprises the steps of remote supervision training data acquisition, entity identification, data annotation, a remote supervision relation extraction algorithm and a rule-based relation extraction algorithm. The information extraction system flow is shown in fig. 2.

The data acquisition and data labeling utilize remote supervision, so that the problem that an entity in a named entity identification task is lack of labels is solved to a certain extent, and the problem that the quantity of relation data sets in a relation extraction task is insufficient is solved to a certain extent; then, the relation triple extraction is completed by using a relation extraction algorithm based on the depth science, so that the relation triple matched with the existing relation list can be obtained more accurately; the rule-based extraction method can solve the problem of error transmission existing in the entity identification and relation extraction model to a certain extent, and improves the extraction accuracy to a certain extent.

Acquiring remote supervision training data:

firstly, according to the domain selected by the user and a given initial relationship set, we acquire richer relationships of the vertical domain and corresponding entity pairs under the relationships from the Chinese knowledgegraph. The data acquisition flow is shown in fig. 3.

The method comprises the steps of selecting a character relation field as a case field, wherein the field comprises various entities such as characters, places, schools, film and television works and the like, and also comprises various relations such as nationality, height, birth date and the like.

In order to crawl data with sufficient quantity and good quality, a detailed data crawling script is designed to ensure the reliability of the quality of the crawler. In the implementation aspect, the system selects a Newstand 3k library of open sources on gitubs and a Beautiful Soup library of open sources, and builds a complete crawler from constructing a news search request to crawling corresponding news sentences.

The first task to be accomplished is to construct access requests for Baidu News and Newcastle news. As can be seen by accessing the search bar of the news website, the source address of the news of the new seas is "http:// search. site. com. cn", and the access request with two entities as the key can be constructed as the address shown in fig. 4 and fig. 5:

the triplets of entity pairs may then be used as keywords to search for news.

And (4) circularly traversing the triple information corresponding to each relation by using the Beautiful Soup library, and continuously constructing a search website corresponding to the corresponding entity pair to obtain corresponding news. And related sentences are extracted by using a webpage text extraction function carried by the library of Newspaper3k as data of a given relationship, so that the aim of acquiring sentences related to two entity pairs in the triple as data of the corresponding relationship from the Internet is fulfilled.

In order to solve the problem of extracting entities as much as possible from the plain text data, the named entity recognition method is an information extraction method which uses a conditional random field to label sequences on the basis of an expanded convolutional neural network.

The core idea of the method is similar to the traditional method for identifying the named entity by using a long-short term memory network. The conventional method mainly uses a Long Short Term Memory network (LSTM) or a Gated Recursive Unit (GRU) to change text information into feature information, i.e. each word or phrase is represented by a vector with the same dimension. The transition probabilities for each word or phrase state are then computed using a model such as a hidden markov model or a conditional random field to label each word or phrase. Such a method is common, and there is a problem that a long-short term memory network or a gated cyclic unit belongs to a sequence model, and the model is inferior to a simple convolutional neural network in performance utilization and multi-GPU performance, but the advantage is that the effect of doing so is generally better than that of the convolutional neural network.

The expanded convolutional neural network is used for replacing a long-short-term memory network or a gated cyclic unit to complete feature extraction, and the advantage of doing so is that the effect similar to or even better than that of the long-short-term memory network is obtained while the resource utilization and the speed improvement are optimized. The overall architecture of the named entity recognition method is shown in fig. 6.

As shown in FIG. 6, the method mainly uses the convolutional neural network based on expansion to train the characteristics of the sequence, and then uses the conditional transition field method to complete the sequence labeling work. The structure mainly comprises an expansion convolution neural network module for training input sequence characteristics by using a word vector of each character and a conditional random field module for using the sequence characteristics output by the expansion convolution neural network module, and finally, the conditional random field module finishes sequence labeling, namely named entity recognition.

The key part of the whole method is the expansion convolution neural network of the front part. The latter part is simply a linear conditional random field model that uses the viterbi algorithm to compute the state transition probabilities.

For a conditional random field portion, the input sequence is x ═ x₁，…，x_t]The output labeled sequence for each element in the sequence is y ═ y₁，...，y_t]. Under the condition that the dilated convolutional neural network already gives the features f (x) of the input sequence, the sequence labeling target designed herein can be represented by formula (1):

on the basis, the sequence labeling using conditional random fields designed by the invention can be represented by formula (2):

wherein,

is a partThe factor(s) is (are),

is a pairwise factor, Z, that scores successive labels_xIs the partitioning function f (x) is characteristic of the input sequence x. Conditional random field methods to avoid overfitting therein

The factor is independent of both the timing t and the input sequence x and the prediction of the conditional random field is done by a global search using the viterbi algorithm.

The present invention uses a dilated convolutional neural network to characterize sequence features. Using a dilated convolutional neural network to align an input sequence x ═ x₁，...，x_t]Each element x in_tPerforming feature representation and outputting c_tCan be represented by equation (3):

here for convenience of representation use is made of

To represent a full concatenation of vectors. W_cIs the convolution kernel in the conventional sense, i.e. the convolution matrix used over a sliding window of width r in the input sequence. A characteristic feature of the dilated convolutional neural network is the width δ of the dilation applied to the sliding window. Such a dilation width may enable the convolutional network to obtain a wider input sequence feature to obtain an input feature comparable to that of the recurrent neural network as much as possible.

It is obvious that the value of the dilation width δ significantly affects the effect of the convolutional neural network. When the value of delta is 1, the convolutional neural network is common. When the width is larger and larger, the more input features are acquired, and the better the theoretical recognition effect is.

The relation extraction method comprises the following steps:

a sentence-level attention relationship extraction method is currently employed. The input sequence is expressed by a method based on a bidirectional gated cyclic unit instead of using a word vector as a representation of each word in a sentence level attention relation extraction method, and the structure of the method is mainly shown in fig. 7.

The relation extraction method can be specifically divided into 5 parts: 1. the input layer, i.e. the input plain text sentence 2, the low-dimensional embedding layer, i.e. each word in the input sentence is mapped to a low-dimensional vector, e.g. using a pre-trained low-dimensional word vector. 3. And a high-dimensional embedding layer, which obtains high-dimensional embedding from low-dimensional embedding through a bidirectional gating circulation unit. 4. The weight vectors generated at each time step by the sentence-level attention layer are combined into a sentence-level feature vector by multiplying the high-dimensional embedding at the word level by the weight vectors. 5. And completing softmax relation classification by using the finally obtained sentence-level feature vector.

In particular, for an attention mechanism using bi-directional gated loop units and word levels to obtain a representation portion of an input sequence, the input sequence x ═ x₁，...，x_t]A sequence representation will first be acquired by a gated round-robin unit based on forward and backward directions. That is, for the t position in the input sequence, the method will obtain the sequence vector represented by the bi-directional gated cyclic unit through equations (4), (5), (6) and (7).

r_t＝σ(W_rx_t+U_rh_t-1) (4)

u_t＝σ(W_u，x_t+U_uh_t-1) (5)

Where σ represents a logical Sigmoid (Sigmoid) function, x is the input, h_t-1Is the previous hidden state. h is_t，r_t，u_tRespectively representing the value of the t positionLayer state, reset gate state and forget gate state, W_r，W_u，W_c，U_r，U_u，U_cThe vectors at the t position obtained by the bidirectional gating circulation unit can be expressed as vectors h obtained by fully connecting forward hidden layer state vectors subjected to gating circulation unit training from front to back and backward hidden layer state vectors subjected to gating circulation unit training from back to front_t(ii) a And obtaining a vector H ═ H given by the bidirectional gating cyclic unit₁，...，h_T]The method then uses the sentence-level attention for relationship classification, and the sentence-level attention mechanism is shown in equations (8), (9) and (10).

M＝tanh(H) (8)

α＝softmax(w^TM) (9)

r＝Hα^T(10)

The last softmax layer of the method uses the sentence vectors given by sentence level attention and the one-hot coded relationship list to train the final relationship classification.

Based on a rule relation extraction algorithm:

rule-based information extraction methods rely primarily on dependency parsing. Dependency grammar (DP) analysis techniques analyze grammatical information that may be extracted based on grammatical rules, such as "principal and predicate object", "shape and complement", and the like, possibly existing in a sentence, so as to obtain the connection and relationship between components in the sentence. For the sentence example "delusional" is a Guangdong song by Zhangoriang singers in hong Kong men, the analysis results are shown in FIG. 8.

The dependency syntax relations that can be obtained by the method are 15 in total, as shown in fig. 9. For this sentence, the triplet that can be extracted using the results of the dependency parsing includes "hong Kong singer Zhangori" triplet information.

Finally, 10 triple extraction rules such as a fixed relation, a fixed language post-relation, a principal-predicate-object relation and the like are set in the rule-based information extraction method, and the rule template is strictly tested and set, so that a satisfactory triple extraction result can be provided by the rule-based information extraction method. Thus, a rule-based information extraction method suitable for extracting texts with most relations can be obtained. The method has the main disadvantages that the rules are made manually, all conditions are difficult to cover, and the relation in the extracted triples is possibly not accurate enough, so that the method is only used as a supplement of a deep learning information extraction method.

Claims

1. An implementation method of an information extraction cloud platform is characterized in that:

step three: and establishing a relation extraction model and calculating an output result, wherein the relation extraction model is a rule-based information extraction method, extracting a triple rule-based rule by using a result of dependency syntax analysis depending on dependency syntax analysis, and supplementing the deep learning-based method in the second step to realize comprehensive use of the two extraction methods.

2. The method for implementing the information extraction cloud platform according to claim 1, wherein: the training process of the trained remote supervision acquisition method comprises the following steps: firstly, according to a field selected by a user and a given initial relation set, acquiring other relations of the vertical field and corresponding entity pairs under the relation from a knowledge Chinese knowledge map through a data crawling script, wherein the crawling script is realized on the basis of a News paper3k library and an open Beautiful Soup library which are open at Github, the Beautiful Soup library is utilized to circularly traverse triple information corresponding to each relation, search websites corresponding to corresponding entity pairs are continuously constructed to obtain corresponding news, relevant sentences are extracted by using a webpage text extraction function carried by the News paper3k library to serve as data of the given relation, and the goal of acquiring sentences related to two entity pairs in the triple as data of the corresponding relation from the Internet is completed.

3. The method for implementing the information extraction cloud platform according to claim 2, wherein: the named entity identification method comprises the following steps:

wherein,

is a local factor that is a function of the local factor,

is a pairwise factor, Z, that scores successive labels_xIs a partition function, and F (x) is a feature of the input sequence x. Conditional random field methods to avoid overfitting therein

4. The method for implementing the information extraction cloud platform according to claim 3, wherein the method comprises the following steps: the specific implementation manner of the relationship extraction method is as follows:

r_t＝σ(W_rx_t+U_rh_t-1)

u_t＝σ(W_ux_t+U_uh_t-1)

M＝tanh(H)

α＝softmax(w^TM)

r＝Hα^T

5. The method for implementing the information extraction cloud platform according to claim 4, wherein the method comprises the following steps: the specific implementation manner of extracting the triple by using the result of the dependency syntax analysis in the relationship extraction method is as follows:

and setting a triple extraction rule based on the fixed relation, the fixed language postposition relation and the main and predicate guest relation.