CN111339407B - Implementation method of information extraction cloud platform - Google Patents

Implementation method of information extraction cloud platform Download PDF

Info

Publication number
CN111339407B
CN111339407B CN202010100115.3A CN202010100115A CN111339407B CN 111339407 B CN111339407 B CN 111339407B CN 202010100115 A CN202010100115 A CN 202010100115A CN 111339407 B CN111339407 B CN 111339407B
Authority
CN
China
Prior art keywords
relation
extraction
sequence
vector
sentence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010100115.3A
Other languages
Chinese (zh)
Other versions
CN111339407A (en
Inventor
张日崇
刘德志
曾方锐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN202010100115.3A priority Critical patent/CN111339407B/en
Publication of CN111339407A publication Critical patent/CN111339407A/en
Application granted granted Critical
Publication of CN111339407B publication Critical patent/CN111339407B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application realizes the realization method of the information extraction cloud platform through the method in the natural language processing field, realizes all links of information extraction based on remote supervision through three steps of data acquisition, design relation extraction method, relation extraction model establishment and operation output result, adopts a template-based and machine learning-based hybrid model to perform complementary extraction mode, and reasonably dispatches a network and computational resources to complete the acquisition of training texts and the information extraction method of training the information extraction model, thereby overcoming the defects of the traditional information extraction method.

Description

Implementation method of information extraction cloud platform
Technical Field
The application relates to the field of natural language processing, in particular to an implementation method of an information extraction cloud platform.
Background
With the development of the internet and the continuous growth of network information, more and more information can be retrieved from the internet through a search engine, and the search results have the characteristics of data sea quantity, diversified forms, comprehensive coverage and the like, so that on one hand, the possibility of searching the results by a user is improved, and on the other hand, the user is difficult to quickly and accurately locate the required information. The rapid and accurate acquisition of useful information from massive information is an urgent need of people in the information age, and the need also promotes the problem of information extraction to become a research hotspot in the current natural language processing field. The information extraction technique is a technique of extracting structured information from semi-structured or unstructured information. The focus of the research of the patent is to extract the relation triples consisting of the entity 1, the relation and the entity 2 from the plain text sentences. The information extraction technology can extract key information from mass data in a semi-automatic mode to construct a knowledge graph, and helps people to better solve the problem by utilizing big data, so that the information extraction becomes a research hotspot which is paid attention to urgent development by researchers due to the characteristics of high efficiency and convenience.
There are two main methods for current relationship and entity extraction:
1. the template-based method is mainly based on various rules which can identify structured information, such as grammar rules and the like, set by engineering personnel to extract the information. This approach is difficult to implement, does not take full account of the overall and poor versatility, and generally produces only sub-optimal results. But this approach may be effective as a machine learning based approach as a complement to the extraction results when extracting a particular sub-domain.
2. Based on a machine learning method, the existing large amount of marked data is utilized to carry out deep learning training on the model, and relation entity extraction is carried out through the obtained deep learning model. For example, the discovery uses a model extraction entity such as an expanded convolutional neural network, and the relationship extraction model is trained using deep learning of a bi-directional GRU plus attention network.
Although information extraction has made an important breakthrough in recent years with the rapid development of deep learning, the following problems still need to be solved in the field: first, unlike the problem of extraction of english text information, the implementation of the chinese information extraction model is more complex and difficult. On the one hand, compared with English, the Chinese sentence has complex syntax structure, complex disambiguation in terms of words, flexible and various semantic expression, and increases a plurality of uncertain factors for training. The labeled Chinese text training data are less, and the probability of the deep learning model under fitting is increased. In the absence of training data, how to automatically obtain large amounts of labeled Chinese text data is a great importance in the field of information extraction. Secondly, although a plurality of people in the scientific research field respectively propose methods for improving Chinese entity extraction and relation extraction, a plurality of people combine the processes to form an automatic tool, so that the automatic tool is convenient for others to call. In the fast-paced information age, the automatic scheduling calculation resource training of how to read quickly and obtaining the information extraction model are the pain point problems to be solved urgently. Finally, conventional information extraction methods rely primarily on specific data sets and relationship lists to train information extraction methods that are applicable to the specific data sets. Once the relationship list needs to be expanded, it is difficult to train out information extraction methods applicable to new relationship lists based on existing data sets. For a specific field, neither a single template-based extraction method nor a single machine-learning-based extraction method can actually solve the problem in the actual scenario. The more practical hybrid extraction model is also an engineering strategy in the field of information extraction.
Therefore, a cloud platform implementation method capable of fully utilizing network resources, automatically extracting labeled Chinese text from a network, and training a model by using the same to finally obtain the cloud platform implementation method capable of guaranteeing information extraction precision and breadth and macroscopically integrating the whole information extraction process is yet to be proposed.
The wave and tide mats caused by the deep learning in recent years are worldwide, and the deep learning also affects various fields of natural language processing under the support of massive resources and strong calculation. Knowledge maps organized in entities and relationships are widely used in search engines, question-answering systems. Because of the huge scale of knowledge and the high cost of manual labeling, the addition of these new knowledge by manual labeling is almost impossible. In order to add more abundant world knowledge to the knowledge graph as timely and accurately as possible, researchers strive to explore a method for efficiently and automatically acquiring the world knowledge, namely an information extraction technology. The task of information extraction has become a research hotspot for researchers in recent years, but the conventional information extraction method and tools thereof all have the above-indicated common problems.
Disclosure of Invention
In order to achieve the above object, the present application designs a complete model construction flow for information extraction in the vertical domain: the method comprises the steps of remote supervision training data acquisition, entity identification, data annotation, remote supervision relation extraction algorithm and rule-based relation extraction algorithm, and realizes an information extraction cloud platform realization method based on the algorithm.
It is divided into three steps.
Step one: the data acquisition comprises the following specific processes: firstly, inputting a selected field and an initial relation set by a user, and acquiring a knowledge base comprising entities and relations in data from the selected field and the initial relation set; then, a trained remote supervision acquisition method is adopted to acquire a text library through remote supervision; finally, adopting a named entity recognition method, and carrying out data annotation by using a knowledge base;
step two: the relation extraction method is designed, the relation extraction method of the sentence level attention is characterized in that the sentence level attention relation extraction method uses word vectors as the representation of each word in a sentence to represent an input sequence by a method based on a bidirectional gating circulation unit, and the method can be divided into five parts: an input layer, namely an input plain text sentence; a low-dimensional embedding layer that maps each word in the input sentence to a low-dimensional vector, for example using a pre-trained low-dimensional word vector; the high-dimensional embedding layer obtains high-dimensional embedding from low-dimensional embedding through a two-way gating circulating unit; the weight vectors generated by the attention layer of the sentence level at each time step are combined into feature vectors of the sentence level in a mode of multiplying the weight vectors by the high-dimensional embedding of the word level; the finally obtained sentence-level feature vector is utilized to complete softmax relation classification;
step three: and establishing a relation extraction model and calculating and outputting a result, wherein the relation extraction model is a rule-based information extraction method, depends on dependency syntax analysis, and is another relation extraction method for extracting the triple rule-based information extraction method by utilizing the result of the dependency syntax analysis, so as to supplement the defect that the deep learning-based method in the second step can only extract a limited relation list. The relationship triples can be extracted more accurately and comprehensively by comprehensively using the two extraction methods.
The training process of the trained remote supervision acquisition method comprises the following steps:
firstly, according to the field selected by a user and a given initial relation set, other relations in the vertical field and corresponding entity pairs under the relation are obtained from an informed Chinese knowledge graph through a data crawling script, the crawling script is realized by selecting a Newspike 3k library which is open on a github and a Beautiful Soup library which is open, the Beautiful Soup library is utilized to circularly traverse triple information corresponding to each relation, search websites corresponding to the corresponding entity pairs are continuously constructed, corresponding news is obtained, and then a webpage text extraction function carried by the Newspike 3k library is utilized to extract related sentences as data of the given relation, so that sentences which relate to two entity pairs in the triple are obtained from the Internet and serve as the targets of the data of the corresponding relation.
The named entity identification method comprises the following steps:
training sequence features by using an expansion-based convolutional neural network, and finishing sequence labeling by using a conditional transfer field method, wherein the structure mainly comprises the following steps: training an expansion convolutional neural network module for inputting sequence features by using the word vector of each character and a conditional random field module for using the sequence features output by the expansion convolutional neural network module, and finally finishing sequence labeling, namely named entity recognition by the conditional random field module;
the random field module is a linear conditional random field model for calculating the state transition probability by using a Viterbi algorithm, and the sequence labeling target method comprises the following steps:
wherein,is a local factor,/->Is a pair factor for scoring continuous labels, Z x Is a partition function, and F (x) is a feature of the input sequence x. Conditional random field method to avoid overfitting, wherein +.>The factor is irrelevant to the time sequence t and the input sequence x, the prediction of the conditional random field is completed through global search by using a Viterbi algorithm, x is the input sequence, and y is the output sequence;
for each element x in the input sequence x using a dilation convolutional neural network t Perform feature representation and output c t
Representation ofFull concatenation of vectors, W c Is a convolution kernel, i.e., a convolution matrix used on a sliding window of width r in the input sequence, and a characteristic feature of the expanded convolutional neural network is the expansion width delta imposed on the sliding window.
The specific implementation mode of the relation extraction method is as follows:
using the bi-directional gating loop unit and the word level attention mechanism to obtain the representation part of the input sequence, the input sequence x will first obtain the sequence representation by the gating loop unit based on the forward and backward directions, i.e. for the t-position in the input sequence, the method will obtain the sequence vector represented by the bi-directional gating loop unit by the following formula:
r t =σ(W r x t +U r h t-1 )
u t =σ(W u x t +U u h t-1 )
wherein σ represents a logical Sigmoid (Sigmoid) function, x is the input, h t-1 Is the previous hidden state. h is a t ,r t ,u t Representing the hidden layer state, the reset gate state and the forget gate state at the t position respectively, W r ,W u ,W c ,U r ,U u ,U c Are training parameters in the gating cyclic unit, and as such, represent element-by-element multiplication; the vector at the t position obtained by the bi-directional gating cycle unit can be expressed as a vector h obtained by fully connecting a forward hidden state vector trained by the gating cycle unit from front to back and a backward hidden state vector trained by the gating cycle unit from back to front t The method comprises the steps of carrying out a first treatment on the surface of the And obtains a combination vector H= [ H ] 1 ,...,h T ]Thereafter, sentence-level attention is reusedForces to classify relationships, the sentence-level attentiveness mechanism is as follows:
M=tanh(H)
α=softmax(w T M)
r=Hα T
the attention mechanism was originally proposed by deep for image classification, which allows "the neural network to pay more attention to relevant parts of the input and less attention to irrelevant parts when performing predictive tasks; whereas in the natural language processing direction, attention mechanisms were originally used in the field of machine translation. As shown in fig. 1, this is a schematic diagram of a general attention mechanism. The core of the attention mechanism is to send a sequence I.e., keys map to the attention weights a of d_k dimension. Where K may be a word vector or a word vector of text. In most cases, another input element q called a Query (Query) is used as a reference in calculating the attention profile. If a query is defined, the attention mechanism will act on input elements that are deemed relevant to the task according to the query; if the query is not defined, the attention mechanism considers the scope as an input element related to the task itself.
Wherein w is a weight vector, alpha is an attention vector, and r is a feature vector of a sentence adopted by the relation extraction method when the classification training is finally carried out.
The specific implementation mode of extracting the triples by utilizing the result of the dependency syntactic analysis in the relation extraction method is as follows: and setting a triplet extraction rule based on the centering relationship, the post-setting relationship and the main-predicate-guest relationship.
The technical scheme of the application carries out design research and development implementation aiming at the problems which need to be solved in the fields of natural language processing, relation extraction and entity identification. For the problem of too few marked data, a remote supervision mode is adopted, a large amount of training data containing entities is obtained by utilizing network space big data, and the training data is further screened by adopting a remote supervision noise reduction mode so as to ensure the quality of the training data. For the common requirements of different fields on information extraction, a method for information extraction cloud platform is proposed and realized, each link of information extraction based on remote supervision is opened, a template-based and machine learning-based hybrid model is adopted to perform complementary extraction, and meanwhile, a network and computing resources are reasonably scheduled to complete the acquisition of training texts and the training of the information extraction model, so that a user can conveniently complete information extraction aiming at specific fields, and further knowledge maps of the specific fields are enriched to support various applications on the knowledge maps.
Drawings
FIG. 1 attention mechanism;
FIG. 2 is a system flow diagram;
FIG. 3 is a data acquisition flow chart;
FIG. 4 request address of New wave news;
FIG. 5 request address for hundred degrees news;
FIG. 6 is a diagram of a named entity recognition method overall architecture;
FIG. 7 is a diagram of an overall architecture of a relationship extraction method;
FIG. 8 dependency syntax types;
Detailed Description
The following is a preferred embodiment of the present application and a technical solution of the present application is further described with reference to the accompanying drawings, but the present application is not limited to this embodiment.
In order to achieve the above object, the present application designs a complete model construction flow for information extraction in the vertical domain: remote supervision training data acquisition, entity identification, data annotation, remote supervision relation extraction algorithm and rule-based relation extraction algorithm. The information extraction system flow is shown in fig. 2.
The method comprises the steps of acquiring data, marking the data, and identifying the named entity by using remote supervision, wherein the problem that the named entity lacks labels in a task is solved to a certain extent, and the problem that the number of relational data sets in a relational extraction task is insufficient is solved to a certain extent; the relation triples matched with the existing relation list can be accurately obtained by completing the extraction of the relation triples by using a relation extraction algorithm based on depth science; the extraction method based on rules can solve the problem of error transfer existing in the extraction model using entity identification and relation to a certain extent, and improves the extraction accuracy to a certain extent.
Remote supervision training data acquisition:
firstly, according to the field selected by the user and a given initial relation set, we acquire the richer relation of the vertical field and the entity pair corresponding to the relation from the knowledge graph of the Chinese. The data acquisition flow is shown in fig. 3.
The person relationship field is selected as a case field, and comprises various entities such as persons, places, schools, film and television works and various relationships such as nationalities, heights and birth dates.
In order to crawl data with sufficient quantity and better quality, a detailed data crawling script is designed to ensure the reliability of the quality of the crawlers. In the implementation aspect, the system selects a Newspike 3k library with an open source and a BeautiffulSoup library with an open source on the github, and builds a complete crawler from construction of a search news request to crawling of a corresponding news sentence.
The first task to be completed is to construct access requests for hundred degree news and new wave news. As can be seen from the search bar of the news website, the source web site of the news is "http:// search. Sina. Com. Cn", and the access request with two entities as keywords can be constructed as the address shown in FIG. 4 and FIG. 5:
news may then be searched using the entity pairs in the triplet as keywords.
And circularly traversing the triplet information corresponding to each relation by using the BeautifluSoup library, and continuously constructing a search website corresponding to the corresponding entity pair to obtain the corresponding news. And then, extracting related sentences from the webpage text extraction function of the Newspike 3k library to serve as data of a given relation, thereby completing the aim of acquiring sentences, which relate to two entity pairs, from the Internet to serve as data of the corresponding relation.
In order to solve the problem of extracting as many entities as possible from plain text data, a named entity identification method is adopted, which is an information extraction method for sequence labeling by using a conditional random field on the basis of an expansion convolutional neural network.
The core idea of the method is similar to the traditional method for identifying the named entities by using a long-short-term memory network. Traditional approaches mainly use long and short term memory networks (Long Short Term Memory, LSTM) or gating loops (Gated Recurrent Unit, GRU) to change text information into feature information, i.e. each word or word is represented by a vector of the same dimension. The transition probabilities for each word or word state are then calculated using a hidden Markov model or a model such as a conditional random field to label each word or word. Such a procedure is common, where the problem is mainly that the long-short term memory network or gating loop belongs to a sequence model, which is not as simple as a convolutional neural network in terms of performance utilization and multi-GPU performance, but has the advantage that the effect of doing so is generally better than that of a convolutional neural network.
The expanded convolutional neural network is used for replacing the long-term memory network or the gating circulation unit to complete the feature extraction work, and the advantage of the method is that the effect similar to or even better than that of the long-term memory network is obtained while the resource utilization and the speed improvement are optimized. The overall architecture of the named entity recognition method is shown in fig. 6.
As shown in fig. 6, the method mainly uses a convolutional neural network based on expansion to train the characteristics of the sequence, and then uses a conditional transfer field method to complete the sequence labeling work. The structure mainly comprises an expansion convolutional neural network module for training the input sequence characteristics by using the word vector of each character and a conditional random field module for training the sequence characteristics output by the expansion convolutional neural network module, and finally, the conditional random field module is used for completing sequence labeling, namely named entity recognition.
The key part of the whole method is the expansion convolutional neural network of the front part. The latter part is only a linear conditional random field model that uses the viterbi algorithm to calculate the state transition probabilities.
For the conditional random field part, the input sequence is x= [ x 1 ,…,x t ]The output labeling sequence for each element in the sequence is y= [ y ] 1 ,…,y t ]. Under the condition that the expanded convolutional neural network has given the characteristic F (x) of the input sequence x, the sequence labeling target designed herein can be represented by formula (1):
on the basis, the sequence labeling using the conditional random field designed by the application can be represented by the formula (2):
wherein,is a local factor,/->Is a pair factor for scoring continuous labels, Z x Is the characteristic of the partition function F (x) as the input sequence x. Conditional random field method to avoid overfitting, wherein +.>The factor is independent of both the timing t and the input sequence x, and the prediction of the conditional random field is done by a global search using the viterbi algorithm.
The present application uses an expanded convolutional neural network to characterize a sequence feature. Using a dilation convolutional neural network to input a sequence x= [ x ] 1 ,...,x t ]Each element x of (2) t Perform feature representation and output c t Can be represented by formula (3):
here, for convenience of representation, use is made ofTo represent the full concatenation of vectors. W (W) c Is the convolution kernel in the traditional sense, i.e. the convolution matrix used on a sliding window of width r in the input sequence. A characteristic feature of the inflated convolutional neural network is an inflated width δ imposed on the sliding window. Such a breadth of expansion may allow the convolutional network to acquire a wider range of input sequence features to try to acquire input features comparable to the recurrent neural network.
It is apparent that the value of the dilation width δ can significantly affect the effect of the convolutional neural network. When delta is 1, the convolution neural network is a common convolution neural network. And as the width is larger, the more input features are acquired, the better the effect of theoretical identification is.
The relation extraction method comprises the following steps:
at present, a relation extraction method of sentence-level attention is adopted. The structure of the method for extracting the sentence-level attention relationship using word vectors as the representation of each word in the sentence is replaced with a method based on a bi-directional gating cyclic unit to represent the input sequence is mainly shown in fig. 7.
The relation extraction method can be specifically divided into 5 parts: 1. the input layer, i.e. the input plain text sentence 2, the low-dimensional embedding layer, maps each word in the input sentence to one low-dimensional vector, e.g. using a pre-trained low-dimensional word vector. 3. And the high-dimensional embedding layer obtains high-dimensional embedding from low-dimensional embedding through a two-way gating circulating unit. 4. The weight vectors generated by the attention layer at the sentence level at each time step are combined into feature vectors at the sentence level by multiplying the weight vectors by the high-dimensional embedding at the word level. 5. And completing softmax relation classification by utilizing the finally obtained sentence-level feature vector.
Specifically, for obtaining a representation portion of an input sequence using a bi-directional gating loop unit and a word-level attention mechanism, the input sequence x= [ x ] 1 ,...,x t ]The sequence representation will first be obtained by the gating loop unit based on the forward and backward directions. That is, for the t position in the input sequence, the method will obtain the sequence vector represented by the bi-directional gating cyclic unit by equations (4), (5), (6) and (7).
r t =a(W r x t +U r h t-1 ) (4)
u t =σ(W u x t +U u h t-1 ) (5)
Wherein σ represents a logical Sigmoid (Sigmoid) function, x is the input, h t-1 Is the previous hidden state. h is a t ,r t ,u t Representing the hidden layer state, the reset gate state and the forget gate state at the t position respectively, W r ,W u ,W c ,U r ,U u ,U c Are training parameters in the gated loop unit, +.. The vector at the t position obtained by the bi-directional gating cycle unit can be expressed as a vector h obtained by fully connecting a forward hidden state vector trained by the gating cycle unit from front to back and a backward hidden state vector trained by the gating cycle unit from back to front t The method comprises the steps of carrying out a first treatment on the surface of the And the vector H= [ H ] given by the bi-directional gating cyclic unit is obtained 1 ,...,h T ]Then, the method uses the attention of sentence level to classify the relationship, and the attention mechanism of sentence level is shown in formulas (8), (9) and (10)。
M=tanh(H) (8)
α=softmax(w T M) (9)
r=Ha T (10)
Wherein w is a weight vector, alpha is an attention vector, and r is a feature vector of a sentence adopted by the relation extraction method when the classification training is finally carried out. The final softmax layer of the method then uses the sentence-level attention given sentence vectors and the list of uniheat-coded relationships to train the final relationship classification.
Extracting algorithm based on rule relation:
rule-based information extraction methods rely primarily on dependency syntax analysis. The dependency grammar (Dependency Parsing, DP) analysis technique obtains the relation and relationship between the components in the sentence by analyzing grammar information that may be extracted based on grammar rules, such as "main predicate", "definite form complement", etc. that may be present in the sentence.
Finally, 10 triad extraction rules such as a centering relationship, a post-fixed language relationship, a main-predicate relationship and the like are set in the rule-based information extraction method, and a rule template is strictly tested and set, so that the rule-based information extraction method can also provide a more satisfactory triad extraction result. In this way, a rule-based information extraction method is obtained that is applicable to most relation extraction texts. The method has the defects that the rule is mainly formulated by human, all conditions are difficult to cover, and the relationship in the extracted triples is possibly inaccurate, so that the method is only used as the supplement of the deep learning information extraction method.

Claims (2)

1. The implementation method of the information extraction cloud platform is characterized by comprising the following steps of:
step one: the data acquisition comprises the following specific processes: firstly, inputting a selected field and an initial relation set by a user, and acquiring a knowledge base comprising entities and relations in data from the selected field and the initial relation set; then, a trained remote supervision acquisition method is adopted to acquire a text library through remote supervision; finally, adopting a named entity recognition method, and carrying out data annotation by using a knowledge base;
step two: the relation extraction method is designed, the relation extraction method of the sentence level attention is characterized in that the sentence level attention relation extraction method uses word vectors as the representation of each word in a sentence to represent an input sequence by a method based on a bidirectional gating circulation unit, and the method can be divided into five parts: an input layer, namely an input plain text sentence; a low-dimensional embedding layer, i.e. mapping each word in the input sentence to a low-dimensional vector; the high-dimensional embedding layer obtains high-dimensional embedding from low-dimensional embedding through a two-way gating circulating unit; the weight vectors generated by the attention layer of the sentence level at each time step are combined into feature vectors of the sentence level in a mode of multiplying the weight vectors by the high-dimensional embedding of the word level; the finally obtained sentence-level feature vector is utilized to complete softmax relation classification;
step three: establishing a relation extraction model and calculating and outputting a result, wherein the relation extraction model is a rule-based information extraction method, relies on dependency syntax analysis, extracts triples by utilizing the result of the dependency syntax analysis, supplements a second deep learning-based method by utilizing a rule-based relation extraction algorithm, and realizes comprehensive use of the two extraction methods; the training process of the trained remote supervision acquisition method comprises the following steps: firstly, according to the field selected by a user and a given initial relation set, other relations in the vertical field and corresponding entity pairs under the relation are obtained from an informed Chinese knowledge graph through a data crawling script, the crawling script is realized by taking a Newspike 3k library which is open on a Github and a Beautiful so as to be open, the Beautiful so as to circularly traverse triple information corresponding to each relation, search websites corresponding to the corresponding entity pairs are continuously constructed, corresponding news is obtained, and then related sentences are extracted by using a webpage text extraction function of the Newspike 3k library as data of the given relation, so that sentences which relate to two entity pairs in the triple are obtained from the Internet as the targets of the corresponding relation data;
training sequence features by using an expansion-based convolutional neural network, and finishing sequence labeling by using a conditional transfer field method, wherein the structure mainly comprises the following steps: training an expansion convolutional neural network module for inputting sequence features by using the word vector of each character and a conditional random field module for using the sequence features output by the expansion convolutional neural network module, and finally finishing sequence labeling, namely named entity recognition by the conditional random field module;
the random field module is a linear conditional random field model for calculating the state transition probability by using a Viterbi algorithm, and the sequence labeling target method comprises the following steps:
wherein,is a local factor,/->Is a pair factor for scoring continuous labels, Z x Is a partition function, F (x) is a feature of the input sequence x, conditional random field method to avoid overfitting, wherein +.>The factor is irrelevant to the time sequence t and the input sequence x, the prediction of the conditional random field is completed through global search by using a Viterbi algorithm, x is the input sequence, and y is the output sequence;
for each element x in the input sequence x using a dilation convolutional neural network t Perform feature representation and output c t
Representing the full concatenation of vectors, W c Is a convolution kernel, i.e. for a width in the input sequencer, the characteristic feature of the expanded convolutional neural network is the expansion width delta applied to the sliding window;
the relation extraction algorithm based on rules is utilized to obtain relations and relations among all components in sentences through analyzing grammar information which possibly exists in sentences and can be extracted based on grammar rules by means of a dependency grammar analysis technology, 15 dependency grammar relations are obtained, finally 10 triplet extraction rules are set in the rule-based information extraction method, and strict test and setting are carried out on rule templates;
using the bi-directional gating loop unit and the word level attention mechanism to obtain the representation part of the input sequence, the input sequence x will first obtain the sequence representation by the gating loop unit based on the forward and backward directions, i.e. for the t-position in the input sequence, the method will obtain the sequence vector represented by the bi-directional gating loop unit by the following formula:
r t =σ(W r x t +U r h t-1 )
u t =σ(W u x t +U u h t-1 )
wherein σ represents a logical Sigmoid (Sigmoid) function, x is the input, h t-1 Is the previous hidden layer state, h t ,r t ,u t Representing the hidden layer state, the reset gate state and the forget gate state at the t position respectively, W r ,W u ,W c ,U r ,U n ,U c Are training parameters in the gating cyclic unit, and as such, represent element-by-element multiplication; the vector at the t-position obtained by the bi-directional gating loop unit can be expressed asThe vector h is obtained by fully connecting a forward hidden state vector trained by a forward and backward gating circulating unit and a backward hidden state vector trained by a backward and forward gating circulating unit t The method comprises the steps of carrying out a first treatment on the surface of the And obtains a combination vector H= [ H ] 1 ,…,h T ]Thereafter, sentence-level attention is used for relationship classification, and the sentence-level attention mechanism is as follows:
M=tanh(H)
α=softmax(w T M)
r=Hα T
wherein w is a weight vector, alpha is an attention vector, and r is a feature vector of a sentence adopted by the relation extraction method when the classification training is finally carried out.
2. The method for implementing the information extraction cloud platform according to claim 1, wherein the method comprises the following steps: the specific implementation mode of extracting the triples by utilizing the result of the dependency syntactic analysis in the relation extraction method is as follows:
and setting a triplet extraction rule based on the centering relationship, the post-setting relationship and the main-predicate-guest relationship.
CN202010100115.3A 2020-02-18 2020-02-18 Implementation method of information extraction cloud platform Active CN111339407B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010100115.3A CN111339407B (en) 2020-02-18 2020-02-18 Implementation method of information extraction cloud platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010100115.3A CN111339407B (en) 2020-02-18 2020-02-18 Implementation method of information extraction cloud platform

Publications (2)

Publication Number Publication Date
CN111339407A CN111339407A (en) 2020-06-26
CN111339407B true CN111339407B (en) 2023-12-05

Family

ID=71183592

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010100115.3A Active CN111339407B (en) 2020-02-18 2020-02-18 Implementation method of information extraction cloud platform

Country Status (1)

Country Link
CN (1) CN111339407B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111737497B (en) * 2020-06-30 2021-07-20 大连理工大学 Weak supervision relation extraction method based on multi-source semantic representation fusion
CN111984790B (en) * 2020-08-26 2023-07-25 南京柯基数据科技有限公司 Entity relation extraction method
CN112906368B (en) * 2021-02-19 2022-09-02 北京百度网讯科技有限公司 Industry text increment method, related device and computer program product
CN113360642A (en) * 2021-05-25 2021-09-07 科沃斯商用机器人有限公司 Text data processing method and device, storage medium and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109284396A (en) * 2018-09-27 2019-01-29 北京大学深圳研究生院 Medical knowledge map construction method, apparatus, server and storage medium
CN110377903A (en) * 2019-06-24 2019-10-25 浙江大学 A kind of Sentence-level entity and relationship combine abstracting method
CN110502749A (en) * 2019-08-02 2019-11-26 中国电子科技集团公司第二十八研究所 A kind of text Relation extraction method based on the double-deck attention mechanism Yu two-way GRU

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108604227B (en) * 2016-01-26 2023-10-24 皇家飞利浦有限公司 System and method for neural clinical paraphrasing generation

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109284396A (en) * 2018-09-27 2019-01-29 北京大学深圳研究生院 Medical knowledge map construction method, apparatus, server and storage medium
CN110377903A (en) * 2019-06-24 2019-10-25 浙江大学 A kind of Sentence-level entity and relationship combine abstracting method
CN110502749A (en) * 2019-08-02 2019-11-26 中国电子科技集团公司第二十八研究所 A kind of text Relation extraction method based on the double-deck attention mechanism Yu two-way GRU

Also Published As

Publication number Publication date
CN111339407A (en) 2020-06-26

Similar Documents

Publication Publication Date Title
CN110633409B (en) Automobile news event extraction method integrating rules and deep learning
CN111339407B (en) Implementation method of information extraction cloud platform
Han et al. Neural knowledge acquisition via mutual attention between knowledge graph and text
Zhong et al. A building regulation question answering system: A deep learning methodology
CN111738007B (en) Chinese named entity identification data enhancement algorithm based on sequence generation countermeasure network
CN110083682A (en) It is a kind of to understand answer acquisition methods based on the machine readings for taking turns attention mechanism more
CN105393264A (en) Interactive segment extraction in computer-human interactive learning
WO2022095573A1 (en) Community question answering website answer sorting method and system combined with active learning
Cai et al. Intelligent question answering in restricted domains using deep learning and question pair matching
CN111782769B (en) Intelligent knowledge graph question-answering method based on relation prediction
CN112328800A (en) System and method for automatically generating programming specification question answers
CN116127095A (en) Question-answering method combining sequence model and knowledge graph
CN111274829A (en) Sequence labeling method using cross-language information
Li et al. Integrating language model and reading control gate in BLSTM-CRF for biomedical named entity recognition
CN115759092A (en) Network threat information named entity identification method based on ALBERT
CN117076653A (en) Knowledge base question-answering method based on thinking chain and visual lifting context learning
CN113761890A (en) BERT context sensing-based multi-level semantic information retrieval method
Su et al. Answer acquisition for knowledge base question answering systems based on dynamic memory network
Song et al. A method for identifying local drug names in xinjiang based on BERT-BiLSTM-CRF
Jia et al. Stcm-net: A symmetrical one-stage network for temporal language localization in videos
Yonglan et al. English-Chinese Machine Translation Model Based on Bidirectional Neural Network with Attention Mechanism.
Tian et al. Scene graph generation by multi-level semantic tasks
CN114911940A (en) Text emotion recognition method and device, electronic equipment and storage medium
Shi et al. Entity relationship extraction based on BLSTM model
Wu et al. A text emotion analysis method using the dual-channel convolution neural network in social networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant